Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants


Leader Boards



1) Message boards : Number crunching : Cache limitations on Intel X computers (Message 143314)
Posted 3 days ago by Profile GrebulonerProject donor
No, I don't know what the difference between exclusive and non-inclusive caches are.

"An inclusive cache contains everything in the cache underneath it and has to be at least the same size as the cache underneath (and usually a lot bigger), compared to an exclusive cache which has none of the data in the cache underneath it. The benefit of an inclusive cache means that if a line in the lower cache is removed due it being old for other data, there should still be a copy in the cache above it which can be called upon. The downside is that the cache above it has to be huge – with Skylake-S we have a 256KB L2 and a 2.5MB/core L3, meaning that the L2 data could be replaced 10 times before a line is evicted from the L3.

A non-inclusive cache is somewhat between the two, and is different to an exclusive cache: in this context, when a data line is present in the L2, it does not immediately go into L3. If the value in L2 is modified or evicted, the data then moves into L3, storing an older copy. (The reason it is not called an exclusive cache is because the data can be re-read from L3 to L2 and still remain in the L3). This is what we usually call a victim cache, depending on if the core can prefetch data into L2 only or L2 and L3 as required. In this case, we believe the SKL-SP core cannot prefetch into L3, making the L3 a victim cache similar to what we see on Zen, or Intel’s first eDRAM parts on Broadwell. Victim caches usually have limited roles, especially when they are similar in size to the cache below it (if a line is evicted from a large L2, what are the chances you’ll need it again so soon), but some workloads that require a large reuse of recent data that spills out of L2 will see some benefit."

Anandtech article

I'm sure there's an extensive set of tests that could find the total usable amount, but it sounds like depending on how the scheduler sees the software (or how the software sees the scheduler?) determines what the total unique current cache data amount is.
2) Message boards : Number crunching : Max FFT size for each LLR project? (Message 143247)
Posted 5 days ago by Profile GrebulonerProject donor
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....

lol if amd make this for 20+ hours,,, some "hypotheticall" 4950X amd will not help much .. there is more than 50% difference!! even "amd 8950" willl not help ))))
Amd must make radicall changes on architecture and forget "cinebench marketing" and i say it like owner of 1950/2990wx-trash cpu and death platform x399, trx40 is already death platform and amd is too noise about amd pro cpus and wr80 chipset and mobos..

The soon to be released 4950X could be fast?

On the Intel front, it is a strange world that the 9960X is so much better than the 9980XE at PG and yet it probably isn't known because I suspect I have the only 2 9960Xs at PG.

Latest stats on SOB (not a lot of data I know):
I've run 4 x 16 thread tasks on 2 x 9960X - average time 26,674 sec - 3.24 SOB tasks / day
I've run 2 x 16 thread tasks on a 9980XE - average time 31,403 sec - 2.75 SOB tasks / day
The 9960Xs are consistently faster and have more throughput than the 9980XE for all tasks at PG.
If the 10980XE is faster than the 9960X, it wouldn't be by much. And it is 50% more expensive.

Stats on SGS (I didn't write down how many tasks / day - just the difference between computers - I think it was around 3500 tasks / day for the 9960Xs)
The new 9960X was doing 16 single thread tasks of SGS as was the 9980XE. 18.8% faster / task than the 9980XE.
The old 9960X was doing 16 single thread tasks of SGS. 17.3% faster / task than the 9980XE.

There is one thing that needs to be said about Woo (and probably Cul too) re: FFT size:
On the 9960X, 2 tasks each of 8 threads runs times of about 19,500 sec.
If you allocate 2 threads for the GPU and run 2 tasks of 7 threads the times are around 27,000 - 28,000 sec. If you run 1 task of 14 threads (with 2 threads for GPU) you get times of about 11,900 sec.
The times on the 9980XE with 2 x 8 thread tasks and 2 threads for GPU are around 23,500 sec.

The 9980XE has 2 Mb more L2 cache (which won't make any difference if you are running 16 threads for LLR / sieve on any of these computers) but has 2.75 Mb more L3 cache.

I think the 9960X is a much better CPU because that was the intended design? Intel added 2 more cores in response to AMD and created the xx80XE CPUs.

Apologies if I've gone really off topic.

I'll have my 10980XE up and running in the next week, so I'll be able to compare for you, though it will be air cooled initially (Noctua D15 with 3 fans) so that may have a comparison effect depending on how you're cooling yours (and indeed, the ability to cool effectively seems to be the biggest determiner of performance with all modern CPUs).

In regards to the 16 vs. 18 core systems, Intel actually never intended to release parts with more than 10 cores when they were first working on Skylake-X. Fortunately for us, AMD had already released the 16 core 1950X blindsiding them and the 12-18 core parts were quickly added to the launch slides without any details, because they were still figuring them out.

Silicon-wise, 18 cores for the HCC die worked with the 4x5 tiling scheme (two tiles go to the memory controllers), and is also why things top out at weird numbers in general: LCC 10C (all Intel initially intended for HEDT): 3x4-2, XCC 28C: 5x6-2.
3) Message boards : Number crunching : Rtx 30 series (Message 143134)
Posted 10 days ago by Profile GrebulonerProject donor
FP64 is still at 1/32, and has official performance specs from Nvidia. I've only seen a couple deep architecture reviews that don't really cover it (waiting for Anandtech's take, they usually cover these things), but it does seem that the fixed FP64 hardware is gone, but I remember reading somewhere on the A100 that the cards can virtually combine two FP32's into a single FP64 with little overhead.

For PG, I'd think that 30+ TFLOPs of FP32 OCL3+ is going to win out over 1 TFLOP of FP64 OCL.

For my own part, I'm looking forward to my extra November paycheck and December safety/vacation payout bonus checks to get a sweet, sweet 3090 to join my new 10980XE. Then I'll start updating the old 900s/1000s with either 3070s or even cheaper used 2080(ti)/Ss, work overtime willing. The massive performance increase at the same price shocked the used market, in that a $1200 GPU is now slower than a $500 one. Turing was priced so high that Pascal held its value. Thankfully, that is no longer the case.
4) Message boards : Problems and Help : PPSE tasks taking much longer than expected (Message 143070)
Posted 12 days ago by Profile GrebulonerProject donor
So, once I set my "% of CPU to use" to 50%, I will effectively double the speed of my computer? And that will be the highest speed that my computer will ever be able to achieve?
It seems quite counterintuitive to me that reducing the CPU usage will increase the speed... but then again, I know nothing about computing and very little about maths because I am only a medical student

More than double, actually. The decreased power usage should allow for higher sustained CPU frequency. Since you are a medical student, I will make a poor attempt at an analogy (I'm a math guy, but grew up in a medical household):

Think of a core of your CPU as your brain. You are scheduled to perform surgery on two patients. 100% CPU (using hyperthreads) would be doing a little operating on one patient while keeping your eye on the other, then stopping to wash and change to spend a little time operating on the other patient with your eyes on the first, and back and forth until they're done. Lots of time/resources wasted on the washing and changing in between, efficiency lost to looking at the wrong body, plus reorienting yourself to where you were when you stopped.

50% CPU (just physical cores) is operating on only one patient to completion while the other is still in pre-op. Only one section of time lost to the switch, so ultimately, more work gets done, with less stress (CPU heat) on you.

The hyperthreads don't count in PG primefinding (LLR/GFN) because it is so specialized. Sieving projects are less specialized and work better with 100% CPU enabled.

BTW, since you are probably on a laptop, make sure there is plenty of space around the fans to keep cool air going in unrestricted. You might need to elevate the body off the desk to help with this.
5) Message boards : Number crunching : International Bacon Day Challenge (Message 142894)
Posted 16 days ago by Profile GrebulonerProject donor
Generally speaking it is good to run 100% of CPU time with 50% of the processors to take into account hyper-threading. For the Bacon Day challenge though, the PPSE tasks are recommended to be single threaded. Is there any reason why I should not increase the percentage of processors? Heat will not be an issue.....

If I haven't misunderstood the question, it is best to not use hyperthreading for LLR. So 1 thread = 1 physical core. I am not sure if the latest AMD CPUs have changed that in anyway.

I agree no hyperthreading, for PPSE is there a reason not to use more than 50% of available threads? Can i set Number of cores to 90%??

It's all about hardware resources. LLR very efficiently uses all the resources of every core for one thread. Hyperthreading/SMT is about taking advantage of unused resources. Since there are none, it causes a net decrease in throughput, so keeping the % set to the number of physical cores available is going to be the best option.
6) Message boards : Seventeen or Bust : The SOB Double Check will end... aka The Way Too Early Prediction Thread (Message 142856)
Posted 17 days ago by Profile GrebulonerProject donor
The doubleckecking probably is in regard to work already done by Rieselsieve? Are these doublechecks doublechecked at PG as well? Or are these special tasks that are only run once and if same result as Rieselsieve they're considered authorative?

It was a double-check of salvaged work residues that had not yet been checked (or were not recorded as checked) during the original RS time. There were also plenty of previously completed ranges for which no residues were found and regular-style 2 user WUs were produced. The original double check thread has more info on the original situation.

Residues are built from steps in the calculation. The probability that two computers make the exact same error(s) in the same place(s) of a workunit is infinitesimal.*

*A case could be made that two identical CPUs with identical hardware bugs would do the wrong thing and match (see: Pentium FDIV bug), but it's been decades since an error of that magnitude has surfaced in retail hardware. The high diversity of modern CPUs minimizes the possibility that these bugs go undiscovered, as a CPU type with PG-significant errors would consistently fail validation across the domain. (In fact, PG recently saw something like this with AMD Navi GPUs, though it was really a software incompatibility issue, which happens, particularly with AMD).
7) Message boards : Number crunching : International Bacon Day Challenge (Message 142803)
Posted 19 days ago by Profile GrebulonerProject donor
Edit: Should we start a new thread so not to inundate this one?


My friend's Bacon number is 2, but I've never worked in the industry, so I can't claim 3 :-(
8) Message boards : General discussion : Mathematical Properties of Infinity (Message 142536)
Posted 31 days ago by Profile GrebulonerProject donor

Assuming 1 byte per chr. Or have 1 bit, or Planck volume, that flashes on/off for the required amount to indicate the number.

Currently doing to 10M. See you in 42 hours.

You know you can do it by hand in a couple minutes, right? Actually, already having done 1-1M, a few seconds.
9) Message boards : General discussion : Mathematical Properties of Infinity (Message 142519)
Posted 32 days ago by Profile GrebulonerProject donor
Oh man, thank you mackeral. You have my brain and math degree in full gear!

Let us assume that since we have the magic to store great amounts of data in nearly infinitesimally sized spaces, that time is of less importance to us (if at all), so storing the information as compactly as possible is far more important than the time required to do the compacting:

A computer would determine the smallest (digitwise) n-gonal sum of prime P (maximum n terms required, so at most 5 pentagonal numbers, 12 dodecagonal numbers, etc.) up to say the P-2-gonal value. (P-gonal is the number itself, P-1-gonal has a simple sum P-1(2) + P-1(1) which is always more digits). Then, whichever n-gon progression has the least amount of digits plus the number of digits of n, total, is written out using n+1 terms. The n-gonal number formulas are easily developed, especially considering that even though 10^185 is a big value, think about how many digits there are if you wrote out the numbers between 1 and 1,000,000. Magnitudes more storage required than most people realize.

So 17 could be written as {3,1,3,4} for triangular (3-gonal) numbers 1,3,4 or {4,1,6} for square (4-gon) numbers 1,6 (the two 0s being omitted), etc. and ultimately, the greatest amount of storage could be reasonably achieved, assuming that the "very few" primes of known compact formats (like Proths) are already determined in their own reduced formats.

Now, if you'll excuse me, there's a very popular gif meme for people who like doing stuff like this (most of us on PG, I would think) that I need to play at myself before stuffing myself into a locker. :)
10) Message boards : General discussion : Mathematical Properties of Infinity (Message 142516)
Posted 32 days ago by Profile GrebulonerProject donor
The secondary problem is that the data storage capacity of the universe (given the likely finite volume) is finite, so even if the most dense storage medium were to be employed (1 data unit per Planck volume), occupying every point in the universe, that's a hard limit of about 10^185 divided by the total number of digits to that point.

If you write it out long hand. I'm sure we can push that up somewhat if we use more condensed notation.

Random thought: is there a way to prove that you can/can't express any single natural number in a fixed limited (finite) amount of data? (if so, what would the minimum amount of data be?). As numbers get bigger, they will tend to require more data to represent its value. But we can use methods to reduce that e.g. the biggest known prime is often written as 2^82,589,933-1, and not the 24+ million decimal digits long hand. So I guess the resulting question then is, can we increase the density of representing numbers faster than the expanded number grows, such that it could be contained in a finite amount of storage?

To a degree, to be sure, but if you have a consecutive list of all known primes up to the point that the universe is full, most (almost all?) will be in a format that cannot be represented in a single extremely reduced number way, like the primes we search for can be (which is why we are able to determine their primality so easily). Of course, if you have the magic to store and read that much information that densely without containers, I'm sure you have the magic to set the information carrying particle to enough distinguishable states to contain more info than a single bit (much like MLC/TLC NAND).

Your random thought had me running to my college Number Theory book and notes! I think the answer would be "sometimes."

Among many options contained in the Fermat polygonal number theorem, my favorite (because it's the second easiest) is the Lagrange four-square theorem. It states that any natural number (N) can be written as the sum of the squares of four integers (a,b,c,d): N=a^2+b^2+c^2+d^2. Square numbers have around twice the number of digits of their roots, very roughly speaking. So if the 4 numbers chosen for squaring (assuming that it isn't a number with only 1 set) together have fewer digits than the output, total data of the number can be reduced to an {a,b,c,d} representation. Because of the great distance between squares as numbers get larger, I would expect that the difference between number of digits of N and the least-digit set of valid {a,b,c,d} would graph like an expanding sine wave.

Now, if latency is not an issue, consider this thought: Fermat's PNT is n n-gonal based. So, 3 triangular numbers can also be used. If the numbers were represented as their sum of triangular numbers using a notation system, where each x(sub)i is the corresponding triangular number, you could have something like 17={1,3,4} and the retrieval system does a separate calculation (x*(x+1))/2 to turn 1, 3, 4 into 1, 6, 10 and produce the number. Such a system would have more potential after a certain point of more data savings, but I would still doubt whether it saves out in the vast voids between very large triangular values.

When having these thoughts, myself, I like to do an internal thought experiment of "how would this look with 3 digit numbers, and how would it look with 30000000000000 digit numbers?"

Next 10 posts
[Return to PrimeGrid main page]
Copyright © 2005 - 2020 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 5.18, 5.16, 5.04
Generated 18 Sep 2020 | 15:22:59 UTC