Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Number crunching :
Max FFT size for each LLR project?
Author |
Message |
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
I think there was a post in the past that had FFT sizes for each project, but I can't find it again. Anyone know where it was? It is possible it would be out of date also, so is there some way I can find current values short of looking at random units?
Why? Ryzen 3000 (Zen 2) was launched yesterday, and comes with at least 32MB of L3 cache. I expect to get my sample on Tuesday and bench it. Based on what I currently understand of it, it has potential to be the best choice for LLR use in terms of a balance of power consumption, compute performance, and pricing. FFT size will allow a more educated guess into the optimal running configuration, which I can then test with benchmarks. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 476,023,430 RAC: 209,677
                               
|
I think there was a post in the past that had FFT sizes for each project, but I can't find it again. Anyone know where it was? It is possible it would be out of date also, so is there some way I can find current values short of looking at random units?
Why? Ryzen 3000 (Zen 2) was launched yesterday, and comes with at least 32MB of L3 cache. I expect to get my sample on Tuesday and bench it. Based on what I currently understand of it, it has potential to be the best choice for LLR use in terms of a balance of power consumption, compute performance, and pricing. FFT size will allow a more educated guess into the optimal running configuration, which I can then test with benchmarks.
Current values:
+-------+----------------------------------------+------+------+
| appid | user_friendly_name | min | max |
+-------+----------------------------------------+------+------+
| 2 | Sophie Germain (LLR) | 128 | 128 |
| 3 | Woodall (LLR) | 1440 | 1920 |
| 4 | Cullen (LLR) | 1536 | 1920 |
| 7 | 321 (LLR) | 800 | 800 |
| 8 | Prime Sierpinski Problem (LLR) | 1920 | 2048 |
| 10 | PPS (LLR) | 192 | 192 |
| 13 | Seventeen or Bust | 2560 | 2880 |
| 15 | The Riesel Problem (LLR) | 720 | 1008 |
| 18 | PPSE (LLR) | 120 | 120 |
| 19 | Sierpinski/Riesel Base 5 Problem (LLR) | 560 | 720 |
| 20 | Extended Sierpinski Problem | 1280 | 1280 |
| 21 | PPS-Mega (LLR) | 200 | 256 |
| 30 | Generalized Cullen/Woodall (LLR) | 1440 | 1792 |
+-------+----------------------------------------+------+------+
The SoB line includes post-DC n=31M candidates.
____________
My lucky number is 75898524288+1 | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
Thanks Michael, looks like they will comfortably fit in the L3 cache so it could give some interesting performance numbers, all going well. Will write more once I've tested, hopefully tomorrow. | |
|
Jay Send message
Joined: 27 Feb 10 Posts: 136 ID: 56067 Credit: 65,749,514 RAC: 11,920
                    
|
Thanks Michael, looks like they will comfortably fit in the L3 cache so it could give some interesting performance numbers, all going well. Will write more once I've tested, hopefully tomorrow.
I'm looking forward to reading what you learn. I'm very interested in AVX-512 performance on these. I've been waiting for a couple years to upgrade to something with good AVX-512 and a good amount of cache. | |
|
|
Thanks Michael, looks like they will comfortably fit in the L3 cache so it could give some interesting performance numbers, all going well. Will write more once I've tested, hopefully tomorrow.
I have been intrigued myself how well a 64 MB cache would contend with large WUs, now that AMD has a parity implementation of AVX2 vs. Intel.
I was reading the Anandtech review on the matter and came across this paragraph (emphasis mine):
What immediately catches the eye when switching between the two results is the new 16MB L3 cache capacity which doubles upon the 8MB of Matisse. We have to remind ourselves that even though the whole chip contains 64MB of L3 cache, this is not a unified cache and a single CPU core will only see its own CCX’s L3 cache before going into main memory, which is in contrast to Intel’s L3 cache where all the cores have access to the full amount.
The 3700X has 2 CCXes of 4 cores each, and the 3900X is supposedly 4 CCXes with 3 active cores each. I think that the cache exclusivity (and multithreaded LLR's scaling problem over multiple sockets) will mean that maximum efficiency will occur by choosing thread x task counts by CCX, while PSP, SoB, and possibly Cul/Woo will still need to work with main memory. All of this of course relying on Windows 10 1903 getting the process assignments right.
I am very much looking forward to your results! And I think we all are hoping that buying an AMD system no longer means making PG compromises...
____________
Eating more cheese on Thursdays. | |
|
|
I'm very interested in AVX-512 performance on these. I've been waiting for a couple years to upgrade to something with good AVX-512 and a good amount of cache.
AVX-512 is still exclusive to HEDT/Server Intel processors (will supposedly hit mainstream platforms whenever Ice Lake arrives). I can't find anything even rumor-based that it will come to AMD in Zen 3.
____________
Eating more cheese on Thursdays. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
The more I learn, the less great this CPU series seems.
The bandwidth between the cores and IO is asymmetric, with only half speed writes.
All data leaving a CCX goes through IO die, even if the destination is another CCX on the same die.
That the L3 cache is not unified means running 8 cores on one task might not be so great. Treating it as 2x4 cores is probably better.
Still, my CPU has been dispatched so I hope to bench tomorrow evening. | |
|
Jay Send message
Joined: 27 Feb 10 Posts: 136 ID: 56067 Credit: 65,749,514 RAC: 11,920
                    
|
AVX-512 is still exclusive to HEDT/Server Intel processors (will supposedly hit mainstream platforms whenever Ice Lake arrives). I can't find anything even rumor-based that it will come to AMD in Zen 3.
Dagnabit! I nearly pulled the trigger on a Xeon Phi 7290 a month ago becuase of AVX-512 and cache. But I decided to wait to see what AMD was bringing. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
Ok, I wont start a new thread for now. Too much data and not enough conclusion. I don't have an image host to put the chart here, but it is at the following link along with a description of the testing.
https://linustechtips.com/main/topic/1080453-ryzen-3600-vs-8086k-for-prime-number-finding/
The question here is probably how fast is Zen 2? Well, the provisional result is they have matched Intel in FPU performance, and with optimisations to how you run tasks, the bigger cache means it can maintain higher performance at bigger FFT sizes.
Note the following is a prediction of throughput based on the Prime95 benchmark data. Do not complain to me if you see differently in reality, as I have also seen discrepancies between theory and reality. That will be for later testing to confirm or otherwise.
I'm comparing my 8086k and the 3600 with a better cooler than it comes with. Clocks did seem to fluctuate with temperature on the 3600.
For small tasks (SGS, PPSE, PPS) running one per core the 8086k takes a small lead, presumably from its higher clock, but the 3600 is close behind.
Mega is in an awkward spot for the 8086k, as it isn't fully efficient one per core, nor running -t3. Maybe -t2 would be better but that was not a tested condition. This size is no problem for the 3600 and it should be faster.
Into the mid sizes (TRP, SR5, 321, ESP) it is better to run the 8086k with 1 task of 6 threads as all the cache is needed to feed it. The 3600 differs here, and two tasks of 3 cores each was more optimal. This is probably due to the 32MB cache actually being split into two separate 16MB regions, so one task per region gives best performance. Running a single task with 6 cores was significantly below that.
The bigger tasks are where the differences really show. GCW, Woo, Cul, PSP, SoB are beyond the 8086k cache and performance drops down towards that afforded by the ram speed. I was only using 3000 which is wholly inadequate to feed it, and it would take about double that. The 3600 still at two tasks of 3 cores each could still do these excluding SoB with high performance, up to 50% faster than the 8086k. For leading edge SoB work the relative performance of the 3600 drops, presumably as it has now exceeded the 16MB chunk size of L3 cache and it is partially hitting ram again. Still, running a single task on 6 threads here gives significantly greater performance than the 8086k.
To recap:
Small units: run 1 task per core on either CPU
Medium units: run 1 task 6 cores on Intel, 2 tasks 3 cores on 3600.
Large units: 1 task 6 cores on both systems, but 3600 will be much faster.
I didn't run 3 tasks of 2 cores each on either system. It should be viable on 8086k and may fill the small performance hole observed at Mega size units. It probably isn't a good idea on the 3600 as you can't divide two cache chunks into 3 equal units, but the bigger cache allows you to jump from 1 to 3 cores per task anyway.
Similar principles could be applied to 8, 12, 16 core models.
I would add the CPU ran hotter than I'd expect, given it wasn't taking that much power in the greater scheme of things, and I had an upgraded over stock cooler. It was running ball park 80C. There is speculation that although the power might not be high, the die is relatively small so power density is the problem. I'm not sure there is a good solution for that. Maybe the 8 core per die models will have more of a cooling problem.
Note I'm running PPSE on the 3600, 1 per core, and will leave it going overnight. If you look up the system, ignore the 1st 12 units done as I mistakenly left in -t2 from previous usage and was also doing things with affinity. It is looking slightly faster than a 6700k I'm using as reference, even though the 6700k is at 4 GHz and 3600 averaged around 50MHz under. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 476,023,430 RAC: 209,677
                               
|
Would it be fair to say...
1) It looks like AMD finally built a CPU with full AVX performance.
2) The huge cache is a big advantage for those FFTs where the FFT fits in the Ryzen cache but not in the Intel cache.
3) Ryzen is now competitive with Intel unless you go with a (dual unit) AVX-512 CPU.
Is that correct?
____________
My lucky number is 75898524288+1 | |
|
GregC Send message
Joined: 12 Nov 18 Posts: 57 ID: 1077873 Credit: 2,275,773,739 RAC: 6,436,926
                   
|
Wow mackerel, thank you so much for that. Very encouraging as my 3600X is arriving Thursday. | |
|
|
Wow mackerel, thank you so much for that.
Yes! Thanks!
____________
Reno, NV
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
To Michael's questions:
1, the AVX performance is now comparable to Intel consumer CPUs. I still need further testing to get a more precise figure on that.
2, yes, the cache helps it work faster in more situations. I've seen similar with the desktop Broadwell CPUs, which had 128 MB L4 cache. Basically it didn't matter what ram you had attached to it. I was running it single channel for a time when I had a shortage and performance remained high.
3, same as #1, but a two unit AVX-512 CPU has more potential, as well as more heat resulting from that. Even at stock I don't consider my 7800X to be safe as it hits 100C with high end air cooler.
A later test will be performance per watt, and I suspect the 3600 will be ahead in that too. | |
|
|
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?
I think there was a post in the past that had FFT sizes for each project, but I can't find it again. Anyone know where it was? It is possible it would be out of date also, so is there some way I can find current values short of looking at random units?
Why? Ryzen 3000 (Zen 2) was launched yesterday, and comes with at least 32MB of L3 cache. I expect to get my sample on Tuesday and bench it. Based on what I currently understand of it, it has potential to be the best choice for LLR use in terms of a balance of power consumption, compute performance, and pricing. FFT size will allow a more educated guess into the optimal running configuration, which I can then test with benchmarks.
Current values:
+-------+----------------------------------------+------+------+
| appid | user_friendly_name | min | max |
+-------+----------------------------------------+------+------+
| 2 | Sophie Germain (LLR) | 128 | 128 |
| 3 | Woodall (LLR) | 1440 | 1920 |
| 4 | Cullen (LLR) | 1536 | 1920 |
| 7 | 321 (LLR) | 800 | 800 |
| 8 | Prime Sierpinski Problem (LLR) | 1920 | 2048 |
| 10 | PPS (LLR) | 192 | 192 |
| 13 | Seventeen or Bust | 2560 | 2880 |
| 15 | The Riesel Problem (LLR) | 720 | 1008 |
| 18 | PPSE (LLR) | 120 | 120 |
| 19 | Sierpinski/Riesel Base 5 Problem (LLR) | 560 | 720 |
| 20 | Extended Sierpinski Problem | 1280 | 1280 |
| 21 | PPS-Mega (LLR) | 200 | 256 |
| 30 | Generalized Cullen/Woodall (LLR) | 1440 | 1792 |
+-------+----------------------------------------+------+------+
The SoB line includes post-DC n=31M candidates.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 476,023,430 RAC: 209,677
                               
|
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?
Those are the FFT sizes (in K, as in 2K == 2048) that LLR chooses for its calculation. That's the number of elements in each array.
Each element is a complex (i.e., real + imaginary) number consisting of 2 64-bit double precision floating point numbers. That's 8 bytes for each number. I believe you need space to store 3 copies of that so you can do a multiply operation such as C = A * B, so the total memory usage should be 24 times the FFT size.
So, for example, for the largest SoB FFT of 2880K, the memory usage for the FFT calculations is 24 * 2880 * 1024 or 70778880. That's 67.5 MB.
For the smallest PPSE FFT of 120K, that's 24 * 120 * 1024 or 2.8 MB.
There's other memory used as well, but it's insignificant. What's important is that the FFT storage fits in cache, since cache is a LOT faster than main memory. If the entire FFT fits in cache the task can run a lot faster.
____________
My lucky number is 75898524288+1 | |
|
|
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?
Those are the FFT sizes (in K, as in 2K == 2048) that LLR chooses for its calculation. That's the number of elements in each array.
Each element is a complex (i.e., real + imaginary) number consisting of 2 64-bit double precision floating point numbers. That's 8 bytes for each number. I believe you need space to store 3 copies of that so you can do a multiply operation such as C = A * B, so the total memory usage should be 24 times the FFT size.
So, for example, for the largest SoB FFT of 2880K, the memory usage for the FFT calculations is 24 * 2880 * 1024 or 70778880. That's 67.5 MB.
For the smallest PPSE FFT of 120K, that's 24 * 120 * 1024 or 2.8 MB.
There's other memory used as well, but it's insignificant. What's important is that the FFT storage fits in cache, since cache is a LOT faster than main memory. If the entire FFT fits in cache the task can run a lot faster.
So the software prefetching is only bringing the 3 copies for the "C = A * B" into the caches?
I thought that the code might also be prefetching the next 3 copies in to prepare for the next iteration.
thanks
| |
|
Vato Volunteer tester
 Send message
Joined: 2 Feb 08 Posts: 861 ID: 18447 Credit: 869,814,206 RAC: 1,183,251
                           
|
the gwnum library that LLR uses will modify in place where necessary.
so it's a relatively static and localised data structure (i think).
especially if it lives fully in your cache!
____________
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
Each element is a complex (i.e., real + imaginary) number consisting of 2 64-bit double precision floating point numbers. That's 8 bytes for each number. I believe you need space to store 3 copies of that so you can do a multiply operation such as C = A * B, so the total memory usage should be 24 times the FFT size.
I found 8x to work well for estimating performance. On the complex part, I had long wondered what that does as it is a benchmark option in P95, but I never used it. I took a data set recently but haven't plotted it against non-complex results yet. I think I'll do that now...
Edit: The complex benchmark run looks almost the same as the non-complex one. Values are slightly lower, but the falls occur at the same FFT sizes so this doesn't change any conclusions. | |
|
|
Hi,
There seems to be a bit of disagreement in regards to calculating the approximate L3 footprint of the various PG WUs. The difference between multiplying by 8 or 24 is pretty large! Is there a reason why 8x seems to work in practical terms when 24x appears to be the theoretical answer? | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
I've only ever seen and used 8x up to this thread and I don't see any evidence to use otherwise. | |
|
|
Hi,
There seems to be a bit of disagreement in regards to calculating the approximate L3 footprint of the various PG WUs. The difference between multiplying by 8 or 24 is pretty large! Is there a reason why 8x seems to work in practical terms when 24x appears to be the theoretical answer?
There is a lot of empirical data that shows that running the CPU at 50% (no hyper-threading) yields the "best" or "near best" performance. I am not sure how long this has been the assumption, but I am not sure how frequently the bottleneck is the cache size.
On the i9-9980XE I am running, it appears more like the Memory Fill Buffer limit is the problem. Fill Buffer unavailable event counts spike at 50% CPU load and the number of cache line evictions stays flat. That implies that there is room in the cache (no evictions), but memory Read/Write traffic is too high.
The information available about when a software prefetch consumes a Fill Buffer and/or generates a CPU stall is confusing to me and I think changes between CPU. Think Intel was even talking about changing the CPU behavior on Skylake so the software prefetch was converted into a NOP if there were no Fill Buffers available.
It appears more like gwnum is over-tuned for one WU which causes multiple WU to choke the bus ... stalling other WU.
I wish it was easier to build the Liinux64 gwnum.a from source, then I could do some testing/analysis.
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
The basic operating state for gwnum is for a single core on a single task. Running multiple tasks, and/or multi-threaded tasks takes a little more consideration.
I think George had said in the past the multi-thread code isn't the best, and breaks up the work into smaller bits before re-assembling them. So the practical result of that is, smaller tasks don't scale well running multiple threads.
With normalised data from Prime95 benchmark it is possible to see how different scenarios behave. In general, if the total footprint of all running tasks is less than the L3 size, you generally get good performance. This could be simplified as FFT size * 8 * number of tasks running < L3 cache size. If you exceed L3 cache size, then ram bandwidth enters into the equation. Ram bandwidth shortage is the biggest limitation for bigger tasks. Dual channel is wholly inadequate for >4 core fast cores (most Intel Core CPUs, Zen 2 Ryzen). Running a single task multi-thread helps in this scenario.
HT is a complicated matter. I've only occasionally seen hints that, in some limited scenarios, it can give an uplift in performance compared to not having/using it. In general it doesn't seem to give any significant boost but still increases power consumption. There are also scenarios where it can be used to lessen losses e.g. Windows scheduler has sucked in past, don't know if it changed since then. Running multiple single thread tasks one per core without affinity could lower throughput by ~10%. Affinity resolves that, turning off HT resolves that, or running more threads than cores resolves that (at higher power usage). | |
|
|
The basic operating state for gwnum is for a single core on a single task. Running multiple tasks, and/or multi-threaded tasks takes a little more consideration.
I think George had said in the past the multi-thread code isn't the best, and breaks up the work into smaller bits before re-assembling them. So the practical result of that is, smaller tasks don't scale well running multiple threads.
With normalised data from Prime95 benchmark it is possible to see how different scenarios behave. In general, if the total footprint of all running tasks is less than the L3 size, you generally get good performance. This could be simplified as FFT size * 8 * number of tasks running < L3 cache size. If you exceed L3 cache size, then ram bandwidth enters into the equation. Ram bandwidth shortage is the biggest limitation for bigger tasks. Dual channel is wholly inadequate for >4 core fast cores (most Intel Core CPUs, Zen 2 Ryzen). Running a single task multi-thread helps in this scenario.
HT is a complicated matter. I've only occasionally seen hints that, in some limited scenarios, it can give an uplift in performance compared to not having/using it. In general it doesn't seem to give any significant boost but still increases power consumption. There are also scenarios where it can be used to lessen losses e.g. Windows scheduler has sucked in past, don't know if it changed since then. Running multiple single thread tasks one per core without affinity could lower throughput by ~10%. Affinity resolves that, turning off HT resolves that, or running more threads than cores resolves that (at higher power usage).
I think we are all 100% in agreement.
I understand the task that George tackled and solved. I understand completely what he was saying about limitations and why he was saying it. I also understand that running an empirical test is the easiest way to get a "reasonable best" performance.
I am just making one additional point that I think prefetching is aggravating the "ram bandwidth" problem ... not helping for threads or multiple WU. I think the "ram bandwidth" problem is triggered and aggravated earlier than "L3 cache full" by gwnum prefetching of data into L1 that is already in the L2 and L3 caches and not in DRAM. I am getting a huge spike in FILL BUFFER NOT AVAILABLE events and very few last level cache (LLC) evictions.
If the data is already in the cache hierarchy, the L1 prefetch instruction consumes a line fill buffer which is the read path from DRAM. All T0 and T1 prefetches would cause this problem. These instructions will insert dead spots in the data pipe from memory and starve the caches every time they prefetch cached data. Other threads or WU really needing data from DRAM will be stalled while the unneeded prefetches are completed.
I have been unable to use Linux performance tools to look at the run-time performance of sllr64 because of the way it is built. All the Linux tools print garbage for the sllr64 code and give up on run-time disassembly.
I could do more analysis if I could build sllr64 (especially on Linux), but I haven't had much success.
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
Been thinking about it a bit. If you can't test with llr, is it possible to try with prime95? Might be worth a post on mersenneforum. | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
To refresh this with new data. I buy 3900X and comparing it with I5-9500K ( six core without HT)
Ryzen
Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 480K all-complex FFT length (12 cores, 4 workers): 0.57, 0.58, 0.58, 0.58 ms. Throughput: 6926.95 iter/sec.
Timings for 480K all-complex FFT length (12 cores, 6 workers): 1.62, 0.85, 1.62, 0.85, 0.57, 0.57 ms. Throughput: 7067.52 iter/sec.
Timings for 480K all-complex FFT length (12 cores, 12 workers): 1.64, 1.63, 1.62, 1.62, 1.62, 1.62, 1.63, 1.64, 1.62, 1.62, 1.62, 1.63 ms. Throughput: 7375.00 iter/sec.
Timings for 480K all-complex FFT length (12 cores hyperthreaded, 4 workers): 0.53, 0.53, 0.53, 0.53 ms. Throughput: 7544.68 iter/sec.
Timings for 480K all-complex FFT length (12 cores hyperthreaded, 6 workers): 1.52, 0.78, 1.52, 0.78, 0.53, 0.53 ms. Throughput: 7651.88 iter/sec.
Timings for 480K all-complex FFT length (12 cores hyperthreaded, 12 workers): 1.52, 1.52, 1.52, 1.53, 1.52, 1.53, 1.52, 1.52, 1.53, 1.53, 1.53, 1.53 ms. Throughput: 7869.45 iter/sec.
Intel
Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 480K FFT length (6 cores, 1 worker): 0.36 ms. Throughput: 2785.83 iter/sec.
Timings for 480K FFT length (6 cores, 2 workers): 0.67, 0.66 ms. Throughput: 3002.62 iter/sec.
Timings for 480K FFT length (6 cores, 3 workers): 1.48, 1.47, 1.43 ms. Throughput: 2055.82 iter/sec.
Timings for 480K FFT length (6 cores, 6 workers): 4.13, 3.84, 3.88, 3.95, 3.91, 3.79 ms. Throughput: 1532.76 iter/sec.
Test case 480K
You first look where you got most of output: Ryzen - 12 workers Intel 2 workers
Ryzen average is 1.53 ms ,average Intela 0.67 ms
Lets say 480 K has 5000000 itterations
Ryzen 5000000*1.53=7650000/1000 ( milisecond to second)=7650 sec per results and /12 it has 12 workers = 637,5 second per result
Day have 86400 seconds so Ryzen will make 135 results per day
Intela 5000000*0.67=3350000/1000=51 results per day
Intel 768K 0.55 ms 1 worker 6 cores
Ryzen 768K all-complex FFT, 12 cores 4 workers. Average times: 0.84, 0.84, 0.84, 0.84 ms. Total throughput: 4745.09 iter/sec.
7500000 *0.55=4125 sec 86400/4125 = 20,94 per day 2.7M digits
7500000 *0.84=6300 sec / 4= 1575 86400/1575 54,98per day 2.7M digits
Both CPU ran at same speed and RAM is at same speed. Since Intel has only 6 cores you must have 2 CPU , 2 MB to power supply, and you still will have 1/3 lower output.
All that cover AVX-512 that add additional boost , but those Intel CPU cost more then AMD
So for this money I think AMD Ryzen 3900 is very nice CPU.
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 480K all-complex FFT length (12 cores, 4 workers): 0.57, 0.58, 0.58, 0.58 ms. Throughput: 6926.95 iter/sec.
Timings for 480K all-complex FFT length (12 cores hyperthreaded, 12 workers): 1.52, 1.52, 1.52, 1.53, 1.52, 1.53, 1.52, 1.52, 1.53, 1.53, 1.53, 1.53 ms. Throughput: 7869.45 iter/sec.
for the 4 thread test were you running 3x4threads? For smaller tasks (ie not SoB) I find 1x8 is significantly slower than 2x4 on my 3700x. Your results show the same if you were only running 1x4. | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 480K all-complex FFT length (12 cores, 4 workers): 0.57, 0.58, 0.58, 0.58 ms. Throughput: 6926.95 iter/sec.
Timings for 480K all-complex FFT length (12 cores hyperthreaded, 12 workers): 1.52, 1.52, 1.52, 1.53, 1.52, 1.53, 1.52, 1.52, 1.53, 1.53, 1.53, 1.53 ms. Throughput: 7869.45 iter/sec.
for the 4 thread test were you running 3x4threads? For smaller tasks (ie not SoB) I find 1x8 is significantly slower than 2x4 on my 3700x. Your results show the same if you were only running 1x4.
I dont understund what you ask?
3 cores per worker , 4 worker =12 cores
So every worker has 3 cores, and 4 worker works in parallel in same time.
So for 480K on 12 workers with 2 cores ( 1 core real+1 cores SMP) give me the fastest output
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
Interesting results Crun-chi. At 480k, the sensible combinations of cores/workers fit in the L3 cache, so as you see higher throughput efficiency is seen when using fewer cores per worker. I find it particularly interesting the with-SMT results are a good % higher still. I don't think I ever tested it on Zen 2 systems. Will have to see if I can repeat it on my single CCD models.
I do have a 7920X (12 cores, with AVX-512) so could try to run a set on that for comparison. The system is currently set up for gaming and is behaving a bit oddly under P95, so I'll need to check the settings on it. As fixed clock was mentioned, what was that clock for test? | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
I think reason is :480k is too small result for such big cache, so 12 x 1 is still better option
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 480K all-complex FFT length (12 cores, 4 workers): 0.57, 0.58, 0.58, 0.58 ms. Throughput: 6926.95 iter/sec.
Timings for 480K all-complex FFT length (12 cores hyperthreaded, 12 workers): 1.52, 1.52, 1.52, 1.53, 1.52, 1.53, 1.52, 1.52, 1.53, 1.53, 1.53, 1.53 ms. Throughput: 7869.45 iter/sec.
for the 4 thread test were you running 3x4threads? For smaller tasks (ie not SoB) I find 1x8 is significantly slower than 2x4 on my 3700x. Your results show the same if you were only running 1x4.
I dont understund what you ask?
3 cores per worker , 4 worker =12 cores
So every worker has 3 cores, and 4 worker works in parallel in same time.
So for 480K on 12 workers with 2 cores ( 1 core real+1 cores SMP) give me the fastest output
right, that answers my question.
so you're saying multhreading was a waste of time for you and 12 workers/cores had the highest throughput?
It's interesting as I've found for any of the "multithreading recommended" tasks that 4 cores each is fastest for the smaller ones and using all 8 cores for SoB is best.
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
It's interesting as I've found for any of the "multithreading recommended" tasks that 4 cores each is fastest for the smaller ones and using all 8 cores for SoB is best.
It's complicated, and it also depends on what your optimisation is towards.
For starters, each task takes a certain amount of data. If this data fits into the CPU caches, you have performance practically not limited by ram bandwidth. The cores can do their best. For smaller tasks, this is easy to achieve. For bigger tasks, especially if running multiple of them, you can exceed the CPU's cache and then your ram speed comes into play. By using multi-threading, you can run fewer tasks, and thus the total tasks use less total cache.
But we have another effect acting in the opposite way. Running multi-threaded is not ideal either. If you have 2 cores working together on 1 task, it does not run at twice the throughput of 2 cores each working on a single task itself (if the limits in previous paragraph are not in play). The efficiency drops as you add more cores to a task. If throughput is the optimisation, you want to run with as few cores per task as you can.
The above two effects are working against each other. As such, I have a rough guideline that for best throughput, you want to pick the tasks/cores combination that fills up the CPU cache as best possible, without exceeding it. If you go either side of that, you may lose throughput.
Note I'm using throughput as the measure here. There is another consideration for prime finding tasks, which is that the first person to report gets the credit for discovering the prime. This means, people may choose to run more cores than is optimal for throughput, but the lower time increases the chance of being first to report. It is a decision each user will have to make.
There's one more factor specifically for Ryzen, at least the models that exist up to today. Although they appear to be one CPU, they are best seen as multiple individual CCX. Each CCX has a partition of cache associated with it. They can't access another CCX's cache except indirectly by going back to ram, which could be horribly bandwidth limited. Generally speaking if you can keep work within a CCX, that is best. If not, then using the whole CPU can still help. | |
|
|
It's interesting as I've found for any of the "multithreading recommended" tasks that 4 cores each is fastest for the smaller ones and using all 8 cores for SoB is best.
It's complicated, and it also depends on what your optimisation is towards.
true, in my case I'm talking about throughput.
The 4 projects I've checked so far in throughput terms are:
SoB: 8 threads > 2 x 4 threads.
ESP: 8 threads = 2 x 4 threads
GCW: 8 threads < 2 x 4 threads.
SR5: 8 threads < (4 x 2 threads = 2 x 4 threads). | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
right, that answers my question.
so you're saying multhreading was a waste of time for you and 12 workers/cores had the highest throughput?
It's interesting as I've found for any of the "multithreading recommended" tasks that 4 cores each is fastest for the smaller ones and using all 8 cores for SoB is best.
I never say that: I say that MT gives me some small improvement until some limit.
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
This is benchmark from Prime95 up to 960K.
I will never go such high, but I select it so boundary when Ryzen is better to 4*3 cores worker become visible.
https://www.dropbox.com/s/l1jdjjmpx3q6rch/results.bench.txt?dl=0
Notice configuration on every FFT with 6*2 cores: aspect ratio between workers is nearly same in whole range
3.15, 1.59, 3.15, 1.60, 1.07, 1.07 ms
1 and 3 has same value
2 and 4 has same value
5 and 6 has same value.
Maybe it is bug in hwloc maybe it is how Ryzen is designed: I dont know
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Having just acquired a 3950x I'm a little puzzled by the performance.
If I've understood everything above then an ESP task @ 1440k will fit into the cache so on my 3700x I've been running 2 of these with 4 threads and they take about 24K seconds to complete.
On the 3950x they're taking 39K seconds each when running 4 even though the 3950x is running at 4.2GHz vs 4GHz of the 3700x.
Are the tasks actually hitting main RAM and running into bandwidth limitations?
| |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
Run Prime95 benchmark, set 1440K and see what will produce best output ( fastest)
That is best answer you can get because then you will see what is best on your computer
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Run Prime95 benchmark, set 1440K and see what will produce best output ( fastest)
That is best answer you can get because then you will see what is best on your computer
I did that. It suggests to run 4 tasks with 4 threads as expected.
My question is why those tasks are taking 39000seconds on the 3950x vs 24000 seconds running 2 tasks on the 3700x which in theory is just half a 3950x. | |
|
|
Run Prime95 benchmark, set 1440K and see what will produce best output ( fastest)
That is best answer you can get because then you will see what is best on your computer
I did that. It suggests to run 4 tasks with 4 threads as expected.
My question is why those tasks are taking 39000seconds on the 3950x vs 24000 seconds running 2 tasks on the 3700x which in theory is just half a 3950x.
Could it be that 4 tasks of 1440k FFT will overload the cache?
____________
My lucky number is 6219*2^3374198+1
| |
|
|
Run Prime95 benchmark, set 1440K and see what will produce best output ( fastest)
That is best answer you can get because then you will see what is best on your computer
I did that. It suggests to run 4 tasks with 4 threads as expected.
My question is why those tasks are taking 39000seconds on the 3950x vs 24000 seconds running 2 tasks on the 3700x which in theory is just half a 3950x.
Could it be that 4 tasks of 1440k FFT will overload the cache?
the 3950x has double the cache of the 3700x. I'll try running 4xSR5 tasks next as that should rule out the cache being overloaded. | |
|
|
Having just acquired a 3950x I'm a little puzzled by the performance.
If I've understood everything above then an ESP task @ 1440k will fit into the cache so on my 3700x I've been running 2 of these with 4 threads and they take about 24K seconds to complete.
On the 3950x they're taking 39K seconds each when running 4 even though the 3950x is running at 4.2GHz vs 4GHz of the 3700x.
Are the tasks actually hitting main RAM and running into bandwidth limitations?
Your problem is likely windows. Tasks run fastest when they stay on within a chiplet on Zen2. That way they don't have to transfer data over the bus within the CPU. For some reason, Windows is horrible at keeping a multithreaded task on a single chiplet with Zen2. Linux is much better, with results being much quicker.
See the thread here:
http://www.primegrid.com/forum_thread.php?id=9063&nowrap=true#139003
http://www.primegrid.com/forum_thread.php?id=9063&nowrap=true#139603
____________
Reno, NV
| |
|
|
Having just acquired a 3950x I'm a little puzzled by the performance.
If I've understood everything above then an ESP task @ 1440k will fit into the cache so on my 3700x I've been running 2 of these with 4 threads and they take about 24K seconds to complete.
On the 3950x they're taking 39K seconds each when running 4 even though the 3950x is running at 4.2GHz vs 4GHz of the 3700x.
Are the tasks actually hitting main RAM and running into bandwidth limitations?
Your problem is likely windows. Tasks run fastest when they stay on within a chiplet on Zen2. That way they don't have to transfer data over the bus within the CPU. For some reason, Windows is horrible at keeping a multithreaded task on a single chiplet with Zen2. Linux is much better, with results being much quicker.
See the thread here:
http://www.primegrid.com/forum_thread.php?id=9063&nowrap=true#139003
http://www.primegrid.com/forum_thread.php?id=9063&nowrap=true#139603
Thanks, that's interesting although I'd expect to see the same happening with the 3700x although I guess it could be that the effects are more pronounced the more cores you have.
Looks like I'll need to give mint a go and see what the results are.
| |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
Having just acquired a 3950x I'm a little puzzled by the performance.
If I've understood everything above then an ESP task @ 1440k will fit into the cache so on my 3700x I've been running 2 of these with 4 threads and they take about 24K seconds to complete.
On the 3950x they're taking 39K seconds each when running 4 even though the 3950x is running at 4.2GHz vs 4GHz of the 3700x.
Are the tasks actually hitting main RAM and running into bandwidth limitations?
Your problem is likely windows. Tasks run fastest when they stay on within a chiplet on Zen2. That way they don't have to transfer data over the bus within the CPU. For some reason, Windows is horrible at keeping a multithreaded task on a single chiplet with Zen2. Linux is much better, with results being much quicker.
See the thread here:
http://www.primegrid.com/forum_thread.php?id=9063&nowrap=true#139003
http://www.primegrid.com/forum_thread.php?id=9063&nowrap=true#139603
His problem is maybe Windows. Why I say that: because we dont know is 3700x run under Linux or Windows, and if it run under Windows why it doesnot have same effect like on 3950x?
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
I just created a mint usb and ran some 4xESP and 4xSR5 tasks.
After 5 minutes running for each batch the ETCs were:
ESP: 27000 (windows 39000)
SR5: 7500 (windows 9500)
That's a massive improvement so thanks Zombie + Crun-chi. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 476,023,430 RAC: 209,677
                               
|
I just created a mint usb and ran some 4xESP and 4xSR5 tasks.
After 5 minutes running for each batch the ETCs were:
ESP: 27000 (windows 39000)
SR5: 7500 (windows 9500)
That's a massive improvement so thanks Zombie + Crun-chi.
With most (probably all) conjecture projects, the FFT size varies depending on the K.
Different FFT sizes incur significant changes in run times above and beyond the effect of fitting in the L3 cache.
Make sure you're comparing tests with the same FFT size, otherwise the results you get may be misleading.
Right now, ESP has two different FFT sizes:
1474560
1572864
SR5 has 5 different FFT sizes:
688128
786432
819200
737280
655360
Also, with SR5, different algorithms are used for +1 and -1, so you don't want to be comparing S5 tests to R5 tests.
Especially with conjecture projects, if you want to get information that you can be confident about, it's important to either run manual tests, so you are certain that you're comparing equivalent numbers, or, at the very least, verify that the FFT size and c (+1 or -1) are identical.
____________
My lucky number is 75898524288+1 | |
|
|
3700x 5800 seconds windows managed.
3950x 8400 seconds windows managed.
3950x 6000 seconds after I discovered process lasso and told it to give each llr an equal cpu share which confines each task to a single ccx. | |
|
|
An update for anyone following this with a Ryzen. For the moment I'm still using windows as I haven't had time to get linux up and running for a proper test.
With Process Lasso confining tasks to a ccx I'm now seeing runtimes 10% faster than on the 3700x.
eg: 3700x: ESP 1440k ~24000 seconds
3950x: ESP 1440k ~ 22000 seconds
This means the problem was also present on the 3700x but there's clearly far less of a performance hit crossing CCXs within a ccd vs crossing CCDs.
If you want to try out process lasso you can find it here: link.
What you need to do is go into options, cpu, configure instance balancer.
Add "primegrid_cllr.exe" as the process and select "equal cpus per instance". Note that this is only going to work if you're using all your real cores. | |
|
|
Thanks for the follow-up remarks Sheridan. A lot of people have noticed indeed the rather disappointing but still impressive results from Ryzen 3-gen when not setting affinity and not checking the exact CCX/CCD architecture per CPU. I know that some who have Ryzen are using a program written by ATN, which in essence assigns/or sets affinity for some general PG or PGS programs. It refreshes every x seconds and checks whether it needs to bind more processes to certain artificial nodes. Btw: I have not dived into Lasso yet, but will do! | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
Thanks for the follow-up remarks Sheridan. A lot of people have noticed indeed the rather disappointing but still impressive results from Ryzen 3-gen when not setting affinity and not checking the exact CCX/CCD architecture per CPU. I know that some who have Ryzen are using a program written by ATN, which in essence assigns/or sets affinity for some general PG or PGS programs. It refreshes every x seconds and checks whether it needs to bind more processes to certain artificial nodes. Btw: I have not dived into Lasso yet, but will do!
Or just install linux :)
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Thanks for the follow-up remarks Sheridan. A lot of people have noticed indeed the rather disappointing but still impressive results from Ryzen 3-gen when not setting affinity and not checking the exact CCX/CCD architecture per CPU. I know that some who have Ryzen are using a program written by ATN, which in essence assigns/or sets affinity for some general PG or PGS programs. It refreshes every x seconds and checks whether it needs to bind more processes to certain artificial nodes. Btw: I have not dived into Lasso yet, but will do!
Or just install linux :)
:/
I have this whole linux intro class waiting for me but I chose to abandon it since the linux ubuntu version was too old. Was 10.2 or smth.
Is there a big difference between the diff versions of linux ubuntu and the diff types of linux? (Ex. mint, red hat, ...)
Also, where can I get the ISOs for those diff types of linux, for example mint? They don't seem to be on the website...?
____________
My lucky number is 6219*2^3374198+1
| |
|
|
Mint is here.
In my experience they're all much the same - fine once they're up and running the way you want but a pain to get to that point unless terminals and arcane commands are your thing (I use ubuntu for development work). | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
I use debian net install: since I have written my own scripts for mprime, for setup linux box I need 30 minutes: everything I do , I do remote on linux shell: do in under Windows.
Setup, run and forget :)
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Re FFT sizes, the GCW project spans a wide range of FFTs, most of which run will run fastest with 4 threads but 1920K would be faster with 8.
Is there any way we could get an FFT range selection alongside the number of threads option? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 476,023,430 RAC: 209,677
                               
|
Re FFT sizes, the GCW project spans a wide range of FFTs, most of which run will run fastest with 4 threads but 1920K would be faster with 8.
Is there any way we could get an FFT range selection alongside the number of threads option?
No. Anything is possible, but this would involve a lot of work.
It would also take a lot to convince me that permitting this would be a good thing.
____________
My lucky number is 75898524288+1 | |
|
|
you mean just wanting an extra 20% throughput isn't a good enough reason? ;-)
As it's a single subproject affected at this time then if it's a lot of work I agree it doesn't make sense to do it. | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1895 ID: 37043 Credit: 822,906,633 RAC: 417,053
                     
|
Thanks for the follow-up remarks Sheridan. A lot of people have noticed indeed the rather disappointing but still impressive results from Ryzen 3-gen when not setting affinity and not checking the exact CCX/CCD architecture per CPU. I know that some who have Ryzen are using a program written by ATN, which in essence assigns/or sets affinity for some general PG or PGS programs. It refreshes every x seconds and checks whether it needs to bind more processes to certain artificial nodes. Btw: I have not dived into Lasso yet, but will do!
Or just install linux :)
:/
I have this whole linux intro class waiting for me but I chose to abandon it since the linux ubuntu version was too old. Was 10.2 or smth.
Is there a big difference between the diff versions of linux ubuntu and the diff types of linux? (Ex. mint, red hat, ...)
Also, where can I get the ISOs for those diff types of linux, for example mint? They don't seem to be on the website...?
Yes there are differences between each version, to see most of the versions try here https://distrowatch.com/ and browse around. People say Linux Mint is the most Windows like but it still takes some getting used too. Ubuntu is one of the most used and alot of the other versions are made from it, Mint included, but Ubuntu can be hard for long time Windows users. No version of Linux can really be completley controlled without the command line and that's where most of the learning comes in. For instance if you put a space between some things you can delete things with zero way to get them back again without a backup or reload and starting over.
Back to FFT stuff is Linux in general faster than Windows or is it that most/some versions of Linux are faster while other versions are slower? | |
|
|
Or just install linux :)
Having done that last night and had a disaster (again) I'd say don't.
Four hours in the tasks were looking to be faster than unoptimised windows but around 10% slower than with process lassoo (I've written my own now). Then it just suddenly became really unresponsive and all the tasks crashed.
| |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
After over month of experimenting new data is here.
This is real data (under mprime Linux)
[Worker #1 Jul 5 22:36:45] Starting PRP test of 4*53^1608950+1 using all-complex FMA3 FFT length 640K, Pass1=512, Pass2=1280, clm=2, 6 threads
[Worker #1 Jul 6 01:51:01] 4*53^1608950+1 is not prime. RES64: 985B646C621A8710. Wh8: A6218917,00000000
24+60+60+51 195 min
[Worker #2 Jul 20 05:50:36] Starting PRP test of 4*53^1613198+1 using all-complex FMA3 FFT length 640K, Pass1=512, Pass2=1280, clm=2, 6 threads
[Worker #2 Jul 20 09:51:40] 4*53^1613198+1 is not prime. RES64: 6F5E857A6F1844E4. Wh8: C751AA47,00000000
10+60+60+60+51 241 min
-------------------------------------------------------------------------------------------------------------
AFTER FLASHING NEW AGESA BIOS and disable HT
[Worker #4 Jul 25 17:32:05] Starting PRP test of 4*53^1618262+1 using all-complex FMA3 FFT length 640K, Pass1=512, Pass2=1280, clm=2, 3 threads
[Worker #4 Jul 25 19:32:30] 4*53^1618262+1 is not prime. RES64: B08543B377E72BF8. Wh8: A496D1D7,00000000
120 min (and few seconds more)
Conclusion
1. If your board maker release new AGESA flash it. In my case reduce nearly 10 A of cpu power consumption.
2. Dont lower voltage too much: CPU will work, CPU will be stable but as you can see performance is terrible.
3. Turn HT of , that gives me more then 6 A smaller CPU power consumption: so before this I have in average 80 °C , now I cannot pass 66 °C
Times are way better then before.
This month was good spent time on tuning CPU ... now just to find some primes :)
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
The data is a bit confusing, what was the running configuration of the earlier two states? In particular, were you using 50% of threads or all of them? I have seen using the extra SMT threads for LLR work increases power consumption somewhat, and in most cases doesn't result in more throughput.
Also it is unclear what was done when, is the improvement due to the new AGESA, the removal of SMT, or perhaps a combination of both? Do you know the AGESA version (not bios version, as that will vary on motherboard) as it will give an indication of what changes happened at what time. | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
The data is a bit confusing, what was the running configuration of the earlier two states? In particular, were you using 50% of threads or all of them? I have seen using the extra SMT threads for LLR work increases power consumption somewhat, and in most cases doesn't result in more throughput.
Also it is unclear what was done when, is the improvement due to the new AGESA, the removal of SMT, or perhaps a combination of both? Do you know the AGESA version (not bios version, as that will vary on motherboard) as it will give an indication of what changes happened at what time.
I think I mark clear (with bold red color) when :)
First I use HT on and since I dont use BOINC ( prime95/mprime) I use all cores not 50%.
Then I disabled HT, flash new bios with new agesa and got results below red color text :)
In my case AMD AGESA Combo-AM4 V2 1.0.0.2 is patch I used!
Freq is same 3,8Ghz and voltage is set to 1.1V
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
I've just discovered something important for Ryzen users who want to run projects with FFT size that goes to ram (ie anything >= 1920K).
If you have 16GB ram you'll probably want to throw it away and get yourself 32GB.
Why? As 8GB ram sticks are currently single rank whereas 16GB sticks are dual rank.
Running prime95 with 16GB ram, single rank sticks, for fft size 3200 on a 3950x returns best performance of 626iter/sec for running 2 tasks with 8 threads each.
Running prime95 with 32GB ram, dual rank sticks, for fft size 3200 on a 3950x returns best performance of 1015iter/sec for running 2 tasks with 8 threads each.
To put that in primegrid terms, a 3200K SoB task would take 28 hours with single rank ram.
With dual rank it will take around 20 hours.
| |
|
Nick  Send message
Joined: 11 Jul 11 Posts: 2301 ID: 105020 Credit: 9,972,165,410 RAC: 26,800,586
                            
|
With dual rank it will take around 20 hours.
That is exactly the same throughput as a 9960X running 1 task of 12 threads.
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours. | |
|
|
With dual rank it will take around 20 hours.
That is exactly the same throughput as a 9960X running 1 task of 12 threads.
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours.
There's probably a bit more time to squeeze out of it if I get bored enough to try some overclocking.
Looking forward to the 4950x next month which is supposed to be 20% faster. ;-) | |
|
Nick  Send message
Joined: 11 Jul 11 Posts: 2301 ID: 105020 Credit: 9,972,165,410 RAC: 26,800,586
                            
|
Looking forward to the 4950x next month which is supposed to be 20% faster. ;-)
That sounds like some really good performance. I'll race you for throughput in SOB on a 9960X with 16 thread tasks? | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
I've just discovered something important for Ryzen users who want to run projects with FFT size that goes to ram (ie anything >= 1920K).
If you have 16GB ram you'll probably want to throw it away and get yourself 32GB.
Why? As 8GB ram sticks are currently single rank whereas 16GB sticks are dual rank.
I've done much testing on dual rank vs single rank in the past, which turned out to be the main difference two systems I had being so different in performance even though the ran speed was not so different. Where ram bandwidth is limiting, it applies to Intel as much as AMD.
There is one further thing to consider: it isn't just module rank. You can equivalently put two single rank modules on a channel, so for a dual channel system that means you can fit 4 single rank modules. Even if you have 2x8GB, no need to bin them, get another pair for 4 modules. Especially if you have high performance ram, there is some risk the loading from more modules might exceed the IMC's capability to be stable. Generally speaking, small reductions in ram speed would still be outweighed from having dual rank or 2 DPC (DIMMs Per Channel).
Personally on Ryzen systems I'd just try to run tasks that will fit inside a CCX so avoid the ram problem altogether. The only time I'd run massive tasks is if it is required by a challenge, but that is infrequent. | |
|
|
You're right about ram speed, possibly because of the timings - 3200 cl16 and 3600cl18 are about the same performance-wise. | |
|
|
You're right about ram speed, possibly because of the timings - 3200 cl16 and 3600cl18 are about the same performance-wise.
Just like 2666-c19 and 2400-c16
____________
My lucky number is 6219*2^3374198+1
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
You're right about ram speed, possibly because of the timings - 3200 cl16 and 3600cl18 are about the same performance-wise.
In my testing, timings don't make a big difference at all. Ram speed and if you can get dual rank/2 DPC make a far bigger different compared to timings.
It might be somewhere on this forum, but in the past I did a test on both AMD and Intel systems using a few ram kits I happen to have. In short, the fastest I had were some dual rank 3200 modules. 2nd but notably slower were single rank modules running at 3600. Those modules were actually rated at 4000, with profiles at 3600 and 4000. They were slower again at 4000 than 3600, and only in that case I suspect a secondary or tertiary timing were to blame. While not tested, the primary timings seemed to scale reasonably between the two speeds so I don't believe that to be a major factor. Again, I could demonstrate this on both Intel and AMD systems so this can't be put on the Zen 2 system going async above 3600. | |
|
|
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....
lol if amd make this for 20+ hours,,, some "hypotheticall" 4950X amd will not help much .. there is more than 50% difference!! even "amd 8950" willl not help ))))
Amd must make radicall changes on architecture and forget "cinebench marketing" and i say it like owner of 1950/2990wx-trash cpu and death platform x399, trx40 is already death platform and amd is too noise about amd pro cpus and wr80 chipset and mobos.. | |
|
Nick  Send message
Joined: 11 Jul 11 Posts: 2301 ID: 105020 Credit: 9,972,165,410 RAC: 26,800,586
                            
|
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....
lol if amd make this for 20+ hours,,, some "hypotheticall" 4950X amd will not help much .. there is more than 50% difference!! even "amd 8950" willl not help ))))
Amd must make radicall changes on architecture and forget "cinebench marketing" and i say it like owner of 1950/2990wx-trash cpu and death platform x399, trx40 is already death platform and amd is too noise about amd pro cpus and wr80 chipset and mobos..
The soon to be released 4950X could be fast?
On the Intel front, it is a strange world that the 9960X is so much better than the 9980XE at PG and yet it probably isn't known because I suspect I have the only 2 9960Xs at PG.
Latest stats on SOB (not a lot of data I know):
I've run 4 x 16 thread tasks on 2 x 9960X - average time 26,674 sec - 3.24 SOB tasks / day
I've run 2 x 16 thread tasks on a 9980XE - average time 31,403 sec - 2.75 SOB tasks / day
The 9960Xs are consistently faster and have more throughput than the 9980XE for all tasks at PG.
If the 10980XE is faster than the 9960X, it wouldn't be by much. And it is 50% more expensive.
Stats on SGS (I didn't write down how many tasks / day - just the difference between computers - I think it was around 3500 tasks / day for the 9960Xs)
The new 9960X was doing 16 single thread tasks of SGS as was the 9980XE. 18.8% faster / task than the 9980XE.
The old 9960X was doing 16 single thread tasks of SGS. 17.3% faster / task than the 9980XE.
There is one thing that needs to be said about Woo (and probably Cul too) re: FFT size:
On the 9960X, 2 tasks each of 8 threads runs times of about 19,500 sec.
If you allocate 2 threads for the GPU and run 2 tasks of 7 threads the times are around 27,000 - 28,000 sec. If you run 1 task of 14 threads (with 2 threads for GPU) you get times of about 11,900 sec.
The times on the 9980XE with 2 x 8 thread tasks and 2 threads for GPU are around 23,500 sec.
The 9980XE has 2 Mb more L2 cache (which won't make any difference if you are running 16 threads for LLR / sieve on any of these computers) but has 2.75 Mb more L3 cache.
I think the 9960X is a much better CPU because that was the intended design? Intel added 2 more cores in response to AMD and created the xx80XE CPUs.
Apologies if I've gone really off topic. | |
|
|
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....
lol if amd make this for 20+ hours,,, some "hypotheticall" 4950X amd will not help much .. there is more than 50% difference!! even "amd 8950" willl not help ))))
Amd must make radicall changes on architecture and forget "cinebench marketing" and i say it like owner of 1950/2990wx-trash cpu and death platform x399, trx40 is already death platform and amd is too noise about amd pro cpus and wr80 chipset and mobos..
The soon to be released 4950X could be fast?
On the Intel front, it is a strange world that the 9960X is so much better than the 9980XE at PG and yet it probably isn't known because I suspect I have the only 2 9960Xs at PG.
Latest stats on SOB (not a lot of data I know):
I've run 4 x 16 thread tasks on 2 x 9960X - average time 26,674 sec - 3.24 SOB tasks / day
I've run 2 x 16 thread tasks on a 9980XE - average time 31,403 sec - 2.75 SOB tasks / day
The 9960Xs are consistently faster and have more throughput than the 9980XE for all tasks at PG.
If the 10980XE is faster than the 9960X, it wouldn't be by much. And it is 50% more expensive.
Stats on SGS (I didn't write down how many tasks / day - just the difference between computers - I think it was around 3500 tasks / day for the 9960Xs)
The new 9960X was doing 16 single thread tasks of SGS as was the 9980XE. 18.8% faster / task than the 9980XE.
The old 9960X was doing 16 single thread tasks of SGS. 17.3% faster / task than the 9980XE.
There is one thing that needs to be said about Woo (and probably Cul too) re: FFT size:
On the 9960X, 2 tasks each of 8 threads runs times of about 19,500 sec.
If you allocate 2 threads for the GPU and run 2 tasks of 7 threads the times are around 27,000 - 28,000 sec. If you run 1 task of 14 threads (with 2 threads for GPU) you get times of about 11,900 sec.
The times on the 9980XE with 2 x 8 thread tasks and 2 threads for GPU are around 23,500 sec.
The 9980XE has 2 Mb more L2 cache (which won't make any difference if you are running 16 threads for LLR / sieve on any of these computers) but has 2.75 Mb more L3 cache.
I think the 9960X is a much better CPU because that was the intended design? Intel added 2 more cores in response to AMD and created the xx80XE CPUs.
Apologies if I've gone really off topic.
I'll have my 10980XE up and running in the next week, so I'll be able to compare for you, though it will be air cooled initially (Noctua D15 with 3 fans) so that may have a comparison effect depending on how you're cooling yours (and indeed, the ability to cool effectively seems to be the biggest determiner of performance with all modern CPUs).
In regards to the 16 vs. 18 core systems, Intel actually never intended to release parts with more than 10 cores when they were first working on Skylake-X. Fortunately for us, AMD had already released the 16 core 1950X blindsiding them and the 12-18 core parts were quickly added to the launch slides without any details, because they were still figuring them out.
Silicon-wise, 18 cores for the HCC die worked with the 4x5 tiling scheme (two tiles go to the memory controllers), and is also why things top out at weird numbers in general: LCC 10C (all Intel initially intended for HEDT): 3x4-2, XCC 28C: 5x6-2.
____________
Eating more cheese on Thursdays. | |
|
|
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....
lol if amd make this for 20+ hours,,, some "hypotheticall" 4950X amd will not help much .. there is more than 50% difference!! even "amd 8950" willl not help ))))
Amd must make radicall changes on architecture and forget "cinebench marketing" and i say it like owner of 1950/2990wx-trash cpu and death platform x399, trx40 is already death platform and amd is too noise about amd pro cpus and wr80 chipset and mobos..
What are you talking about? I can run at least 2 tasks in 20 hours. That's exactly the same as running 1 task in 10 with a cpu that cost 30% less than a 9980XE. | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3245 ID: 50683 Credit: 152,646,050 RAC: 18,212
                         
|
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....
What nonsense throwing resources...
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Nick  Send message
Joined: 11 Jul 11 Posts: 2301 ID: 105020 Credit: 9,972,165,410 RAC: 26,800,586
                            
|
Or a 9980XE running 1 task of 14 threads.
Time of 10 hours....
What nonsense throwing resources...
Really? Nonsense indeed. I can't fit 2 of SOB in cache. You do know about running tasks within cache?
Since running tasks of that size I am now using MORE threads for 1 task at a time.
I will use the 9960X as an example as that is a better CPU.
I have read that running 2 x 8 thread tasks on a 3950X gets 2 tasks completed in 20 hours - 72,000 sec. Is this a waste of resources?
On either 9960X I can complete 2 tasks - 1 at a time with 16 threads - in 53,000 sec. How much throughput do you expect from this CPU? | |
|
|
Hi,
There seems to be a bit of disagreement in regards to calculating the approximate L3 footprint of the various PG WUs. The difference between multiplying by 8 or 24 is pretty large! Is there a reason why 8x seems to work in practical terms when 24x appears to be the theoretical answer?
I have read through this thread several times, and I am not sure I have seen a direct answer to this question. Did I miss it, or is it still unanswered?
____________
Reno, NV
| |
|
|
I think there was a post in the past that had FFT sizes for each project, but I can't find it again. Anyone know where it was? It is possible it would be out of date also, so is there some way I can find current values short of looking at random units?
Why? Ryzen 3000 (Zen 2) was launched yesterday, and comes with at least 32MB of L3 cache. I expect to get my sample on Tuesday and bench it. Based on what I currently understand of it, it has potential to be the best choice for LLR use in terms of a balance of power consumption, compute performance, and pricing. FFT size will allow a more educated guess into the optimal running configuration, which I can then test with benchmarks.
Current values:
+-------+----------------------------------------+------+------+
| appid | user_friendly_name | min | max |
+-------+----------------------------------------+------+------+
| 2 | Sophie Germain (LLR) | 128 | 128 |
| 3 | Woodall (LLR) | 1440 | 1920 |
| 4 | Cullen (LLR) | 1536 | 1920 |
| 7 | 321 (LLR) | 800 | 800 |
| 8 | Prime Sierpinski Problem (LLR) | 1920 | 2048 |
| 10 | PPS (LLR) | 192 | 192 |
| 13 | Seventeen or Bust | 2560 | 2880 |
| 15 | The Riesel Problem (LLR) | 720 | 1008 |
| 18 | PPSE (LLR) | 120 | 120 |
| 19 | Sierpinski/Riesel Base 5 Problem (LLR) | 560 | 720 |
| 20 | Extended Sierpinski Problem | 1280 | 1280 |
| 21 | PPS-Mega (LLR) | 200 | 256 |
| 30 | Generalized Cullen/Woodall (LLR) | 1440 | 1792 |
+-------+----------------------------------------+------+------+
The SoB line includes post-DC n=31M candidates.
Is there a current version of this table? If not, could this be an up-to-date link? Now that speed is not longer the goal, it is now way more important to maximize throughput. To do that, we need to know the FFT size, so we don't overload our cache.
____________
Reno, NV
| |
|
streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 1051 ID: 301928 Credit: 563,881,658 RAC: 1,342
                         
|
Is there a current version of this table?
+---------+--------+--------+
| project | minfft | maxfft |
+---------+--------+--------+
| 321 | 864 | 864 |
| CUL | 1600 | 1920 |
| DIV | 384 | 384 |
| ESP | 1440 | 1536 |
| GCW | 1680 | 2016 |
| MEG | 256 | 256 |
| PPS | 120 | 200 |
| PSP | 2304 | 2304 |
| SGS | 128 | 128 |
| SOB | 2880 | 3072 |
| SR5 | 640 | 800 |
| TRP | 800 | 1120 |
| WOO | 1600 | 2016 |
+---------+--------+--------+
PPS and PPSE are combined in this table due to some technical reasons. PPS has larger FFT (200K) and PPSE is smaller (120K) | |
|
|
Looking forward to the 4950x next month which is supposed to be 20% faster. ;-)
That sounds like some really good performance. I'll race you for throughput in SOB on a 9960X with 16 thread tasks?
just read this
On Zen 2, each CCD houses two core complexes (CCXs), whereby each complex is comprised of four cores that share 16MB of L3 cache. According to the AMD document, Zen 3's composition is completely different - there's only one CCX inside each CCD. The CCX possesses eight cores that can either run in single-thread (1T) or two-thread SMT (simultaneous multithreading) mode (2T), amounting up to 16 threads per complex. Since there's only one CCX now, all eight CPU cores can now directly access the 32MB of shared L3 cache.
if that's right it's going to be more than 20% faster on SoB as it will be running purely in L3 rather flipping things between CCXs and ram. | |
|
Message boards :
Number crunching :
Max FFT size for each LLR project? |