Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
321 Prime Search :
Multi-threading: Max # of threads for each task Setting?
Author |
Message |
|
What should I set this at? I have windows 10, i5-3470 4c-4t and 24g of ram. I also have a Ryzen3900XT 12c-24t. How do you gauge thread count for this and different computers? Any insight would be appreciated. | |
|
|
For the i5-3470 I would run 1 task on 4 cores and for the 3900XT I would run 4 tasks on 3 cores each. These tasks are presently using about 7MB of CPU cache each, so the 3470 will be a bit slow, but the 3900XT should run them pretty quick with that setting.
You will want to make sure that the tasks are run on the same CCX (a CCX is a set of 3 adjacent cores for the 3900XT) for the Ryzen CPU though to maximize efficiency. You can use Process Lasso (free, here https://bitsum.com/) or, if you hop on the Discord server, someone may have a link to another program that the same person who made the LLR2 software wrote. Ryzen 3rd gen CPUs are much more productive and efficient if you can keep a task on the same CCX.
As far as optimal configurations go, that is a function of FFT length and cache size. If you look at the output of a task after you run it (by clicking on the task on your tasks summary page) you can see the FFT size. Multiply that by 8 and you get MB of CPU cache used during the test. For 321 right now the FFT size is 864K, so each task uses about 7MB cache. The i5-3470 has 6MB and each CCX of the 3900XT has 16MB.
Some examples: Looking at PPSE tasks, with an FFT length of 120K (~1MB cache usage), the most efficient configuration would be to run 1 task on each core (4 tasks total for the 3470 and 12 tasks total for the 3900XT). For PPS-Mega the FFT size right now is 256K, or about 2MB per task. You would probably want to run 2 tasks with 2 threads each on the 3470 and 12 tasks with 1 thread each for the 3900XT for optimal throughput, to stay within the cache limits for each CPU.
Hope this helps, and good luck! | |
|
|
or, if you hop on the Discord server, someone may have a link to another program that the same person who made the LLR2 software wrote. Ryzen 3rd gen CPUs are much more productive and efficient if you can keep a task on the same CCX.
I think you are referring to AffinityWatcher by Pavel Atnashev (user 914937). /JeppeSN | |
|
|
For the i5-3470 I would run 1 task on 4 cores and for the 3900XT I would run 4 tasks on 3 cores each. These tasks are presently using about 7MB of CPU cache each, so the 3470 will be a bit slow, but the 3900XT should run them pretty quick with that setting.
You will want to make sure that the tasks are run on the same CCX (a CCX is a set of 3 adjacent cores for the 3900XT) for the Ryzen CPU though to maximize efficiency. You can use Process Lasso (free, here https://bitsum.com/) or, if you hop on the Discord server, someone may have a link to another program that the same person who made the LLR2 software wrote. Ryzen 3rd gen CPUs are much more productive and efficient if you can keep a task on the same CCX.
As far as optimal configurations go, that is a function of FFT length and cache size. If you look at the output of a task after you run it (by clicking on the task on your tasks summary page) you can see the FFT size. Multiply that by 8 and you get MB of CPU cache used during the test. For 321 right now the FFT size is 864K, so each task uses about 7MB cache. The i5-3470 has 6MB and each CCX of the 3900XT has 16MB.
Some examples: Looking at PPSE tasks, with an FFT length of 120K (~1MB cache usage), the most efficient configuration would be to run 1 task on each core (4 tasks total for the 3470 and 12 tasks total for the 3900XT). For PPS-Mega the FFT size right now is 256K, or about 2MB per task. You would probably want to run 2 tasks with 2 threads each on the 3470 and 12 tasks with 1 thread each for the 3900XT for optimal throughput, to stay within the cache limits for each CPU.
Hope this helps, and good luck!
Hi, are you talking L2 or L3 CPU cache? I ask because my CPU has only 2MB of L2 but 45MB L3. Should I be calculating for L2 or L3? | |
|
|
For the i5-3470 I would run 1 task on 4 cores and for the 3900XT I would run 4 tasks on 3 cores each. These tasks are presently using about 7MB of CPU cache each, so the 3470 will be a bit slow, but the 3900XT should run them pretty quick with that setting.
You will want to make sure that the tasks are run on the same CCX (a CCX is a set of 3 adjacent cores for the 3900XT) for the Ryzen CPU though to maximize efficiency. You can use Process Lasso (free, here https://bitsum.com/) or, if you hop on the Discord server, someone may have a link to another program that the same person who made the LLR2 software wrote. Ryzen 3rd gen CPUs are much more productive and efficient if you can keep a task on the same CCX.
As far as optimal configurations go, that is a function of FFT length and cache size. If you look at the output of a task after you run it (by clicking on the task on your tasks summary page) you can see the FFT size. Multiply that by 8 and you get MB of CPU cache used during the test. For 321 right now the FFT size is 864K, so each task uses about 7MB cache. The i5-3470 has 6MB and each CCX of the 3900XT has 16MB.
Some examples: Looking at PPSE tasks, with an FFT length of 120K (~1MB cache usage), the most efficient configuration would be to run 1 task on each core (4 tasks total for the 3470 and 12 tasks total for the 3900XT). For PPS-Mega the FFT size right now is 256K, or about 2MB per task. You would probably want to run 2 tasks with 2 threads each on the 3470 and 12 tasks with 1 thread each for the 3900XT for optimal throughput, to stay within the cache limits for each CPU.
Hope this helps, and good luck!
Hi, are you talking L2 or L3 CPU cache? I ask because my CPU has only 2MB of L2 but 45MB L3. Should I be calculating for L2 or L3?
Generally speaking, it's L3, but there is an "it depends" based on cache inclusiveness. AMD and consumer Intel chips store a copy of L2 in L3, so only L3 matters (and in Ryzen 3k/5k, it's L3 per CCX). Skylake-X/SP and up (HEDT/server) have a mostly non-inclusive hierarchy, so it's a little less than L2+L3.
What CPU are you using that mixes itsy bitsy L2 with massive L3?
____________
Eating more cheese on Thursdays. | |
|
|
For the i5-3470 I would run 1 task on 4 cores and for the 3900XT I would run 4 tasks on 3 cores each. These tasks are presently using about 7MB of CPU cache each, so the 3470 will be a bit slow, but the 3900XT should run them pretty quick with that setting.
You will want to make sure that the tasks are run on the same CCX (a CCX is a set of 3 adjacent cores for the 3900XT) for the Ryzen CPU though to maximize efficiency. You can use Process Lasso (free, here https://bitsum.com/) or, if you hop on the Discord server, someone may have a link to another program that the same person who made the LLR2 software wrote. Ryzen 3rd gen CPUs are much more productive and efficient if you can keep a task on the same CCX.
As far as optimal configurations go, that is a function of FFT length and cache size. If you look at the output of a task after you run it (by clicking on the task on your tasks summary page) you can see the FFT size. Multiply that by 8 and you get MB of CPU cache used during the test. For 321 right now the FFT size is 864K, so each task uses about 7MB cache. The i5-3470 has 6MB and each CCX of the 3900XT has 16MB.
Some examples: Looking at PPSE tasks, with an FFT length of 120K (~1MB cache usage), the most efficient configuration would be to run 1 task on each core (4 tasks total for the 3470 and 12 tasks total for the 3900XT). For PPS-Mega the FFT size right now is 256K, or about 2MB per task. You would probably want to run 2 tasks with 2 threads each on the 3470 and 12 tasks with 1 thread each for the 3900XT for optimal throughput, to stay within the cache limits for each CPU.
Hope this helps, and good luck!
Hi, are you talking L2 or L3 CPU cache? I ask because my CPU has only 2MB of L2 but 45MB L3. Should I be calculating for L2 or L3?
Generally speaking, it's L3, but there is an "it depends" based on cache inclusiveness. AMD and consumer Intel chips store a copy of L2 in L3, so only L3 matters (and in Ryzen 3k/5k, it's L3 per CCX). Skylake-X/SP and up (HEDT/server) have a mostly non-inclusive hierarchy, so it's a little less than L2+L3.
What CPU are you using that mixes itsy bitsy L2 with massive L3?
So I'm guessing 6 cores per task is optimal for me? CPU is a E5 v3 2686
| |
|
|
Hmm according to task manager my system has the following available:
L1 2.2MB
L2 9MB
L3 90MB | |
|
|
Hmm according to task manager my system has the following available:
L1 2.2MB
L2 9MB
L3 90MB
Might you be running a two socket system? 36 cores total across 2 CPUs? (your computers are hidden)
You can fit 6 tasks in L3 with room to spare, so I'd suggest 3 threads/task.
Not fully comparable, of course, but I found much better throughput doing 321 with 6x3t instead of 3x6t on my 18c 10980XE.
____________
Eating more cheese on Thursdays. | |
|
|
I don't know.
Have not done extensive testing for now since the runtime of WUs forbids this, but i currently have following 4 WUs, two running with 2 threads per WU (the upper two), 2 running with 1 thread per WU (the lower two).
this is a Xeon E5-2690 with 8 Cores, 256kB L2-Cache per Core and 20 MB shared L3-Cache for the CPU, so yes this woul be above the
one could see that the 2-threaded WUs are slower than single-threaded ones.
When those WUs are finished i will give the last found 321-prime ( 3*2^1832496+1) a shot (since i know the outcome, but it is longer still) and run this 8 time in parallel versus 1-time 8-threaded, 2-times 4-threaded, 4-times 2-threaded.
Got a Xeon Gold 6254 with 1 ML L2-Cache per Core and 24.75 MB L3-Cache per CPU, will run it on that one, too, baseline single-threaded fired up for now, seems to be running at 3.4 GHz.
With 4 processes in parallel it should need 30 MB Cache, since the L3-Cache is a non-inclusive-victim-cache (same as in my E5-2690 above) it should utilize 28.75 MB of Cache and as per Specification Update should still run at 3.4 GHz.
I bind it to the second node to avoid cache-copies between nodes as well as foreign-node-ram-access.
Hyperthreading is not deactivated, but you have to use what you have got.
$ numactl -N 1 ./llr64 -d -q"3*2^16408818+1" -a1 & disown
$ Starting Proth prime test of 3*2^16408818+1
Using all-complex FMA3 FFT length 960K, Pass1=384, Pass2=2560, a = 5
3*2^16408818+1, bit: 30000 / 16408819 [0.18%]. Time per bit: 3.825 ms. | |
|
Ravi FernandoProject administrator Volunteer tester Project scientist Send message
Joined: 21 Mar 19 Posts: 211 ID: 1108183 Credit: 13,391,841 RAC: 6,188
              
|
When those WUs are finished i will give the last found 321-prime ( 3*2^1832496+1) a shot
Do you mean 3*2^16408818+1? | |
|
|
When those WUs are finished i will give the last found 321-prime ( 3*2^1832496+1) a shot
Do you mean 3*2^16408818+1?
Yes, my bad, wrong buffer ([ctrl]-[v] vs middle mouse button).
The code has the right one, obviously | |
|
|
So here goes:
Xeon Gold 6254, 1 WU@1 Thread:
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 62610.163 sec.
Xeon Gold 6254, 4 WU@1 Thread:
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 63368.355 sec.
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 63656.892 sec.
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 63385.569 sec.
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 63500.971 sec.
Xeon Gold 6254, 1 WU@4 Threads:
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 18658.265 sec.
Xeon E5-2690, 4 WU@1 Thread:
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 103037.590 sec.
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 103037.493 sec.
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 103038.159 sec.
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 103036.720 sec.
Xeon E5-2690, 1 WU@2Thread:
3*2^16408818+1 is prime! (4939547 decimal digits) Time : 26465.674 sec.
Throughput/day (4WU@1T vs 1WU@4T):
Xeon Gold 6254 => 5.24 vs 4.63
xeon E5-2690 => 3.35 vs 3.25
I will go for 8WUs@1T vs 1WU@8T now. | |
|
|
Ich konnte noch folgende CPUs auftreiben
Xeon Gold 6128 (6 Kerne @ 3.6 GHz Turbo; 19.25 MB L3)
Xeon Gold 5217 (8 Kerne @ 3.0 GHz Turbo; 11 MB L3)
Xeon E7-4880 v2 (15 Kerne @ ? Turbo; 37.5 MB L3)
let's see... | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 803 ID: 164101 Credit: 305,702,054 RAC: 5,416

|
Hi, are you talking L2 or L3 CPU cache? I ask because my CPU has only 2MB of L2 but 45MB L3. Should I be calculating for L2 or L3?
Generally speaking, it's L3, but there is an "it depends" based on cache inclusiveness. AMD and consumer Intel chips store a copy of L2 in L3, so only L3 matters (and in Ryzen 3k/5k, it's L3 per CCX).
Zen L3 cache is a victim cache then the size is L2 + L3... no? | |
|
|
Hi, are you talking L2 or L3 CPU cache? I ask because my CPU has only 2MB of L2 but 45MB L3. Should I be calculating for L2 or L3?
Generally speaking, it's L3, but there is an "it depends" based on cache inclusiveness. AMD and consumer Intel chips store a copy of L2 in L3, so only L3 matters (and in Ryzen 3k/5k, it's L3 per CCX).
Zen L3 cache is a victim cache then the size is L2 + L3... no?
If long running "heavy" tasks and non-inlusive, then almost yes. | |
|
|
Yeah, the Xeon Gold 5217 (8 Cores @ 3.0 GHz Turbo; 11 MB L3) is currently working on 8 WUs in parallel with a "Time per it: 16.570 ms", that is a lot slower than the E5-2690 with 9.317 ms and a lot slower than the Xeon Gold 6254.
It is running with 3.0 GHz, but only drawing 68 W as per rapl, whereas it was drawing 85 W (max TDP) when doing one WU with 8 threads.
Projected finishing time is around 3 days, yielding a throughput of 2.54/day, throughput was 7.22/day with 1 WU@8threads and 4.52/day with 1 WU@4threads.
But Fujitsu was cheap only fitting one memory module with 32 GB instead of populating all six channels.
Will post back when finished | |
|
|
Yeah, the Xeon Gold 5217 (8 Cores @ 3.0 GHz Turbo; 11 MB L3) is currently working on 8 WUs in parallel with a "Time per it: 16.570 ms", that is a lot slower than the E5-2690 with 9.317 ms and a lot slower than the Xeon Gold 6254.
It is running with 3.0 GHz, but only drawing 68 W as per rapl, whereas it was drawing 85 W (max TDP) when doing one WU with 8 threads.
Projected finishing time is around 3 days, yielding a throughput of 2.54/day, throughput was 7.22/day with 1 WU@8threads and 4.52/day with 1 WU@4threads.
But Fujitsu was cheap only fitting one memory module with 32 GB instead of populating all six channels.
Will post back when finished
One memory channel? Oy Vey! Leaves quite a lot of performance potential off the table.
Also, the Gold 5000 series only has a single AVX512 unit, which when used for PG makes it slower than using the AVX2/FMA3 optimization. You might want to do an additional run and see what the difference is.
____________
Eating more cheese on Thursdays. | |
|
|
Does the llr make use of the AVX512?
I only see "using FMA3" or the like in the output. | |
|
|
Does the llr make use of the AVX512?
I only see "using FMA3" or the like in the output.
It does. The performance hit to single AVX512 unit CPUs has been well-documented, so I wonder if bypassing it and using faster FMA3 on affected chips (like that Xeon) was baked into the code?
(copied from an stderr output on my Cascade Lake):
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8
LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=1 llr.in
Using zero-padded AVX-512 FFT length 128K, Pass1=128, Pass2=1K, clm=1
____________
Eating more cheese on Thursdays. | |
|
|
Ah no problem. 3.8.24 seems to be from July 2020.
I have an usb-stick for my test-runs, on it is version 3.8.21 (see my post below or above, depending on sort-order).
Since i have a fair amount of test-data for "697*2^530150+1" for different CPUs i would not change that for the time being. (also regarding the fact i did runs for days worth now with that version).
initial quickrun with 3.8.24 the Gold 6254:
1WU@18 threads decidedly slower @ 0.692 ms per bit (16408817)
1WU@4 threads decidedly faster @ 0.930 ms per bit | |
|
|
So far (seconds per WU, Throughput, Speedup compared to 1 WU@1Thread).
The Gold 52xx is severly hampered by its cache (or the lack thereof). | |
|
Message boards :
321 Prime Search :
Multi-threading: Max # of threads for each task Setting? |