Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Number crunching :
Multi-threaded LLR on a Threadripper 1950X
Author |
Message |
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
I don't want to derail the Hallowoo challenge thread further, so I'll move this to its own topic instead.
Azmodes wrote: So is multi-threading actually worth it, throughput-wise?
I tested it with the LLR program through the command prompt and in this case apparently not. Should I have turned off hyperthreading? (It's a TR 1950X)
https://i.imgur.com/NHwqb5f.jpg (EDIT: the first "h per day" should read 10.4)
I waited until the 20,000th iteration before entering the times.
Michael Goetz wrote: It's exceptionally beneficial on larger tasks, generally anything larger than PPS-MEGA. On a typical quad-core, by running a single 4-thread tasks instead of four individual single-thread tasks, you're using 25% of the memory, which means your cache hit rate is higher. Hence, more speed and better overall throughput. The difference is significant, especially on Intel FMA3-capable CPUs because the extreme performance of the CPU makes the memory delays even more significant.
The faster the CPU, the larger the gain you'll see from multi-threading. But even slow CPUs should see an improvement. The memory problem was evident even back when we were running Core-2 CPUs 8 years ago.
xii5ku wrote: This is, of course, what I see too on all of my CPUs (currently all Intel). (Edit: And it was clearly showing in my own measurements, some with random tasks, others with fixed tasks.)
But here is a transcript of Azmodes' screenshot:
#tasks ___ iterations ____ msecs per __ tasks per d __ #threads __ h per task
__ 1 ____ 17,016,603 _____ 2.199 _____ 2.309 _______ 16 ______ 10.4
__ 2 ____ 17,016,603 _____ 3.409 _____ 2.979 ________ 8 ______ 16.1
__ 4 ____ 17,016,603 _____ 6.653 _____ 3.053 ________ 4 ______ 31.4
__ 8 ____ 17,016,603 ____ 13.271 _____ 3.061 ________ 2 ______ 62.7
_ 16 ____ 17,016,603 ____ 23.403 _____ 3.471 ________ 1 _____ 110.6
What is going on? A mistake in the test, or does AMD Zen behave so much differently?
mackerel wrote: Based on my testing primarily with Intel quad cores (excluding Broadwell, Skylake-X), I see 3 performance zones:
1, small tasks where 1 per core does not exceed L3 cache. Here it is more advantageous to run 1 per core, as multi-thread overhead can significantly reduce throughput.
2, multi-thread task(s) substantially filling L3 cache get a speedup compared to one per core, as it is no longer fighting for ram bandwidth. With sufficient ram bandwidth, this may not be seen.
3, Single or multi-thread tasks that significantly exceed L3 cache. These will be ram bandwidth limited.
Boradwell with L4 cache goes into practically unlimited ram bandwidth negating case 3. Skylake-X with its cache structure I'm less clear about. It doesn't seem to scale as well as I'd hope in practice but I don't have enough data to make any detailed observations. I suspect it may behave similarly to Zen in some respects.
Both Zen and Skylake-X are unlike traditional Intels with inclusive cache. Instead they have exclusive or non-inclusive cache respectively. Data is not duplicated in L2 and L3. I suspect there is extra data shuffling because of that, but you potentially have a bigger effective cache. I'm also unclear what happens if more than one core needs the same data.
In the case of Threadripper, we have an additional potential problem, you have two CCX per die, and two die per socket. 1 task running all core may potentially be impacted from having to cross between those barriers. 2 tasks of 8 cores each, assuming the scheduler is smart enough to put them on different dies, would be effectively like two 8 core consumer Ryzen CPUs in terms of operation and ram bandwidth (assuming all channels populated). The 1950X has a total 32MB L3 cache, so that should support 4 simultaneous Woo tasks without being limited by the ram, if at all. Based on this, I would suspect 2 or 4 tasks dividing the cores equally between them would be most efficient. That 8 is comparable, and 16 even better, I'm not sure how to explain.
One possible reason is that Ryzen family AVX units are much weaker than Intel ones, approx. half the performance. Because of that you could feed about double the Zen cores compared to Intel ones for a given number of ram channels, especially with more limited clock on higher core counts. What speed ram was the system using? If 3200 or faster, that could be fairly close to not being limited. I'm not sure this is a complete potential reason, but a starting point towards one.
I was also under the impression that it's faster as well as actually more efficient, but the above test results made me scratch my head. I know AMD can't hold a candle to modern-ish Intel CPUs when it comes to LLR, but I thought they would still benefit from multi-threading.
As for the RAM, 32 GB worth of these.
I do think I set up the multi-threading parameters correctly. I based it on what Michael told me once and scaled it up and down for the other thread/task ratios accordingly. Here's the batch files for 4x4 and 16x1, along with the LLR executable and four/sixteen folders titled 0, 1, 2, 3, etc. in the same directories:
bench4.bat wrote: start "Core 0" runLLR.bat 0 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 1" runLLR.bat 1 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 2" runLLR.bat 2 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 3" runLLR.bat 3 %1 %2 %3 %4 %5 %6 %7 %8 %9
bench4 llr64 -d -t4 -q"[number]"
bench16 wrote: start "Core 0" runLLR.bat 0 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 1" runLLR.bat 1 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 2" runLLR.bat 2 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 3" runLLR.bat 3 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 4" runLLR.bat 4 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 5" runLLR.bat 5 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 6" runLLR.bat 6 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 7" runLLR.bat 7 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 8" runLLR.bat 8 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 9" runLLR.bat 9 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 10" runLLR.bat 10 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 11" runLLR.bat 11 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 12" runLLR.bat 12 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 13" runLLR.bat 13 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 14" runLLR.bat 14 %1 %2 %3 %4 %5 %6 %7 %8 %9
start "Core 15" runLLR.bat 15 %1 %2 %3 %4 %5 %6 %7 %8 %9
bench16 llr64 -d -t1 -q"[number]"
runllr.bat wrote: cd %1
path ..;
del z*
del llr.ini
llr64 -v
%2 %3 %4 %5 %6 %7 %8 %9
Do I have to modify something else for higher/lower core counts? At least the CPU load and number of running processes seem to confirm that it works for all variations. No other tasks or intensive processes were run in tandem. I can also confirm that actual Woodall tasks I completed on this host align with the predictions for 4x4. *shrugs*
I guess I'll try turning off SMT and do another test run.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
No non-HT data yet, but here's some rough estimations from running various instances of the GFN21 app (3.3.4) through the command line:
#tasksXthreads ___ time est. (h)___ tasks per d
__ 1x16 __________ 34 ___________ 0.706
__ 2x8 ___________ 45 ___________ 1.067
__ 4x4 ___________ 83 ___________ 1.157
__ 8x2 ___________ 155 __________ 1.239
_ 16x1 ___________ 288 __________ 1.333
Same pattern.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2639 ID: 29980 Credit: 568,393,769 RAC: 3,323
                              
|
I assume you're running Windows from the batch files... I have seen some not so smart things when HT is on with a multi-socket Intel system, and have to assume similar may happen with a single ThreadRipper. It will be interesting to see the SMT off results as that should negate some of the potential for Windows doing silly things. | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
Yes, Windows 10 64-bit. I'll try and get some non-SMT results posted this weekend.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
With SMT (AMDese for hyperthreading) turned off in the BIOS:
LLR
#tasks ___ iterations ____ msecs per __ tasks per d __ #threads __ h per task
__ 1 ____ 17,016,603 _____ 1.741 _____ 2.916 _______ 16 ______ 8.2
__ 2 ____ 17,016,603 _____ 3.056 _____ 3.323 ________ 8 ______ 14.4
__ 4 ____ 17,016,603 _____ 6.258 _____ 3.245 ________ 4 ______ 29.6
__ 8 ____ 17,016,603 ____ 13.143 _____ 3.091 ________ 2 ______ 62.1
_ 16 ____ 17,016,603 ____ 22.252 _____ 3.651 ________ 1 _____ 105.2
GFN21
#tasksXthreads ___ time est. (h)___ tasks per d
__ 1x16 __________ 24 ___________ 1.000
__ 2x8 ___________ 34.5 _________ 1.391
__ 4x4 ___________ 70.5 _________ 1.362
__ 8x2 ___________ 139 __________ 1.381
_ 16x1 ___________ 267 __________ 1.438
A bit different, but singlethreading is still the most efficient. *shrugs*
EDIT: Okay, ummm, so when I only run one instance with 4 threads (as opposed to four), duration is 43 hours (2.233 tasks per day if it were 4 tasks) for GFN21 and 4.077 msecs per iteration (4.982 tasks per day if it were 4 tasks) for LLR. Diminishing returns and such, but should these times really differ by that much? I mean, CPU load is obviously only a quarter of when I was running four at a time. I'm confused. :(
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
|
Have any tests been done to show the difference between the two Memory Access Modes (UMA/NUMA)? | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
Since I have no idea what that is, I think the answer is "no". :P
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
Do we have any Threadripper (or Ryzen in general) crunchers here willing to do some testing of their own and post their results here?
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
|
Do we have any Threadripper (or Ryzen in general) crunchers here willing to do some testing of their own and post their results here?
I have a 1950X, with good fast RAM at 3200MHZ and CAS 14.
I'm willing to run a few tests. What few tests are you interested in? Give me specifics please. | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
Thanks. Well, basically LLR and GFN21 mutli-threading, to see if there is a similar pattern (i.e. that multithreading is less efficient, and whether SMT off helps or not). See the tables for the parameters and combinations I used (I chose a Woodall-level n value for LLR).
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
|
@Azmodes
Can you check how TR1950X behave on SoB, like you previously did for GFN21? | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
Yes, I can do some testing over the weekend.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
|
Did You had time to check SoB? | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
Sorry, slipped my mind... Here it is.
#tasks ___ iterations ____ msecs per __ tasks per d __ #threads __ h per task
__ 1 ____ 28,493,124 _____ 2.038 _____ 1.488 _______ 16 ______ 16.1
__ 2 ____ 28,493,124 _____ 3.748 _____ 1.618 ________ 8 ______ 29.7
__ 4 ____ 28,493,124 _____ 7.846 _____ 1.546 ________ 4 ______ 62.1
__ 8 ____ 28,493,124 ____ 15.530 _____ 1.562 ________ 2 ______ 122.9
_ 16 ____ 28,493,124 ____ 29.252 _____ 1.643 ________ 1 ______ 233.7
I tested 21181*2^28493110+1. Running Windows 10 Professional 64-bit, CPU at 3.75 GHz (SMT on) with 32 GB of quad-channel DDR4 RAM at 2666 MHz.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
|
Thanks for the numbers.
It's not as fast as I thought it would be. :( | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,504,179 RAC: 18,428
                       
|
If anyone's still interested, I did a test with NUMA (i.e. I set memory access mode to "local" in the AMD tuning tool, then rebooted; task manager does show two NUMA nodes):
#tasks ___ iterations ____ msecs per __ tasks per d __ #threads __ h per task
__ 1 ____ 28,493,124 _____ 3.344 _____ 0.907________ 16 _____ 26.5
__ 2 ____ 28,493,124 _____ 3.488 _____1.738 ________ 8 ______ 27.6
__ 4 ____ 28,493,124 _____ 7.237 _____ 1.676 _______ 4 ______ 57.3
Number tested is again 21181*2^28493110+1.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
Message boards :
Number crunching :
Multi-threaded LLR on a Threadripper 1950X |