Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Proth Prime Search :
PPS-MEGA: Smaller FFT longer crunch time ?
Author |
Message |
|
Is this normal behavior ?
I'm crunching PPS-MEGA and noticed that tasks with a 240K FFT take 28-33% longer to crunch than those with a 256K FFT.
Por instance, typical tasks:
https://www.primegrid.com/result.php?resultid=1003602409
BOINC llr wrapper (version 8.00)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8
LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=4 llr.in
Using all-complex FMA3 FFT length 256K, Pass1=128, Pass2=2K, clm=2, 4 threads, a = 7
Run time: 1,076.00 seconds
https://www.primegrid.com/result.php?resultid=1003588100
BOINC llr wrapper (version 8.00)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8
LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=4 llr.in
Using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 4 threads, a = 3
Run time: 1,442.00 seconds
Am I missing something ?
Edit: Maybe those different Pass1, Pass2 and "a" values do affect the crunch time?
____________
"Accidit in puncto, quod non contingit in anno."
Something that does not occur in a year may, perchance, happen in a moment. | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3250 ID: 50683 Credit: 152,646,050 RAC: 10,054
                         
|
This is ok, you are missed fact that you using 4 cores on candidate that have only 240/256K
So your resources are better exploited on 256K candidate and hence has lower computing time
But that is my version :)
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
Guess: FFT sizes that are a power of two are faster than those that aren't.
____________
My lucky number is 75898524288+1 | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,586,795 RAC: 1,513,610
                        
|
Guess: FFT sizes that are a power of two are faster than those that aren't.
This sounds like a potential optimization - step up FFT to the next power of two. Care to run some experiments? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
Guess: FFT sizes that are a power of two are faster than those that aren't.
This sounds like a potential optimization - step up FFT to the next power of two. Care to run some experiments?
Thank you for volunteering! :)
As an aside... this is something better discussed with the people who, you know, actually develop this software. If you do find something that could be optimized, they're the ones that would benefit from that knowledge. If there's a question that only could be answered by a developer, it's beneficial to actually ask the developers.
____________
My lucky number is 75898524288+1 | |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
Hi
llr3.8.23 mostly using FFT length 240K
llr3.8.21 mostly using FFT length 256K
Test on host
http://www.primegrid.com/results.php?hostid=946202 | |
|
|
aww, I think 256k is faster
host 946571
____________
My lucky number is 6219*2^3374198+1
| |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,586,795 RAC: 1,513,610
                        
|
I just noticed this too, on 2 machines, using BOINC tasks over a span of 7 to 9 hours.
i5-4590T (4 cores, BOINC 100% CPU)
1 task @ 4 threads
240K FFT (average of 11 tasks) 1370 sec run time, 5099 CPU time
256K FFT (average of 7 tasks) 1802 sec run time, 6733 sec CPU time
240K FFT takes 32% more run time and 32% more CPU time than 256K FFT
i7-5820K (6 cores, HT on and BOINC 50% CPU)
3 tasks @ 2 threads each + AP27 on GPU
240K FFT (average of 11 tasks) 2645 sec run time, 4343 sec CPU time
256K FFT (average of 9 tasks) 3647 sec run time, 5535 sec CPU time
240K FFT takes 38% more run time and 27% more CPU time than 256K FFT
To prove or refute Crun-chi's conjecture that the larger FFT is better at exploiting multi-core hardware,
we would need to run PPS-MEGA tasks on a single-core system (no HT).
Does anyone have the CPU and the patience to try this?
FMA3 almost certainly isn't available on single-core hardware.
Without saying that it proves anything, we can test 1 task 1 thread on multicore systems,
thanks to the recently introduced PrimeGrid preferences for cores and tasks.
I will report my results in a subsequent post.
It seems counterintuitive that a shorter FFT would be slower.
Is this effect similar to using a shorter word size for large number computations?
The appropriate test of this would be to try a FFT size of 280K vs 256K.
In the end, if we can't undestand why 256K FFT is faster than 240K FFT,
we should just use what we know works better.
"Shut up and calculate", as N. David Mermin said (often misattributed to Richard Feynman).
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
I had previously made observations in a more generalised sense. In short, multi-thread scaling does seem to "work better" with larger FFT sizes up to the point you run out of cache and become ram bandwidth limited. Looking the other way, as FFTs get smaller, efficiency continues to fall. As a result of these factors, the sweet spot seems to be balancing threads/tasks to fit in your CPU cache without exceeding it.
These particular tasks have somewhat smaller FFTs. 256k FFT is kinda borderline for 2MB/core L3 cache CPUs, so on those 1 or two threads per task is probably optimal for throughput. For an i5 with only 1.5MB/core, 2 threads per task is probably better than 1. In either case, 4 is right out (unless you only care about run time and not throughput).
https://linustechtips.com/main/topic/1080453-ryzen-3600-vs-8086k-for-prime-number-finding/
Look at the 8086k 1w and 2w (workers=tasks) lines.
I'm now wondering if there is a way to force a (bigger) FFT size in LLR. If so, it would be interesting to manually run a 240k FFT test at 256k and compare. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,586,795 RAC: 1,513,610
                        
|
I ran a bunch of single-thread tasks one at a time on my 6-core system.
The effect is still there, but much less pronounced than with multiple threads.
Averaging over 11 tasks for 240K FFT and 8 tasks for 256K FFT:
240K FFT used 6% more run time and 11% more CPU time for than 256K FFT.
The run time is skewed on a couple of tasks. Probably the internet was unavailable for a time. | |
|
|
composite, are these "live" tasks that you are sent on BOINC from PrimeGrid? (What subproject?)
Maybe, the k values determine what FFT size is used, and determine the run-time as well?
Would it not be better to test one fixed candidate every time (same k and n)?
Not sure I know what I am talking about…
/JeppeSN | |
|
|
I have noticed similar behavior. This is on a i7-9700k running at steady 4.5ghz with dual channel 3200mhz RAM. Running 4 tasks at 2 threads per task.
https://www.primegrid.com/result.php?resultid=1034733243
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8
LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=2 llr.in
Using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads, a = 3
Run Time: 2,510.74
https://www.primegrid.com/result.php?resultid=1034728451
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8
LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=2 llr.in
Using all-complex FMA3 FFT length 256K, Pass1=128, Pass2=2K, clm=2, 2 threads, a = 7
Run Time: 1,546.69
____________
| |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,586,795 RAC: 1,513,610
                        
|
Yes my tests were with live tasks, PPS-MEGA.
The k is only available to us if we peek in BOINC's slot directories, and that has to be done while the task is running. Technically feasible, but I'm not interested in doing this at the moment. | |
|
|
True. But when testing, you could start LLR from the command line with the same number, on different FFT sizes, and on both computers. Or you could do it with both a small k candidate and a large k. That might shed some light on why big FFT sizes seem to give shorter run-times on the live tasks. /JeppeSN | |
|
KEP Send message
Joined: 10 Aug 05 Posts: 303 ID: 110 Credit: 13,001,669 RAC: 23,247
          
|
I seem to remember, from back when running x threads per task, became possible, that George Woltman, declared that there is a penalty when running smaller FFT length tasks, because the cos/sin calculations are spread out on n cores and before the next cos/sin calculation or maybe it was mul/mod calculation can be done, each and every core has to return its part of the calculation - and for reasons I either hasn't heard about or plain forgotten, those x parts does not complete their sin/cos calculations the same time and hence leaving one or more cores idle for a small time before doing next cos/sin calculation.
I´m not sure if that is the explanation, but I have on my i5-4670 running at 3.4 GHz no gains on using all 4 cores per mega test compared to running 1 core per test and running 4 tests simultaniously. I haven't experimented with 2 cores and 2 tests, but when running CRUS work at megadigit level 2 cores test and running 2 tests simultaniously is the most productive. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,586,795 RAC: 1,513,610
                        
|
...those x parts does not complete their sin/cos calculations the same time and hence leaving one or more cores idle for a small time before doing next cos/sin calculation.
This is a well-known effect of using multiple cooperating CPUs when they need to synchronize. It's related to the number of cores, not the FFT size. It's one reason for diminishing returns when adding more threads to a task. | |
|
|
With my 9900KS, the 240K FFT length PPS Mega tasks take over 2x longer than the 256K FFT length tasks. Even DIV tasks are faster than the 240K FFT length PPS Mega.
What's even more weird: trying to run any current length PPS Mega task using the independant LLR app in command line, every single task runs with 256K FFT length.
Could something cause the Boinc LLR-app to occasionally run using a "wrong" FFT length? | |
|
Bur Volunteer tester
 Send message
Joined: 25 Feb 20 Posts: 515 ID: 1241833 Credit: 415,561,849 RAC: 23,663
                
|
I also observed this behavior, apparently it was never really solved why it happened?
I also noticed this difference: Pass1=128 is small and Pass2=2K is large for the fast 256K tasks, while the slower 240K tasks have larger Pass1=1280 and smaller Pass2=192.
Also a = 3 for faster task and a = 5 for slower task.
What is the meaning of these parameters? Maybe they are the reason for the difference in CPU times?
____________
1281979 * 2^485014 + 1 is prime ... no further hits up to: n = 5,700,000 | |
|
Message boards :
Proth Prime Search :
PPS-MEGA: Smaller FFT longer crunch time ? |