Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Problems and Help :
Is there a table or chart that shows the optimal CPU cache for each project?
Author |
Message |
|
I think this would go a long way for answering the eternal "how many threads should I be running" question for each project? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 486,261,643 RAC: 686,196
                               
|
It changes over time as the task sizes change.
Just take a look at the task page for a completed task. Look at the output at the bottom of the page. It lists the FFT size. Multiply that by 8, and that's the amount of cache you need for one task.
Or just ask for about a specific sub-project and somebody can look up what the current FFT sizes are. This is usually a good idea for conjecture projects and GCW, which have different FFT sizes depending on the K or B.
That being said, here's the current (maximum) FFT sizes for active LLR tasks. Multiply these numbers by 8:
+---------+---------------+
| project | max(fft_size) |
+---------+---------------+
| 321 | 1048576 |
| CUL | 2359296 |
| ESP | 2359296 |
| GCW | 2621440 |
| MEG | 262144 |
| PPSE | 131072 |
| PPS | 262144 |
| PSP | 2949120 |
| SGS | 131072 |
| SOB | 3932160 |
| SR5 | 983040 |
| TRP | 1310720 |
| WOO | 2293760 |
+---------+---------------+
____________
My lucky number is 75898524288+1 | |
|
|
So if I take MEG 262144 x 8 = 2,097,152
What does that number tell me? I need 2MB L3 per work unit? Those are bytes? | |
|
Monkeydee Volunteer tester
 Send message
Joined: 8 Dec 13 Posts: 548 ID: 284516 Credit: 1,739,968,607 RAC: 3,318,067
                            
|
So if I take MEG 262144 x 8 = 2,097,152
What does that number tell me? I need 2MB L3 per work unit? Those are bytes?
Exactly
____________
My Primes
Badge Score: 4*2 + 6*2 + 7*1 + 8*11 + 9*1 + 11*3 + 12*1 = 169
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 486,261,643 RAC: 686,196
                               
|
So if I take MEG 262144 x 8 = 2,097,152
What does that number tell me? I need 2MB L3 per work unit? Those are bytes?
Exactly. For a PPS-MEG task, you need 2 MB of cache per task to get maximum performance.
____________
My lucky number is 75898524288+1 | |
|
|
So if I take MEG 262144 x 8 = 2,097,152
What does that number tell me? I need 2MB L3 per work unit? Those are bytes?
Exactly. For a PPS-MEG task, you need 2 MB of cache per task to get maximum performance.
Thank you Michael! | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1174 ID: 55391 Credit: 1,236,214,790 RAC: 2,189,600
                        
|
That's fine, but you actually need more cache than minimum because the O/S, and even the BOINC client itself, has a habit of stealing some of your CPU cycles and cache lines.
Ideally, we would be running a BOINC-client based O/S that would function similar to DOS. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 3,386
                              
|
That's fine, but you actually need more cache than minimum because the O/S, and even the BOINC client itself, has a habit of stealing some of your CPU cycles and cache lines.
At a first level it doesn't need to be that complicated and could be seen as guidelines than a hard rule.
I started trying to figure this out in the pre-Ryzen era. Intel CPUs had inclusive cache at the time, so only L3 needed to be considered. The resulting rule of thumb I came up with is you want to fill the L3 cache as best you can without exceeding it. Due to inefficiencies with multi-threading, too many threads relative to the size of task gave poor performance. It isn't all or nothing but a transition around the cache boundary.
With Ryzen, and non-inclusive cache CPUs from Intel (Skylake-X, Ice Lake mobile, Rocket Lake desktop, and derivatives of such), I'm seeing hints that it may be correct to also include L2 on top of L3. Generally speaking L2 is much smaller so doesn't make much difference, outside of Skylake-X where L2 is almost as big as L3. It seems to work on Skylake-X, but I never looked at it in depth for other CPUs where L2 is relatively small.
Also note that with many Ryzen configurations, the L3 cache isn't a single lump but must be considered as separate chunks. Up to and including Zen 2 you had to consider up to 4 core CCX. An 8 core CPU was best seen as two 4 core CPUs. With Zen 3 the CCX was increased to 8 cores, so you only have to worry about that above 8 core models.
IMO considering the OS overhead isn't a major factor. Sure there may be background tasks stealing some CPU cycles here and there, but if they take some cache, they also get displaced when done and it doesn't affect performance that much unless you want to optimise out a small fractional throughput benefit. If you were to dedicate a farm to doing this and nothing else, maybe a stripped down linux install might make sense, but for the masses it wont make a significant difference. | |
|
Jay Send message
Joined: 27 Feb 10 Posts: 136 ID: 56067 Credit: 65,857,807 RAC: 9,485
                    
|
It changes over time as the task sizes change.
...
That being said, here's the current (maximum) FFT sizes for active LLR tasks. Multiply these numbers by 8:
+---------+---------------+
| project | max(fft_size) |
+---------+---------------+
| 321 | 1048576 |
| CUL | 2359296 |
| ESP | 2359296 |
| GCW | 2621440 |
| MEG | 262144 |
| PPSE | 131072 |
| PPS | 262144 |
| PSP | 2949120 |
| SGS | 131072 |
| SOB | 3932160 |
| SR5 | 983040 |
| TRP | 1310720 |
| WOO | 2293760 |
+---------+---------------+
How does FFT change? I assume it doesn't increase by 1. Is there any record over time of FFT sizes that could be used to interpolate future growth?
When I purchanse a new machine I don't want just get something that is adequate for right now, but something that will have enough cache for the near future FFT sizes as well.
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 3,386
                              
|
How does FFT change? I assume it doesn't increase by 1.
The way I understand it is that FFT sizes need to get bigger when you do tests on bigger numbers. This is to keep the required accuracy so the result does not have errors. There are some sweet spot FFT sizes which is what gets used.
As for buying a CPU, generally I wouldn't worry about it unless you really intend to optimise PrimeGrid work over all else. Even then, you can consider which projects you're interested in. Running a task fast is not the same as getting the most tasks done in a given time. | |
|
Jay Send message
Joined: 27 Feb 10 Posts: 136 ID: 56067 Credit: 65,857,807 RAC: 9,485
                    
|
The way I understand it is that FFT sizes need to get bigger when you do tests on bigger numbers. This is to keep the required accuracy so the result does not have errors. There are some sweet spot FFT sizes which is what gets used.
As for buying a CPU, generally I wouldn't worry about it unless you really intend to optimise PrimeGrid work over all else. Even then, you can consider which projects you're interested in. Running a task fast is not the same as getting the most tasks done in a given time.
That's exactly what I do hope to do.
Yes, FFT increases with bigger numbers being tested. I focus on SOB tasks. Based on the chart posted earlier I'm looking at 3932160*8 = 31,457,280. I'm going to round up and call it 32 MB per task.
Maybe I can find a processor with 32 MB and it will be adequate for todays tasks. But what happens when FFT increases? First question is how much does it increase at a time? Will the next increase bump it up to 33 MB? Or will the next increase bump it up to 36 MB? or double it to 64 MB? Or something else? I assume those "sweet spot FFT sizes" are known. What's the next one? What are the next 20?
If that information is known, the next question is how often does FFT increase? Does it increase with each individual task? Does it increase when tasks double in size? Will the next "sweet spot FFT size" come into play when tasks are just a bit bigger than they are now, or not until they are a lot bigger than now?
I understand that I'm concerned with future happenings. I also know that making predictions is difficult (especially about the future). Which is why I'm hoping there is a chart of FFT vs time, or a general rule that someone can look at and interpolate from. Can I say FFT doubles every month? Every year? Every 3 years? etc.
Then I can say I'll likely need XXX MB L3 cache to fit the tasks expected for the next Y years. Then I can look at cpu's and select one based on that info.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 846 ID: 164101 Credit: 306,576,688 RAC: 5,420

|
How does FFT change? I assume it doesn't increase by 1.
FFT size is of the form 2n · 3p · 5q · 7r where p, q and r are "small". The optimal set of parameters depends on the implementation, 24 may run faster than 3 · 5 on some architectures and the size is about the same, etc.
| |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1174 ID: 55391 Credit: 1,236,214,790 RAC: 2,189,600
                        
|
Does the FFT memory access pattern already maximize the cache hit ratio?
A project's memory access pattern is relevant when you have less than the optimal amount of cache.
For instance, if your cache size is a little bit less than what you need to fit the whole thing in cache, then
getting the next larger size of cache may not be worth the marginal increase in speed,
when the cache hit ratio is already over, say, 99%.
At the opposite extreme, a cache hit rate of 0% makes the cache useless.
This happens with an array larger than the cache size, and you repeatedly
cycle through the array from beginning to end. With a LRU replacement policy,
this access pattern replaces existing cache entries just before they are needed.
If allowed by the algorithm, an improved access pattern would be to reverse
the direction of stepping through the array each cycle, thus avoiding cache misses
except at both ends of the array. | |
|
|
The way I understand it is that FFT sizes need to get bigger when you do tests on bigger numbers. This is to keep the required accuracy so the result does not have errors. There are some sweet spot FFT sizes which is what gets used.
As for buying a CPU, generally I wouldn't worry about it unless you really intend to optimise PrimeGrid work over all else. Even then, you can consider which projects you're interested in. Running a task fast is not the same as getting the most tasks done in a given time.
That's exactly what I do hope to do.
Yes, FFT increases with bigger numbers being tested. I focus on SOB tasks. Based on the chart posted earlier I'm looking at 3932160*8 = 31,457,280. I'm going to round up and call it 32 MB per task.
Maybe I can find a processor with 32 MB and it will be adequate for todays tasks. But what happens when FFT increases? First question is how much does it increase at a time? Will the next increase bump it up to 33 MB? Or will the next increase bump it up to 36 MB? or double it to 64 MB? Or something else? I assume those "sweet spot FFT sizes" are known. What's the next one? What are the next 20?
If that information is known, the next question is how often does FFT increase? Does it increase with each individual task? Does it increase when tasks double in size? Will the next "sweet spot FFT size" come into play when tasks are just a bit bigger than they are now, or not until they are a lot bigger than now?
I understand that I'm concerned with future happenings. I also know that making predictions is difficult (especially about the future). Which is why I'm hoping there is a chart of FFT vs time, or a general rule that someone can look at and interpolate from. Can I say FFT doubles every month? Every year? Every 3 years? etc.
Then I can say I'll likely need XXX MB L3 cache to fit the tasks expected for the next Y years. Then I can look at cpu's and select one based on that info.
Easy rule of thumb: buy the CPU with the biggest cache that fits in your budgets (initial outlay and power bill). In terms of currentishly available options, if you want it to last a long time, think AMD 5900X/5950X (64MB 12/16c) or take a chance on the supposed-to-be-released-already 5800X3D (96MB 8c). If you've got the money to burn, go high-core count Cascade Lake-X desktop or -SP Xeons, Ice Lake-SP Xeons, or Epyc 7xx3 (ordered by increasing cost). Honorable mention to Threadripper (up to 256MB cache), but it's stuck on last gen architecture.
____________
Eating more cheese on Thursdays. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 846 ID: 164101 Credit: 306,576,688 RAC: 5,420

|
Does the FFT memory access pattern already maximize the cache hit ratio?
Basic FFT algorithm is not efficient but implementations of gwnum (LLR2) or genefer are optimized for modern cache hierarchy.
A simple test for beginners is the comparison on current CPUs of the recursive implementation of Cooley-Tukey algorithm and the iterative one. The iterations are expected to run faster because function calls are replaced by indexing but this is not true because of memory access pattern.
The first memory efficient algorithm was implemented in 1989 on Cray supercomputers FFTs in External or Hierarchical Memory.
Today many algorithms exist but they are all based on the same method: replace the array of size n with a matrix ~ sqrt(n) x sqrt(n) and perform a convolution with two passes. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 3,386
                              
|
FFT sizes going up depends on the distribution of candidates to test and how many people are going through those tests. At best it is a moving target but I'd say it is relatively slow. It is not going to double in a month, and I wouldn't expect a doubling in the magnitude of a year unless it was a very new subproject and things are moving fast. SoB feels slow, in more ways than one. Actually, you could look up known SoB primes and run the test on them manually to see what FFT was used. Combined with discovery date you have some kind of history, and can attempt extrapolation at your own risk.
You can see the FFT steps if you run a Prime95 benchmark on your system. LLR uses the same FFT code so they can be carefully used to try and work out performance. Note you may see different FFT sizes used on different CPUs, e.g. Ryzen vs Intel.
Some testing I did in the past which tries to show the performance considerations graphically. Skimming through it again, I think that was before AVX-512 support.
https://linustechtips.com/topic/838473-cpu-performance-with-prime95-293/
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 3,386
                              
|
Easy rule of thumb: buy the CPU with the biggest cache that fits in your budgets (initial outlay and power bill). In terms of currentishly available options, if you want it to last a long time, think AMD 5900X/5950X (64MB 12/16c) or take a chance on the supposed-to-be-released-already 5800X3D (96MB 8c). If you've got the money to burn, go high-core count Cascade Lake-X desktop or -SP Xeons, Ice Lake-SP Xeons, or Epyc 7xx3 (ordered by increasing cost). Honorable mention to Threadripper (up to 256MB cache), but it's stuck on last gen architecture.
Keep in mind that the desire for cache is to allow the CPU cores to run at their maximum potential without being held back by ram performance. If the cache requirement is met, performance can scale with cores. If work goes beyond the cache, then ram performance will become relevant.
A big caution with AMD CPUs: the L3 cache may not be unified. The 12/16 core Zen 3 models are two CCX, so are best seen as two lumps of 32MB, not one lump of 64MB. I'll have to say I've not done any testing on them since Zen 2, but to my understanding of the architecture there is no way to directly share data between each L3 cache segment (CCX) without going out to ram. Internal bandwidth of AMD CPUs is generally comparable to dual channel ram bandwidths, so it would get congested fast if multiple segments try to do so at the same time. It still shows some improvement but doesn't scale near where it could be had the cache been unified.
The upcoming 5800X3D does have 96MB of unified cache but only 8 cores. You have practically unlimited performance of those 8 cores. Note with such a large cache, you will probably get more throughput running two SoB tasks with 4 cores each than one task of 8, as it is more efficient the fewer threads you use on a single task. 3 tasks might even be higher throughput still but you'd have to find a way to manage 3+3+2 core split to try it, and it might not be great long term given how close it is to the limit already. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1174 ID: 55391 Credit: 1,236,214,790 RAC: 2,189,600
                        
|
Does the FFT algorithm use the whole memory allocation throughout the calculation,
or does it have a peaking usage pattern where releasing some of the memory part way
through the calculation would benefit other tasks requiring cache? If the latter, two tasks
can be synchronized without actually releasing the memory, with a shared mutex so that
one task is allowed to use more cache while the other needs less cache. In this situation
the mutex is acting like a traffic cop at an intersection. The cache's bandwidth usage would
be maximized rather than becoming congested by attempted simultaneous maximum demand. | |
|
|
Easy rule of thumb: buy the CPU with the biggest cache that fits in your budgets (initial outlay and power bill). In terms of currentishly available options, if you want it to last a long time, think AMD 5900X/5950X (64MB 12/16c) or take a chance on the supposed-to-be-released-already 5800X3D (96MB 8c). If you've got the money to burn, go high-core count Cascade Lake-X desktop or -SP Xeons, Ice Lake-SP Xeons, or Epyc 7xx3 (ordered by increasing cost). Honorable mention to Threadripper (up to 256MB cache), but it's stuck on last gen architecture.
Keep in mind that the desire for cache is to allow the CPU cores to run at their maximum potential without being held back by ram performance. If the cache requirement is met, performance can scale with cores. If work goes beyond the cache, then ram performance will become relevant.
A big caution with AMD CPUs: the L3 cache may not be unified. The 12/16 core Zen 3 models are two CCX, so are best seen as two lumps of 32MB, not one lump of 64MB. I'll have to say I've not done any testing on them since Zen 2, but to my understanding of the architecture there is no way to directly share data between each L3 cache segment (CCX) without going out to ram. Internal bandwidth of AMD CPUs is generally comparable to dual channel ram bandwidths, so it would get congested fast if multiple segments try to do so at the same time. It still shows some improvement but doesn't scale near where it could be had the cache been unified.
The upcoming 5800X3D does have 96MB of unified cache but only 8 cores. You have practically unlimited performance of those 8 cores. Note with such a large cache, you will probably get more throughput running two SoB tasks with 4 cores each than one task of 8, as it is more efficient the fewer threads you use on a single task. 3 tasks might even be higher throughput still but you'd have to find a way to manage 3+3+2 core split to try it, and it might not be great long term given how close it is to the limit already.
Can't remember where it popped up, but someone on here did do some tests with big tasks across CCX's and while it wasn't the best scaling, it was still noticeably faster (by throughput) than running 1 task per CCX and going to main memory.
I am very tempted by the 5800X3D, myself, but the sad reality is that I can pick up a used Skylake-X/Cascade Lake-X setup with more cores for less than it would probably cost, and having AVX-512+quad channel memory, would vastly outperform the AMD, even if the task is bigger than the cache.
____________
Eating more cheese on Thursdays. | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1915 ID: 37043 Credit: 835,132,210 RAC: 806,003
                     
|
That's fine, but you actually need more cache than minimum because the O/S, and even the BOINC client itself, has a habit of stealing some of your CPU cycles and cache lines.
Ideally, we would be running a BOINC-client based O/S that would function similar to DOS.
I've always wondered why someone doesn't build one from Linux for the Boinc community? I realize the amount of time and energy to build one isn't small but to be able to focus just on Boinc and not all the other stuff modern day OS's load would be much more efficient. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 3,386
                              
|
Can't remember where it popped up, but someone on here did do some tests with big tasks across CCX's and while it wasn't the best scaling, it was still noticeably faster (by throughput) than running 1 task per CCX and going to main memory.
If you're running in cache then you shouldn't be going to memory?
I'll try to illustrate my best understanding:
Say you wanted to do SoB. Based on the table earlier, this will just fit in 32MB of cache. If you have a full fat Zen 3 CPU and not one of the APUs, you have 32MB per CCX (up to 8 cores). Running one task on that is optimal. If you have a higher Zen 3 CPU with 12 or 16 cores, you have two CCX. Without owning one, my prediction is running two SoB tasks, one per CCX, is optimal and higher throughput than one task using both CCX. Note this probably requires you to use something like Process Lasso to keep each task on a single CCX. Without it, Windows does bad things and you can get worse performance than expected. I'm looking at best performance with correct configuration, which should not be confused with less optimal configurations appearing better than another unoptimised configuration.
Say SoB FFT sizes increase so that 32MB is no longer enough for one. Then I'd expect one task using both CCX to become optimal.
Now if you had an 8 core Zen 2 CPU, you have two CCX with 4 cores and 16MB each. In that case, you would run one task even though you are crossing the CCX barrier, because you don't have nearly enough resource to run it unconstrained.
I am very tempted by the 5800X3D, myself, but the sad reality is that I can pick up a used Skylake-X/Cascade Lake-X setup with more cores for less than it would probably cost, and having AVX-512+quad channel memory, would vastly outperform the AMD, even if the task is bigger than the cache.
It is a trade off between many factors. As fast as Skylake-X family are, they're relatively power inefficient even if they have high peak rates when fed adequately. Intel dropping official support for AVX-512 from Alder Lake doesn't help, but if rumours turn out right AMD will be making their AVX-512 debut with Zen 4, so that might be one to keep an eye out for. Note even one unit AVX-512 implementations seem to give a benefit in this workload, even if not as much as two unit implementations. | |
|
|
Does the FFT memory access pattern already maximize the cache hit ratio?
Basic FFT algorithm is not efficient but implementations of gwnum (LLR2) or genefer are optimized for modern cache hierarchy.
A simple test for beginners is the comparison on current CPUs of the recursive implementation of Cooley-Tukey algorithm and the iterative one.
The iterations are expected to run faster because function calls are replaced by indexing but this is not true because of memory access pattern.
The first memory efficient algorithm was implemented in 1989 on Cray supercomputers FFTs in External or Hierarchical Memory.
Today many algorithms exist but they are all based on the same method: replace the array of size n with a matrix ~ sqrt(n) x sqrt(n) and perform a convolution with two passes.
In modern meteorology, spherical harmonics direct and inverse transforms intensively use the FFTW library
to solve Navier-Stokes equations in the spectral domain on multiple nodes multiple sockets multiple cores.
The "ECMWF HPC2020" european facilities are dedicated to weather prediction and climate change.
FFTW especially features an auto tuning functionality to adapt to all kind of topology and architecture ...
Keeping the hands on the software implementation details is most of the time a very valuable approach too!
References:
https://www.fftw.org/ and https://events.ecmwf.int/event/169/timetable/
https://www.ecmwf.int/sites/default/files/medialibrary/2022-01/NL-170-C3-DellAcqua-Figure-2.jpg | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1174 ID: 55391 Credit: 1,236,214,790 RAC: 2,189,600
                        
|
https://www.ecmwf.int/sites/default/files/medialibrary/2022-01/NL-170-C3-DellAcqua-Figure-2.jpg
I am surprised that this computing facility doesn't use hot aisle or cool aisle design to increase energy efficiency.
https://datacenterenclosure.com/hot-and-cold-aisles-in-your-data-center-what-to-know/ | |
|
|
https://www.ecmwf.int/sites/default/files/medialibrary/2022-01/NL-170-C3-DellAcqua-Figure-2.jpg
I am surprised that this computing facility doesn't use hot aisle or cool aisle design to increase energy efficiency.
https://datacenterenclosure.com/hot-and-cold-aisles-in-your-data-center-what-to-know/
The answer is watercooling. Some sort of geeks' inspired watercooling ... at another scale !
The data centre cooling system relies on three sources, operating alternately and together, based on the season: geothermal wells, adiabatic dry coolers and chillers.
https://www.ecmwf.int/en/newsletter/170/computing/ecmwfs-new-data-centre-italy
A heat exchanger at the bottom of the rack connects to the building’s water cooling system.
https://www.ecmwf.int/en/newsletter/163/computing/hpc2020-ecmwfs-new-high-performance-computing-facility | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 846 ID: 164101 Credit: 306,576,688 RAC: 5,420

|
In modern meteorology, spherical harmonics direct and inverse transforms intensively use the FFTW library to solve Navier-Stokes equations in the spectral domain on multiple nodes multiple sockets multiple cores.
The "ECMWF HPC2020" european facilities are dedicated to weather prediction and climate change.
FFTW especially features an auto tuning functionality to adapt to all kind of topology and architecture ...
Keeping the hands on the software implementation details is most of the time a very valuable approach too!
FFT libraries are not efficient for intensive computing.
One step is "direct transform" => "spectral domain operations" => "inverse transform" => "spatial/temporal domain operations". For the prime number search this step is: "direct transform" => "pointwise product" => "inverse transform" => "carry propagation".
Because of cache hierarchy, two passes are more efficient to perform the transform. Then the flow is:
"transform pass 1" => "transform pass 2" => "product" => "inverse transform pass 2" => "inverse transform pass 1" => "carry".
"transform pass 2" => "product" => "inverse transform pass 2" can be gather together then memory is accessed once rather than thrice. The new flow is "transform pass 1" => "convolution pass 2" => "inverse transform pass 1" => "carry".
Now we can also gather together the end of one step "inverse transform pass 1" => "carry" and the beginning of the next step "transform pass 1". Finally we have "transform pass 1" => "convolution pass 2" => "convolution pass 1" => "convolution pass 2" => ... => "convolution pass 2" => "inverse transform pass 1" => "carry".
We have two passes for one step rather than six. FFTW implements "FFT pass 1" => "FFT pass 2" but the sub-functions are internal and not accessible. Then we can't optimize the implementation of a convolution with it.
Another issue is that the FFT is just one of the transforms we can implement but many others exist. Each transform is related to a polynomial, the polynomial of the FFT is xn - 1. For GFN the transform related to xn + 1 is more efficient than a "weigthed" FFT. If you are familiar with signal processing, an appropriate Z-transform is more efficient than FFT => spectral domain filter => FFT inv. FFT libraries are useful for tests but not for the final implementation.
Of course FFTW algorithms were analyzed. In our case, the auto tuning functionality generates an implementation which is close to LLR2 or genefer patterns. We have a single architecture x64 and topology is almost always the same then a single implementation and a set of parameters are sufficient. | |
|
|
De toute évidence, la question a été analysée en profondeur ! Et ces optimisations sont peu communes ... dans mon domaine.
Par pure curiosité, j'ai regardé rapidement la portion multi-threadée du source de genefer et l'article de 2015.
https://www.researchgate.net/publication/307642059_Genefer_Programs_for_Finding_Large_Probable_Generalized_Fermat_Primes
La méthodologie ci-dessus est-elle exposée dans un article complémentaire à celui de 2015 (celui de 2014) ?
Mes outils sont dérivés des travaux suivants, mais ne sont pas optimisés subtilement comme les votres.
https://www.researchgate.net/scientific-contributions/Paul-N-Swarztrauber-70757609
https://sites.ecmwf.int/docs/atlas/getting_started/install_transi/ | |
|
Message boards :
Problems and Help :
Is there a table or chart that shows the optimal CPU cache for each project? |