Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
genefer 3.3.4
Author |
Message |
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
A beta version of genefer 3.3.4 (64-bit Windows) is available at
https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows
Avx and fma3 implementations are based on a 2 pass transform for n = 9, 13, 17 and 21 (the previous version was recursive). That's expected to reduce the number of memory accesses. n = 9, 13, 17 are useful for testing and n = 21 is the target of this implementation (and the Winter Solstice Challenge!).
A multi-threaded version is available (nt flag).
On my laptop (i5-7300 HQ), the estimated time for GFN-21 is
genefer_windows64.exe -x fma3 -q 250000^2097152+1
genefer 3.3.3-3 124 h
genefer 3.3.4-dev 108 h
genefer 3.3.4-dev -nt 2 65 h
genefer 3.3.4-dev -nt 3 58 h
genefer 3.3.4-dev -nt 4 58 h
2 threads are efficient but not 3 or 4... I don't know why.
I'm interested in benchmarks on other hardware.
Thanks, Yves
| |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 611,432,322 RAC: 719,902
                         
|
Took me a while to find the 3.3.3-3 version, so here's a quick link for everyone else:
https://app.assembla.com/spaces/genefer/subversion/source/HEAD/tags/3.3.3/windows/genefer_windows64.exe?_format=raw
Anyway, here are my tests:
-i5 4590 @3500mhz, 2x8GB 2133mhz DDR3
genefer 3.3.3-3 134 h
genefer 3.3.4-dev 129 h
genefer 3.3.4-dev -nt 2 67 h 50 min
genefer 3.3.4-dev -nt 3 48 h 20 min
genefer 3.3.4-dev -nt 4 41 h 50 min
-i5 6600k @4100mhz, 2x8GB 3000mhz DDR4
genefer 3.3.3-3 87 h 30 min
genefer 3.3.4-dev 80 h 20 min
genefer 3.3.4-dev -nt 2 42 h
genefer 3.3.4-dev -nt 3 29 h 50min
genefer 3.3.4-dev -nt 4 25 h 50min
As a bonus, I tried running 2 nt2 tasks at once on the 4590. It jumped from ~68h to an average of 81h per WU, so it's basically equivalent to running with 4 threads. Not sure if this is good, because it means you can just nt4 and not really lose that much performance, or bad bad, because there's a clear bottleneck somewhere (RAM, most likely) that severely limits performance.
As a last note, I ran the same test on my Gtx 960, and to my surprise, it would take 33h - aka longer than running with 3 threads on my Skylake machine. Granted, the 960 is not the flashiest thing, but it's still really impressive to see a midrange CPPU (by nowadays standards) beat a low-midrange GPU. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
Windows 10 x64, i7 8700k.
genefer 3.3.3-3 79 h 10 min
genefer 3.3.4-dev 74 h 00 min
genefer 3.3.4-dev -nt 2 39 h 20 min
genefer 3.3.4-dev -nt 3 27 h 00 min
genefer 3.3.4-dev -nt 4 21 h 50 min
genefer 3.3.4-dev -nt 5 19 h 50 min
genefer 3.3.4-dev -nt 6 19 h 30 min
Fury Nano would need 16 h 30 min.
Running 2 CPU tests each using 3 threads would be faster.
Trying to compare CPU vs GPU, I took 1620000^131072+1 test.
CPU i7 8700K 6 threads - Estimated time for 1620000^131072+1 is 0:05:24
GPU Fury Nano - Estimated time for 1620000^131072+1 is 0:04:58
Running 2 CPU tests each using 3 threads would result in 7 minutes test so 3 min 30 secs each - faster than GPU.
____________
My stats | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Just a quick test before bed.
i7-8086k stock (clocks will vary depending on active cores), HT disabled, ram 3000C14 dual channel single rank
genefer_windows64.exe -x fma3 -q "250000^2097152+1"
genefer 3.3.3-3 76:10 h
genefer 3.3.4-dev 70:10 h
genefer 3.3.4-dev -nt 2 37:40 h
genefer 3.3.4-dev -nt 3 27:50 h
genefer 3.3.4-dev -nt 4 24:30 h
genefer 3.3.4-dev -nt 5 23:40 h
genefer 3.3.4-dev -nt 6 23:00 h
What is the relationship between n or N and FFT size again?
Edit: found an old post, N = FFT size, so that would mean a data set around 16MB, too big for most CPUs L3 but an 8+ core (assuming 2MB per core) might get better scaling. We're probably ram bandwidth limited.
Honza, what clock is your 8700k running at, and what is the ram configuration in that system? I'd assume faster ram than I have... | |
|
|
Estimates for an i7-6700k@4.4ghz, HT disabled. 3200 [edit] c16 dual channel memory:
1 thread 74:10h
2 threads 39:30
3 threads 27:40
4 threads 23:20
I tried running 2 2-thread tasks simultaneously, with a 48:20 estimated time to complete. One task, 4-thread throughput looks to be better.
I am definitely liking the MT option. This takes CPUs into the realm of GPU performance, with the exception of the more recent high end GPUs, and uses less power. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
A beta version of genefer 3.3.4 (64-bit Windows) is available at
https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows
Avx and fma3 implementations are based on a 2 pass transform for n = 9, 13, 17 and 21 (the previous version was recursive). That's expected to reduce the number of memory accesses. n = 9, 13, 17 are useful for testing and n = 21 is the target of this implementation (and the Winter Solstice Challenge!).
A multi-threaded version is available (nt flag).
On my laptop (i5-7300 HQ), the estimated time for GFN-21 is
genefer_windows64.exe -x fma3 -q 250000^2097152+1
genefer 3.3.3-3 124 h
genefer 3.3.4-dev 108 h
genefer 3.3.4-dev -nt 2 65 h
genefer 3.3.4-dev -nt 3 58 h
genefer 3.3.4-dev -nt 4 58 h
2 threads are efficient but not 3 or 4... I don't know why.
I'm interested in benchmarks on other hardware.
Thanks, Yves
I've got a guess.
The tests were run on n=21, so this is a big number, and it's not going to fit into cache, even when running just 1 task. Memory speed is therefore important.
When running LLR with multithreading, we see a big advantage since it seems to alleviate the penalty due to slow memory and cache misses. For this to happen, you need to have better locality of reference when running a single task than when running multiple tasks. That implies that LLR and/or gwnum is very good about locality of reference, keeping memory references to the same "something" such that cache is used efficiently, even between multiple threads. I'm guessing George and possibly Jean put a lot of effort into making it work like that.
If correct, this would suggest that it might be possible to further optimize the code to improve the cache utilization.
____________
My lucky number is 75898524288+1 | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Based on my testing with Prime95 and LLR, it looks like best performance when running multi-threaded is when the total working data fits inside the L3 cache. Above that, you run into ram bandwidth limiting. Much below that, you lose efficiency due to the multi-thread process in some way. It remains to be seen if gfn scales similarly, although with the more limited FFT size choice it might be less clear.
One test we haven't done is 4x single thread gfn units, compared to the 4 thread one. Where ram bandwidth limited, both should give approximately the same throughput. If scaling test were to be done on a large work unit like PSP, I'd expect to see similar scaling there. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
Honza, what clock is your 8700k running at, and what is the ram configuration in that system? I'd assume faster ram than I have...
It was my home setup, 4x4GB DDR4-2400.
Office machine is also i7 8700K, but 4x8GB DDR4-2133, tests on this configuration see below.
Single thread is around 4,5GHz; 2 are ~4,4; 6 are ~4,3
Note that the more thread, the slower is this configuration comparing to my home machine.
It seems slower memory has bigger impact with more threads.
I *think* it might be also why Yves sees drop of performance with his dual-channgel i5 7300HQ + what Mike's described about CPU cache misses.
genefer 3.3.3-3 82 h 30 min
genefer 3.3.4-dev 77 h 00 min
genefer 3.3.4-dev -nt 2 41 h 10 min
genefer 3.3.4-dev -nt 3 29 h 20 min
genefer 3.3.4-dev -nt 4 26 h 10 min
genefer 3.3.4-dev -nt 5 24 h 50 min
genefer 3.3.4-dev -nt 6 19 h 30 min
____________
My stats | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
Thanks for the benchmarks. All processors (Haswell, Skylake, Kaby Lake, Coffee Lake) show the same behaviour:
- genefer 3.3.4 runs faster than genefer 3.3.3 (single threaded). It should be especially true if more than one task are running because of a better cache usage.
- 2 threads are fast and it may be faster to run n "genefer -nt 2" than 2n "genefer" on 2n cores.
- 3 threads or more are inefficient.
What is the relationship between n or N and FFT size again?
[...]
The tests were run on n=21, so this is a big number, and it's not going to fit into cache, even when running just 1 task. Memory speed is therefore important.
We have n = 21, N = 2^21, data size = N * 8 = 16 MB.
We also have the cos/sin table (read only): 10 MB (I can try to reduce this).
2^21 real numbers = (2^10)^2 complex numbers.
This is a complex matrix 1024*1024.
A 2 pass algorithm processes first 1024 columns and then 1024 lines.
For data locality, 4 columns are copied to a scratch area, lines are naturally local.
1024 complex numbers = 16 kB, 4 columns = 64 kB, locality in L1/L2 is not the problem.
The shared L3 cache is not large enough for n = 21. Then we have to read 16*2 + 10 = 42 MB and write 16*2 = 32 MB per step.
The degree of freedom is how lines and columns are distributed to the threads. In this version, it's cut into n parts (vertically or horizontally, n is the number of threads), I will try different slicing.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
One test we haven't done is 4x single thread gfn units, compared to the 4 thread one. Where ram bandwidth limited, both should give approximately the same throughput. If scaling test were to be done on a large work unit like PSP, I'd expect to see similar scaling there.
i5-7300 HQ (4+2GB DDR4-2400) / genefer_windows64.exe -x fma3 -q 250000^2097152+1
genefer 3.3.4-dev 108 h
genefer 3.3.4-dev -nt 2 65 h
genefer 3.3.4-dev -nt 3 58 h
genefer 3.3.4-dev -nt 4 58 h
genefer 3.3.4-dev (2 tasks) 137 h => 1 GFN-21 / 68 h
genefer 3.3.4-dev (3 tasks) 183 h => 1 GFN-21 / 61 h
genefer 3.3.4-dev (4 tasks) 238 h => 1 GFN-21 / 59 h
genefer 3.3.4-dev -nt 2 (2 tasks) 120 h => 1 GFN-21 / 60 h
The best throughput is "1 task genefer 3.3.4-dev -nt 3": 1 GFN-21 / 58 h and one core is free to run geneferocl. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Since my laptop has the same CPU but different ram, for comparison:
i5-7300 HQ (2x 8GB DDR4-2133) / genefer_windows64.exe -x fma3 -q "250000^2097152+1"
(Is it just me, but it wont run unless I put the number to be tested in quotes?)
genefer 3.3.4-dev 102 h (107h on 2nd run)
genefer 3.3.4-dev -nt 2 56 h
genefer 3.3.4-dev -nt 3 41 h
genefer 3.3.4-dev -nt 4 36 h
genefer 3.3.4-dev (2 tasks) 114 h => 1 GFN-21 / 57 h
genefer 3.3.4-dev (3 tasks) 124 h => 1 GFN-21 / 41.3 h
genefer 3.3.4-dev (4 tasks) 146 h => 1 GFN-21 / 36.5 h
genefer 3.3.4-dev -nt 2 (2 tasks) 72.7 h => 1 GFN-21 / 36.3 h
I'm not sure how memory runs when not using the same modules. Some have speculated you get partial dual channel for the overlap region, but I never confirmed this. Two of the same module type ensures dual channel could be operated, with good improvement as seen here.
Edit: I just realised I'm running mismatched ram myself. My laptop came with 1x8GB 2400 (inaccessible), and I later added 1x8GB 2133 taken from my old laptop when it died. I have previously confirmed that I got the expected ram bandwidth increase from running dual channel, allowing for the lower speed (synthetic tests, Prime95). Turns out the original module is single rank, and the old module is dual rank. I didn't think to look until now... | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
We have n = 21, N = 2^21, data size = N * 8 = 16 MB.
We also have the cos/sin table (read only): 10 MB (I can try to reduce this).
I recall hearing about the sin/cos tables for gwnum too, but I never knew its size. At least for gwnum, performance does seem to relate the FFT size to L3 (assuming inclusive cache), without considering the table.
Time allowing, I'm curious how this might scale on other CPUs too. I will aim to try 7800X (non-inclusive mesh cache, 6x1MB L2 + 8.25MB L3), Ryzen 1700 (exclusive cache, 8x0.5MB L2 + 16MB L3), and Xeon E5-6873v3 (35MB L3 cache). | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
I recall hearing about the sin/cos tables for gwnum too, but I never knew its size. At least for gwnum, performance does seem to relate the FFT size to L3 (assuming inclusive cache), without considering the table.
A little trick can reduce its size.
Let N = 2^20 and a table exp(2.Pi.I k / N), where 0 <= k < N.
Let S= 2^10 and k = j * S + i with 0 <= i, j < S. Then we have
exp(2.Pi.I k / N) = exp(2.Pi.I j / S) * exp(2.Pi.I i /N).
Then two tables of 2^10 numbers and one complex product can replace a table of 2^20 numbers.
This is not efficient with a single threaded program which is CPU limited but this approach should be tested in the multi-threaded version.
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
i7-7800X (DDR4-3000, quad channel, 2 rank per channel, mesh clock overclocked from 2000 default to 3000) / genefer_windows64.exe -x fma3 -q "250000^2097152+1"
genefer 3.3.4-dev 91:50 h
genefer 3.3.4-dev -nt 2 47:30 h
genefer 3.3.4-dev -nt 3 32:20 h
genefer 3.3.4-dev -nt 4 26:20 h
genefer 3.3.4-dev -nt 5 21:20 h
genefer 3.3.4-dev -nt 6 21:10 h
Looks like here we get reasonable scaling to 5 threads but the 6th doesn't help.
genefer 3.3.4-dev -nt 3 (2 tasks) 37:10 h => 1 GFN-21 / 18.6 h
genefer 3.3.4-dev -nt 2 (3 tasks) 55:20 h => 1 GFN-21 / 18.4 h
Looking at the remaining time as it ran, it was slowly going up, so this might be a bit optimistic. I didn't wait to see if it eventually stabilised. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
R7 1700 (DDR4-2666, dual channel, 2 rank per channel) / genefer_windows64.exe -x fma3 -q "250000^2097152+1"
genefer 3.3.4-dev 154:00 h
genefer 3.3.4-dev -nt 2 80:30 h
genefer 3.3.4-dev -nt 3 61:10 h
genefer 3.3.4-dev -nt 4 46:40 h
genefer 3.3.4-dev -nt 5 42:10 h
genefer 3.3.4-dev -nt 6 36:40 h
genefer 3.3.4-dev -nt 7 32:00 h
genefer 3.3.4-dev -nt 8 28:40 h
| |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,462,325 RAC: 1,507,837
                        
|
We also have the cos/sin table (read only): 10 MB (I can try to reduce this).
I haven't looked at the code so some of these suggestions may already be in use, or not useful at all:
Can you put the table into a shared library? The single copy in memory would avoid the duplication of cache usage by multiple simultaneous tasks and keeps portions of the table resident in cache longer.
Does the algorithm use trigonometric transformations and symmetries to reduce the size of the table? A L3 cache miss takes far more time than many of these extra simple operations.
sin^2 x = 1 - cos^2 x
sin -x = - sin x
cos -x = cos x
The latter 2 allows the table to contain only the first quadrant.
Do you need to use both sin x and cos x in the same calculation? Use a 2-D array so that cos x and sin x are in adjacent memory locations and thus loaded into the same cache line. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,462,325 RAC: 1,507,837
                        
|
Here's a thought on shared cache usage by multiple threads using the same code path when table access is linear and the table size is bigger than L3 cache. When one of the threads leads the others in table access and it stalls on cache misses, the other threads eventually catch up to it. Ultimately all the threads always stall together on the same cache miss and they remain synchronized this way. The whole CPU is idle while the threads are stalled, so throughput is limited by RAM bandwidth.
A fix for this is to have each thread use an alternate code path having a different table access pattern. For example, one access pattern can be linear through all rows, and another can be all even rows followed by all odd rows. With current generations of CPU significant speedup may occur with just 2 access patterns. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
My table contains only the first quadrant, its size is
[1024^2 / 4 + (1024^2 / 16 + 1024^2 / 64 + ... ) * 3 ] * sizeof(Complex)
~ 8 MB (10 MB is the allocated space, not the used data size).
The * 3 is because of the radix-4 FFT, I have to store w, w^2 and w^3. Daniel J. Bernstein (and gwnum) replaced w^3 with w^-1 = conj(w) but this is not possible with my transform which is not a FFT.
I tried to use the relations sin(pi/2 - x) = cos(x) and cos(pi/2 - x) = sin(x) to reduce the size but it's worse, probably because some data are read in reversed order.
My table is bit-reversed then is read in order with prefetching.
I just found that single-threaded code runs faster with prefetching but that 4 threads run faster without. Some prefetch should remove from L3 cache some data still computed by other threads. I'm working on this issue.
I'm going to try 'alternate code path', thanks for your help. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
E5-2683 v3 ES (14 cores, 35 MB L3 cache, quad channel 2133 ram, lower turbo clocks than retail sample at 2.3 GHz all cores active)
threads - estimated duration
01 167:00
02 81:20
03 57:10
04 44:10
05 36:30
06 30:30
07 26:00
08 23:10
09 21:50
10 19:50
11 17:50
12 16:50
13 16:40
14 15:10
I found it odd that 13 threads was largely unchanged from 12 threads, yet 14 still gave more of an improvement. I did retest those multiple times and it was repeatable.
2x7 29:40 ea - 14:50 per unit (I didn't check if times increased like observed yesterday) | |
|
|
Better than 50% reduction in time going from 1 to 2 threads. Nice.
E5-2683 v3 ES (14 cores, 35 MB L3 cache, quad channel 2133 ram, lower turbo clocks than retail sample at 2.3 GHz all cores active)
threads - estimated duration
01 167:00
02 81:20
03 57:10
04 44:10
05 36:30
06 30:30
07 26:00
08 23:10
09 21:50
10 19:50
11 17:50
12 16:50
13 16:40
14 15:10
I found it odd that 13 threads was largely unchanged from 12 threads, yet 14 still gave more of an improvement. I did retest those multiple times and it was repeatable.
2x7 29:40 ea - 14:50 per unit (I didn't check if times increased like observed yesterday)
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Better than 50% reduction in time going from 1 to 2 threads. Nice.
I'd take that with caution, as I didn't run the it multiple times to check. The only checking I did was for 12-14 threads given the oddity at 13. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,221,462,325 RAC: 1,507,837
                        
|
When you prefetch memory, just access the first memory location in a block that loads into a cache line. The rest of the locations in the block load automatically without referencing them. Is that how your prefetching works?
Here's another version of alternate code paths that should work using only a single thread.
The interesting thing about cache filling highlighted by the speculative execution vulnerabilities is that alternate code paths cause memory access even when a code branch is not followed. So it isn't necessary to use another thread to prefetch, merely a bit mask test on a register value that has a known result to control the branch path. The code path not used does the prefetching, say an integer checksum into a register over the first memory word from each block to be prefetched. This effectively instructs the invisible hardware in the core to do the prefetching.
EDIT: Bear in mind how much computation you can do on the code branch used for each memory fetch in the code branch not used. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
genefer 3.3.4 - release candidate 1 (64-bit Windows) is available at https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows
AVX and FMA3 implementations for GFN-21 are faster and multithreading is supported.
This version is also faster than the beta.
It has been running for few days on a Ryzen 7 1800X (with 6 threads) and one task is validated. http://www.primegrid.com/workunit.php?wuid=577928463
Then now tests can be extended.
Note that a processor is about as fast as a GTX 1050/1060. Both have a TDP ~ 100 Watt then Flops/Watt are equivalent. | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,973,006,583 RAC: 5,441,534
                               
|
genefer 3.3.4 - release candidate 1 (64-bit Windows) is available at https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows
AVX and FMA3 implementations for GFN-21 are faster and multithreading is supported.
This version is also faster than the beta.
It has been running for few days on a Ryzen 7 1800X (with 6 threads) and one task is validated. http://www.primegrid.com/workunit.php?wuid=577928463
Then now tests can be extended.
Note that a processor is about as fast as a GTX 1050/1060. Both have a TDP ~ 100 Watt then Flops/Watt are equivalent.
Awesome. Great work!
____________
My lucky number 10590941048576+1 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
Note that it does perform better with 1 or 2 threads, but not great with t3 and terribly with 4-6 threds.
With 6 threads, task manager shows ~90% CPU utilization.
250000^2097152+1 candidate, not the one used in BOINC enviroment.
genefer 3.3.4-RC1 75 h 40 min
genefer 3.3.4-RC1 -nt 2 39 h 00 min
genefer 3.3.4-RC1 -nt 3 28 h 40 min
genefer 3.3.4-RC1 -nt 4 26 h 20 min
genefer 3.3.4-RC1 -nt 5 25 h 50 min
genefer 3.3.4-RC1 -nt 6 26 h 10 min
Windows 10 x64, i7 8700k, Office machine 4x8GB DDR4-2133 (slower one comparing to my home host)
genefer 3.3.3-3 82 h 30 min
genefer 3.3.4-dev 77 h 00 min
genefer 3.3.4-dev -nt 2 41 h 10 min
genefer 3.3.4-dev -nt 3 29 h 20 min
genefer 3.3.4-dev -nt 4 26 h 10 min
genefer 3.3.4-dev -nt 5 24 h 50 min
genefer 3.3.4-dev -nt 6 19 h 30 min
____________
My stats | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
Note that it does perform better with 1 or 2 threads, but not great with t3 and terribly with 4-6 threds.
Bad news. But you're a bit pessimistic: 1-3 threads are faster, 4 threads is equivalent, 5-6 threads are worst.
On Ryzen 7 we have
nt 1: 147 h
nt 2: 77.5 h
nt 4: 41 h
nt 6: 31 h
nt 8: 25.5 h
Note that 1 X Lake core = 2 Ryzen cores.
Intel processors have 2 256-bit FPU per core and AMD has 2 128-bit FPU/core.
Is 25 hours the memory limit? 8 cores are required with Ryzen 7 to reach it, only 4 with i7 8700k. The prefetching of the new version may be inefficient when we use too much cores? Was the "19 h 30 min" estimate of the previous version correct?
I'm interested in llr 321 or PSP performances on this processor. | |
|
|
AVX performance from a 3930k with plenty of memory bandwidth:
250000^(2^21)+1
3.3.3-3: 132h26 (1t)
3.3.4-RC1:
150h20 1t - six tasks: 180h each average
76:00 2t - three tasks: 88:00 each average
51:40 3t - two tasks: ~60:20 each
41:50 4t
36:30 5t
35:10 6t
30:20 12t (using HT threads here)
2 threads/3 tasks seems to provide the best throughput at 6 WUs per 176h, but it's a small difference (2%ish) over the 1 thread and 3 thread results, but the current app provides the best overall performance.
____________
Eating more cheese on Thursdays. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
AVX performance from a 3930k with plenty of memory bandwidth: [...] but the current app provides the best overall performance.
Thanks, I'm going to fix it.
New AVX code is faster on Intel >= 5th Gen (that's useless because fma is available on these processors) but slower on 4th Gen.
Then I will revert AVX/1 thread version to 3.3.3 code.
| |
|
|
I have posted Linux and Mac binaries at https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin
I am working on an issue with the Mac related to multithreading, so don't be surprised if it doesn't work. The Linux version should work OK, but I haven't tested as I've only got a single-core VM. If anyone can try it that would be great.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Hi folks,
Watch this space for a proper test campaign, but you can now find a complete set of release-candidate binaries:
https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows
https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/mac
https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/linux
Please do download and have a play with them. I would be interested to see if anyone observes any significant difference between the performance of Yves and my builds (either serial or multi-threaded), as our toolchains as slightly different.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
Iain's binary.
genefer 3.3.4-Iain 74 h 40 min
genefer 3.3.4-Iain -nt 2 40 h 00 min
genefer 3.3.4-Iain -nt 3 30 h 40 min
genefer 3.3.4-Iain -nt 4 27 h 00 min
genefer 3.3.4-Iain -nt 5 26 h 10 min
genefer 3.3.4-Iain -nt 6 26 h 30 min
Note that it does perform better with 1 or 2 threads, but not great with t3 and terribly with 4-6 threds.
With 6 threads, task manager shows ~90% CPU utilization.
250000^2097152+1 candidate, not the one used in BOINC enviroment.
genefer 3.3.4-RC1 75 h 40 min
genefer 3.3.4-RC1 -nt 2 39 h 00 min
genefer 3.3.4-RC1 -nt 3 28 h 40 min
genefer 3.3.4-RC1 -nt 4 26 h 20 min
genefer 3.3.4-RC1 -nt 5 25 h 50 min
genefer 3.3.4-RC1 -nt 6 26 h 10 min
Windows 10 x64, i7 8700k, Office machine 4x8GB DDR4-2133 (slower one comparing to my home host)
genefer 3.3.3-3 82 h 30 min
genefer 3.3.4-dev 77 h 00 min
genefer 3.3.4-dev -nt 2 41 h 10 min
genefer 3.3.4-dev -nt 3 29 h 20 min
genefer 3.3.4-dev -nt 4 26 h 10 min
genefer 3.3.4-dev -nt 5 24 h 50 min
genefer 3.3.4-dev -nt 6 19 h 30 min
____________
My stats | |
|
|
OK, so it's time to do some more extensive testing to prepare for putting the new app into production. I've set up a google sheet here:
https://docs.google.com/spreadsheets/d/1UoLSRjhng9p_rlf64A8xVfa0dCJ8XyEzGrZb5xD9-Ko/edit?usp=sharing
Binaries can be obtained from:
Windows: https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows
Mac: https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/mac
Linux: https://app.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/linux
As usual, there are manual tests (run at the command line) and BOINC tests which require setting up an app_info.xml file. In addition, for this release we are testing the new multithreading (enabled by the -nt T flag, where T is the number of threads), which requires an app_config.xml file similarly to LLR. If you need help setting these up, just ask!
The new multithreaded transform only affects 64-bit builds, running the FMA3 or AVX transforms on n=9,13,17,21 - the other tests (including OCL) are only included to check that the builds still work on a range of machines. Detailed testing of the multithreaded code is targeted at n=17 (manual tests) and n=21 (BOINC) 64-bit CPUs.
Please post in the thread if you want to reserve tests, and test results (or links to WUs). Manual credit will be awarded for all completed tests!
Thanks in advance!
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 611,432,322 RAC: 719,902
                         
|
I've got a 32bit CPU with HT running Windows 10 32bit, so I can take care of the 4 manual tests. | |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
hi B10
genefer 3.3.4 (Windows/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: genefer_windows64-334-2 -q 2041898^65536+1 -nt 2
Low priority change succeeded.
Testing 2041898^65536+1...
Using FMA3 transform (2 threads)
Starting initialization...
Initialization complete (0.095 seconds).
Estimated time for 2041898^65536+1 is 0:08:02
2041898^65536+1 is a probable prime. (413535 digits) (err = 0.4375) (time = 0:08:31) 12:15:48
| |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
Hi C15
geneferocl 3.3.4 (Windows/OpenCL/32-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Command line: geneferocl_windows334-2 -q 157476^65536+1
Normal priority change succeeded.
Checking available transform implementations...
A benchmark is needed to determine best transform, testing available transform implementations...
Testing OCL transform...
Running on platform 'AMD Accelerated Parallel Processing', device 'gfx900', vendor 'Advanced Micro Devices, Inc.', version 'OpenCL 1.2 AMD-APP (2671.3)' and driver '2671.3 (PAL,HSAIL)'.
64 computeUnits @ 1630MHz, memSize=3072MB, cacheSize=16kB, cacheLineSize=64B, localMemSize=32kB, maxWorkGroupSize=256.
Testing OCL2 transform...
Testing OCL3 transform...
Testing OCL4 transform...
Testing OCL5 transform...
Benchmarks completed (11.723 seconds).
Testing 157476^65536+1...
Using OCL4 transform
Starting initialization...
Initialization complete (0.115 seconds).
Estimated time for 157476^65536+1 is 0:00:57
157476^65536+1 is composite. (RES=9f64b3f0d545615c) (340605 digits) (err = 0.0000) (time = 0:01:01) 12:25:57
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
Reserving B15, B23-B26, B32
EDIT: B32 is ongoing...
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
B15
>geneferocl_windows.exe -q "157476^65536+1"
geneferocl 3.3.4 (Windows/OpenCL/32-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Command line: geneferocl_windows.exe -q 157476^65536+1
Normal priority change succeeded.
Checking available transform implementations...
A benchmark is needed to determine best transform, testing available transform implementations...
Testing OCL transform...
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 1070', vendor 'NVIDIA Corporation', version 'OpenCL 1.2 CUDA' and driver '398.82'.
15 computeUnits @ 1784MHz, memSize=8192MB, cacheSize=240kB, cacheLineSize=128B, localMemSize=48kB, maxWorkGroupSize=1024.
Testing OCL2 transform...
Testing OCL3 transform...
Testing OCL4 transform...
Testing OCL5 transform...
Benchmarks completed (2.517 seconds).
Testing 157476^65536+1...
Using OCL4 transform
Starting initialization...
Initialization complete (0.061 seconds).
Estimated time for 157476^65536+1 is 0:00:56
157476^65536+1 is composite. (RES=9f64b3f0d545615c) (340605 digits) (err = 0.0000) (time = 0:00:56) 12:45:38
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
B25
>genefer_windows64.exe -q "1722230^131072+1" -nt 3
genefer 3.3.4 (Windows/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: genefer_windows64.exe -q 1722230^131072+1 -nt 3
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform (3 threads)
Starting initialization...
Initialization complete (0.141 seconds).
Estimated time for 1722230^131072+1 is 0:08:45
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:09:18) 12:49:04
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
B24
>genefer_windows64.exe -q "1722230^131072+1" -nt 2
genefer 3.3.4 (Windows/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: genefer_windows64.exe -q 1722230^131072+1 -nt 2
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform (2 threads)
Starting initialization...
Initialization complete (0.145 seconds).
Estimated time for 1722230^131072+1 is 0:11:20
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:11:12) 12:51:12
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
B23
>genefer_windows64.exe -q "1722230^131072+1" -nt 1
genefer 3.3.4 (Windows/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: genefer_windows64.exe -q 1722230^131072+1 -nt 1
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform
Starting initialization...
Initialization complete (0.251 seconds).
Estimated time for 1722230^131072+1 is 0:27:50
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:18:25) 12:58:58
____________
My stats | |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
hi C10
genefer 3.3.4 (Linux/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: ./genefer_linux64-334-2 -q 2041898^65536+1 -nt 2
Low priority change succeeded.
Testing 2041898^65536+1...
Using FMA3 transform (2 threads)
Starting initialization...
Initialization complete (0.040 seconds).
Estimated time for 2041898^65536+1 is 0:03:59
2041898^65536+1 is a probable prime. (413535 digits) (err = 0.4375) (time = 0:04:00) 12:50:25
D16
geneferocl 3.3.4 (Linux/OpenCL/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Command line: ./geneferocl_linux64-334-2 -q 157476^65536+1
Normal priority change succeeded.
Checking available transform implementations...
A benchmark is needed to determine best transform, testing available transform implementations...
Testing OCL transform...
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 780 Ti', vendor 'NVIDIA Corporation', version 'OpenCL 1.2 CUDA' and driver '390.48'.
15 computeUnits @ 1084MHz, memSize=3018MB, cacheSize=240kB, cacheLineSize=128B, localMemSize=48kB, maxWorkGroupSize=1024.
Testing OCL2 transform...
Testing OCL3 transform...
Testing OCL4 transform...
Testing OCL5 transform...
Benchmarks completed (9.054 seconds).
Testing 157476^65536+1...
Using OCL transform
Starting initialization...
Initialization complete (0.025 seconds).
Estimated time for 157476^65536+1 is 0:01:10
157476^65536+1 is composite. (RES=9f64b3f0d545615c) (340605 digits) (err = 0.0029) (time = 0:01:11) 12:53:09
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
B26
>genefer_windows64.exe -q "1722230^131072+1" -nt 5
genefer 3.3.4 (Windows/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: genefer_windows64.exe -q 1722230^131072+1 -nt 5
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform (5 threads)
Starting initialization...
Initialization complete (0.122 seconds).
Estimated time for 1722230^131072+1 is 0:06:54
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:07:14) 13:08:11
____________
My stats | |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
Hi C23 C24 C25 C26
genefer 3.3.4 (Linux/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: ./genefer_linux64-334-2 -q 1722230^131072+1 -nt 1
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform
Starting initialization...
Initialization complete (0.116 seconds).
Estimated time for 1722230^131072+1 is 0:14:40
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:14:38) 13:21:35
pip@6700k:~/genefer3-3-3$ ./genefer_linux64-334-2 -q "1722230^131072+1" -nt 2
genefer 3.3.4 (Linux/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: ./genefer_linux64-334-2 -q 1722230^131072+1 -nt 2
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform (2 threads)
Starting initialization...
Initialization complete (0.116 seconds).
Estimated time for 1722230^131072+1 is 0:07:46
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:07:48) 13:29:35
pip@6700k:~/genefer3-3-3$ ./genefer_linux64-334-2 -q "1722230^131072+1" -nt 3
genefer 3.3.4 (Linux/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: ./genefer_linux64-334-2 -q 1722230^131072+1 -nt 3
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform (3 threads)
Starting initialization...
Initialization complete (0.115 seconds).
Estimated time for 1722230^131072+1 is 0:05:29
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:05:30) 13:37:42
pip@6700k:~/genefer3-3-3$ ./genefer_linux64-334-2 -q "1722230^131072+1" -nt 4
genefer 3.3.4 (Linux/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: ./genefer_linux64-334-2 -q 1722230^131072+1 -nt 4
Low priority change succeeded.
Testing 1722230^131072+1...
Using FMA3 transform (4 threads)
Starting initialization...
Initialization complete (0.115 seconds).
Estimated time for 1722230^131072+1 is 0:04:17
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 0:04:20) 13:42:39
| |
|
|
C38
I'll be working on C40 and C41 later, if no one has done those by then.
____________
| |
|
|
C38
I'll be working on C40 and C41 later, if no one has done those by then.
C40
____________
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
B32
>genefer_windows64.exe -q "10037266^131072+1" -x x87
genefer 3.3.4 (Windows/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: genefer_windows64.exe -q 10037266^131072+1 -x x87
Low priority change succeeded.
Testing 10037266^131072+1...
Using x87 (80-bit) transform
Starting initialization...
Initialization complete (0.253 seconds).
Estimated time for 10037266^131072+1 is 6:22:00
10037266^131072+1 is a probable prime. (917716 digits) (err = 0.0127) (time = 4:53:41) 17:41:44
____________
My stats | |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
hi C32
genefer 3.3.4 (Linux/CPU/64-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: fma3 avx sse4 sse2 x87
Command line: ./genefer_linux64-334-2 -q 10037266^131072+1
Low priority change succeeded.
FMA3 transform is past its b limit.
AVX (Intel) transform is past its b limit.
SSE4 transform is past its b limit.
SSE2 transform is past its b limit.
Testing 10037266^131072+1...
Using x87 (80-bit) transform
Starting initialization...
Initialization complete (0.130 seconds).
Estimated time for 10037266^131072+1 is 3:58:00
10037266^131072+1 is a probable prime. (917716 digits) (err = 0.0117) (time = 3:54:46) 18:37:07
| |
|
|
From that newest linked exe from Iain these are the estimate times I got with 250000^2097152+1 that people were using earlier in this thread.
1950X threadripper with 3200mhz Cas 14 ram. Default CPU clocks.
-nt 1 143 hours
-nt 2 78:10
-nt 3 54:08
-nt 4 40:40
-nt 5 35:00
-nt 6 30:00
-nt 7 26:20
-nt 8 23:30
-nt 9 22:40
-nt 10 20:00
-nt 11 18:40
-nt 12 17:40
-nt 13 17:30
-nt 14 15:40
-nt 15 15:20
-nt 16 17:00 dropped to 15:41 after running a few seconds.
I ran -q "2041898^65536+1" -nt 1 through the 64 bit app in Windows and it said it was probable prime. I wasn't sure if I should put my name in green in that cell or not. | |
|
|
I ran -q "2041898^65536+1" -nt 1 through the 64 bit app in Windows and it said it was probable prime. I wasn't sure if I should put my name in green in that cell or not.
Thanks, 288larsson already did that test, but always good to do a bit more testing!
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
I sent a request through the spreadsheet to get edit access. I have a Genefer 21 running with 14 threads. And it is using the correct app and currently 42% of my CPU as reported by Windows 10 math. Estimated completion time is in 16 hours. Link to WU (I currently can't view it) it, but the number matches: http://www.primegrid.com/workunit.php?wuid=580282641
Below is my app_info.xml and app_config.xml. Always interested in constructive feedback.
<app_info>
<app>
<name>genefer</name>
<user_friendly_name>Genefer</user_friendly_name>
</app>
<file_info>
<name>genefer_windows64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>genefer</app_name>
<version_num>319</version_num>
<api_version>7.6.33</api_version>
<file_ref>
<file_name>genefer_windows64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
<app_config>
<app_version>
<app_name>genefer</app_name>
<cmdline>-nt 14</cmdline>
<avg_ncpus>14</avg_ncpus>
</app_version>
</app_config>
| |
|
|
Want to help here, downloaded the Winx64 binary. What option do I select? I just double clicked it. | |
|
|
Want to help here, downloaded the Winx64 binary. What option do I select? I just double clicked it.
All the win64 tests are now taken care of. However, once you've download the binary you can run it at the command line (example arguments are provided in the google sheet).
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
I have a Genefer 21 running with 14 threads. And it is using the correct app and currently 42% of my CPU as reported by Windows 10 math.
Thanks, I added it to the google sheet. Thanks for posting the app_info and app_config too!
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 611,432,322 RAC: 719,902
                         
|
First batch of tasks is in
B9
C:\Users\Paulo\Desktop\Arquivos\Genefer>genefer_windows32.exe -q "2041898^65536+1"
genefer 3.3.4 (Windows/CPU/32-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: sse2 x87 f64
Command line: genefer_windows32.exe -q 2041898^65536+1
Low priority change succeeded.
Testing 2041898^65536+1...
Using SSE2 transform
Resuming 2041898^65536+1 from a checkpoint (778895 iterations left)
Estimated time remaining for 2041898^65536+1 is 3:12:00
Testing 2041898^65536+1... 502964 steps to go (2:05:10 remaining)
maxErr exceeded for 2041898^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 2041898^65536+1...
Using x87 (80-bit) transform
Resuming 2041898^65536+1 from a checkpoint (505705 iterations left)
Estimated time remaining for 2041898^65536+1 is 3:44:00
Testing 2041898^65536+1... 503511 steps to go (3:45:45 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 2041898^65536+1...
Using SSE2 transform
Resuming 2041898^65536+1 from a checkpoint (503473 iterations left)
maxErr exceeded for 2041898^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 2041898^65536+1...
Using x87 (80-bit) transform
Resuming 2041898^65536+1 from a checkpoint (503473 iterations left)
Estimated time remaining for 2041898^65536+1 is 3:44:00
Testing 2041898^65536+1... 501280 steps to go (3:44:52 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 2041898^65536+1...
Using SSE2 transform
Resuming 2041898^65536+1 from a checkpoint (501242 iterations left)
Estimated time remaining for 2041898^65536+1 is 2:01:00
Testing 2041898^65536+1... 283407 steps to go (1:09:06 remaining)
maxErr exceeded for 2041898^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 2041898^65536+1...
Using x87 (80-bit) transform
Resuming 2041898^65536+1 from a checkpoint (283810 iterations left)
Estimated time remaining for 2041898^65536+1 is 2:06:00
Testing 2041898^65536+1... 281588 steps to go (2:06:41 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 2041898^65536+1...
Using SSE2 transform
Resuming 2041898^65536+1 from a checkpoint (281586 iterations left)
Estimated time remaining for 2041898^65536+1 is 1:11:00
Testing 2041898^65536+1... 232700 steps to go (0:56:17 remaining)
maxErr exceeded for 2041898^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 2041898^65536+1...
Using x87 (80-bit) transform
Resuming 2041898^65536+1 from a checkpoint (236102 iterations left)
Estimated time remaining for 2041898^65536+1 is 1:45:00
Testing 2041898^65536+1... 233880 steps to go (1:45:10 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 2041898^65536+1...
Using SSE2 transform
Resuming 2041898^65536+1 from a checkpoint (233877 iterations left)
Estimated time remaining for 2041898^65536+1 is 0:56:40
Testing 2041898^65536+1... 232694 steps to go (0:56:06 remaining)
maxErr exceeded for 2041898^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Too many errors with SSE2; Calculation will proceed using only more accurate transforms.
Testing 2041898^65536+1...
Using x87 (80-bit) transform
Resuming 2041898^65536+1 from a checkpoint (233877 iterations left)
Estimated time remaining for 2041898^65536+1 is 1:44:00
2041898^65536+1 is a probable prime. (413535 digits) (err = 0.4375) (time = 9:39:39) 06:17:42
So far, so good. The problem comes next, on the B21 test. The spreadsheet tells me it should complete with SSE2, but it MaxErr and falls back to x87. It still reaches the correct result, but I figure I should mention it anyway:
C:\Users\Paulo\Desktop\Arquivos\Genefer>genefer_windows32.exe -q "1722230^131072"+1" -nt 1
genefer 3.3.4 (Windows/CPU/32-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: sse2 x87 f64
Command line: genefer_windows32.exe -q 1722230^131072+1 -nt 1
Low priority change succeeded.
Testing 1722230^131072+1...
Using SSE2 transform
Starting initialization...
Initialization complete (2.793 seconds).
Estimated time for 1722230^131072+1 is 23:00:00
Testing 1722230^131072+1... 2663044 steps to go (22:37:19 remaining)
maxErr exceeded for 1722230^131072+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2664224 iterations left)
Estimated time remaining for 1722230^131072+1 is 42:00:00
Testing 1722230^131072+1... 2663183 steps to go (42:18:22 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 1722230^131072+1...
Using SSE2 transform
Resuming 1722230^131072+1 from a checkpoint (2663173 iterations left)
maxErr exceeded for 1722230^131072+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2663173 iterations left)
Estimated time remaining for 1722230^131072+1 is 42:00:00
Testing 1722230^131072+1... 2662133 steps to go (42:18:35 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 1722230^131072+1...
Using SSE2 transform
Resuming 1722230^131072+1 from a checkpoint (2662123 iterations left)
Estimated time remaining for 1722230^131072+1 is 22:40:00
Testing 1722230^131072+1... 2652688 steps to go (22:27:47 remaining)
maxErr exceeded for 1722230^131072+1, 0.4688 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2654240 iterations left)
Estimated time remaining for 1722230^131072+1 is 41:50:00
Testing 1722230^131072+1... 2653200 steps to go (42:07:40 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 1722230^131072+1...
Using SSE2 transform
Resuming 1722230^131072+1 from a checkpoint (2653189 iterations left)
Estimated time remaining for 1722230^131072+1 is 22:50:00
Testing 1722230^131072+1... 2652696 steps to go (22:40:25 remaining)
maxErr exceeded for 1722230^131072+1, 0.4688 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Too many errors with SSE2; Calculation will proceed using only more accurate transforms.
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2653189 iterations left)
Estimated time remaining for 1722230^131072+1 is 41:50:00
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 43:31:27) 01:50:58
I'm running the rest of the tests right now, but as you can see from completion times, it'll take a while. | |
|
|
The spreadsheet tells me it should complete with SSE2, but it MaxErr and falls back to x87. It still reaches the correct result, but I figure I should mention it anyway.
Yes, I also observed this with the 32-bit build. If I remember correctly, some transforms have a slightly different implementation (there are some x86_64-specific optimisiations) that also happens to make them more accurate i.e. higher b limit. As a result, the 64-bit test runs fine, and the 32-bit drops back and forward between x87 and the faster transforms. Nothing to worry about.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 611,432,322 RAC: 719,902
                         
|
B22 done
C:\Users\Paulo\Desktop\Arquivos\Genefer>genefer_windows32.exe -q "1722230^131072"+1" -nt 2
genefer 3.3.4 (Windows/CPU/32-bit)
Copyright 2001-2018, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Michael Goetz, Ronald Schneider
Copyright 2011-2018, Iain Bethune
Genefer is free source code, under the MIT license.
Supported transform implementations: sse2 x87 f64
Command line: genefer_windows32.exe -q 1722230^131072+1 -nt 2
Low priority change succeeded.
Testing 1722230^131072+1...
Using SSE2 transform
Starting initialization...
Initialization complete (3.026 seconds).
Estimated time for 1722230^131072+1 is 23:00:00
Testing 1722230^131072+1... 2663051 steps to go (22:38:11 remaining)
maxErr exceeded for 1722230^131072+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2664264 iterations left)
Estimated time remaining for 1722230^131072+1 is 42:00:00
Testing 1722230^131072+1... 2663226 steps to go (42:21:18 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 1722230^131072+1...
Using SSE2 transform
Resuming 1722230^131072+1 from a checkpoint (2663215 iterations left)
maxErr exceeded for 1722230^131072+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2663215 iterations left)
Estimated time remaining for 1722230^131072+1 is 42:00:00
Testing 1722230^131072+1... 2662176 steps to go (42:21:50 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 1722230^131072+1...
Using SSE2 transform
Resuming 1722230^131072+1 from a checkpoint (2662166 iterations left)
Estimated time remaining for 1722230^131072+1 is 23:50:00
Testing 1722230^131072+1... 2652698 steps to go (22:33:21 remaining)
maxErr exceeded for 1722230^131072+1, 0.4688 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2654323 iterations left)
Estimated time remaining for 1722230^131072+1 is 41:50:00
Testing 1722230^131072+1... 2653282 steps to go (42:06:13 remaining)
Successful computation progress with x87 (80-bit); switching back to SSE2.
Testing 1722230^131072+1...
Using SSE2 transform
Resuming 1722230^131072+1 from a checkpoint (2653271 iterations left)
Estimated time remaining for 1722230^131072+1 is 22:40:00
Testing 1722230^131072+1... 2652709 steps to go (22:31:50 remaining)
maxErr exceeded for 1722230^131072+1, 0.4688 > 0.4500
maxErr exceeded while using SSE2; switching to x87 (80-bit).
Too many errors with SSE2; Calculation will proceed using only more accurate transforms.
Testing 1722230^131072+1...
Using x87 (80-bit) transform
Resuming 1722230^131072+1 from a checkpoint (2653271 iterations left)
Estimated time remaining for 1722230^131072+1 is 42:00:00
1722230^131072+1 is a probable prime. (817377 digits) (err = 0.4375) (time = 43:38:29) 21:31:24 | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 611,432,322 RAC: 719,902
                         
|
I closed the cmd by accident so I've lost data on the transform switching, but the last test complete successfully.
10037266^131072+1 is a probable prime. (917716 digits) (err = 0.0127) (time = 49:29:08) 23:00:33 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
>geneferocl_windows.exe -q "250000^2097152+1"
geneferocl 3.3.4 (Windows/OpenCL/32-bit)
Running on platform 'NVIDIA CUDA', device 'GeForce RTX 2080', vendor 'NVIDIA Corporation', version 'OpenCL 1.2 CUDA' and driver '411.63'.
46 computeUnits @ 1860MHz, memSize=8192MB, cacheSize=736kB, cacheLineSize=128B, localMemSize=48kB, maxWorkGroupSize=1024.
Testing OCL2 transform...
Testing OCL3 transform...
Testing OCL4 transform...
Testing OCL5 transform...
Benchmarks completed (17.039 seconds).
Testing 250000^2097152+1...
Using OCL4 transform
Starting initialization...
Initialization complete (11.637 seconds).
Estimated time for 250000^2097152+1 is 7:15:00
____________
My stats | |
|
|
C40 validated.
____________
| |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,991,452,870 RAC: 3,262,552
                                   
|
Hi
E38http://www.primegrid.com/result.php?resultid=935319520
E40 http://www.primegrid.com/result.php?resultid=935116556
E41 http://www.primegrid.com/result.php?resultid=934518695 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
Hi
E38http://www.primegrid.com/result.php?resultid=935319520
E40 http://www.primegrid.com/result.php?resultid=935116556
E41 http://www.primegrid.com/result.php?resultid=934518695
I'm now officially impressed. Looking at the workunit for that third result (GFN21), your CPU version of the task was about a third faster than its GPU wingman. http://www.primegrid.com/workunit.php?wuid=581485233
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
E41 http://www.primegrid.com/result.php?resultid=934518695
Looking at the workunit for that third result (GFN21), your CPU version of the task was about a third faster than its GPU wingman.
11 hours is faster than a GTX 1080!
... I must work on AVX-512. | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,973,006,583 RAC: 5,441,534
                               
|
E41 http://www.primegrid.com/result.php?resultid=934518695
Looking at the workunit for that third result (GFN21), your CPU version of the task was about a third faster than its GPU wingman.
11 hours is faster than a GTX 1080!
... I must work on AVX-512.
10 cores? Wow that's impressive! How? More efficient code? What wizardry is this? :)
____________
My lucky number 10590941048576+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
E41 http://www.primegrid.com/result.php?resultid=934518695
Looking at the workunit for that third result (GFN21), your CPU version of the task was about a third faster than its GPU wingman.
11 hours is faster than a GTX 1080!
... I must work on AVX-512.
10 cores? Wow that's impressive! How? More efficient code? What wizardry is this? :)
It's a $999 Core(TM) i9-7900X CPU. 10 cores, 20 threads, 140W.
____________
My lucky number is 75898524288+1 | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,973,006,583 RAC: 5,441,534
                               
|
E41 http://www.primegrid.com/result.php?resultid=934518695
Looking at the workunit for that third result (GFN21), your CPU version of the task was about a third faster than its GPU wingman.
11 hours is faster than a GTX 1080!
... I must work on AVX-512.
10 cores? Wow that's impressive! How? More efficient code? What wizardry is this? :)
It's a $999 Core(TM) i9-7900X CPU. 10 cores, 20 threads, 140W.
Thanks Michael, on shopping list :)
____________
My lucky number 10590941048576+1 | |
|
|
You have an app_info.xml for this?
____________
DeleteNull | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
... I must work on AVX-512.
Early versions of Skylake-X had a single FP-512 unit. Then these versions have a throughput of 1 FP-512 operation or 2 FP-256 operations per clock cycle (AVX-512 was not a real improvement).
But new Skylake-X have two FP-512 units. The 10-core i9-7900X can support 10*16 DP calculations per cycle (by comparison, the i7-8700 can support 6*8).
| |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Early versions of Skylake-X had a single FP-512 unit. Then these versions have a throughput of 1 FP-512 operation or 2 FP-256 operations per clock cycle (AVX-512 was not a real improvement).
But new Skylake-X have two FP-512 units. The 10-core i9-7900X can support 10*16 DP calculations per cycle (by comparison, the i7-8700 can support 6*8).
Early communication from Intel was that Skylake-X had a single AVX-512 unit, but once people got their hands on them they saw it had two. As far as I'm aware, they all have two units.
Looking forward to any performance increases, but hope we don't hit a ram bandwidth limitation too hard.
Similar Xeons may have one or two units. It will be interesting to see what Intel's strategy will be for future CPUs including AVX-512.
Edit: looking up Intel's page for 7800X as example, it does explicitly list having two AVX-512 units:
https://ark.intel.com/products/123589/Intel-Core-i7-7800X-X-series-Processor-8_25M-Cache-up-to-4_00-GHz | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
E41 http://www.primegrid.com/result.php?resultid=934518695
Is Linux better than Windows with multithreading?
Windows is always doing something, then I noticed that you should not run on all cores.
And on a 8-core Ryzen with -nt 7, if no other application is running computation time is about 99,000 seconds but if geneferocl (GFN20) is running at the same time, comp. time is about 117,000 sec.
The performance is very sensitive to other applications even if the overall load is always lower than 100%. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,419,927,798 RAC: 2,655,298
                                      
|
Is Linux better than Windows with multithreading?
Yes.
This is usually heard from Linux guys but anyway...
It was observed using LLR app with many-cores AMD. Intel dual-CPU systems as well.
The more cores, the bigger difference it seems.
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
Genefer 3.3.4 is now live.
It is BOINC version 3.20.
OCL apps are unchanged and remain as 3.19 (3.3.3).
____________
My lucky number is 75898524288+1 | |
|
|
Thanks everyone for helping out with testing! Manual credit has now been applied to your account!
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14044 ID: 53948 Credit: 482,306,103 RAC: 564,923
                               
|
This wasn't needed during testing with app_info, but now that we're in production and you can use app_config instead, you'll need to include a <plan_class> tag:
<app_config>
<app_version>
<app_name>genefer</app_name>
<cmdline>-nt 4</cmdline>
<plan_class>cpuGFN21</plan_class>
<avg_ncpus>4</avg_ncpus>
</app_version>
</app_config>
App names and plan_class names can be found on our applications page.
____________
My lucky number is 75898524288+1 | |
|
Jay Send message
Joined: 27 Feb 10 Posts: 136 ID: 56067 Credit: 65,857,807 RAC: 16,115
                    
|
Based on my testing with Prime95 and LLR, it looks like best performance when running multi-threaded is when the total working data fits inside the L3 cache. Above that, you run into ram bandwidth limiting. Much below that, you lose efficiency due to the multi-thread process in some way.
Based on this, is there an easy place I can refer to in order to see the size of the "total working data" for each project here (but especially interested in SOB) and how fast it's been increasing?
Thanks,
Jay
| |
|
|
Forgive me if this was asked and answered.
Is the code for multithreading dependent on AVX/FMA3 being present?
I tried running on a few macs. With FMA3 it ran happily on multiple cores. With only SSE4 available, it would only run single-threaded.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
Is the code for multithreading dependent on AVX/FMA3 being present?
I tried running on a few macs. With FMA3 it ran happily on multiple cores. With only SSE4 available, it would only run single-threaded.
Yes, AVX is required for the multithreaded version.
I'm working on a more generic version that will be able to run on any vector size (128, 256 or 512 bits).
| |
|
|
For those who are also running a GPU GFN21, you may want to consider an <app> section as well. This is from a dual-GPU machine, so <max_concurrent> is set to 3, but presumably it would be the number of GPUs + however may CPU instances you want (in my case 1).
<app>
<name>genefer</name>
<max_concurrent>3</max_concurrent>
</app>
<app_version>
<app_name>genefer</app_name>
<cmdline>-nt 4</cmdline>
<plan_class>cpuGFN21</plan_class>
<avg_ncpus>4</avg_ncpus>
</app_version>
This wasn't needed during testing with app_info, but now that we're in production and you can use app_config instead, you'll need to include a <plan_class> tag:
<app_config>
<app_version>
<app_name>genefer</app_name>
<cmdline>-nt 4</cmdline>
<plan_class>cpuGFN21</plan_class>
<avg_ncpus>4</avg_ncpus>
</app_version>
</app_config>
App names and plan_class names can be found on our applications page.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,554,426 RAC: 5,458

|
For those who are also running a GPU GFN21, you may want to consider an <app> section as well.
Running one GPU app and one "cpuGFN21 -nt 4" app on a computer, I noticed that after a reboot (Windows update) Boinc downloads 4 CPU tasks ("store at least" = "store up to an additional" = 0). When they are tested, it just downloads one task at a time until a reboot...
Does "max_concurrent" fix this issue? | |
|
dukebgVolunteer tester
 Send message
Joined: 21 Nov 17 Posts: 242 ID: 950482 Credit: 23,670,125 RAC: 0
                  
|
For those who are also running a GPU GFN21, you may want to consider an <app> section as well.
Running one GPU app and one "cpuGFN21 -nt 4" app on a computer, I noticed that after a reboot (Windows update) Boinc downloads 4 CPU tasks ("store at least" = "store up to an additional" = 0). When they are tested, it just downloads one task at a time until a reboot...
Does "max_concurrent" fix this issue?
It can, yes, but a better solution may be to set your "use CPUs" to 25%.
The issue is basically a BOINC bug. It happens when you run out of tasks of this kind. As far as I remember it witout looking in the code, the configs values that scheduler is aware of are in between the configs of current tasks. In those previously loaded variables or something. So if there is no current tasks it assumes they take 1 CPU, so it schedules the download of one of them for each core. Maybe a reboot clears those internal variables too. | |
|
Message boards :
Generalized Fermat Prime Search :
genefer 3.3.4 |