Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
Genefer 3.2.0 testing thread
Author |
Message |
|
Hi everyone,
I have nearly completed a new version of genefer (currently called 3.2.0beta). You can download windows, linux and mac binaries from https://www.assembla.com/code/genefer/subversion/nodes. Look in the trunk/bin/<os> directory to get the latest set of binaries. The code is reasonably well tested now (thanks to Ron and Roger especially), and I'd like to see it tested either standalone, using prpnet, or using BOINC with app_info.xml.
What you should expect to see:
- Instead of many different apps (genefx64, genefersse3, geneferavx etc.) genefer 3.2.0 has only a 32 and 64 bit CPU app, each of which contain a range of different transforms, and will use the fastest available transform for the particular b^2^n+1 that is being tested
- Note this choice of transform might not alway be what you expect (e.g. SSE2 can be faster than SSE3 on some CPUs, and SSE3 may be better than AVX esp. on AMD CPUs).
- The performance of genefer 3.2.0 should be very similar to 3.1.2 except for some minor effects caused by changing compilers, and running on 64 bit vs 32 bit. If you see big differences in performance, please let me know.
- The b 'limits' may change slightly between 3.1.2 and 3.2.0 but in any case are not hard limits and you may be able to complete tests successfully beyond the B limit for a given N. If a round-off error occurs, genefer 3.2.0 will automatically try a slower transform with higher B limit.
- The CUDA and OpenCL apps are essentially unchanged to 3.1.2.
Please post your test results to this thread (both successes and failure please).
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime!
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Roger,
Work is actively proceeding on a combined genefer CPU app which supports all the transforms in a single executable. We'll want to put in support for that single executable into PRPNet, rather than supporting multiple versions of genefer. Possibly two versions, 64-bit and 32-bit, but no more than that.
Received exe of the alpha combined app from Iain Bethune. First version crashed. Latest version looks like it works!
Testing continues:
Testing 771564^524288+1...
Using SSE2 transform
@Iain: It says it's using the SSE2 transform on my CPU (AMD 1100T). Is that the same as genefx64? genefx64 has a B-limit of 735,000 at N=19 for Intel and AMD, and if so my current test of b=771564 should maxErr.
Is there an SSE3 code path for the combined genefer? SSE3 genefer has a B-limit of 885,000 for Intel and 870,000 for AMD at N=19.
At N=19, b=741302 SSE3 is by far the fastest for my CPU.
At N=20, b=142972 genefx64 is the fastest for my CPU.
At N=22, b=9928 SS33 is again the fastest for my CPU.
http://www.primegrid.com/forum_thread.php?id=4889&nowrap=true#64127
I can hear people saying N=22 is not practical on CPU at N=22, not the point, we're testing software. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
@Iain: It says it's using the SSE2 transform on my CPU (AMD 1100T). Is that the same as genefx64? genefx64 has a B-limit of 735,000 at N=19 for Intel and AMD, and if so my current test of b=771564 should maxErr.
Some of the questions I can answer, even though Iain did all the work on the combined version.
Genefx64 is an older program, and the transforms are different even if they use the same instruction set.
As for b limits, not only is genefx64 different than the new genefer, and therefore will have different limits, but the limits themselves are only estimates, not hard limits. Some tests will succeed beyond the reported limits, some won't.
I'll let Iain handle the other questions.
____________
My lucky number is 75898524288+1 | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Well then let's test it:
>genefer_windows64_Copy.exe -x SSE2 -l
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64_Copy.exe -x SSE2 -l
Priority change succeeded.
Generalized Fermat Number b Limits
Running limits test for transform implementation "SSE2"
Checking limit for m = 256. Testing b = 7715000
Have to limit and bench test -x sse2 and sse3. Results to follow. | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
"genefersse3.64.limits" wrote: Generalized Fermat Number b Limits
The upper bound m = 256, b = 7600000, Err = 0.2812
Starting b = 9470000, Err b = 7605000, Err = 0.2969, 5 Err b = 0
The upper bound m = 512, b = 6595000, Err = 0.2812
Starting b = 7690000, Err b = 6600000, Err = 0.3125, 5 Err b = 0
The upper bound m = 1024, b = 5250000, Err = 0.2812
Starting b = 6250000, Err b = 5255000, Err = 0.3438, 5 Err b = 0
The upper bound m = 2048, b = 4355000, Err = 0.2812
Starting b = 5070000, Err b = 4360000, Err = 0.3125, 5 Err b = 0
The upper bound m = 4096, b = 3485000, Err = 0.2891
Starting b = 4120000, Err b = 3490000, Err = 0.3125, 5 Err b = 0
The upper bound m = 8192, b = 2885000, Err = 0.2969
Starting b = 3340000, Err b = 2890000, Err = 0.3125, 5 Err b = 0
The upper bound m = 16384, b = 2340000, Err = 0.2969
Starting b = 2720000, Err b = 2345000, Err = 0.3125, 5 Err b = 0
The upper bound m = 32768, b = 1955000, Err = 0.2812
Starting b = 2200000, Err b = 1960000, Err = 0.3125, 5 Err b = 0
The upper bound m = 65536, b = 1600000, Err = 0.2969
Starting b = 1790000, Err b = 1605000, Err = 0.3125, 5 Err b = 0
The upper bound m = 131072, b = 1305000, Err = 0.2812
Starting b = 1450000, Err b = 1310000, Err = 0.3125, 5 Err b = 0
The upper bound m = 262144, b = 1065000, Err = 0.2969
Starting b = 1180000, Err b = 1070000, Err = 0.3125, 5 Err b = 0
The upper bound m = 524288, b = 890000, Err = 0.2969
Starting b = 960000, Err b = 895000, Err = 0.3594, 5 Err b = 0
The upper bound m = 1048576, b = 720000, Err = 0.2969
Starting b = 780000, Err b = 725000, Err = 0.3125, 5 Err b = 0
The upper bound m = 2097152, b = 595000, Err = 0.2969
Starting b = 630000, Err b = 600000, Err = 0.3281, 5 Err b = 0
The upper bound m = 4194304, b = 495000, Err = 0.2969
Starting b = 510000, Err b = 500000, Err = 0.3281, 5 Err b = 0
real 1m26.560s
user 1m25.633s
sys 0m0.812s
"genefersse2.64.limits" wrote: Generalized Fermat Number b Limits
The upper bound m = 256, b = 6005000, Err = 0.2656
Starting b = 9470000, Err b = 6010000, Err = 0.3125, 5 Err b = 8085000
The upper bound m = 512, b = 5170000, Err = 0.2812
Starting b = 7690000, Err b = 5175000, Err = 0.3125, 5 Err b = 7385000
The upper bound m = 1024, b = 4235000, Err = 0.2812
Starting b = 6250000, Err b = 4240000, Err = 0.3125, 5 Err b = 6065000
The upper bound m = 2048, b = 3470000, Err = 0.2812
Starting b = 5070000, Err b = 3475000, Err = 0.3125, 5 Err b = 4530000
The upper bound m = 4096, b = 2905000, Err = 0.2812
Starting b = 4120000, Err b = 2910000, Err = 0.3125, 5 Err b = 3735000
The upper bound m = 8192, b = 2180000, Err = 0.2812
Starting b = 3340000, Err b = 2185000, Err = 0.3125, 5 Err b = 2990000
The upper bound m = 16384, b = 1860000, Err = 0.2812
Starting b = 2720000, Err b = 1865000, Err = 0.3125, 5 Err b = 2530000
The upper bound m = 32768, b = 1540000, Err = 0.2812
Starting b = 2200000, Err b = 1545000, Err = 0.3125, 5 Err b = 1990000
The upper bound m = 65536, b = 1240000, Err = 0.2812
Starting b = 1790000, Err b = 1245000, Err = 0.3125, 5 Err b = 1675000
The upper bound m = 131072, b = 1060000, Err = 0.2812
Starting b = 1450000, Err b = 1065000, Err = 0.3125, 5 Err b = 1335000
The upper bound m = 262144, b = 870000, Err = 0.2812
Starting b = 1180000, Err b = 875000, Err = 0.3125, 5 Err b = 1115000
The upper bound m = 524288, b = 735000, Err = 0.2812
Starting b = 960000, Err b = 740000, Err = 0.3125, 5 Err b = 925000
The upper bound m = 1048576, b = 615000, Err = 0.2812
Starting b = 780000, Err b = 620000, Err = 0.3125, 5 Err b = 755000
The upper bound m = 2097152, b = 495000, Err = 0.2812
Starting b = 630000, Err b = 500000, Err = 0.3125, 5 Err b = 595000
The upper bound m = 4194304, b = 435000, Err = 0.3125
Starting b = 510000, Err b = 440000, Err = 0.3438, 5 Err b = 0
real 5m25.765s
user 5m24.124s
sys 0m1.220s
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
"genefersse3.64.limits" (AMD 1100T) wrote:
Generalized Fermat Number b Limits
The upper bound m = 256, b = 7600000, Err = 0.2812
Starting b = 9470000, Err b = 7605000, Err = 0.2969, 5 Err b = 0
The upper bound m = 512, b = 6595000, Err = 0.2812
Starting b = 7690000, Err b = 6600000, Err = 0.3125, 5 Err b = 0
The upper bound m = 1024, b = 5250000, Err = 0.2812
Starting b = 6250000, Err b = 5255000, Err = 0.3438, 5 Err b = 0
The upper bound m = 2048, b = 4355000, Err = 0.2812
Starting b = 5070000, Err b = 4360000, Err = 0.3125, 5 Err b = 0
The upper bound m = 4096, b = 3485000, Err = 0.2891
Starting b = 4120000, Err b = 3490000, Err = 0.3125, 5 Err b = 0
The upper bound m = 8192, b = 2885000, Err = 0.2969
Starting b = 3340000, Err b = 2890000, Err = 0.3125, 5 Err b = 0
The upper bound m = 16384, b = 2340000, Err = 0.2969
Starting b = 2720000, Err b = 2345000, Err = 0.3125, 5 Err b = 0
The upper bound m = 32768, b = 1955000, Err = 0.2812
Starting b = 2200000, Err b = 1960000, Err = 0.3125, 5 Err b = 0
The upper bound m = 65536, b = 1600000, Err = 0.2969
Starting b = 1790000, Err b = 1605000, Err = 0.3125, 5 Err b = 0
The upper bound m = 131072, b = 1345000, Err = 0.2969
Starting b = 1450000, Err b = 1350000, Err = 0.3438, 5 Err b = 0
The upper bound m = 262144, b = 1085000, Err = 0.2969
Starting b = 1180000, Err b = 1090000, Err = 0.3125, 5 Err b = 0
The upper bound m = 524288, b = 870000, Err = 0.2812
Starting b = 960000, Err b = 875000, Err = 0.3125, 5 Err b = 0
The upper bound m = 1048576, b = 725000, Err = 0.2969
Starting b = 780000, Err b = 730000, Err = 0.3281, 5 Err b = 0
The upper bound m = 2097152, b = 595000, Err = 0.2969
Starting b = 630000, Err b = 600000, Err = 0.3125, 5 Err b = 0
The upper bound m = 4194304, b = 490000, Err = 0.2969
Starting b = 510000, Err b = 495000, Err = 0.3281, 5 Err b = 0
Testing genefersse2.64.limits is super slow. Needs looking at.
Command line: genefer_windows64.exe -x SSE2 -b
Priority change succeeded.
Generalized Fermat Number Bench
Running benchmarks for transform implementation "SSE2"
6008024^256+1 Time: 0 us/mul. Err: 0.2500 1736 digits
4913974^512+1 Time: 29.3 us/mul. Err: 0.2500 3427 digits
4019150^1024+1 Time: 0 us/mul. Err: 0.2500 6763 digits
3287270^2048+1 Time: 7.81 us/mul. Err: 0.2500 13347 digits
2688666^4096+1 Time: 30.5 us/mul. Err: 0.2500 26336 digits
2199064^8192+1 Time: 143 us/mul. Err: 0.3125 51956 digits
1798620^16384+1 Time: 307 us/mul. Err: 0.1562 102481 digits
1471094^32768+1 Time: 651 us/mul. Err: 0.1562 202102 digits
1203210^65536+1 Time: 1.4 ms/mul. Err: 0.1562 398482 digits
984108^131072+1 Time: 3.02 ms/mul. Err: 0.1562 785521 digits
804904^262144+1 Time: 7.19 ms/mul. Err: 0.1406 1548156 digits
658332^524288+1 Time: 17.5 ms/mul. Err: 0.1328 3050541 digits
538452^1048576+1 Time: 36.4 ms/mul. Err: 0.1406 6009544 digits
440400^2097152+1 Time: 75.2 ms/mul. Err: 0.1250 11836006 digits
360204^4194304+1 Time: 198 ms/mul. Err: 0.1250 23305854 digits
>genefer_windows64.exe -x SSE2 -b3
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -x SSE2 -b3
Priority change succeeded.
Running benchmarks for transform implementation "SSE2"
14^32768+1 37557 digits 0 days 0.0 hours (0.64 ms/mul, 327
69 iterations) 77 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.66 ms/mul, 327
82 iterations) 77 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.64 ms/mul, 163
853 iterations) 387 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.66 ms/mul, 196
623 iterations) 464 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (1.40 ms/mul, 655
37 iterations) 326 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (1.40 ms/mul, 655
50 iterations) 326 GFLOPS
710000^65536+1 383469 digits 0 days 0.1 hours (1.39 ms/mul, 262
158 iterations) 1306 GFLOPS
2500000^65536+1 419296 digits 0 days 0.1 hours (1.40 ms/mul, 327
695 iterations) 1632 GFLOPS
14^131072+1 150226 digits 0 days 0.1 hours (3.06 ms/mul, 131
073 iterations) 1374 GFLOPS
75898^131072+1 639662 digits 0 days 0.1 hours (3.03 ms/mul, 131
086 iterations) 1375 GFLOPS
700000^131072+1 766129 digits 0 days 0.5 hours (3.03 ms/mul, 655
373 iterations) 6872 GFLOPS
1000000^131072+1 786432 digits 0 days 0.6 hours (3.03 ms/mul, 786
444 iterations) 8246 GFLOPS
14^262144+1 300451 digits 0 days 0.5 hours (7.10 ms/mul, 262
145 iterations) 5772 GFLOPS
75898^262144+1 1279324 digits 0 days 0.5 hours (7.13 ms/mul, 262
158 iterations) 5773 GFLOPS
468750^262144+1 1486604 digits 0 days 0.5 hours (7.22 ms/mul, 262
160 iterations) 5773 GFLOPS
815000^262144+1 1549575 digits 0 days 1.6 hours (7.33 ms/mul, 786
447 iterations) 17318 GFLOPS
14^524288+1 600902 digits 0 days 2.7 hours (18.88 ms/mul, 52
4289 iterations) 24189 GFLOPS
75898^524288+1 2558647 digits 0 days 2.7 hours (18.88 ms/mul, 52
4302 iterations) 24190 GFLOPS
468750^524288+1 2973207 digits 0 days 2.7 hours (18.88 ms/mul, 52
4304 iterations) 24190 GFLOPS
710000^524288+1 3067745 digits 0 days 11.0 hours (18.94 ms/mul, 20
97166 iterations) 96758 GFLOPS
>genefer_windows64.exe -x SSE3 -b
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -x SSE3 -b
Priority change succeeded.
Generalized Fermat Number Bench
Running benchmarks for transform implementation "SSE3"
6008024^256+1 Time: 0 us/mul. Err: 0.1484 1736 digits
4913974^512+1 Time: 0 us/mul. Err: 0.1562 3427 digits
4019150^1024+1 Time: 15.6 us/mul. Err: 0.1562 6763 digits
3287270^2048+1 Time: 22.9 us/mul. Err: 0.1562 13347 digits
2688666^4096+1 Time: 61 us/mul. Err: 0.1875 26336 digits
2199064^8192+1 Time: 130 us/mul. Err: 0.1875 51956 digits
1798620^16384+1 Time: 313 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 663 us/mul. Err: 0.2109 202102 digits
1203210^65536+1 Time: 1.58 ms/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 3.31 ms/mul. Err: 0.1875 785521 digits
804904^262144+1 Time: 8.09 ms/mul. Err: 0.1875 1548156 digits
658332^524288+1 Time: 19.4 ms/mul. Err: 0.1953 3050541 digits
538452^1048576+1 Time: 46.1 ms/mul. Err: 0.1875 6009544 digits
440400^2097152+1 Time: 100 ms/mul. Err: 0.1797 11836006 digits
360204^4194304+1 Time: 241 ms/mul. Err: 0.1758 23305854 digits
>genefer_windows64.exe -x SSE3 -b3
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -x SSE3 -b3
Priority change succeeded.
Running benchmarks for transform implementation "SSE3"
14^32768+1 37557 digits 0 days 0.0 hours (0.67 ms/mul, 327
69 iterations) 77 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.66 ms/mul, 327
82 iterations) 77 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.67 ms/mul, 163
853 iterations) 387 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.66 ms/mul, 196
623 iterations) 464 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (1.59 ms/mul, 655
37 iterations) 326 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (1.58 ms/mul, 655
50 iterations) 326 GFLOPS
710000^65536+1 383469 digits 0 days 0.1 hours (1.59 ms/mul, 262
158 iterations) 1306 GFLOPS
2500000^65536+1 419296 digits 0 days 0.1 hours (1.59 ms/mul, 327
695 iterations) 1632 GFLOPS
14^131072+1 150226 digits 0 days 0.1 hours (3.34 ms/mul, 131
073 iterations) 1374 GFLOPS
75898^131072+1 639662 digits 0 days 0.1 hours (3.32 ms/mul, 131
086 iterations) 1375 GFLOPS
700000^131072+1 766129 digits 0 days 0.6 hours (3.31 ms/mul, 655
373 iterations) 6872 GFLOPS
1000000^131072+1 786432 digits 0 days 0.7 hours (3.31 ms/mul, 786
444 iterations) 8246 GFLOPS
14^262144+1 300451 digits 0 days 0.5 hours (8.16 ms/mul, 262
145 iterations) 5772 GFLOPS
75898^262144+1 1279324 digits 0 days 0.5 hours (8.22 ms/mul, 262
158 iterations) 5773 GFLOPS
468750^262144+1 1486604 digits 0 days 0.6 hours (8.50 ms/mul, 262
160 iterations) 5773 GFLOPS
815000^262144+1 1549575 digits 0 days 1.8 hours (8.56 ms/mul, 786
447 iterations) 17318 GFLOPS
14^524288+1 600902 digits 0 days 3.1 hours (21.53 ms/mul, 52
4289 iterations) 24189 GFLOPS
75898^524288+1 2558647 digits 0 days 3.2 hours (22.54 ms/mul, 52
4302 iterations) 24190 GFLOPS
468750^524288+1 2973207 digits 0 days 3.2 hours (22.48 ms/mul, 52
4304 iterations) 24190 GFLOPS
710000^524288+1 3067745 digits 0 days 13.0 hours (22.39 ms/mul, 20
97166 iterations) 96758 GFLOPS
So this new SSE2 implementation appears quicker at N=19, high b on my CPU than SSE3, as first test indicated. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Can you check the contents of the genefer.dat file in the same directory as the executable and let me know what it says.
genefer.dat:
Windows/CPU/64-bit : 771564 524288 sse2 32853 sse3 44152 default 53909 x87 106023 128 4294967294
That's what I expected to see. The first two entries are the B & N. Then a list of the transforms followed by the estimated times. Looks like SSE2 beats SSE3 by quite a bit on your machine. I don't know enough about the 1100T but generally SSE3 does better on systems with smaller caches and/or less memory bandwidth, but if there is plenty then SSE2 can be faster.
x87 is the old genefer80 implementation (except it is extended so that is can be used in a 64-bit app as well as the old 32-bit version)
default is what used to be called 'genefer32' or 'genefer64' i.e. the FFT does not use SIMD intrinsics, but lets the compiler do the best it can.
I've successfully tested 12x GFN32768 and 2x GFN65536 with PRPNet:
...
[2013-12-08 05:32:55 WAST] GFN32768: Getting work from server prpnet.primegrid.c
om at port 12005
[2013-12-08 05:32:57 WAST] GFN32768: PRPNet server is version 5.2.8
Generalized Fermat Number Prime Search N=32768
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN32768.in
Priority change succeeded.
Start test of file 'work_GFN32768.in' - 05:32:57
Testing 7028430^32768+1...
Using SSE3 transform
Starting initialization...
Initialization complete (0.046 seconds).
Estimated total run time for 7028430^32768+1 is 0:08:08
Testing 7028430^32768+1... 741376 steps to go (0:08:06 remaining)
maxErr exceeded for 7028430^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 7028430^32768+1...
Using SSE2 transform
Starting initialization...
Initialization complete (0.031 seconds).
Estimated total run time for 7028430^32768+1 is 0:07:56
Testing 7028430^32768+1... 741376 steps to go (0:07:53 remaining)
maxErr exceeded for 7028430^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 7028430^32768+1...
Using Default transform
Starting initialization...
Initialization complete (0.047 seconds).
Estimated total run time for 7028430^32768+1 is 0:13:21
Testing 7028430^32768+1... 741376 steps to go (0:13:17 remaining)
maxErr exceeded for 7028430^32768+1, 1.0000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 7028430^32768+1...
Using x87 (80-bit) transform
Starting initialization...
Initialization complete (0.031 seconds).
Estimated total run time for 7028430^32768+1 is 0:31:35
7028430^32768+1 is a probable composite. (RES=a2da3b3d9c10be0b) (224358 digits)
(err = 0.0068) (time = 0:31:46) 06:04:53
[2013-12-08 06:04:53 WAST] GFN32768: 7028430^32768+1 is not prime. Residue a2da
3b3d9c10be0b
[2013-12-08 06:04:53 WAST] Total Time: 6:26:27 Total Work Units: 12 Special R
esults Found: 0
[2013-12-08 06:04:54 WAST] GFN32768: Returning work to server prpnet.primegrid.c
om at port 12005
[2013-12-08 06:04:55 WAST] GFN32768: INFO: Test for 7028430^32768+1 was accepted
[2013-12-08 06:04:56 WAST] GFN32768: INFO: All 1 test results were accepted
...
...
[2013-12-08 01:57:47 WAST] GFN65536: Getting work from server prpnet.primegrid.c
om at port 12003
[2013-12-08 01:57:49 WAST] GFN65536: PRPNet server is version 5.2.8
Generalized Fermat Number Prime Search N=65536
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN65536.in
Priority change succeeded.
Start test of file 'work_GFN65536.in' - 01:57:49
Testing 3265086^65536+1...
Using SSE2 transform
Starting initialization...
Initialization complete (0.109 seconds).
Testing 3265086^65536+1... 1417216 steps to go
maxErr exceeded for 3265086^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 3265086^65536+1...
Using SSE3 transform
Starting initialization...
Initialization complete (0.093 seconds).
Testing 3265086^65536+1... 1417216 steps to go
maxErr exceeded for 3265086^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE3; switching to Default.
Testing 3265086^65536+1...
Using Default transform
Starting initialization...
Initialization complete (0.094 seconds).
Testing 3265086^65536+1... 1417216 steps to go
maxErr exceeded for 3265086^65536+1, 0.5000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 3265086^65536+1...
Using x87 (80-bit) transform
Starting initialization...
Initialization complete (0.093 seconds).
Estimated total run time for 3265086^65536+1 is 2:12:44
3265086^65536+1 is a probable composite. (RES=b46b64ac0a6feb49) (426895 digits)
(err = 0.0024) (time = 2:13:02) 04:10:56
[2013-12-08 04:10:56 WAST] GFN65536: 3265086^65536+1 is not prime. Residue b46b
64ac0a6feb49
[2013-12-08 04:10:56 WAST] Total Time: 4:28:40 Total Work Units: 2 Special Re
sults Found: 0
[2013-12-08 04:10:57 WAST] GFN65536: Returning work to server prpnet.primegrid.c
om at port 12003
[2013-12-08 04:10:58 WAST] GFN65536: INFO: Test for 3265086^65536+1 was accepted
[2013-12-08 04:10:59 WAST] GFN65536: INFO: All 1 test results were accepted
... | |
|
|
I ran the Mac 64-bit CPU version.
On the limit tests for both SSE2 and SSE3, I got the same results as rroonnaalldd posted in message 71275.
I ran the benchmarks for SSE2 and SSE3. This was on a Sept. 2009 vintage iMac, C2D @2.93 GHz, no other significant system load.
Generalized Fermat Number Bench
Running benchmarks for transform implementation "SSE2"
6008024^256+1 Time: 5.47 us/mul. Err: 0.2500 1736 digits
4913974^512+1 Time: 9.37 us/mul. Err: 0.2500 3427 digits
4019150^1024+1 Time: 15.6 us/mul. Err: 0.2500 6763 digits
3287270^2048+1 Time: 30.5 us/mul. Err: 0.2500 13347 digits
2688666^4096+1 Time: 64.7 us/mul. Err: 0.2500 26336 digits
2199064^8192+1 Time: 164 us/mul. Err: 0.3125 51956 digits
1798620^16384+1 Time: 345 us/mul. Err: 0.1562 102481 digits
1471094^32768+1 Time: 730 us/mul. Err: 0.1562 202102 digits
1203210^65536+1 Time: 1.52 ms/mul. Err: 0.1562 398482 digits
984108^131072+1 Time: 3.15 ms/mul. Err: 0.1562 785521 digits
804904^262144+1 Time: 6.63 ms/mul. Err: 0.1406 1548156 digits
658332^524288+1 Time: 13.9 ms/mul. Err: 0.1328 3050541 digits
538452^1048576+1 Time: 37 ms/mul. Err: 0.1406 6009544 digits
440400^2097152+1 Time: 83.2 ms/mul. Err: 0.1250 11836006 digits
360204^4194304+1 Time: 172 ms/mul. Err: 0.1250 23305854 digits
... and ...
Generalized Fermat Number Bench
Running benchmarks for transform implementation "SSE3"
6008024^256+1 Time: 4.84 us/mul. Err: 0.1406 1736 digits
4913974^512+1 Time: 7.78 us/mul. Err: 0.1562 3427 digits
4019150^1024+1 Time: 13.7 us/mul. Err: 0.1562 6763 digits
3287270^2048+1 Time: 25.1 us/mul. Err: 0.1719 13347 digits
2688666^4096+1 Time: 54.8 us/mul. Err: 0.1875 26336 digits
2199064^8192+1 Time: 116 us/mul. Err: 0.1953 51956 digits
1798620^16384+1 Time: 254 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 527 us/mul. Err: 0.2188 202102 digits
1203210^65536+1 Time: 1.15 ms/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 2.39 ms/mul. Err: 0.2031 785521 digits
804904^262144+1 Time: 5.13 ms/mul. Err: 0.1953 1548156 digits
658332^524288+1 Time: 11.1 ms/mul. Err: 0.2031 3050541 digits
538452^1048576+1 Time: 28.2 ms/mul. Err: 0.1875 6009544 digits
440400^2097152+1 Time: 63.2 ms/mul. Err: 0.2031 11836006 digits
360204^4194304+1 Time: 149 ms/mul. Err: 0.1914 23305854 digits
Looks like SSE3 is faster in all cases for me. Versus previous version, 3.2beta looks maybe a little faster, but the difference is very small, down in the noise.
--Gary
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
In the PRPNet prpclient.ini can you pass command line flags to the application? i.e.
geneferexe=genefer_windows64.exe -x SSE3
Below is an example where it would be useful. genefer_windows64.exe first tries SSE2 because it is quicker, but then it exceeds maxErr and switches to SSE3.
The -x flag forces use of the specified code path.
>genefer_windows64.exe -q "771564^524288+1"
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -q 771564^524288+1
Priority change succeeded.
No relevant benchmark data exists, testing available transform implementations..
.
This checkpoint file was saved by an older version of genefer. Current test wil
l be restarted
Testing 771564^524288+1...
Using SSE2 transform
This checkpoint file was saved by an older version of genefer. Current test wil
l be restarted
Starting initialization...
Initialization complete (2.286 seconds).
Estimated total run time for 771564^524288+1 is 89:09:11
Testing 771564^524288+1... 9867264 steps to go (85:47:35 remaining)
maxErr exceeded for 771564^524288+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 771564^524288+1...
Using SSE3 transform
Checkpoint is for SSE2 transform, expected SSE3. Current test will be restarted
Starting initialization...
Initialization complete (2.200 seconds).
Estimated total run time for 771564^524288+1 is 83:13:23
Testing 771564^524288+1... 7991296 steps to go (64:51:37 remaining)
Another way to fix this would be for genefer_windows64.exe to know it's B-limits, but they are hardware dependent.
Maybe run its limit tests once and keep referring to a save file for future runs. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
In the PRPNet prpclient.ini can you pass command line flags to the application? i.e.
geneferexe=genefer_windows64.exe -x SSE3
No, you can not. Genefer will need to read settings like that from a configuration file. (I haven't looked at this iteration of the code, so I don't know if that's currently implemented or not. If not, we may need to add it. That sort of capability is in the framework code for similar functionality for the CUDA version.)
____________
My lucky number is 75898524288+1 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
During test, I hit Ctrl+C so I would try to continue from checkpoint.
c:\_PG\PRPclient.1>prpclient.exe
Total percent of servers does not equal 100. Normalizing
[2013-12-08 10:01:52 CEST] PRPNet Client application v5.3.0 started
[2013-12-08 10:01:52 CEST] User name Honza at email address is x@xxx.xx
It appears that the PRPNet client was aborted without completing
the workunits asigned by server GFN32768. What do you want to do with them?
1 = Report completed and abort the rest, then get more work
2 = Report completed and abort the rest, then shut down
3 = Return completed, then continue
4 = Complete in-progress, abort the rest, report them, then get more work
5 = Complete in-progress, abort the rest, report them, then shut down
6 = Complete all work units, report them, then shut down
7 = Complete all work units, then shut down
9 = Continue from client left off when it was shut down
Choose option: 6
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 avx sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN32768.in
Priority change succeeded.
Start test of file 'work_GFN32768.in' - 10:01:57
Testing 7028526^32768+1...
Using x87 (80-bit) transform
Resuming 7028526^32768+1 from a checkpoint (511582 iterations left)
Estimated total run time for 7028526^32768+1 is 0:20:23
Testing 7028526^32768+1... 507904 steps to go (0:13:53 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=507904. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=507904. Continuing processing anyway...
Testing 7028526^32768+1... 503808 steps to go (0:13:47 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=503808. Continuing processing anyway...
It had problem renaming checkpoint ever since.
(I tryed to turn off antivirus).
Eventually, task was finished, returned and client aborted as per option 6.
Problem renaming genefer.ckpt.new to genefer.ckpt at i=4096. Continuing processing anyway...
7028526^32768+1 is a probable composite. (RES=a3187b9a9831f147) (224359 digits) (err = 0.0068) (time = 0:20:17) 10:15:45
____________
My stats | |
|
Tyler Project administrator Volunteer tester Send message
Joined: 4 Dec 12 Posts: 1081 ID: 183129 Credit: 1,384,625,026 RAC: 3,918
                          
|
Running PRPNet Client 5.2.4, genefer windows 64 bit downloaded from above.
Intel i5 2500k @ 4.5GHz
Port 12005 GFN32768
[2013-12-08 18:51:52 MST] PRPNet Client application v5.2.4 started
[2013-12-08 18:51:52 MST] User name 1998golfer at email address is fake@email.invald
[2013-12-08 18:51:53 MST] GFN32768: Getting work from server prpnet.primegrid.com
at port 12005
[2013-12-08 18:51:54 MST] GFN32768: PRPNet server is version 5.2.8
Generalized Fermat Number Prime Search N=32768
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 avx sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN32768.in
Priority change succeeded.
Start test of file 'work_GFN32768.in' - 18:51:54
No relevant benchmark data exists, testing available transform implementations...
Testing 7028596^32768+1...
Using AVX transform
Starting initialization...
Initialization complete (0.022 seconds).
Estimated total run time for 7028596^32768+1 is 0:02:11
Testing 7028596^32768+1... 741376 steps to go (0:02:11 remaining)
maxErr exceeded for 7028596^32768+1, 0.5000 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 7028596^32768+1...
Using SSE3 transform
Starting initialization...
Initialization complete (0.022 seconds).
Estimated total run time for 7028596^32768+1 is 0:04:13
Testing 7028596^32768+1... 741376 steps to go (0:04:12 remaining)
maxErr exceeded for 7028596^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 7028596^32768+1...
Using SSE2 transform
Starting initialization...
Initialization complete (0.021 seconds).
Estimated total run time for 7028596^32768+1 is 0:05:13
Testing 7028596^32768+1... 741376 steps to go (0:05:11 remaining)
maxErr exceeded for 7028596^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 7028596^32768+1...
Using Default transform
Starting initialization...
Initialization complete (0.021 seconds).
Estimated total run time for 7028596^32768+1 is 0:06:43
Testing 7028596^32768+1... 741376 steps to go (0:06:41 remaining)
maxErr exceeded for 7028596^32768+1, 1.0000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 7028596^32768+1...
Using x87 (80-bit) transform
Starting initialization...
Initialization complete (0.021 seconds).
Estimated total run time for 7028596^32768+1 is 0:14:42
Testing 7028596^32768+1... 729088 steps to go (0:14:23 remaining)
EDIT:
Restarting from checkpoint is being... funky...
[2013-12-08 18:56:13 MST] PRPNet Client application v5.2.4 started
[2013-12-08 18:56:13 MST] User name 1998golfer at email address is fake@email.invald
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 avx sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN32768.in
Priority change succeeded.
Start test of file 'work_GFN32768.in' - 18:56:13
Testing 7028596^32768+1...
Using x87 (80-bit) transform
Resuming 7028596^32768+1 from a checkpoint (539700 iterations left)
Estimated total run time for 7028596^32768+1 is 0:15:23
Testing 7028596^32768+1... 536576 steps to go (0:11:04 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=536576. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=536576. Continuing processing anyway...
Testing 7028596^32768+1... 532480 steps to go (0:10:59 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=532480. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=532480. Continuing processing anyway...
Testing 7028596^32768+1... 528384 steps to go (0:10:54 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=528384. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=528384. Continuing processing anyway...
Testing 7028596^32768+1... 524288 steps to go (0:10:49 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=524288. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=524288. Continuing processing anyway...
Testing 7028596^32768+1... 520192 steps to go (0:10:44 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=520192. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=520192. Continuing processing anyway...
Testing 7028596^32768+1... 516096 steps to go (0:10:39 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=516096. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=516096. Continuing processing anyway...
Testing 7028596^32768+1... 512000 steps to go (0:10:34 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=512000. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=512000. Continuing processing anyway...
Testing 7028596^32768+1... 507904 steps to go (0:10:29 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=507904. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=507904. Continuing processing anyway...
Testing 7028596^32768+1... 503808 steps to go (0:10:24 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=503808. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=503808. Continuing processing anyway...
Testing 7028596^32768+1... 499712 steps to go (0:10:19 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=499712. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=499712. Continuing processing anyway...
Testing 7028596^32768+1... 495616 steps to go (0:10:14 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=495616. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=495616. Continuing processing anyway...
Testing 7028596^32768+1... 491520 steps to go (0:10:08 remaining)
Problem renaming genefer.ckpt to genefer.ckpt.old at i=491520. Continuing processing anyway...
Problem renaming genefer.ckpt.new to genefer.ckpt at i=491520. Continuing processing anyway...
and it just keeps doing that..
____________
275*2^3585539+1 is prime!!! (1079358 digits)
Proud member of Aggie the Pew
| |
|
|
Looks like there is an issue with checkpointing - I'll look into that!
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Apart from checkpointing cosmetic issue.
It seems like Genefer is trying using fastest AVX and just before finishing it realises maxErr.
Next try with SSE3 and again just before finishinig, retry using SSE2.
After a while using SSE2, restarted using Default from scratch.
Perhaps stupid questing but would it be possible to restart from ckeckpoint?
Task was finished almost twice only to let it run using slowest method from scratch :-(
Hi! Welcome to PrimeGrid's GFN 262144 Prime Search.
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default 128 sse2 x87 avx sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN262144.in
Priority change succeeded.
Start test of file 'work_GFN262144.in' - 21:17:07
Testing 1026120^262144+1...
Using AVX transform
Starting initialization...
Initialization complete (0.732 seconds).
Estimated total run time for 1026120^262144+1 is 6:00:29
Testing 1026120^262144+1... 466944 steps to go (0:32:09 remaining)
maxErr exceeded for 1026120^262144+1, 0.4531 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 1026120^262144+1...
Using SSE3 transform
Checkpoint is for AVX transform, expected SSE3. Current test will be restarted
Starting initialization...
Initialization complete (0.733 seconds).
Estimated total run time for 1026120^262144+1 is 9:07:06
Testing 1026120^262144+1... 466944 steps to go (0:48:48 remaining)
maxErr exceeded for 1026120^262144+1, 0.4531 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 1026120^262144+1...
Using SSE2 transform
Checkpoint is for SSE3 transform, expected SSE2. Current test will be restarted
Starting initialization...
Initialization complete (1.544 seconds).
Estimated total run time for 1026120^262144+1 is 11:08:17
Testing 1026120^262144+1... 5230592 steps to go (11:07:46 remaining)
maxErr exceeded for 1026120^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 1026120^262144+1...
Using Default transform
Checkpoint is for SSE2 transform, expected Default. Current test will be restarted
Starting initialization...
Initialization complete (0.748 seconds).
Estimated total run time for 1026120^262144+1 is 17:14:22
Testing 1026120^262144+1... 4554752 steps to go (15:00:01 remaining)
____________
My stats | |
|
|
Hi Honza, that is interesting! I knew that in principle it's possible for a test to run a long time without a round-off error, and you have found a good example of this. It's most likely when B is 'close' to the B limit, so unfortunately there are quite a lot of test like this on that port coming up... clearly what is happening now is not efficient.
It was also suggested that the transform to start with could be chosen as a result of a limits test but for N=262144 the B limit is 1065000 - it's a conservative estimate so would still result in the same behaviour you saw.
At present it's not possible to restart from a checkpoint created by a different transform implementation, whether it's possible or not I'd have to think hard about.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
...clearly what is happening now is not efficient
Yeah.
Roger was actually lucky with his test.
Behaviour we now see on live GFN262144 range is good reason that "Genefer will need to read settings like that from a configuration file".
For now, reverting back to old 3.1.2 scheme or running basicgenefer on GFN266144 seems more effecient.
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
At present it's not possible to restart from a checkpoint created by a different transform implementation, whether it's possible or not I'd have to think hard about.
Cheers
- Iain
There's no reason why the checkpoint file from one transform can't be used if restarted with another transform. Switching between a 32 bit and 64 bit executable might not work (depending on how the arrays are written to the file), but the transform in use should have no effect on the contents of the checkpoint.
Genefer (any version) is an iterative algorithm that does N steps. We only checkpoint after each complete step.
The transforms take you from step N to step N+1, and they do that in different ways, but the result after each step will always be identical. It's as if you need to add 1+1, and you can do it on a calculator, and abacus, your fingers, in your head, or with a computer. No matter which way you do it, as long as there's no error the answer will be 2. (Or 10, the explanation for which I'll leave to the reader.)
(In Genefer, the transforms are merely different methods for performing a multiplication, rather than an addition as in my example.)
As long as the arrays that are being written to the file have the same C definition, there's no reason why you can't switch transforms in the middle of the computation.
If the transforms DO use different C array definitions, it should still be possible to restart with a different transform, but you'll need to convert the array to the correct format as it's read.
____________
My lucky number is 75898524288+1 | |
|
|
Mike, the checkpoint arrays are premultiplied by the sin/cos twiddle factors, which makes restarting with a different transform without loss of accuracy tricky. I propose to modify so that we checkpoint out the arrays when rounded to integer - this way we a restart will be deterministic, can work with any transform implementation. It will require a bit of reorganisation of the code though...
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
On checkpoint and .ini parameter file, I would like to see option to set checkpointing interval if possible.
GFN262144 writes about 3.5GB per test in 5 hours, this would be about 100GB a day on 6-core machine.
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
On checkpoint and .ini parameter file, I would like to see option to set checkpointing interval if possible.
GFN262144 writes about 3.5GB per test in 5 hours, this would be about 100GB a day on 6-core machine.
Assuming you're using it under BOINC, you can set that option in the BOINC preferences.
____________
My lucky number is 75898524288+1 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Assuming you're using it under BOINC, you can set that option in the BOINC preferences.
PRPNet that is.
____________
My stats | |
|
|
Hi all,
New binaries are posted at https://www.assembla.com/code/genefer/subversion/nodes in the trunk/bin/<os> directories. New features include:
- Working CUDA and OpenCL applications featuring the new error recovery from genefer 3.1.2-9
- If a CPU test falls back to a slower (higher B limit) transform, it will carry on from the previous checkpoint rather than restarting from the beginning. This addresses the issues Honza reported on the GFN262144 port.
- As a side effect, the checkpoint file is independent of the transform, so it is possible to start a calculation with AVX, stop, then restart with CUDA!
- Note, as this is a beta release, I have not incremented the checkpoint version number. Please ensure you delete any existing checkpoint files before restarting with the new app.
If you're not too busy running the current GFN challenge, please test and continue to post results here!
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime!
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Thanks for new builds.
Updated to latest version which started test from scratch. It OK, checkpoint from old version. Aborted the test so restarting from checkpoint can be tested.
What I found interesting: It started on SSE2 then switched to SSE3, not the other way around.
Other instances started using AVX and holding so far.
c:\_Honza\PRPNet.1>prpclient.exe
[2014-01-08 15:35:43 SE(Å”] PRPNet Client application v5.3.0 started
[2014-01-08 15:35:43 SE(Å”] User name Honza at email address is x@xxx.xx
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 sse3
Command line: genefer_windows64.exe work_GFN262144.in
Priority change succeeded.
Start test of file 'work_GFN262144.in' - 15:35:43
Testing 1039322^262144+1...
Using SSE2 transform
Resuming 1039322^262144+1 from a checkpoint (5233521 iterations left)
Estimated total run time for 1039322^262144+1 is 11:57:43
Testing 1039322^262144+1... 5230592 steps to go (11:56:30 remaining)
maxErr exceeded for 1039322^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 1039322^262144+1...
Using SSE3 transform
Resuming 1039322^262144+1 from a checkpoint (5233521 iterations left)
Estimated total run time for 1039322^262144+1 is 12:42:42
Testing 1039322^262144+1... 5210112 steps to go (12:38:25 remaining)
____________
My stats | |
|
|
Hi Honza, thanks for the testing!
Updated to latest version which started test from scratch. It OK, checkpoint from old version.
I'm not quite clear what you mean here, but you cannot/should not restart a calculation with the latest 3.2.0-0beta binaries from a checkpoint from a previous version. It probably won't work, and if it does it will give you wrong results!
Aborted the test so restarting from checkpoint can be tested.
This should work OK, and you will find when it restarts that it starts again with the fastest transform, and only drops back to slower ones if it receives a MaxErr.
What I found interesting: It started on SSE2 then switched to SSE3, not the other way around.
This is working as designed, you have one of the particular host/WU combinations where SSE2 is faster than SSE3. For SSE2 the runtime estimate is 11:57:43, and for SSE3 it is 12:42:42
Other instances started using AVX and holding so far.
The output you showed was from a host that doesn't support AVX, right?
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Hi Ian,
thanks for the new builds. I'm testing on windows using PRPNET.
AFAIS everything works as expected. The checkpoint file issue is gone!
[2014-01-08 19:27:00 MZ] GFN65536: Getting work from server prpnet.primegrid.com at port 12003
[2014-01-08 19:27:02 MZ] GFN65536: PRPNet server is version 5.2.8
Generalized Fermat Number Prime Search N=65536
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 avx sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN65536.in
Priority change succeeded.
Start test of file 'work_GFN65536.in' - 19:27:02
Testing 3394104^65536+1...
Using AVX transform
Starting initialization...
Initialization complete (0.070 seconds).
Testing 3394104^65536+1... 1421312 steps to go
maxErr exceeded for 3394104^65536+1, 0.5000 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 3394104^65536+1...
Using SSE3 transform
Starting initialization...
Initialization complete (0.075 seconds).
Testing 3394104^65536+1... 1421312 steps to go
maxErr exceeded for 3394104^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 3394104^65536+1...
Using SSE2 transform
Starting initialization...
Initialization complete (0.090 seconds).
Testing 3394104^65536+1... 1421312 steps to go
maxErr exceeded for 3394104^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 3394104^65536+1...
Using Default transform
Starting initialization...
Initialization complete (0.085 seconds).
Testing 3394104^65536+1... 1421312 steps to go
maxErr exceeded for 3394104^65536+1, 0.5000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 3394104^65536+1...
Using x87 (80-bit) transform
Starting initialization...
Initialization complete (0.060 seconds).
Estimated total run time for 3394104^65536+1 is 1:20:55
3394104^65536+1 is a probable composite. (RES=624922acdbe8f402) (427998 digits) (err = 0.0024) (time = 1:18:55) 20:45:59
[2014-01-08 20:45:59 MZ] GFN65536: 3394104^65536+1 is not prime. Residue 624922acdbe8f402
[2014-01-08 20:45:59 MZ] Total Time: 7:35:00 Total Work Units: 5 Special Results Found: 0
[2014-01-08 20:46:00 MZ] GFN65536: Returning work to server prpnet.primegrid.com at port 12003
[2014-01-08 20:46:00 MZ] GFN65536: INFO: Test for 3394104^65536+1 was accepted
[2014-01-08 20:46:00 MZ] GFN65536: INFO: All 1 test results were accepted
[2014-01-08 20:46:01 MZ] GFN262144: Getting work from server prpnet.primegrid.com at port 11002
[2014-01-08 20:46:02 MZ] GFN262144: PRPNet server is version 5.2.8
Hi! Welcome to PrimeGrid's GFN 262144 Prime Search.
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 avx sse3
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN262144.in
Priority change succeeded.
Start test of file 'work_GFN262144.in' - 20:46:02
Testing 1039390^262144+1...
Using AVX transform
Starting initialization...
Initialization complete (0.690 seconds).
Estimated total run time for 1039390^262144+1 is 3:19:58
1039390^262144+1 is a probable composite. (RES=297c6a06b3957a8d) (1577263 digits) (err = 0.4219) (time = 3:34:22) 00:20:25
[2014-01-09 00:20:25 MZ] GFN262144: 1039390^262144+1 is not prime. Residue 297c6a06b3957a8d
[2014-01-09 00:20:25 MZ] Total Time: 11:09:26 Total Work Units: 6 Special Results Found: 0
[2014-01-09 00:20:25 MZ] GFN262144: Returning work to server prpnet.primegrid.com at port 11002
[2014-01-09 00:20:26 MZ] GFN262144: INFO: Test for 1039390^262144+1 was accepted
[2014-01-09 00:20:26 MZ] GFN262144: INFO: All 1 test results were accepted
Excellent work!
____________
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
This is working as designed, you have one of the particular host/WU combinations where SSE2 is faster than SSE3. For SSE2 the runtime estimate is 11:57:43, and for SSE3 it is 12:42:42
I wasn't paying enough attention - yes, SSE2 is faster on this CPU so everything is OK.
Yes, this was host without AVX.
Some tests on the other CPU using AVX looks fine so far.
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
This one is interesting...
Started using AVX, then SSE3, SSE2, Default and x87.
Aboarted to see if it can continue from checkpoint and it did - using AVX, whereby completing using AVX and saving a lot of time.
Start test of file 'work_GFN262144.in' - 02:59:13
Testing 1040142^262144+1...
Using AVX transform
Starting initialization...
Initialization complete (0.874 seconds).
Estimated total run time for 1040142^262144+1 is 5:01:06
Testing 1040142^262144+1... 4624384 steps to go (4:25:44 remaining)
maxErr exceeded for 1040142^262144+1, 0.4531 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 1040142^262144+1...
Using SSE3 transform
Resuming 1040142^262144+1 from a checkpoint (4628479 iterations left)
Estimated total run time for 1040142^262144+1 is 8:56:49
Testing 1040142^262144+1... 4624384 steps to go (7:53:46 remaining)
maxErr exceeded for 1040142^262144+1, 0.4531 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 1040142^262144+1...
Using SSE2 transform
Resuming 1040142^262144+1 from a checkpoint (4628479 iterations left)
Estimated total run time for 1040142^262144+1 is 11:04:45
Testing 1040142^262144+1... 4624384 steps to go (9:46:40 remaining)
maxErr exceeded for 1040142^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 1040142^262144+1...
Using Default transform
Resuming 1040142^262144+1 from a checkpoint (4628479 iterations left)
Estimated total run time for 1040142^262144+1 is 14:19:40
Testing 1040142^262144+1... 4624384 steps to go (12:38:42 remaining)
maxErr exceeded for 1040142^262144+1, 0.4531 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 1040142^262144+1...
Using x87 (80-bit) transform
Resuming 1040142^262144+1 from a checkpoint (4628479 iterations left)
Estimated total run time for 1040142^262144+1 is 32:06:25
Testing 1040142^262144+1... 3682304 steps to go (22:33:47 remaining)
^C caught. Writing checkpoint.
[2014-01-10 09:24:45 CEST] GFN262144: No data in file [genefer.log]. Is genefer broken?
c:\_Honza\PRPNet.1>prpclient.exe
[2014-01-10 09:24:46 CEST] PRPNet Client application v5.2.7 started
[2014-01-10 09:24:46 CEST] User name Honza at email address is x@xxx.xx
It appears that the PRPNet client was aborted without completing
the workunits asigned by server GFN262144. What do you want to do with them?
...
Choose option: 3
...
Command line: genefer_windows64.exe work_GFN262144.in
Priority change succeeded.
Start test of file 'work_GFN262144.in' - 09:24:49
Testing 1040142^262144+1...
Using AVX transform
Resuming 1040142^262144+1 from a checkpoint (3678528 iterations left)
Estimated total run time for 1040142^262144+1 is 5:03:44
1040142^262144+1 is a probable composite. (RES=76e7632c96ac2f4e) (1577345 digits) (err = 0.4062) (time = 9:57:52) 12:59:01
[2014-01-10 12:59:01 CEST] GFN262144: 1040142^262144+1 is not prime. Residue 76e7632c96ac2f4e
[2014-01-10 12:59:01 CEST] Total Time: 3:34:15 Total Work Units: 1 Special Results Found: 0
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Assuming that the residual is correct (somebody aught to check that), that is interesting. I'd like to get Yves' thoughts on this one.
To me, this seems like a test that's close to the boundary of where AVX can work. Perhaps in this particular test, the part of the test that is "worst" from a limit perspective is the very beginning, and once it's past that part, it can continue using the much faster AVX.
If that's true, perhaps some thought should be given to whether the "One-strike-and-you're out" strategy for downshifting the transform isn't the best when you're near the boundary. Trying to shift back to faster transforms might be advantageous. The logic would have to be a lot smarter if we do that. We need to identify when it's hopeless and not repeatedly waste time trying to run AVX when b is way beyond what AVX can handle.
____________
My lucky number is 75898524288+1 | |
|
|
If that's true, perhaps some thought should be given to whether the "One-strike-and-you're out" strategy for downshifting the transform isn't the best when you're near the boundary. Trying to shift back to faster transforms might be advantageous. The logic would have to be a lot smarter if we do that. We need to identify when it's hopeless and not repeatedly waste time trying to run AVX when b is way beyond what AVX can handle.
This had occurred to me too, but I wasn't sure if it would occur in practice or not. The round-off error does not grow necessarily as the calculation goes on, it may 'peak' several times, so it would seem that it might well make sense to try the faster, less precise transforms again after some number of 'successful' iterations on the slower transform. To some extent, it depends on the relative performance of each and the cost of making the switch (which should not be too high, since we skip the initialisation phase when switching transforms).
I need to think about this some...[/quote]
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Assuming that the residual is correct (somebody aught to check that)
Just ran the test with geneferocl (which hit a maxErr just before the end), and completed the test with genefersse3. Got the same residual as Honza, so I'm happy the code is correct!
Testing 1040142^262144+1...
Using SSE3 transform
Resuming 1040142^262144+1 from a checkpoint (255 iterations left)
1040142^262144+1 is a probable composite. (RES=76e7632c96ac2f4e) (1577345 digits) (err = 0.3125) (time = 1:08:26) 21:09:51
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
If you're not too busy running the current GFN challenge, please test and continue to post results here!
Cheers
- Iain
Successfully tested 1x GFN524288. I like how the program continues from the last checkpoint. % complete at time of transform change would be nice.
>genefer_windows64.exe -q "771564^524288+1"
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -q 771564^524288+1
Priority change succeeded.
No relevant benchmark data exists, testing available transform implementations...
Benchmarks completed (148.572 seconds).
Testing 771564^524288+1...
Using SSE2 transform
Starting initialization...
Initialization complete (2.119 seconds).
Estimated total run time for 771564^524288+1 is 58:08:39
Testing 771564^524288+1... 9867264 steps to go (55:57:10 remaining)
maxErr exceeded for 771564^524288+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 771564^524288+1...
Using SSE3 transform
Resuming 771564^524288+1 from a checkpoint (9871359 iterations left)
Estimated total run time for 771564^524288+1 is 75:45:59
771564^524288+1 is a probable composite. (RES=715153c25a88143d) (3086679 digits)
(err = 0.3594) (time = 64:17:30) 14:01:58
Going to try a GFN262144 next. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Successfully tested a GFN262144:
C:\Users\Roger\Downloads\GeneferLatest>genefer_windows64.exe -q "1040142^262144+1"
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -q 1040142^262144+1
Priority change succeeded.
No relevant benchmark data exists, testing available transform implementations...
Benchmarks completed (63.730 seconds).
Testing 1040142^262144+1...
Using SSE2 transform
Starting initialization...
Initialization complete (0.808 seconds).
Estimated total run time for 1040142^262144+1 is 11:43:52
Testing 1040142^262144+1... 5234688 steps to go (11:43:11 remaining)
maxErr exceeded for 1040142^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 1040142^262144+1...
Using SSE3 transform
Resuming 1040142^262144+1 from a checkpoint (5238783 iterations left)
Estimated total run time for 1040142^262144+1 is 14:07:6
Testing 1040142^262144+1... 4624384 steps to go (12:27:36 remaining)
maxErr exceeded for 1040142^262144+1, 0.4531 > 0.4500
maxErr exceeded while using SSE3; switching to Default.
Testing 1040142^262144+1...
Using Default transform
Resuming 1040142^262144+1 from a checkpoint (4628479 iterations left)
Estimated total run time for 1040142^262144+1 is 20:18:52
Testing 1040142^262144+1... 4624384 steps to go (17:55:42 remaining)
maxErr exceeded for 1040142^262144+1, 0.4531 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 1040142^262144+1...
Using x87 (80-bit) transform
Resuming 1040142^262144+1 from a checkpoint (4628479 iterations left)
Estimated total run time for 1040142^262144+1 is 47:07:8
1040142^262144+1 is a probable composite. (RES=76e7632c96ac2f4e) (1577345 digits) (err = 0.0005) (time = 42:48:29) 14:02:58
I note it re-does the benchmarks for a GFN262144 even though it just did a GFN524288 in the same directory. Let me know if any other tests would be useful. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Benchmarks are repeated at the start of every test. Not only might the system perform differently with a different n (or b), but environmental conditions could have changed which might affect performance.
____________
My lucky number is 75898524288+1 | |
|
|
Hi all, new binaries are posted at https://www.assembla.com/code/genefer/subversion/nodes in the trunk/bin/<os> directories. At the moment, only Windows available as my main dev box is down for a HDD restore :(
This version introduces a scheme to automatically restart the calculation with a faster transform after a maxErr is encountered and successfully recovered using a slower transform. I have done some testing, but would be good to see how the behaviour works with 'close-to-B-limit' tests that are current available via PRPNet.
The checkpoint file format is the same, so you can just drop this app in as a replacement for the current binary without any worries!
Please have a go and let me know the results.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Iain's commit message in the source control describes the implemented algorithm:
Author: ibethune
2014-02-01 15:41 (about 2 hours ago)
Allow transform 'up-shift' after successful progress with a slower transform. Algorithm is as follows:
* When down-shifting, only attempt to compute as far as the next checkpoint.
* When the checkpoint is reached, restart with the fastest available transform (subject to the following rule):
** If a maxErr exceeded is encountered, this increments a counter (per transform).
** A penalty of 5 is applied if the maxErr is encountered before reaching a checkpoint as this indicates the transform is likely past the B limit.
** Once the count reaches 10 that transform is blacklisted and will no longer be used.
This should reduce the impact of 'transient' maxErrs - we can see 10 within a calculation and still continue with the fast transform. 'hard' maxErrs i.e. beyond the B limit, we will stop using the transform after seeing 2 of these (just to be sure). The total of 10 and penalty of 5 might need tuning.
Where 'up-shift' means move to a faster transform, and 'down-shift' means move to a slower transform. Faster transforms would be less precise a lot of the time, but not always.
Lets say we have 5 transforms and the 4 fastest hard fail before the first checkpoint every time. It will try each transform from fastest to slowest twice, then stay on the slowest transform till completion.
The -x flag still forces use of a particular transform and no up or down shifting will occur.
My only critique is that maybe should do more than one checkpoint on a successful transform, otherwise will bounce through the faster transforms too quickly.
Now lets test it! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
So far so good.
Roger, I guess up-shift to faster transform after checkpoint should work fine, there are enough intervations between.
What I wish for is that let the checkpoint interval be set via .ini file. I've mentioned elsewhere that checkpointing is a bit I/O intensive - especially for machines running 24/7 with bunch of cores.
Testing 1045004^262144+1...
Using AVX transform
Starting initialization...
Initialization complete (0.811 seconds).
Estimated total run time for 1045004^262144+1 is 5:05:19
Testing 1045004^262144+1... 2502656 steps to go (2:25:46 remaining)
maxErr exceeded for 1045004^262144+1, 0.4531 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 1045004^262144+1...
Using SSE3 transform
Resuming 1045004^262144+1 from a checkpoint (2502655 iterations left)
Estimated total run time for 1045004^262144+1 is 9:03:48
maxErr exceeded for 1045004^262144+1, 0.4531 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 1045004^262144+1...
Using SSE2 transform
Resuming 1045004^262144+1 from a checkpoint (2502655 iterations left)
Estimated total run time for 1045004^262144+1 is 11:03:45
maxErr exceeded for 1045004^262144+1, 0.4688 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 1045004^262144+1...
Using Default transform
Resuming 1045004^262144+1 from a checkpoint (2502655 iterations left)
Estimated total run time for 1045004^262144+1 is 14:29:34
Successful computation progress with Default; switching back to AVX.)
Testing 1045004^262144+1...
Using AVX transform
Resuming 1045004^262144+1 from a checkpoint (2498559 iterations left)
Estimated total run time for 1045004^262144+1 is 5:12:08
1045004^262144+1 is a probable composite. (RES=0de8012d27aa62c2) (1577876 digits) (err = 0.4062) (time = 5:05:08) 10:15:13
[2014-02-02 10:15:13 CEST] GFN262144: 1045004^262144+1 is not prime. Residue 0de8012d27aa62c2
____________
My stats | |
|
|
Thanks Honza, that appears to be working as designed. For the record, I retested 1040142^262144+1 with the new code and got a matching residue, so all seems to be working well.
It should be fairly straightforward to limit the checkpoint rate according to a value in a file. Do you want this to be a time-based measure i.e. to checkpoint every X seconds? This is similar to BOINC, where we currently only checkpoint after a certain amount of time (set by the client) has elapsed.
My only critique is that maybe should do more than one checkpoint on a successful transform, otherwise will bounce through the faster transforms too quickly.
Roger - on this issue doing a single checkpoint is enough as if the original max-error is a transient one (i.e. the fast transform is well below its B limit), then we are unlikely to see another soon, and we want to get back on the faster transform as soon as possible. In the case where the fast transform is over its B limit, then we also want to try it again as soon as possible, so we can be sure this is the case and then stop using it.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Yes, I prefer a time-based measure. We are used to it, this makes it easier for participants.
Iteration-based would require a bit of math - different N's and speed among transforms, CUDA etc.
Thanks Honza, that appears to be working as designed. For the record, I retested 1040142^262144+1 with the new code and got a matching residue, so all seems to be working well.
It should be fairly straightforward to limit the checkpoint rate according to a value in a file. Do you want this to be a time-based measure i.e. to checkpoint every X seconds? This is similar to BOINC, where we currently only checkpoint after a certain amount of time (set by the client) has elapsed.
I expect total time per task to be much more similar to each other, i.e. smaller variability.
Running GFN262144 for a while using 3 instances, I've got 200+ WUs from each instance to compare with.
GeneferAVX B limit is ~1,065,000, we are doing ~8K (1K tests) a month and with recent 1,045,072 we should be there in ~2 months. (I can put x10 CPU power to this and find out in a week but guess there is no rush).
____________
My stats | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Resuming application after Ctrl-C works fine:
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN262144.in
Priority change succeeded.
Start test of file 'work_GFN262144.in' - 21:17:03
Testing 1044944^262144+1...
Using SSE2 transform
Resuming 1044944^262144+1 from a checkpoint (1097811 iterations left)
Testing 1044944^262144+1... 1097728 steps to go
maxErr exceeded for 1044944^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 1044944^262144+1...
Using SSE3 transform
Resuming 1044944^262144+1 from a checkpoint (1097727 iterations left)
Estimated total run time for 1044944^262144+1 is 17:17:07
Successful computation progress with SSE3; switching back to SSE2.ng)
Testing 1044944^262144+1...
Using SSE2 transform
Resuming 1044944^262144+1 from a checkpoint (1093631 iterations left)
maxErr exceeded for 1044944^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Testing 1044944^262144+1...
Using SSE3 transform
Resuming 1044944^262144+1 from a checkpoint (1093631 iterations left)
Estimated total run time for 1044944^262144+1 is 17:15:44
Successful computation progress with SSE3; switching back to SSE2.ng)
Testing 1044944^262144+1...
Using SSE2 transform
Resuming 1044944^262144+1 from a checkpoint (1089535 iterations left)
maxErr exceeded for 1044944^262144+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to SSE3.
Too many errors with SSE2; Calculation will proceed using only more accurate transforms.
Testing 1044944^262144+1...
Using SSE3 transform
Resuming 1044944^262144+1 from a checkpoint (1089535 iterations left)
Estimated total run time for 1044944^262144+1 is 17:15:44
1044944^262144+1 is a probable composite. (RES=aed6fd019a4cdee5) (1577870 digits) (err = 0.4062) (time = 16:43:41) 00:54:03
SSE2 transform is well over the B-limit here so hard fails, but SSE3 is almost as fast and completes it no worries. Before this I stopped the application with Ctrl-C, then restarted. Transform failure counters get reset on restart.
Note the extra "ng)" at the end of the "Successful computation progress..." lines. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Is there a plan for an OpenCL version with auto-transform shifting?
GFN524288 on PRPnet still has ~50,000 worth of b left for OpenCL but I am getting about 1 in 5 WUs fail due to maxErr. | |
|
|
Is there a plan for an OpenCL version with auto-transform shifting?
GFN524288 on PRPnet still has ~50,000 worth of b left for OpenCL but I am getting about 1 in 5 WUs fail due to maxErr.
I wonder if the existing OpenCL app is more sensitive to the approaching 'B' limit on 524288 than genefercuda is... I've been running the latter for about 2.5 days now (at 3 hours per WU) without a maxErr failure. Note that this is with the existing stock app, not the new/beta 3.2.0 version (sorry for the digression).
Back on topic, Iain, if/when a Linux version of the latest source is available, let us know if you need testing assistance.
--Gary | |
|
|
Is there a plan for an OpenCL version with auto-transform shifting?
Not at the moment - we only have one implementation of the transform using OpenCL, and to an extent it auto-tunes itself for better performance. There is no 'high B limit' OpenCL implementation, so nothing to shift down to once the current transform hits maxErr exceeded. It is *possible* (with some implementation effort), to combine OpenCL (or CUDA) with the CPU transforms, but I don't have an immediate plan to do so.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Back on topic, Iain, if/when a Linux version of the latest source is available, let us know if you need testing assistance.
--Gary
There are reasonably up-to-date Linux apps in the SVN (missing only the cosmetic fix to the output overwriting that Roger reported). Please give them a go and let me know. There are also Linux 3.2.0 binaries for CUDA and OpenCL. The CUDA code should be marginally faster than the 3.1.2, and the OpenCL is essentially unchanged.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
Is there a plan for an OpenCL version with auto-transform shifting?
Not at the moment - we only have one implementation of the transform using OpenCL, and to an extent it auto-tunes itself for better performance. There is no 'high B limit' OpenCL implementation, so nothing to shift down to once the current transform hits maxErr exceeded. It is *possible* (with some implementation effort), to combine OpenCL (or CUDA) with the CPU transforms, but I don't have an immediate plan to do so.
- Iain
Actually, what would be most useful at the moment (I think) would be an OpenCL version that auto-transforms to CUDA with maxerr. The failure rate of the OpenCL app is much higher on Kepler cards than CUDA, but it is typically 40% faster. It would be really nice to have an OpenCL app that shifts to CUDA when encountering maxerr (even if it doesn't shift back) to allow these to complete a bit slower rather than failing outright.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Is there a plan for an OpenCL version with auto-transform shifting?
Not at the moment - we only have one implementation of the transform using OpenCL, and to an extent it auto-tunes itself for better performance. There is no 'high B limit' OpenCL implementation, so nothing to shift down to once the current transform hits maxErr exceeded. It is *possible* (with some implementation effort), to combine OpenCL (or CUDA) with the CPU transforms, but I don't have an immediate plan to do so.
- Iain
Actually, what would be most useful at the moment (I think) would be an OpenCL version that auto-transforms to CUDA with maxerr. The failure rate of the OpenCL app is much higher on Kepler cards than CUDA, but it is typically 40% faster. It would be really nice to have an OpenCL app that shifts to CUDA when encountering maxerr (even if it doesn't shift back) to allow these to complete a bit slower rather than failing outright.
From a linking perspective, I'm not sure if that's possible with a single executable. I think it would have to be a multi-executable app, with a master app that calls different OpenCL and CUDA apps when needed. More complicated, but not impossible.
I'm only guessing about not being to link CUDA and OpenCL together. It's conceivable it can be done.
____________
My lucky number is 75898524288+1 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Running latest Genefer for ~3 days, run times are much more consistent.
Variability can be attributed rather to different CPU loads using other apps than up-shift/down-shift transformations.
For sure, the average run-time is now lower :-)
Well done.
I expect total time per task to be much more similar to each other, i.e. smaller variability.
Running GFN262144 for a while using 3 instances, I've got 200+ WUs from each instance to compare with.
GeneferAVX B limit is ~1,065,000, we are doing ~8K (1K tests) a month and with recent 1,045,072 we should be there in ~2 months. (I can put x10 CPU power to this and find out in a week but guess there is no rush).
____________
My stats | |
|
|
Running latest Genefer for ~3 days, run times are much more consistent.
Variability can be attributed rather to different CPU loads using other apps than up-shift/down-shift transformations.
For sure, the average run-time is now lower :-)
Well done.
OK, that's great news!
I also ran a double-check on Roger's 1044944^262144+1 with a single transform (in this case SSE3 - no transform shifting) and got the same residue.
I'm pretty happy everything is working OK now. It would be good if someone could test on Linux (Gary?) and also under BOINC using app_info.xml.
I also need to produce new Windows GPU builds, and I'm working on that!
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
17 tests so far with GFN262144, all good. Will be interesting over the next 2.5 months as this port heads to the SSE3/AVX hard limit at 1,065,000. No downshifts so far for me to 80-bit unlike this test: http://www.primegrid.com/forum_thread.php?id=5389&nowrap=true#72555 | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
Is there a plan for an OpenCL version with auto-transform shifting?
Not at the moment - we only have one implementation of the transform using OpenCL, and to an extent it auto-tunes itself for better performance. There is no 'high B limit' OpenCL implementation, so nothing to shift down to once the current transform hits maxErr exceeded. It is *possible* (with some implementation effort), to combine OpenCL (or CUDA) with the CPU transforms, but I don't have an immediate plan to do so.
- Iain
Actually, what would be most useful at the moment (I think) would be an OpenCL version that auto-transforms to CUDA with maxerr. The failure rate of the OpenCL app is much higher on Kepler cards than CUDA, but it is typically 40% faster. It would be really nice to have an OpenCL app that shifts to CUDA when encountering maxerr (even if it doesn't shift back) to allow these to complete a bit slower rather than failing outright.
From a linking perspective, I'm not sure if that's possible with a single executable. I think it would have to be a multi-executable app, with a master app that calls different OpenCL and CUDA apps when needed. More complicated, but not impossible.
I'm only guessing about not being to link CUDA and OpenCL together. It's conceivable it can be done.
I was just thinking that an easy fix for this on PRPnet would be to set the application priority to use the OpenCL application first then go to CUDA if both apps are available to use (i.e., when the comment out slashes are removed from CUDA ans OCL genefer apps, the CUDA version currently runs first, but this would work best for all NVidia cards if the opposite were true).
| |
|
|
Here's a linux-64-bit test run, using the number used in post 71280:
$ ./genefer_linux64 -q "7028430^32768+1"
genefer 3.2.0beta-0 (Linux/CPU/64-bit)
Supported transform implementations: default sse2 x87 avx sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: ./genefer_linux64 -q 7028430^32768+1
Priority change succeeded.
No relevant benchmark data exists, testing available transform implementations...
Benchmarks completed (2.660 seconds).
Testing 7028430^32768+1...
Using AVX transform
Starting initialization...
Initialization complete (0.020 seconds).
Testing 7028430^32768+1... 745299 steps to go
maxErr exceeded for 7028430^32768+1, 0.5000 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 7028430^32768+1...
Using SSE3 transform
Resuming 7028430^32768+1 from a checkpoint (745299 iterations left)
maxErr exceeded for 7028430^32768+1, 0.5000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 7028430^32768+1...
Using SSE2 transform
Resuming 7028430^32768+1 from a checkpoint (745299 iterations left)
maxErr exceeded for 7028430^32768+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 7028430^32768+1...
Using Default transform
Resuming 7028430^32768+1 from a checkpoint (745299 iterations left)
maxErr exceeded for 7028430^32768+1, 0.5000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 7028430^32768+1...
Using x87 (80-bit) transform
Resuming 7028430^32768+1 from a checkpoint (745299 iterations left)
Estimated total run time for 7028430^32768+1 is 0:14:31
Testing 7028430^32768+1... 741376 steps to go (0:14:27 remaining)
Successful computation progress with x87 (80-bit); switching back to AVX.
Testing 7028430^32768+1...
Using AVX transform
Resuming 7028430^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7028430^32768+1, 0.5000 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Too many errors with AVX; Calculation will proceed using only more accurate transforms.
Testing 7028430^32768+1...
Using SSE3 transform
Resuming 7028430^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7028430^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Too many errors with SSE3; Calculation will proceed using only more accurate transforms.
Testing 7028430^32768+1...
Using SSE2 transform
Resuming 7028430^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7028430^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Too many errors with SSE2; Calculation will proceed using only more accurate transforms.
Testing 7028430^32768+1...
Using Default transform
Resuming 7028430^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7028430^32768+1, 1.0000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Too many errors with Default; Calculation will proceed using only more accurate transforms.
Testing 7028430^32768+1...
Using x87 (80-bit) transform
Resuming 7028430^32768+1 from a checkpoint (741375 iterations left)
Estimated total run time for 7028430^32768+1 is 0:14:31
7028430^32768+1 is a probable composite. (RES=a2da3b3d9c10be0b) (224358 digits) (err = 0.0073) (time = 0:14:25) 23:37:28
Residue matches!
--Gary
| |
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
Apart from being slower, is there any evidence that SSE3 (and SSE2) transform is more accurate than AVX? Since they all use IEEE 64-bit, shouldn't they all work out to roughly the same capability? If so, isn't it better to directly failover from AVX (or SSE3/2) to x87 and back?
PS:- It is possible that a (new, would be) transform using Haswell FMA instructions can be more accurate than AVX/SSE3/SSE2, in which case the fastest would also become the best. | |
|
|
Apart from being slower, is there any evidence that SSE3 (and SSE2) transform is more accurate than AVX? Since they all use IEEE 64-bit, shouldn't they all work out to roughly the same capability? If so, isn't it better to directly failover from AVX (or SSE3/2) to x87 and back?
I think SSE3 and AVX should be the same in terms of accuracy, however there is probably a little room for the compiler to cause some differences. SSE2 uses a completely different FFT implementation which is (usually) less accurate. I see what you're getting at, but for the sake of saving a little (much less than 1%) time avoiding trying the intermediate transforms, I'd prefer to avoid the extra complexity in the transform shift logic.
PS:- It is possible that a (new, would be) transform using Haswell FMA instructions can be more accurate than AVX/SSE3/SSE2, in which case the fastest would also become the best.
No idea… I haven't looked in detail at FMA3.
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Just posted a Windows CUDA 3.2.0-0beta binary in https://www.assembla.com/code/genefer/subversion/nodes in the trunk/bin/windows directory. Please go ahead and test - there are a couple of things particularly to look for:
- My build environment is different to Mike's previous one, so please report if for any reason the binary doesn't work on your system.
- I have built with CUDA 5.5 but I believe it should still run on systems with earlier CUDA SDKs / drivers installed
- There are some small improvements to the CUDA kernels (these have been present in recent linux & mac builds), which might give up to 5% speed improvement, especially for large N.
- The windows build now includes code generated for CC 1.3, 2.0, 3.0 and 3.5, so might run a little faster on modern cards.
Let me know if you have success or not!
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime!
| |
|
|
Also a Windows OpenCL binary is now available in SVN. Please download and test! It should be exactly the same performance-wise as the current 3.1.2 binary. It's build using the Nvidia OpenCL stack, so if someone has an ATI/AMD card to test on that would be great. I think it should work, but I don't have the hardware to test.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
GTX 645 on Win Vista Home Premium 64-bit driver 331.82
GFN 524288 PRPnet port
Old CUDA Est. run time: 9 hours 15 minutes
New CUDA Est. run time: Not Applicable...will not run without CUDA 5.5 dll files
Old OCL Est. run time: 7 hours 34 minutes
New OCL Est. run time: 7 hours 36 minutes
EDIT:
Added the CUDA 5.5 files manually and it is faster by a good margin.
New CUDA Est. run time: 8 hours 15 minutes (That's about 12% faster!). | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Win 7, i5-3570, HD7950.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
...
Running benchmarks for transform implementation "OCL"
14^262144+1 300451 digits 0 days 0.1 hours (0.39 ms/mul, 998074 iterations) 21978 GFLOPS
75898^262144+1 1279324 digits 0 days 0.4 hours (0.42 ms/mul, 4249818 iterations) 93581 GFLOPS
468750^262144+1 1486604 digits 0 days 0.5 hours (0.39 ms/mul, 4938388 iterations) 108744 GFLOPS
815000^262144+1 1549575 digits 0 days 0.6 hours (0.42 ms/mul, 5147574 iterations) 113350 GFLOPS
14^524288+1 600902 digits 0 days 0.3 hours (0.72 ms/mul, 1996149 iterations) 92097 GFLOPS
75898^524288+1 2558647 digits 0 days 1.6 hours (0.69 ms/mul, 8499637 iterations) 392151 GFLOPS
468750^524288+1 2973207 digits 0 days 1.8 hours (0.67 ms/mul, 9876777 iterations) 455688 GFLOPS
710000^524288+1 3067745 digits 0 days 1.9 hours (0.71 ms/mul, 10190825 iterations) 470178 GFLOPS
14^1048576+1 1201803 digits 0 days 1.3 hours (1.23 ms/mul, 3992299 iterations) 385133 GFLOPS
75898^1048576+1 5117293 digits 0 days 5.7 hours (1.22 ms/mul, 16999276 iterations) 1639903 GFLOPS
468750^1048576+1 5946413 digits 0 days 6.6 hours (1.21 ms/mul, 19753555 iterations) 1905606 GFLOPS
700000^1048576+1 6129030 digits 0 days 6.7 hours (1.20 ms/mul, 20360194 iterations) 1964127 GFLOPS
14^2097152+1 2403605 digits 0 days 5.4 hours (2.45 ms/mul, 7984600 iterations) 1607512 GFLOPS
75898^2097152+1 10234585 digits 0 days 23.1 hours (2.45 ms/mul, 33998553 iterations) 6844813 GFLOPS
380742^2097152+1 11703432 digits 1 days 2.1 hours (2.42 ms/mul, 38877955 iterations) 7827166 GFLOPS
570000^2097152+1 12070945 digits 1 days 2.6 hours (2.40 ms/mul, 40098808 iterations) 8072956 GFLOPS
14^4194304+1 4807210 digits 0 days 21.6 hours (4.88 ms/mul, 15969202 iterations) 6697969 GFLOPS
1248^4194304+1 12986466 digits 2 days 9.5 hours (4.80 ms/mul, 43140102 iterations) 18094270 GFLOPS
10000^4194304+1 16777217 digits 3 days 3.2 hours (4.86 ms/mul, 55732704 iterations) 23375990 GFLOPS
50000^4194304+1 19708909 digits 3 days 15.9 hours (4.84 ms/mul, 65471576 iterations) 27460769 GFLOPS
150000^4194304+1 21710101 digits 3 days 22.6 hours (4.72 ms/mul, 72119391 iterations) 30249065 GFLOPS
309258^4194304+1 23028076 digits 4 days 5.1 hours (4.76 ms/mul, 76497608 iterations) 32085422 GFLOPS
480000^4194304+1 23828853 digits 4 days 7.5 hours (4.71 ms/mul, 79157734 iterations) 33201160 GFLOPS
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Win 7, Intel E7400, R280X.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1348.5)' and diver '1348.5 (VM)'.
Running benchmarks for transform implementation "OCL"
...
14^262144+1 300451 digits 0 days 0.0 hours (0.34 ms/mul, 998074 iterations) 21978 GFLOPS
75898^262144+1 1279324 digits 0 days 0.3 hours (0.33 ms/mul, 4249818 iterations) 93581 GFLOPS
468750^262144+1 1486604 digits 0 days 0.4 hours (0.34 ms/mul, 4938388 iterations) 108744 GFLOPS
815000^262144+1 1549575 digits 0 days 0.5 hours (0.36 ms/mul, 5147574 iterations) 113350 GFLOPS
14^524288+1 600902 digits 0 days 0.3 hours (0.66 ms/mul, 1996149 iterations) 92097 GFLOPS
75898^524288+1 2558647 digits 0 days 1.5 hours (0.66 ms/mul, 8499637 iterations) 392151 GFLOPS
468750^524288+1 2973207 digits 0 days 1.8 hours (0.67 ms/mul, 9876777 iterations) 455688 GFLOPS
710000^524288+1 3067745 digits 0 days 1.8 hours (0.66 ms/mul, 10190825 iterations) 470178 GFLOPS
14^1048576+1 1201803 digits 0 days 1.2 hours (1.09 ms/mul, 3992299 iterations) 385133 GFLOPS
75898^1048576+1 5117293 digits 0 days 5.1 hours (1.09 ms/mul, 16999276 iterations) 1639903 GFLOPS
468750^1048576+1 5946413 digits 0 days 6.0 hours (1.09 ms/mul, 19753555 iterations) 1905606 GFLOPS
700000^1048576+1 6129030 digits 0 days 6.2 hours (1.11 ms/mul, 20360194 iterations) 1964127 GFLOPS
14^2097152+1 2403605 digits 0 days 4.7 hours (2.14 ms/mul, 7984600 iterations) 1607512 GFLOPS
75898^2097152+1 10234585 digits 0 days 20.3 hours (2.16 ms/mul, 33998553 iterations) 6844813 GFLOPS
380742^2097152+1 11703432 digits 0 days 23.1 hours (2.14 ms/mul, 38877955 iterations) 7827166 GFLOPS
570000^2097152+1 12070945 digits 1 days 0.0 hours (2.16 ms/mul, 40098808 iterations) 8072956 GFLOPS
14^4194304+1 4807210 digits 0 days 19.0 hours (4.30 ms/mul, 15969202 iterations) 6697969 GFLOPS
1248^4194304+1 12986466 digits 2 days 3.4 hours (4.30 ms/mul, 43140102 iterations) 18094270 GFLOPS
10000^4194304+1 16777217 digits 2 days 17.7 hours (4.25 ms/mul, 55732704 iterations) 23375990 GFLOPS
50000^4194304+1 19708909 digits 3 days 5.2 hours (4.25 ms/mul, 65471576 iterations) 27460769 GFLOPS
150000^4194304+1 21710101 digits 3 days 13.1 hours (4.25 ms/mul, 72119391 iterations) 30249065 GFLOPS
309258^4194304+1 23028076 digits 3 days 17.9 hours (4.23 ms/mul, 76497608 iterations) 32085422 GFLOPS
480000^4194304+1 23828853 digits 3 days 21.1 hours (4.24 ms/mul, 79157734 iterations) 33201160 GFLOPS
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Gave it a try, even b limit is passed.
It was all OK until WU actually finished - with maxerr.
c:\_PG>geneferocl-windows.exe -q "1052182^262144+1"
geneferocl 3.2.0beta-0 (Windows/OpenCL/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -q 1052182^262144+1
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Testing 1052182^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (1.395 seconds).
Estimated total run time for 1052182^262144+1 is 0:35:34
Testing 1052182^262144+1... 65536 steps to go (0:00:26 remaining)
maxErr exceeded for 1052182^262144+1, 0.4844 > 0.4500
maxErr exceeded by all available transform implementations
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
This one looks OK.
c:\_PG>geneferocl-windows.exe -q "24518^262144+1"
Testing 24518^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (1.095 seconds).
Estimated total run time for 24518^262144+1 is 0:27:00
24518^262144+1 is a probable prime. (1150678 digits) (err = 0.0003) (time = 0:26:23) 18:57:38
____________
My stats | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
GTX 645 on Win Vista Home Premium 64-bit driver 331.82
GFN 524288 PRPnet port
Old CUDA Est. run time: 9 hours 15 minutes
New CUDA Est. run time: Not Applicable...will not run without CUDA 5.5 dll files
Old OCL Est. run time: 7 hours 34 minutes
New OCL Est. run time: 7 hours 36 minutes
EDIT:
Added the CUDA 5.5 files manually and it is faster by a good margin.
New CUDA Est. run time: 8 hours 15 minutes (That's about 12% faster!).
The speed improvement by Cuda5.x depends on the used gpu. For the 400 and 500 series seems to be Cuda3.2 the fastest. Cuda4.x brought some architectural improvements but these make no difference for the current PG-app. If you want to get an improvement, you have to find a faster calculation method and rewrite the complete source code or simple using a faster DP-capable gpu. Linking the current source code to any newer Cuda-version will change nothing until the gpu architecture itself is changed. Therefore you see an improvement for the series 600 and 700 with Cuda. Anyway OpenCL produces shorter runtimes if you have the right nVidia gpu. In comparison to Cuda seems to be OpenCL:
- faster for the 600
- even faster for the 700
- equal with minimal improvements for the 500
- slower for the 400 series
The linux applications are linked against Cuda3.2 and the current Cuda version since a while now. I have not read anything by the PG community about a performance improvement in the interim. My hope with Cuda5.0 was that this version would unleash the maximum performance of the freshly released GTX Titan. But a GTX Titan or the entire series 600 and 700 in combination with Linux were not an often used combination here on Primegrid in the past. I think the main problem was the low DP-performance of the entire 600 and 700 series. The Cuda5.0-app will not be updated anymore since august 2013. I upgraded to Cuda5.5 and have upgrade plans for Cuda6.0 after their release but you can recompile the sources against all Cuda versions since 3.2.
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,988,457,789 RAC: 3,294,646
                                   
|
hi i run test on prpnet and it works cuda to avx
C:\Users\Public\folding\prp\prpclient-5.2.8-windows\prpclient-5.2.8-windows\prpc
lient-2>prpclient
[2014-02-09 18:13:08 Vn] PRPNet Client application v5.3.0 started
[2014-02-09 18:13:08 Vn] User name 288larsson at email address is
[2014-02-09 18:13:09 Vn] GFN262144: Getting work from server prpnet.primegrid.co
m at port 11002
[2014-02-09 18:13:10 Vn] GFN262144: PRPNet server is version 5.2.8
Hi! Welcome to PrimeGrid's GFN 262144 Prime Search.
genefercuda 3.2.0beta-0 (Windows/CUDA/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda.exe -d 1 work_GFN262144.in
Start test of file 'work_GFN262144.in' - 18:13:10
Generalized Fermat Number Bench 2
SHIFT=5 1047046^262144+1 Time: 980 us/mul. Err: 4.06e-001 1578098
digits
SHIFT=6 1047046^262144+1 Time: 460 us/mul. Err: 4.06e-001 1578098
digits
SHIFT=7 1047046^262144+1 Time: 430 us/mul. Err: 4.06e-001 1578098
digits
SHIFT=8 1047046^262144+1 Time: 560 us/mul. Err: 4.06e-001 1578098
digits
SHIFT=9 1047046^262144+1 Time: 931 us/mul. Err: 4.06e-001 1578098
digits
SHIFT=10 1047046^262144+1 Time: 1.92 ms/mul. Err: 4.06e-001
1578098 digits
Best SHIFT determined experimentally to be 7.
Testing 1047046^262144+1...
Using CUDA transform
Using AUTO-SHIFT=7
Starting initialization...
Initialization complete (1.531 seconds).
Estimated total run time for 1047046^262144+1 is 0:38:26
Testing 1047046^262144+1... 1048576 steps to go (0:07:41 remaining)
maxErr exceeded for 1047046^262144+1, 0.4688 > 0.4500
MaxErr exceeded may be caused by overclocking, overheated GPUs and other transie
nt errors.
maxErr exceeded by all available transform implementations
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 avx sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefx64.exe work_GFN262144.in
Priority change succeeded.
Start test of file 'work_GFN262144.in' - 18:44:10
Testing 1047046^262144+1...
Using AVX transform
Starting initialization...
Initialization complete (0.580 seconds).
Estimated total run time for 1047046^262144+1 is 3:26:11
Testing 1047046^262144+1... 5144576 steps to go (3:22:21 remaining) | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Started with b=1046946 (over b limit) on OpenCL.
>geneferocl-windows.exe -q "1046946^262144+1"
Got it to the end with marerr.
Trying to resume using Genefer64 made it stop working.
Perhaps OpenCL wrote nonsense to checkpoint file.
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Speed comparisons:
All tests on C2Q Q6600 with GTX 460.
Old 3.1.2 (CUDA 3.2):
genefercuda 3.1.2-0 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 166 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 165 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 213 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 335 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 594 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 1.05 ms/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.98 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 3.96 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 8.28 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 16.6 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 34.6 ms/mul. Err: 0.1797 45879398 digits
New 3.2.0 (CUDA 5.5):
genefercuda 3.2.0beta-0 (Windows/CUDA/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
Running benchmarks for transform implementation "CUDA"
2199064^8192+1 Time: 116 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 148 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 208 us/mul. Err: 0.2109 202102 digits
1203210^65536+1 Time: 331 us/mul. Err: 0.2188 398482 digits
984108^131072+1 Time: 594 us/mul. Err: 0.2031 785521 digits
804904^262144+1 Time: 1.1 ms/mul. Err: 0.2227 1548156 digits
658332^524288+1 Time: 2.04 ms/mul. Err: 0.2031 3050541 digits
538452^1048576+1 Time: 4.14 ms/mul. Err: 0.1953 6009544 digits
440400^2097152+1 Time: 8.29 ms/mul. Err: 0.2109 11836006 digits
360204^4194304+1 Time: 17.3 ms/mul. Err: 0.1816 23305854 digits
294612^8388608+1 Time: 36.6 ms/mul. Err: 0.1797 45879398 digits
Genefer Mark = 24.
New OpenCL:
geneferocl 3.2.0beta-0 (Windows/OpenCL/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1 CUDA' and driver '332.21'.
Generalized Fermat Number Bench
Running benchmarks for transform implementation "OCL"
2199064^8192+1 Time: 87.5 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 128 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 168 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 305 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 556 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 1.13 ms/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 2.04 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 4.27 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 8.77 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 18.5 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 39.5 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 23.
As you can see, on this video card, the older CUDA 3.2 build is faster than the CUDA 5.5 build. Unfortunately, it's not clear if we'll be able to continue to build more apps with CUDA 3.2. Even if we could, it's unclear what the best strategy would be for providing the fastest app for every GPU. There's a lot of permutations in configuration possibilities.
____________
My lucky number is 75898524288+1 | |
|
|
Thanks Honza,
I'm looking at this - it should work, so there is evidently a bug here…
- Iain
Started with b=1046946 (over b limit) on OpenCL.
>geneferocl-windows.exe -q "1046946^262144+1"
Got it to the end with marerr.
Trying to resume using Genefer64 made it stop working.
Perhaps OpenCL wrote nonsense to checkpoint file.
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Hi Honza,
Can you send me the output you got from the restart with genefer? Also if you have a copy of the checkpoint file that would be great? I have tried on Linux but not able to recreate the problem so far.
For reference, the residues I got for your two tests were:
1052182^262144+1 - 56bc7bb02840dabe
1046946^262144+1 - f4696dd0281b538e
- Iain
Thanks Honza,
I'm looking at this - it should work, so there is evidently a bug here…
- Iain
Started with b=1046946 (over b limit) on OpenCL.
>geneferocl-windows.exe -q "1046946^262144+1"
Got it to the end with marerr.
Trying to resume using Genefer64 made it stop working.
Perhaps OpenCL wrote nonsense to checkpoint file.
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,988,457,789 RAC: 3,294,646
                                   
|
hi i run test on prpnet and it works cuda to avx
hey i was wrong in prpnet cuda or ocl ends with marerr to cpu. cpu start from the beginning.
But if I run stand-alone test and cuda or ocl ends with marerr. I can complete with cpu. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Started using OpenCL on R280X, got maxerr
Testing 1046946^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (2.157 seconds).
Estimated total run time for 1046946^262144+1 is 0:31:27
Testing 1046946^262144+1... 65536 steps to go (0:00:23 remaining)
maxErr exceeded for 1046946^262144+1, 0.5000 > 0.4500
maxErr exceeded by all available transform implementations
Trying to resume using Genefer worked this time.
c:\_Honza\xxl>genefer_windows64.exe -q "1046946^262144+1"
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 sse3
Command line: genefer_windows64.exe -q 1046946^262144+1
Priority change succeeded.
No relevant benchmark data exists, testing available transform implementations...
Benchmarks completed (53.094 seconds).
Testing 1046946^262144+1...
Using SSE3 transform
Resuming 1046946^262144+1 from a checkpoint (65535 iterations left)
Estimated total run time for 1046946^262144+1 is 12:29:28
1046946^262144+1 is a probable composite. (RES=f4696dd0281b538e) (1578088 digits) (err = 0.3750) (time = 0:38:27)
You can grab OpenCL checkpoint here
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Out of curiosity, OpenCL resumed on different CPU.
Testing 1046946^262144+1...
Using AVX transform
Resuming 1046946^262144+1 from a checkpoint (65535 iterations left)
Estimated total run time for 1046946^262144+1 is 6:32:33
1046946^262144+1 is a probable composite. (RES=f4696dd0281b538e) (1578088 digits) (err = 0.3750) (time = 0:33:48)
May try in PRPNet enviroment later tonight or next day if time allows.
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Not that it was intentional but while GFN262144 was unavailable, PRPnet took opportunity to get some work from GFN65536, which is above b limits for SSE3/2.
This is how it plays out so far...
[2014-02-11 14:46:41 SE(Å”] GFN262144: Returning work to server prpnet.primegrid.com at port 11002
[2014-02-11 14:46:41 SE(Å”] GFN262144: INFO: Test for 1047240^262144+1 was accepted
[2014-02-11 14:46:41 SE(Å”] GFN262144: INFO: All 1 test results were accepted
[2014-02-11 14:47:02 SE(Å”] prpnet.primegrid.com:11002 connect to socket failed with error 10060
[2014-02-11 14:47:03 SE(Å”] GFN65536: Getting work from server prpnet.primegrid.com at port 12003
[2014-02-11 14:47:04 SE(Å”] GFN65536: PRPNet server is version 5.2.8
Generalized Fermat Number Prime Search N=65536
genefer 3.2.0beta-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe work_GFN65536.in
Priority change succeeded.
Start test of file 'work_GFN65536.in' - 14:47:04
No relevant benchmark data exists, testing available transform implementations...
Benchmarks completed (9.714 seconds).
Testing 3493178^65536+1...
Using SSE3 transform
Starting initialization...
Initialization complete (0.108 seconds).
Testing 3493178^65536+1... 1424496 steps to go
maxErr exceeded for 3493178^65536+1, 0.4688 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 3493178^65536+1...
Using SSE2 transform
Resuming 3493178^65536+1 from a checkpoint (1424496 iterations left)
maxErr exceeded for 3493178^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 3493178^65536+1...
Using Default transform
Resuming 3493178^65536+1 from a checkpoint (1424496 iterations left)
maxErr exceeded for 3493178^65536+1, 0.5000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 3493178^65536+1...
Using x87 (80-bit) transform
Resuming 3493178^65536+1 from a checkpoint (1424496 iterations left)
Estimated total run time for 3493178^65536+1 is 1:40:49
Successful computation progress with x87 (80-bit); switching back to SSE3.
Testing 3493178^65536+1...
Using SSE3 transform
Resuming 3493178^65536+1 from a checkpoint (1421311 iterations left)
maxErr exceeded for 3493178^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Too many errors with SSE3; Calculation will proceed using only more accurate transforms.
Testing 3493178^65536+1...
Using SSE2 transform
Resuming 3493178^65536+1 from a checkpoint (1421311 iterations left)
maxErr exceeded for 3493178^65536+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Too many errors with SSE2; Calculation will proceed using only more accurate transforms.
Testing 3493178^65536+1...
Using Default transform
Resuming 3493178^65536+1 from a checkpoint (1421311 iterations left)
maxErr exceeded for 3493178^65536+1, 0.5000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Too many errors with Default; Calculation will proceed using only more accurate transforms.
Testing 3493178^65536+1...
Using x87 (80-bit) transform
Resuming 3493178^65536+1 from a checkpoint (1421311 iterations left)
Estimated total run time for 3493178^65536+1 is 1:40:41
Testing 3493178^65536+1... 905216 steps to go (1:03:59 remaining)
____________
My stats | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
...Therefore you see an improvement for the series 600 and 700 with Cuda. Anyway OpenCL produces shorter runtimes if you have the right nVidia gpu. In comparison to Cuda seems to be OpenCL:
- faster for the 600
- even faster for the 700
- equal with minimal improvements for the 500
- slower for the 400 series
...
Interestingly, I just picked up a GTX 660 OEM card (1152 shaders), and the new CUDA is faster than OpenCL. On the GFN524288 port, it runs as follows:
OCL (new or old version is the same): about 4 hours 10 minutes
CUDA (old version): about 4 hours 40 minutes
CUDA (new version): about 4 hours 0 minutes
(all tested with the same work unit)
| |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
GTX660 = 960 cuda cores in 5 shader clusters based on GK106
GTX660 TI = 1344 cc in 7 sc based on GK104
GTX660 OEM = 1152 cc in 6 sc based on GK104
Hmm. But your last unit was faster with OCL.
Please can you test some other GFN ports or units?
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
Hmm. But your last unit was faster with OCL.
True, but that was a GTX 645 card in another machine.
Please can you test some other GFN ports or units?
Crazy day at work today, so I can't really do much testing. Anyway, here are the benchmarks for this card:
Old CUDA version
Command line: genefercuda.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 214 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 250 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 282 us/mul. Err: 0.2500 202102 digits
1203210^65536+1 Time: 322 us/mul. Err: 0.2352 398482 digits
984108^131072+1 Time: 547 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 889 us/mul. Err: 0.2227 1548156 digits
658332^524288+1 Time: 1.56 ms/mul. Err: 0.2500 3050541 digits
538452^1048576+1 Time: 3.24 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 6.33 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 12.5 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 28.1 ms/mul. Err: 0.1797 45879398 digits
Old OpenCL version
Command line: geneferocl.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 660', version 'OpenCL 1.
CUDA' and driver '332.21'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 201 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 219 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 220 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 244 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 400 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 742 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.43 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.66 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 5.39 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 11.1 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 25.9 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 36.
New OpenCL version
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 660', version 'OpenCL 1.1
CUDA' and driver '332.21'.
Generalized Fermat Number Bench
Running benchmarks for transform implementation "OCL"
2199064^8192+1 Time: 293 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 291 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 297 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 315 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 474 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 869 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.39 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.66 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 5.39 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 12.3 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 22.2 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 37.
New CUDA version
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
Running benchmarks for transform implementation "CUDA"
2199064^8192+1 Time: 149 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 145 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 182 us/mul. Err: 0.2109 202102 digits
1203210^65536+1 Time: 283 us/mul. Err: 0.2188 398482 digits
984108^131072+1 Time: 488 us/mul. Err: 0.2109 785521 digits
804904^262144+1 Time: 781 us/mul. Err: 0.2070 1548156 digits
658332^524288+1 Time: 1.43 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 2.73 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 5.47 ms/mul. Err: 0.2109 11836006 digits
360204^4194304+1 Time: 11.1 ms/mul. Err: 0.2109 23305854 digits
294612^8388608+1 Time: 23.1 ms/mul. Err: 0.1953 45879398 digits
Genefer Mark = 37.
These show the new CUDA a bit faster for some GFN and a bit slower on others. I am at a loss as to why the benchmarks show the OCL faster for GFN524288, but actual work runs just a bit faster on the new CUDA.
| |
|
288larsson Volunteer tester
 Send message
Joined: 17 Apr 10 Posts: 136 ID: 58815 Credit: 5,988,457,789 RAC: 3,294,646
                                   
|
hi test geneferocl 3.2.0beta-0.
GPU maxErr-complete with cpu 3391432^32768+1 is a probable composite. (RES=e0ec2048f1eee357)
Run on cpu only. 3391432^32768+1 is a probable prime. | |
|
|
Hi everyone,
Thanks for all the testing and info posted so far, I've been a bit swamped with other work at the minute but I will go back and look over the issues that Honza and also 288larsson have raised, hopefully in a few days time.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Unfortunately there is a bug in geneferocl - it passes all the inbuilt tests, but seems to return incorrect residues when close to the error limit. I am current investigating, but please stop using geneferocl 3.2.0-0beta until further notice!
The CPU and CUDA versions are unaffected.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Hi all,
This is a summary of the bug that was found, and the current status:
* A bug existed since 2013-12-14 which would cause GeneferOCL 3.2.0beta-0 to fail to detect round-off errors (maxErr exceeded) until the end of the calculation. Any checkpoints produced before this point would contain bad data. If the calculation was then restarted using genefer (CPU), the calculation would complete and give an incorrect residue.
* BOINC was not exposed to this bug (if anyone used the new OpenCL code under app_info.xml), since BOINC does not support restarting with the CPU code. Therefore any test which geneferOCL completed successfully is valid.
* On PRPNet if a test is carried out with GeneferOCL and a round-off error occurs, the test will be restarted with the CPU code (depending on what is in the prpclient.ini), so any tests started with GeneferOCL but completed with the CPU code must be considered invalid
The code is now fixed, and the new version is called 3.2.0beta2. The checkpoint file has been made incompatible between this and previous versions. To my knowledge and testing so far, this version fixes the issues recently reported by Honza and 288larsson.
Please download the new binaries from SVN (Mac and Windows are ready, Linux is still the old version), and re-run the tests for which you noticed problems. Unfortunately the PRPnet ports need to remain shut for a while until we can determine reliably which tests were affected (and so must be double-checked/re-run), and put a check in place to reject any results from the old beta code.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
All Linux binaries are updated now.
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
The code is now fixed, and the new version is called 3.2.0beta2. The checkpoint file has been made incompatible between this and previous versions. To my knowledge and testing so far, this version fixes the issues recently reported by Honza and 288larsson.
R280X, Win7 x64.
No MaxErr, actually far from maxErr...and RES=0000000000000000
>geneferocl-windows.exe -q "1046946^262144+1"
geneferocl 3.2.0beta2-0 (Windows/OpenCL/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -q 1046946^262144+1
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1348.5)' and driver '1348.5 (VM)'.
Testing 1046946^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (2.079 seconds).
Estimated total run time for 1046946^262144+1 is 0:31:27
1046946^262144+1 is a probable composite. (RES=0000000000000000) (1578088 digits) (err = 0.0001) (time = 0:28:30) 11:26:49
____________
My stats | |
|
|
OK, late night optimism crushed! Will take a look at that one ;)
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
Unfortunately the PRPnet ports need to remain shut for a while until we can determine reliably which tests were affected (and so must be double-checked/re-run), and put a check in place to reject any results from the old beta code.
Cheers
- Iain
If it helps, all of my PRPnet genefer work was completed on either CPU or GPU without any switching between the two (and thus, should all be okay).
| |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
If it helps, all of my PRPnet genefer work was completed on either CPU or GPU without any switching between the two (and thus, should all be okay).
All my PRPNet work was done using CPUs.
(Using GPU only for GFN WR clean-up)
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Yes, it helps. I suspect we'll have to re-run a bunch of tests, and information like that means we can assume that some are valid which might have otherwise needed to be run again.
____________
My lucky number is 75898524288+1 | |
|
|
Everything I ran on prpnet on GFN-524288 in February used the older app (3.1.2.?), with GPU/CUDA. | |
|
|
>geneferocl-windows.exe -q "1046946^262144+1"
geneferocl 3.2.0beta2-0 (Windows/OpenCL/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -q 1046946^262144+1
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1348.5)' and driver '1348.5 (VM)'.
Testing 1046946^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (2.079 seconds).
Estimated total run time for 1046946^262144+1 is 0:31:27
1046946^262144+1 is a probable composite. (RES=0000000000000000) (1578088 digits) (err = 0.0001) (time = 0:28:30) 11:26:49
This is a weird one. We have tested that number on Linux/CUDA, Linux/AVX and Windows/CUDA and got the correct residue f4696dd0281b538e . With Linux/OpenCL and Windows/OpenCL it hits a maxErr exceeded about 25% of the way through the test and stops. Both of the OpenCL runs were done on Nvidia cards. I'd be interested if either you or anyone else can recreate with an AMD card, using the OpenCL app. I will continue to attempt debugging in the mean time.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
This is a weird one. We have tested that number on Linux/CUDA, Linux/AVX and Windows/CUDA and got the correct residue f4696dd0281b538e . With Linux/OpenCL and Windows/OpenCL it hits a maxErr exceeded about 25% of the way through the test and stops. Both of the OpenCL runs were done on Nvidia cards. I'd be interested if either you or anyone else can recreate with an AMD card, using the OpenCL app. I will continue to attempt debugging in the mean time.
Tested numbers was chosen delibeartely. The test on R280X is weird one.
This one is from my other GPU, 7950.
And it behaves as expected.
Perhaps I should re-run on R280X tommorow.
c:\_PG>geneferocl-windows.exe -q "1046946^262144+1"
geneferocl 3.2.0beta2-0 (Windows/OpenCL/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -q 1046946^262144+1
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Testing 1046946^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (1.503 seconds).
Estimated total run time for 1046946^262144+1 is 0:29:52
Testing 1046946^262144+1... 4521984 steps to go (0:25:46 remaining)
maxErr exceeded for 1046946^262144+1, 0.4609 > 0.4500
maxErr exceeded by all available transform implementations
____________
My stats | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,146,962,503 RAC: 22,766,775
                                                
|
Yes, it helps. I suspect we'll have to re-run a bunch of tests, and information like that means we can assume that some are valid which might have otherwise needed to be run again.
Is there any reason to keep the two smaller GFN ports closed? Both are well beyond the b limits for anything but Genefer80 (i.e., they error out immediately on all other apps and could not have run on a GPU at all). Or am I missing something in how switching would have occurred on these ports?
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Yes, it helps. I suspect we'll have to re-run a bunch of tests, and information like that means we can assume that some are valid which might have otherwise needed to be run again.
Is there any reason to keep the two smaller GFN ports closed? Both are well beyond the b limits for anything but Genefer80 (i.e., they error out immediately on all other apps and could not have run on a GPU at all). Or am I missing something in how switching would have occurred on these ports?
If you have OCL enabled in prpnet, the client will try to run all GFN with OCL first. That could be a problem on the small ports too.
____________
My lucky number is 75898524288+1 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Re-run on R280X, this time went as expected.
No sure what went wrong the other time.
>geneferocl-windows.exe -q "1046946^262144+1"
geneferocl 3.2.0beta2-0 (Windows/OpenCL/32-bit)
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -q 1046946^262144+1
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1348.5)' and dr
iver '1348.5 (VM)'.
Testing 1046946^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (2.031 seconds).
Estimated total run time for 1046946^262144+1 is 0:30:03
Testing 1046946^262144+1... 4521984 steps to go (0:25:55 remaining)
maxErr exceeded for 1046946^262144+1, 0.4609 > 0.4500
maxErr exceeded by all available transform implementations
____________
My stats | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
R280X, good error handling.
Testing 1052182^262144+1...
Using OCL transform
Starting initialization...
Initialization complete (2.047 seconds).
Estimated total run time for 1052182^262144+1 is 0:30:03
Testing 1052182^262144+1... 5177344 steps to go (0:29:41 remaining)
maxErr exceeded for 1052182^262144+1, 0.4531 > 0.4500
maxErr exceeded by all available transform implementations
____________
My stats | |
|
|
Hi Honza, Thanks for those - it is somewhat reassuring that we get the same result with your two later R280X tests. Can you run the number with the stock apps (3.1.2) and see check it reliably errors out...
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
R280X, stock app.
Note that it did error out, but in different place (steps to go).
>geneferocl -q "1046946^262144+1"
geneferocl 3.1.2-7 (Windows 32-bit OpenCL)
Command line: geneferocl -q 1046946^262144+1
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1348.5)' and driver '1348.5 (VM)'.
Testing 1046946^262144+1...
Starting initialization...
Initialization complete (2.172 seconds).
Estimated total run time for 1046946^262144+1 is 0:31:21
Testing 1046946^262144+1... 4390912 steps to go (0:26:16 remaining)
maxErr exceeded for 1046946^262144+1, 0.4609 > 0.4500
MaxErr exceeded may be caused by overclocking, overheated GPUs and other transient errors.
____________
My stats | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,548,482 RAC: 5,409

|
genefer_windows64.exe -x
-x requires a transform string, valid transforms are: default sse2 x87 avx sse3
I don't understand how "x87" implementation can be available on the 64-bit version because x86-64 instruction set has no x87 instructions.
I tried to force x87 mode but it fails:
genefer_windows64.exe -x x87
Command line: genefer_windows64.exe -x x87
Priority change succeeded.
Argument decoding error (this should not happen).
Fatal error (1). Genefer is terminating.
Yves | |
|
|
I don't understand how "x87" implementation can be available on the 64-bit version because x86-64 instruction set has no x87 instructions.
All modern x86-64 processors still carry a x87 unit... e.g. see 'fpu' in flags below. Unless I somehow misunderstand your question? The 64-bit genefer code contains routines which use the x87 instruction set, as well as sse2, sse3, avx (and they are only executed if that instruction set is supported by the CPU).
[ibethune@hydra ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 8
model name : AMD Opteron(tm) Processor 4162 EE
stepping : 1
cpu MHz : 1700.018
cache size : 512 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips : 3400.03
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
The error message that you found is because if you force it to use a specific instruction set you need to also tell the code to so something e.g.
./genefer64 -x x87 -b
./genefer64 -x x87 -q 1234^8192+1
It is probably possible to allow the -x option with the menu-driver interactive mode. I'll add this when time allows.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
Perhaps this link can explain the confusion?
http://www.virtualdub.org/blog/pivot/entry.php?id=107 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 843 ID: 164101 Credit: 306,548,482 RAC: 5,409

|
I don't understand how "x87" implementation can be available on the 64-bit version because x86-64 instruction set has no x87 instructions.
All modern x86-64 processors still carry a x87 unit... e.g. see 'fpu' in flags below. Unless I somehow misunderstand your question?
I thought that the x87 unit is just available for 32-bit program, not for 64-bit application. But it works with genefer_windows64 then I was wrong.
The compiler never generates any x87 instruction for x64 targets, only SSE2 instruction set is used.
Thanks, Yves | |
|
|
From axn's link
Adapting legacy code is also not trivial since you have to rewrite it for 64-bit pointers and to have the correct function prologues and epilogues.
That's exactly what I had to do! I don't know about MSVC, but at least with GCC, you can force the compiler to generate x87 instructions for math code using -mfpmath=387, even when generating 64-bit code.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
I'm really at a loss to explain Honza's RES=0000000000000000 result. Unless it seems to be repeatable and I can debug it, I assume it's caused by some unexpected glitch in the GPU stack somewhere. We have occasionally seen similar results in BOINC with the 3.1.2 code, so I'm not convinced this is a new problem, just a very rare one we don't understand. And it seems to result in an obviously bad residue which we can detect easily.
We hope to upgrade the prpserver on the GFN ports and re-open soon. When that's done we'll let you know and testing can resume.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
FYI, the ports are now open again, and new prpclient packages are released (http://www.primegrid.com/forum_thread.php?id=1215&nowrap=true#75606) containing genefer 3.2.0-0. Use of the buggy beta code is detected by both the client and server, and will be rejected.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Iam trying the new genefer64 file:
[2014-04-26 08:13:26 MS] GFN32768: Getting work from server prpnet.primegrid.com
at port 12005
[2014-04-26 08:13:28 MS] GFN32768: PRPNet server is version 5.3.0
Generalized Fermat Number Prime Search N=32768
genefer 3.2.0-0 (Windows/CPU/64-bit)
Supported transform implementations: default sse2 x87 avx sse3
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer64.exe work_GFN32768.in
Priority change succeeded.
Start test of file 'work_GFN32768.in' - 08:13:28
No relevant benchmark data exists, testing available transform implementations..
.
Benchmarks completed (2.839 seconds).
Testing 7033526^32768+1...
Using AVX transform
Starting initialization...
Initialization complete (0.025 seconds).
Testing 7033526^32768+1... 745333 steps to go
maxErr exceeded for 7033526^32768+1, 0.4688 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Testing 7033526^32768+1...
Using SSE3 transform
Resuming 7033526^32768+1 from a checkpoint (745333 iterations left)
maxErr exceeded for 7033526^32768+1, 0.4688 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Testing 7033526^32768+1...
Using SSE2 transform
Resuming 7033526^32768+1 from a checkpoint (745333 iterations left)
maxErr exceeded for 7033526^32768+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Testing 7033526^32768+1...
Using Default transform
Resuming 7033526^32768+1 from a checkpoint (745333 iterations left)
maxErr exceeded for 7033526^32768+1, 0.4688 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Testing 7033526^32768+1...
Using x87 (80-bit) transform
Resuming 7033526^32768+1 from a checkpoint (745333 iterations left)
Estimated time remaining for 7033526^32768+1 is 0:15:37
Testing 7033526^32768+1... 741376 steps to go (0:15:43 remaining)
Successful computation progress with x87 (80-bit); switching back to AVX.
Testing 7033526^32768+1...
Using AVX transform
Resuming 7033526^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7033526^32768+1, 0.5000 > 0.4500
maxErr exceeded while using AVX; switching to SSE3.
Too many errors with AVX; Calculation will proceed using only more accurate tran
sforms.
Testing 7033526^32768+1...
Using SSE3 transform
Resuming 7033526^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7033526^32768+1, 1.0000 > 0.4500
maxErr exceeded while using SSE3; switching to SSE2.
Too many errors with SSE3; Calculation will proceed using only more accurate tra
nsforms.
Testing 7033526^32768+1...
Using SSE2 transform
Resuming 7033526^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7033526^32768+1, 0.5000 > 0.4500
maxErr exceeded while using SSE2; switching to Default.
Too many errors with SSE2; Calculation will proceed using only more accurate tra
nsforms.
Testing 7033526^32768+1...
Using Default transform
Resuming 7033526^32768+1 from a checkpoint (741375 iterations left)
maxErr exceeded for 7033526^32768+1, 1.0000 > 0.4500
maxErr exceeded while using Default; switching to x87 (80-bit).
Too many errors with Default; Calculation will proceed using only more accurate
transforms.
Testing 7033526^32768+1...
Using x87 (80-bit) transform
Resuming 7033526^32768+1 from a checkpoint (741375 iterations left)
Estimated time remaining for 7033526^32768+1 is 0:15:40
Testing 7033526^32768+1... 679936 steps to go (0:19:37 remaining)
Why the program cannot run AVX? The older genefer.exe, genfx64 and genefer80 are not needed anymore? | |
|
|
It looks like that my returned result got invalid. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,418,065,418 RAC: 2,742,360
                                      
|
Why the program cannot run AVX? The older genefer.exe, genfx64 and genefer80 are not needed anymore?
It can't run AVX because of b limit. See B limit thread.
Since this is issue with Genefer 3.2.0, I would recommend Genefer 3.2.0 testing thread
____________
My stats | |
|
|
The genefer 3.2.0 for CPU with GFN32768 (genefer64) from the latest package was not count at PRPnet. It ran successfully but could be invalid on server side. | |
|
|
Hi Reb, can you let me know which platform you are using, and post the output from your prpclient run (including the genefer output) so I know exactly which versions you are using. I'll look into this ASAP.
OK, I see now you posted these in the PRPNet Help thread - I'll take a look.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Hi Reb, the output from genefer looks to be normal. Can you post the first part which reports the prpclient version. You need to be using prpclient 5.3.0 - did you update the client binary, or it's possible I posted the wrong binary...
Also seeing the end of genefer log and the return of the WU would be useful to fully understand what is going wrong.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
|
Hi Reb, the output from genefer looks to be normal. Can you post the first part which reports the prpclient version. You need to be using prpclient 5.3.0 - did you update the client binary, or it's possible I posted the wrong binary...
Also seeing the end of genefer log and the return of the WU would be useful to fully understand what is going wrong.
Cheers
- Iain
Iam using PRPclient 5.3.0 since a longer time, compiled by myself. Here is the output:
[2014-04-26 08:13:25 MS] PRPNet Client application v5.3.0 started
[2014-04-26 08:13:25 MS] User name rebirther at email address is rebirther@web.de
[2014-04-26 08:13:26 MS] GFN32768: Getting work from server prpnet.primegrid.com at port 12005
[2014-04-26 08:13:28 MS] GFN32768: PRPNet server is version 5.3.0
[2014-04-26 08:30:04 MS] GFN32768: 7033526^32768+1 is not prime. Residue 9d0932147ce64c85
[2014-04-26 08:30:04 MS] Total Time: 0:16:39 Total Work Units: 1 Special Results Found: 0
[2014-04-26 08:30:04 MS] GFN32768: Returning work to server prpnet.primegrid.com at port 12005
[2014-04-26 08:30:05 MS] GFN32768: INFO: Test for 7033526^32768+1 was ignored. Your application and/or prpclient is obsolete and MUST be upgraded
[2014-04-26 08:30:05 MS] GFN32768: INFO: All 1 test results were accepted
Ah I see the last lines, ignored but have the latest version, confused. | |
|
|
Hi Reb,
Do you know what version of the 5.3.0 code you compiled? Specifically, for this to work, you need at least the revision 9 code (from 2014-02-19), and preferably the latest. If you're not sure, please rebuild with the latest and test again.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
If you compiled any parts yourself, or are using any builds that aren't from the release itslef, it might not be the "latest" 5.3.0 software. Perhaps try using both the client and the genefer programs from the release package?
____________
My lucky number is 75898524288+1 | |
|
|
Hi Reb,
Do you know what version of the 5.3.0 code you compiled? Specifically, for this to work, you need at least the revision 9 code (from 2014-02-19), and preferably the latest. If you're not sure, please rebuild with the latest and test again.
Cheers
- Iain
ok, mine seems to be older, I have now downloaded the latest r11. Will report later.
Edit:
Its working now, I thought the fix by blocking the client was only on server side but who cares. Thx for the hints! | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I tried to use the v5.3.0 prpclient from Mersenne Forums in an attempt to get handed oldest WUs first (worked last time with v5.2.8).
I want older WUs as I am on the edge of maxErr for N=19 with OpenCL and older WUs error out less.
The client completed 4 WUs from the leading edge but each time I got:
"INFO: Test for 793258^524288+1 was ignored. Your application and/or prpclient is obsolete and MUST be upgraded"
This is not suprising given the previous posts. (rebirther seems to have linked the latest version now in Mersenne Forums.)
Then after reading these posts I downloaded the prpclient from Primegrid and fired it up.
I am quite surprised to have downloaded a WU from the trailing edge 762516^524288+1.
(Unfortunately I previously crunched this WU so I am expecting maxErr.)
Before the recent batch of WU results was deleted from the servers I recorded the WU status.
GFN32768 completed 7,033,516 previous 7,057,638 untested 76624 previous 74764 => results deleted 1860
GFN65536 completed 3,286,248 previous 3,496,210 untested 43592 previous 15695 => results deleted 27897
GFN262144 completed 1,038,274 previous 1,047,078 untested 27612 previous 26079 => results deleted 1533
GFN524288 completed 762,502 previous 786,402 untested 1193 previous 1181 => results deleted 12 I tried to adjust the "untested" by WUs done in last few days as I am a bit late to the party, so should be close.
Not many WUs for me to go through in GFN524288, I am thinking most of them I'd previously got maxErr for.
So I am going to have to abandon using OpenCL again for N=19 and head back to N=20. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
I am quite surprised to have downloaded a WU from the trailing edge 762516^524288+1.
There were less than 20 n=524288 tasks that had to be redone, so it's not surprising they're already sent out (if not completed.)
There's about 2000 each of 32768 and 262144 to be redone, and about 27000 of the 65536 tasks that have to be retested.
____________
My lucky number is 75898524288+1 | |
|
|
Its working now, I thought the fix by blocking the client was only on server side but who cares. Thx for the hints!
The server will block a 5.2.8 client from even downloading any tests (and wasting time). However, it also requires a valid application version number in order to accept the test result, and this was not in the first SVN version of the 5.3.0 code. In any case, at least it's working now.
Cheers
- Iain
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,060,436 RAC: 494,925
                               
|
Later today, following |
|