Author |
Message |
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
If anyone would like to get started early building llr v3.8.7dev:
http://www.mersenneforum.org/showthread.php?p=288014#post288014
Support now for 64 bit platforms. It will be good to see time comparisons between 32 and 64 bit AND to see how this affects the AVX builds. NOTE: v3.8.7dev still comes with gwnum v26.5. To build for AVX, you'll need gwnum v27.2 (ftp://mersenne.org/gimps/source272.zip).
UPDATE: gwnum v27.3 (ftp://mersenne.org/gimps/source273.zip)
Please post your builds and any timing results in this thread.
Good Luck!
UPDATED: 18 FEB 2012
llr Non-AVX v3.8.7 w/gwnum v26.5
Windows 32 bit: llr387dev.zip by Jean
Windows 64 bit: cllr387dev-win-x64.7z (VS2005) by Michael or llr64bit-win (VS2010) by Rebirther
Linux 32 bit static: llr387devslinux.zip by Jean or sllr32_dev387.tar.bz2 by rroonnaalldd
Linux 64 bit static: sllr64_dev387.tar.bz2 by rroonnaalldd
MacIntel 32&64 bit: llr387devsrc.7z by Iain
llrAVX v3.8.7 w/gwnum 27.2
Windows 32 bit: llr32bit-AVX-win by Rebirther UPDATED to gwnum 27.3
Windows 64 bit: llr64bit-AVX-win by Rebirther UPDATED to gwnum 27.3
Linux 32 bit static: sllr32_dev387_gwnum273.tar.bz2 by rroonnaalldd UPDATED to gwnum 27.3
Linux 64 bit static: sllr64_dev387_gwnum273.tar.bz2 by rroonnaalldd UPDATED to gwnum 27.3
MacIntel 32&64 bit: llr387devsrc_avx.7z by Iain
Also available...from the AVX build of pfgw thread
pfgwAVX v3.6.0 w/gwnum 27.3
Windows 32 bit: pfgw3.6.0-AVX-32bit by Rebirther
Windows 64 bit: pfgw3.6.0-AVX-64bit by Rebirther
____________
|
|
|
|
I built a 32-bit Linux AVX executable of LLR 3.8.7dev (using gwnum 27.2, of course). Runtimes are indistinguishable from 3.8.6 (any diff is less than 0.1%). I ran various triple and quadruple checks of known primes and non-primes from the command line using prpclient (short tests, like the low port and SGS). All residuals matched vs. the prior version.
I'm attempting to build a 64 bit version but am encountering difficulty. The linker complains about the 32-bit gwnum library, so I'm trying to build a 64-bit version of that, which fails, seemingly wanting a custom-built version of the curl library, at least according to comments in the makefile. Is this effort worth it? Comments on mersenneforum seemed to indicate a speed-up vs. 32-bit, but past history says 32-bit == 64-bit, speed-wise, for LLR.
--Gary
____________
"I am he as you are he as you are me and we are all together"
87*2^3496188+1 is prime! (1052460 digits)
4 is not prime! (1 digit) |
|
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1256 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
I'm attempting to build a 64 bit version but am encountering difficulty. The linker complains about the 32-bit gwnum library, so I'm trying to build a 64-bit version of that, which fails, seemingly wanting a custom-built version of the curl library, at least according to comments in the makefile. Is this effort worth it? Comments on mersenneforum seemed to indicate a speed-up vs. 32-bit, but past history says 32-bit == 64-bit, speed-wise, for LLR.
Until this new release of llr, llr could only be built as a 32-bit app. Now that a 64-bit llr can be built, it means that a 64-bit build of llr should be faster than a 32-bit build of llr. |
|
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3233 ID: 50683 Credit: 151,443,349 RAC: 73,965
                         
|
Until this new release of llr, llr could only be built as a 32-bit app. Now that a 64-bit llr can be built, it means that a 64-bit build of llr should be faster than a 32-bit build of llr.
Waiting for that LLR :)
Linux or Windows version doesn't matter :)
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Until this new release of llr, llr could only be built as a 32-bit app. Now that a 64-bit llr can be built, it means that a 64-bit build of llr should be faster than a 32-bit build of llr.
Waiting for that LLR :)
Linux or Windows version doesn't matter :)
This will not solve your problem on BD...
Detection and using a cpu-feature (AVX, SSE, MMX, x87) is a decision by the gwnum-lib.
If you want to have an AVX-app then you need the gwnum27.2-lib.
boinc@Lubuntu32:~/Cuda/llrCPU$ ./sllr_dev387_gwnum272 -d -q"30448908048555*2^666666-1"
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-1 FFT length 72K, Pass1=96, Pass2=768
Iter: 6/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iter: 6/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-3 FFT length 80K, Pass1=320, Pass2=256
Iter: 6/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-1 FFT length 84K, Pass1=112, Pass2=768
Iter: 6/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-3 FFT length 96K, Pass1=128, Pass2=768
Iter: 6/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-3 FFT length 112K, Pass1=448, Pass2=256
Iter: 5/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Iter: 5/45, ERROR: ROUND OFF (1) > 0.4
Continuing from last save file.
Unrecoverable error, Restarting with next larger FFT length...
Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle.
[add]
boinc@Lubuntu32:~/Cuda/llrCPU$ ./sllr_dev387 -d -q"30448908048555*2^666666-1"
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-1 FFT length 72K, Pass1=96, Pass2=768
V1 = 5 ; Computing U0...done.Starting Lucas-Lehmer loop...
30448908048555*2^666666-1, iteration : 10000 / 666666 [1.50%]. Time per iteration : 1.281 ms.
30448908048555*2^666666-1, iteration : 20000 / 666666 [3.00%]. Time per iteration : 1.321 ms.
30448908048555*2^666666-1, iteration : 30000 / 666666 [4.50%]. Time per iteration : 1.256 ms.
^C
Caught signal. Terminating.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
boinc@Lubuntu32:~/Cuda/llrCPU$ ./sllr_dev387_gwnum272 -d -q"30448908048555*2^666666-1"
...
Unrecoverable error, Restarting with next larger FFT length...
Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle.
Many posts ago (somewhere) it has been suggested that gwnum v27.2 might have problems with the large k in SGS. Therefore, that won't change until v27.2 is updated...regardless of 32 bit vs. 64 bit llr. :)
____________
|
|
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1256 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Many posts ago (somewhere) it has been suggested that gwnum v27.2 might have problems with the large k in SGS. Therefore, that won't change until v27.2 is updated...regardless of 32 bit vs. 64 bit llr. :)
I asked George last week and he indicated that it will be a few weeks before gwnum v27.3 is ready. |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Made the comparison 'sllr64_dev387' vs 'sllr_dev387':
boinc@vmware2k-3:~/Cuda/test/llr-apps CPU$ ./sllr64_dev387 -d -q"30*2^400000+1"
Starting Proth prime test of 30*2^400000+1
Using all-complex Core2 type-1 FFT length 24K, Pass1=32, Pass2=768, a = 7
30*2^400000+1, bit: 10000 / 400004 [2.49%]. Time per bit: 0.334 ms.
30*2^400000+1, bit: 20000 / 400004 [4.99%]. Time per bit: 0.328 ms.
30*2^400000+1 is not prime. Proth RES64: 703F8ACAB9634020 Time : 131.398 sec.
boinc@vmware2k-3:~/Cuda/test/llr-apps CPU$ ./sllr_dev387 -d -q"30*2^400000+1"
Starting Proth prime test of 30*2^400000+1
Using all-complex Core2 type-1 FFT length 24K, Pass1=32, Pass2=768, a = 7
30*2^400000+1, bit: 10000 / 400004 [2.49%]. Time per bit: 0.364 ms.
30*2^400000+1, bit: 20000 / 400004 [4.99%]. Time per bit: 0.358 ms.
30*2^400000+1 is not prime. Proth RES64: 703F8ACAB9634020 Time : 143.432 sec.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
3.8.7dev-AVX now available for win32, tested with PPSElow, 38sec on i5-2500k@4Ghz
Download |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
3.8.7dev-AVX for Intel and 3.8.7dev available for Linux32 and Linux64:
sllr32_dev387.7z
sllr32_dev387_avx.7z
sllr64_dev387.7z
sllr64_dev387_avx.7z
[add]
non-AVX versions
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
I ran rroonnaalldd'ss Linux 64-bit AVX 3.8.7 build vs. the barely-a-month-old 3.8.6 AVX (32-bit) build. Any runtime difference I see is "in the noise"... far less than 1%, and sometimes 3.8.6 was faster. Basically, a tie, so far I as I've seen. This is with PPS Low and SGS; all residues matched.
--Gary
____________
"I am he as you are he as you are me and we are all together"
87*2^3496188+1 is prime! (1052460 digits)
4 is not prime! (1 digit) |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
It looks like I've been successful in building a 64-bit Windows version of 3.8.7.
So far I've done exactly one test with it, and it seems to be about 10% faster than the 32 bit version.
I'll make it available for testing once I've done a bit more testing with it.
This is NOT an AVX build.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
A Windows 64 bit build of LLR 3.8.7-dev can now be downlaoded here.
The source code is unchanged from what Jean Penne distributed in the 3.8.7. dev release.
The distribution file includes the executable (cllr.exe) as well as an app_info.xml file which contains entries for PPS-LLR and TRP-LLR. It should be usable, with the correct entries in app_info, with the other LLR projects. (There's also an entry in that app_info for GeneferCUDA.)
Note that this build of LLR can use the 32-bit Boinc wrapper without penalty. There's no need for a separate 64 bit wrapper.
This should also be usable with PRPNet, although I have not tested it.
This build should be usable on any 64 bit Intel or AMD CPU, although it will NOT make use of AVX instructions. It has only been tested so far on a single Core2Quad CPU, but the code is unchanged and I wouldn't expect there to be problems.
Note that there are some factoring features of LLR that do NOT build in the 64 bit windows version due to the source apparently being lost and only 32 bit object files being available. Therefore, a 32 bit build of LLR has slightly more functionality than the 64 bit build, but I don't believe that affects current projects on PRPNet or Boinc.
____________
My lucky number is 75898524288+1 |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
Well, I've tested on i7-920 Win 2008 R2 x64 and it's about 10% slower comparing to stock app.
I'll try on i5-2500 running VMware (where AVX version has no use).
On another note - PG folder and slots are growing and growing - over 400MB.
I've manually cleaned and adjusted client_state or old apps can be deleted. I guess there is no need to have several LLR wrappers and several CLLR versions.
Too bad this x64 app can't be packed using UPX. Default app shrinks from 26 to 1MB.
____________
My stats |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Well, I've tested on i7-920 Win 2008 R2 x64 and it's about 10% slower comparing to stock app.
That's definitely unexpected. It's about 12% faster on my computer (C2Q Q6600).
EDIT: regarding sizes, my boinc/www.primegrid.com directory is only about 75 MB, and only 1 file (the gcw sieve file) exists in more than one version.
Boinc IS supposed to delete old versions under certain circumstances. I don't know enough about the mechanics involved to guess why you might have so many obsolete files.
____________
My lucky number is 75898524288+1 |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
I'll try on i5-2500 running VMware (where AVX version has no use).
VMware blocks or better masks enhanced CPU features only after a manual intervention (EVC).
This is sometimes needed for VMotion between hosts with different CPU features.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,655,586,930 RAC: 6,970,494
                                                
|
Well, I've tested on i7-920 Win 2008 R2 x64 and it's about 10% slower comparing to stock app.
That's definitely unexpected. It's about 12% faster on my computer (C2Q Q6600).
Maybe the difference is due to Hyperthreading on the i7 if it is turned on given that the Q6600 doesn't have HT? LLR behaves not very well with HT in some cases.
____________
141941*2^4299438-1 is prime!
|
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
Boinc IS supposed to delete old versions under certain circumstances. I don't know enough about the mechanics involved to guess why you might have so many obsolete files.
They still rott in client_state.xml file. I had to get rid of sieve files couple years ago, now it was time to eliminate old apps versions.
Slowing down on i7-920 may have something to do with Hyperthreading and x64.
I actually remember some highly-optimized apps actually being slower on server dual-Xeons with HT.
I have left stock app there packed with UPX and using app_info. Sounds quite complicated but it suits well: 8x26MBs vs 8x1.06MB makes a difference.
i5-2500 is faster. ~435 vs 400 secs. Goog job.
Will try some server Xeons...
____________
My stats |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
VMware blocks or better masks enhanced CPU features only after a manual intervention (EVC).
This is sometimes needed for VMotion between hosts with different CPU features.
Thanks, but I'm running "only" ESX/ESXi without vCenter etc.
____________
My stats |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
I suppose the slowdown could be due to more cache misses caused by moving more data around, but 10% seems like an awfully big drop for that -- especially considering that it would be offset by the speed gains from faster math.
If this is what's happening, then people may see wildly different results depending upon which specific CPU they're using AND how many cores are in use.
Which brings me to the next question: Can you repeat that test with all the other cores idle? If it is indeed a cache problem, running just one core may show a dramatic increase in speed relative to the 32 bit stock version.
____________
My lucky number is 75898524288+1 |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
Maybe the difference is due to Hyperthreading on the i7 if it is turned on given that the Q6600 doesn't have HT? LLR behaves not very well with HT in some cases.
From experience - most likely.
I run only 3 cores on some quad-cored servers as I found (even) stock app slowing it down when all 4 cores we running.
____________
My stats |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
Which brings me to the next question: Can you repeat that test with all the other cores idle? If it is indeed a cache problem, running just one core may show a dramatic increase in speed relative to the 32 bit stock version.
Sure.
cllrx64.exe -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 556.670 sec.
cllrx86.exe -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 558.832 sec.
____________
My stats |
|
|
|
Boinc IS supposed to delete old versions under certain circumstances. I don't know enough about the mechanics involved to guess why you might have so many obsolete files.
They still rott in client_state.xml file. I had to get rid of sieve files couple years ago, now it was time to eliminate old apps versions.
Slowing down on i7-920 may have something to do with Hyperthreading and x64.
I actually remember some highly-optimized apps actually being slower on server dual-Xeons with HT.
I have left stock app there packed with UPX and using app_info. Sounds quite complicated but it suits well: 8x26MBs vs 8x1.06MB makes a difference.
i5-2500 is faster. ~435 vs 400 secs. Goog job.
Will try some server Xeons...
I also have not seen any of the improvement under Linux under 2 different hosts. Here I used rroonnaalldd's versions on a dual Xeon with HT off but turbo boost on. Here 2 out of 8 available cores were crunching so turbo boost should have been minor. I used one of my first primes:
$ time ./sllr32_dev387 -q"7791*2^482045+1" -d
Starting Proth prime test of 7791*2^482045+1
7791*2^482045+1, bit: 10000 / 482057 [2.07%]. Time per bit: 0.446 ms. = 5
7791*2^482045+1, bit: 250000 / 482057 [51.86%]. Time per bit: 0.438 ms.
7791*2^482045+1, bit: 370000 / 482057 [76.75%]. Time per bit: 0.438 ms.
7791*2^482045+1 is prime! Time : 213.479 sec.
real 3m33.629s
user 3m33.015s
sys 0m0.022s
$ time ./sllr64_dev387 -q"7791*2^482045+1" -d
Starting Proth prime test of 7791*2^482045+1
7791*2^482045+1, bit: 10000 / 482057 [2.07%]. Time per bit: 0.443 ms. = 5
7791*2^482045+1, bit: 130000 / 482057 [26.96%]. Time per bit: 0.447 ms.
7791*2^482045+1, bit: 250000 / 482057 [51.86%]. Time per bit: 0.443 ms.
7791*2^482045+1, bit: 370000 / 482057 [76.75%]. Time per bit: 0.449 ms.
7791*2^482045+1 is prime! Time : 214.500 sec.
real 3m34.652s
user 3m34.017s
sys 0m0.015s
Yet same host, different example
$ time ./sllr64_dev387 -d -q"30*2^400000+1"
Starting Proth prime test of 30*2^400000+1
30*2^400000+1, bit: 50000 / 400004 [12.49%]. Time per bit: 0.236 ms. = 7
30*2^400000+1, bit: 250000 / 400004 [62.49%]. Time per bit: 0.233 ms.
30*2^400000+1 is not prime. Proth RES64: 703F8ACAB9634020 Time : 94.046 sec.
real 1m34.182s
user 1m33.909s
sys 0m0.004s
$ time ./sllr32_dev387 -d -q"30*2^400000+1"
Starting Proth prime test of 30*2^400000+1
30*2^400000+1, bit: 40000 / 400004 [9.99%]. Time per bit: 0.237 ms.6, a = 7
30*2^400000+1, bit: 200000 / 400004 [49.99%]. Time per bit: 0.240 ms.
30*2^400000+1 is not prime. Proth RES64: 703F8ACAB9634020 Time : 95.984 sec.
real 1m36.127s
user 1m35.829s
sys 0m0.023s
But another host with HT on but 50% of cores used elsewhere:
$ time ./sllr32_dev387 -d -q"30*2^400000+1"
Starting Proth prime test of 30*2^400000+1
Using all-complex Pentium4 type-1 FFT length 24K, Pass1=96, Pass2=256, a = 7
30*2^400000+1 is not prime. Proth RES64: 703F8ACAB9634020 Time : 107.311 sec.
real 1m47.500s
user 1m46.547s
sys 0m0.055s
$ time ./sllr64_dev387 -d -q"30*2^400000+1"
Starting Proth prime test of 30*2^400000+1
30*2^400000+1 is not prime. Proth RES64: 703F8ACAB9634020 Time : 110.374 sec.
real 1m50.562s
user 1m49.464s
sys 0m0.041s
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Which brings me to the next question: Can you repeat that test with all the other cores idle? If it is indeed a cache problem, running just one core may show a dramatic increase in speed relative to the 32 bit stock version.
Sure.
cllrx64.exe -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 556.670 sec.
cllrx86.exe -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 558.832 sec.
Essentially the same speed with the other cores idle, but 10% slower when other programs are running. Cache usage is one possible cause of that behavior. Except that you're seeing no benefit at all from the 64 bit math. I can't explain that.
Here's my results, with the computer mostly idle (the GPU is running, but the CPU is not)
64bit:
cllr.exe -q"8661*2^678147+1" -d
Starting Proth prime test of 8661*2^678147+1
Using all-complex Core2 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
8661*2^678147+1 is prime! Time : 648.074 sec.
32 bit stock app:
cllrx86.exe -q"8661*2^678147+1" -d
Starting Proth prime test of 8661*2^678147+1
Using all-complex Core2 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
8661*2^678147+1 is prime! Time : 731.731 sec.
One thing to check that isn't visible in what you posted: make sure both programs are choosing the same FFT type.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Jean Penne must own a DeLorean. I suspect the flux capacitor came from eBay.
Primality Testing of k*b^n+/-1 Program - PC/MacIntel Version 3.8.7
Using new Gwnum library (V26.6) and IBDWT for k's up to 22 bits
(Copyright 1996-2011 Just For Fun Software, Inc.
Author: George Woltman
Email: xxxxxxxxxxxxx)
Written : May 20010 by Jean Penne
Email : xxxxxxxxxxxxx
Use -m on the command line and choose option 7. :)
____________
My lucky number is 75898524288+1 |
|
|
|
One thing to check that isn't visible in what you posted: make sure both programs are choosing the same FFT type.
Same FFT but I am rather suspicious of those Linux binaries.
Under windows, I do see a slight difference with your your binary vs PRPNet 5.05 windows binary with one other boinc Riesel WU running (no GPU WUs). So I am wondering if that is mainly compiler options and OS task scheduling since it is only about 97% of the time. Really all cores under Windows should be running at 100% to avoid OS issues.
>cllr_64.exe -q"8661*2^678147+1" -d
Starting Proth prime test of 8661*2^678147+1
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
8661*2^678147+1 is prime! Time : 475.647 sec.
>llr_prp5.exe -q"8661*2^678147+1" -d
Starting Proth prime test of 8661*2^678147+1
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
8661*2^678147+1 is prime! Time : 487.332 sec.
|
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
One thing to check that isn't visible in what you posted: make sure both programs are choosing the same FFT type.
Yes, I checked that earlier 'cause I had the same idea.
Now I remember I still have two Q9550 and that one is running Win x64, will try.
It has 6MB cache (unlike older Q with 4MB...or i7-920 with 8MB) which proved significant in some cases in the past.
cllrx64.exe -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 493.501 sec.
cllrx86.exe -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 558.633 sec.
____________
My stats |
|
|
|
Tested the 64 bit version with Win7 on an E350 processor.
llr Test>cllr -q"3117*2^314958+1" -d
Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle.
side note: the llr version supplied with the prpclient 5.0.5 produces the same error.
Andy |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Tested the 64 bit version with Win7 on an E350 processor.
llr Test>cllr -q"3117*2^314958+1" -d
Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle.
side note: the llr version supplied with the prpclient 5.0.5 produces the same error.
Andy
So you're getting the same error with the 32 bit version, right?
____________
My lucky number is 75898524288+1 |
|
|
|
So you're getting the same error with the 32 bit version, right?
Yes, that is right.
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
So you're getting the same error with the 32 bit version, right?
Yes, that is right.
I'm not running 5.0.5, so I'm not sure which version of llr that is. What does "llr -v" show on the stock app? Is it 3.8.7 or 3.8.6? If it's 3.8.7, try running 3.8.6 from PRPNet 5.0.4. If you're having the same same problem with 3.8.6 then... Oh, let's step back a bit...
What's an E350?
FWIW, here's that WU on Core2:
C:\GeneferCUDA test\From others\llr 3.8.7 dev\64\llr387devsrc>llr\cllr___Win64_Release\cllr -q"3117*2^314958+1" -d
Starting Proth prime test of 3117*2^314958+1
Using all-complex Core2 type-1 FFT length 24K, Pass1=32, Pass2=768, a = 7
3117*2^314958+1 is not prime. Proth RES64: 4E6DB6FACE0861F5 Time : 120.099 sec.
____________
My lucky number is 75898524288+1 |
|
|
|
An Zacate E350 is a dual core cpu from AMD with 1,6 GHz clock, an integrated gpu on chip and a tdp of 18 watts. Its not fast, but enough for a home computer. I didn't mention my original post in
http://www.primegrid.com/forum_thread.php?id=3350&nowrap=true#48082
sorry for that. The stock llr test doesn't run in version 3.8.6 and 3.8.7. BUT the avxllr version DOES run and the E350 has no avx built in.
Andy |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
An Zacate E350 is a dual core cpu from AMD with 1,6 GHz clock, an integrated gpu on chip and a tdp of 18 watts. Its not fast, but enough for a home computer. I didn't mention my original post in
http://www.primegrid.com/forum_thread.php?id=3350&nowrap=true#48082
sorry for that. The stock llr test doesn't run in version 3.8.6 and 3.8.7. BUT the avxllr version DOES run and the E350 has no avx built in.
Andy
I'm guessing you can't do BOINC LLR tasks on that either, correct? (At least without using app_info to run the avx build).
Question for Mark: Do you know if George is aware of this problem? If not, what's the best place for reporting this? It would seem to be a gwnum issue.
____________
My lucky number is 75898524288+1 |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
I'm not running 5.0.5, so I'm not sure which version of llr that is.
The PRPNet Update 5.0.5 thread lists what's in the package. Additionally, the readme files in the programs folder identify what version each build is. :)
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Rogue posted your answer in message #48539.
"I asked George last week and he indicated that it will be a few weeks before gwnum v27.3 is ready."
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Rogue posted your answer in message #48539.
"I asked George last week and he indicated that it will be a few weeks before gwnum v27.3 is ready."
That was referring to a problem with gwnum 27.2 (presumably an avx buiild).
LLR 3.8.7 uses gwnum 26.6.
____________
My lucky number is 75898524288+1 |
|
|
|
I have successful built a 64bit avx win version now (gwnum 27.2). I cannot see any speed increase with ppselow or trp.
@Michael: PM |
|
|
|
I tested my last founded prime with this new build and comparing it with the other builds on a i5-2500K 64bit:
C:\llr>llravx64.exe -q"7515*2^726237+1"
7515*2^726237+1 is prime! Time : 291.827 sec.
C:\llr>llravx.exe -q"7515*2^726237+1"
7515*2^726237+1 is prime! Time : 289.136 sec.
C:\llr>llr64.exe -q"7515*2^726237+1"
7515*2^726237+1 is prime! Time : 401.899 sec.
C:\llr>llr.exe -q"7515*2^726237+1"
7515*2^726237+1 is prime! Time : 437.396 sec.
No significant difference between 32/64 avx builds, maybe it looks different at higher numbers. The 64 without avx seems faster than the 32.
Regards Odi
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Jenne Penne has just released an updated 3.8.7 that includes a fix for a memory leak.
You can download the Windows 64 bit executable file here.
Please update your version of the cllr.exe if you downloaded the earlier version from yesterday.
____________
My lucky number is 75898524288+1 |
|
|
|
Update 32/64bit with the latest code:
llr32bit-AVX-win
llr64bit-AVX-win
Special thx to Michael for hints! |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
The first post has been updated with the latest builds available.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Update for linux32/64 from the latest code:
sllr32_dev387.tar.bz2
sllr32_dev387_avx.tar.bz2
sllr64_dev387.tar.bz2
sllr64_dev387_avx.tar.bz2
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Thomas11 found another mem-leak for non-base-2 tests and Jean fixed this today in his sunday source...
Updates for linux32/64 from the latest code:
sllr32_dev387.tar.bz2
sllr32_dev387_avx.tar.bz2
sllr64_dev387.tar.bz2
sllr64_dev387_avx.tar.bz2
[add]
all executable files are packed with UPX
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
Updated 32/64bit with the latest code:
llr32bit-AVX-win
llr64bit-AVX-win
llr64bit-win |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Latest version of the Windowsx64 non-AVX is available, featuring Jean's Sunday code.
This should operate identically to the one Rebirther posted. Mine's built with VS 2005 and his is built worth VS 2010, so it might be worth doing a timing test on them. Chances are there's no measurable difference since the important parts are in gwnum, which is precompiled assembler.
____________
My lucky number is 75898524288+1 |
|
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3233 ID: 50683 Credit: 151,443,349 RAC: 73,965
                         
|
64 bit windows non AVX version
In BOINC
from average 535 ----> 460 seconds.
AVX version also works fine, but both (32 and 64 bit) have some kind of delay.
When Boinc start to compute in non AVX version all 4 task ( on quad core) start crunching in same time
On AVX version, only one task start, and then after ten seconds start second WU , and then after ten seconds start 3WU and 4WU task...
But crunching time is aprox same....
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! |
|
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3233 ID: 50683 Credit: 151,443,349 RAC: 73,965
                         
|
Latest version of the Windowsx64 non-AVX is available, featuring Jean's Sunday code.
This should operate identically to the one Rebirther posted. Mine's built with VS 2005 and his is built worth VS 2010, so it might be worth doing a timing test on them. Chances are there's no measurable difference since the important parts are in gwnum, which is precompiled assembler.
Crunch time is same, no speed difference...
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! |
|
|
|
Updated 32/64bit with the latest code:
llr32bit-AVX-win
llr64bit-AVX-win
llr64bit-win
I wonder what is happening as I do not see any difference here. Perhaps just Windows not being consistent.
>llr_reb.exe -q"7515*2^726237+1" -d
Resuming Proth prime test of 7515*2^726237+1 at bit 33761 [4.64%]
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 535.005 sec.
>llr_prp5.exe -q"7515*2^726237+1" -d
Starting Proth prime test of 7515*2^726237+1
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 535.180 sec.
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
I wonder what is happening as I do not see any difference here. Perhaps just Windows not being consistent.
There is no difference in the computation times.
Both patches only fix memory leaks for base-2 and non-base-2 tests.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
I wonder what is happening as I do not see any difference here. Perhaps just Windows not being consistent.
There is no difference in the computation times.
Both patches only fix memory leaks for base-2 and non-base-2 tests.
No! One is an old prpnet distributed 32 bit version and the other is this fast 64-bit version that is as slow as the 32bit version running under exactly same conditions on the same host! In fact I tend to see the opposite of rogues' statement that it means that a 64-bit build of llr should be faster than a 32-bit build of llr.
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
I wonder what is happening as I do not see any difference here. Perhaps just Windows not being consistent.
There is no difference in the computation times.
Both patches only fix memory leaks for base-2 and non-base-2 tests.
No! One is an old prpnet distributed 32 bit version and the other is this fast 64-bit version that is as slow as the 32bit version running under exactly same conditions on the same host! In fact I tend to see the opposite of rogues' statement that it means that a 64-bit build of llr should be faster than a 32-bit build of llr.
Sorry, my fault.
Do you compare the AVX- or nonAVX-app?
Gary wrote somewhere, that he saw no run time differences between 32bit and 64bit for the AVX-app.
Only the 64bit nonAVX-app seems to be a little faster than its 32bit counterpart.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
Updated 32/64bit with the latest code:
llr32bit-AVX-win
llr64bit-AVX-win
llr64bit-win
I wonder what is happening as I do not see any difference here. Perhaps just Windows not being consistent.
>llr_reb.exe -q"7515*2^726237+1" -d
Resuming Proth prime test of 7515*2^726237+1 at bit 33761 [4.64%]
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 535.005 sec.
>llr_prp5.exe -q"7515*2^726237+1" -d
Starting Proth prime test of 7515*2^726237+1
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 535.180 sec.
I can confirm there is no difference between AVX 32bit and AVX 64bit
(Files from the first Post UPDATED: 13 FEB 2012)
Windows 64 bit Non-AVX
cllr387dev-win-x64.7z (VS2005) by Michael
P:\()_test\Non-AVX\Windows 64\Michael>cllr -q"7515*2^726237+1" -d
Starting Proth prime test of 7515*2^726237+1
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 319.480 sec.
llr64bit-win (VS2010) by Rebirther
P:\()_test\Non-AVX\Windows 64\Rebirther>llr -q"7515*2^726237+1" -d
Starting Proth prime test of 7515*2^726237+1
Using all-complex Pentium4 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 319.638 sec.
Windows without AVX
Windows 32 bit llr32bit-AVX-win by Rebirther
P:\()_test\AVX\Windows 32>llr -q"7515*2^726237+1" -d
Starting Proth prime test of 7515*2^726237+1
Using all-complex AVX Core2 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 232.644 sec.
Windows 64 bit llr64bit-AVX-win by Rebirther
P:\()_test\AVX\Windows 64>llr -q"7515*2^726237+1" -d
Starting Proth prime test of 7515*2^726237+1
Using all-complex AVX Core2 type-3 FFT length 64K, Pass1=256, Pass2=256, a = 7
7515*2^726237+1 is prime! Time : 232.633 sec.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
Tested the 64 bit version with Win7 on an E350 processor.
llr Test>cllr -q"3117*2^314958+1" -d
Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle.
side note: the llr version supplied with the prpclient 5.0.5 produces the same error.
Andy
dl5sfk,
I asked Jean Penne about the problem. He responded here that he's unable to reproduce the error.
If you want to pursue this further, I suggest talking to Jean directly. Unfortunately, there's nothing more I'm able to do to help resolve your problem. If you decide to do that, I would suggest first trying Jean Penne's own 32-bit build of LLR 3.8.7, which can be downloaded HERE. That way you'll both know that the problem isn't caused by something I did building the software.
Good luck!
____________
My lucky number is 75898524288+1 |
|
|
|
I wonder what is happening as I do not see any difference here. Perhaps just Windows not being consistent.
There is no difference in the computation times.
Both patches only fix memory leaks for base-2 and non-base-2 tests.
No! One is an old prpnet distributed 32 bit version and the other is this fast 64-bit version that is as slow as the 32bit version running under exactly same conditions on the same host! In fact I tend to see the opposite of rogues' statement that it means that a 64-bit build of llr should be faster than a 32-bit build of llr.
Sorry, my fault.
Do you compare the AVX- or nonAVX-app?
With tongue in cheek, sure I can....
Now where is that tracking number that you sent me for that i7-3960X system?
(Really I am waiting for Ivy Bridge etc.) |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
MacIntel 32&64 bit non-AVX and AVX builds added to first post.
____________
|
|
|
|
dl5sfk,
I asked Jean Penne about the problem. He responded here that he's unable to reproduce the error.
If you want to pursue this further, I suggest talking to Jean directly. Unfortunately, there's nothing more I'm able to do to help resolve your problem. If you decide to do that, I would suggest first trying Jean Penne's own 32-bit build of LLR 3.8.7, which can be downloaded HERE. That way you'll both know that the problem isn't caused by something I did building the software.
Good luck!
Michael Goetz, thank you for your investigation and help. I will test the original version from Jean Penne and eventually try to track the problem on my system.
|
|
|
|
John, the links for MacIntel files on first post are wrong. Please correct them.
____________
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
gwnum v27.3 has been released. Please see George's comments here. It can be retrieved here: ftp://mersenne.org/gimps/source273.zip
It is unknown at this time whether this resolves the AMD Bulldozer issue.
____________
|
|
|
|
gwnum v27.3 has been released. Please see George's comments here. It can be retrieved here: ftp://mersenne.org/gimps/source273.zip
I came here to post the same thing. :)
But according to George, it's potentially slower on AVX computers... |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
gwnum v27.3 has been released. Please see George's comments here. It can be retrieved here: ftp://mersenne.org/gimps/source273.zip
But according to George, it's potentially slower on AVX computers...
I'm not quite sure where it says that???
An early beta prime95 version 27.3 is available. This version supports 64-bit optimized AVX FFTs. 32-bit AVX FFTs are also a little bit faster. I haven't done full benchmarks so I'm not sure how much faster it is than versions 27.2 or 26.6.
The good/bad news is these FFTs are so fast that they are limited by memory bandwidth -- standard Sandy Bridge CPUs will experience a slow down when running all 4 cores. I'd like to hear from Sandy Bridge-E users to see if they also suffer slow downs when all 4 cores are running.
I don't think he means it will be slower...just that there will be a slowdown when running on 4 cores.
____________
|
|
|
|
I don't think he means it will be slower...just that there will be a slowdown when running on 4 cores.
Hmm, but this is like before I think. Rebirther tested it some weeks ago with the gnum 27.2 pfgw-avx build on SR5@prpnet. If he running all 4 cores, it slows down a lot in comparison to 2 cores running simultaneously and 2 cores on another project.
Regards Odi
____________
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
I don't think he means it will be slower...just that there will be a slowdown when running on 4 cores.
Hmm, but this is like before I think. Rebirther tested it some weeks ago with the gnum 27.2 pfgw-avx build on SR5@prpnet. If he running all 4 cores, it slows down a lot in comparison to 2 cores running simultaneously and 2 cores on another project.
Note: this is not a new phenomenon introduced in 27.2 or 27.3. There was a memory bandwidth problem in the past as well. Some of the new hardware overcame it, but now it looks like the envelope is being pushed again. :)
____________
|
|
|
|
An early beta prime95 version 27.3 is available. This version supports 64-bit optimized AVX FFTs. 32-bit AVX FFTs are also a little bit faster. I haven't done full benchmarks so I'm not sure how much faster it is than versions 27.2 or 26.6.
The good/bad news is these FFTs are so fast that they are limited by memory bandwidth -- standard Sandy Bridge CPUs will experience a slow down when running all 4 cores. I'd like to hear from Sandy Bridge-E users to see if they also suffer slow downs when all 4 cores are running.
I don't think he means it will be slower...just that there will be a slowdown when running on 4 cores.
Ah, I probably misunderstood. Well, if someone compiles LLR with the newer gwnum, I'd be happy to test on both i7-2600 and i5-2500. :) |
|
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3233 ID: 50683 Credit: 151,443,349 RAC: 73,965
                         
|
And AMD BD is ready for testing :)
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
Well, if someone compiles LLR with the newer gwnum, I'd be happy to test on both i7-2600 and i5-2500. :)
Here's an "early bird" special. Static Linux v3.8.7 built using gwnum v27.3.
____________
|
|
|
|
llr32avx-win
llr64avx-win
No time to test, pls report any issues!
I will build a new pfgw64avx soon, need to test larger SR5 if the old problem still exists or not, yes there is a bandwidth problem so you can only see a big difference in runtime between 2-4 cores (40%/10% speed increase -->pfgw)
Edit:
pfgw also built, see other thread! |
|
|
|
Still no joy on Bulldozer. |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
Intel i5-2500 stock speed, Win 2008R2 x64, running only one at a time.
Results from Rebirther's recent app, using GWNUM 27.3
llr -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 234.014 sec.
llr64 -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 197.413 sec.
pfgw32 -q"8661*2^678147+1" -d
8661*2^678147+1 is 3-PRP! (260.8109s+0.0007s)
pfgw64 -q"8661*2^678147+1" -d
8661*2^678147+1 is 3-PRP! (222.5442s+0.0006s)
LLRAVX is previous 3.8.6 version
llravx -q"8661*2^678147+1" -d
8661*2^678147+1 is prime! Time : 241.185 sec.
Note: those tests are using 50k FTT, recent live test from BOINC (n~752k) FFT length: 64K
____________
My stats |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,139,771,685 RAC: 2,272,291
                                      
|
Same host, several instances or LLRAVX64:
1 instance: 197 secs
2 instances: 197 secs
3 instances: 199 secs
4 instances: ~205 secs
On a side note - I noteced that during initial phase, app uses more than single core. In the end, BOINC time is a bit lower compating to TaskManager.
____________
My stats |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Update for Linux32/64 with gwnum27.3:
sllr32_dev387_gwnum273.tar.bz2
sllr64_dev387_gwnum273.tar.bz2
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
Is anyone able to provide LLR builds for the following:
ftp://mersenne.org/gimps/source266.zip
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
All apps for Linux are listed at http://primegrid.pytalhost.net/Mirror.htm or you use the following links:
sllr32_386src_gwnum266, no 64bit possible!
sllr32_386devsrc_gwnum266, no 64bit possible!
sllr32_dev387_gwnum266
sllr64_dev387_gwnum266
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
All apps for Linux are listed at http://primegrid.pytalhost.net/Mirror.htm or you use the following links:
sllr32_386src_gwnum266, no 64bit possible!
sllr32_386devsrc_gwnum266, no 64bit possible!
sllr32_dev387_gwnum266
sllr64_dev387_gwnum266
Many thanks!!! No further builds are necessary at this time.
____________
|
|
|
|
Hi there
thnks for the app. it is much more quick on my i7 2500, Win7 then the standad app, but i think that ther is a bug. It semms, that sometimes two times are interchanged.
Aufgabe Computer Laufzeit CPU Zeit Punkte Anwendung
354000151 221117 536.22s 558.75s 11.82 PPS (LLR)
353992881 222147 324.48s 318.88s 11.82 PPS (LLR)
on my computer 221117 the runtime is less then the cpu-time, i think it is not ok.
~20% of this computers have the same 'problem'
michael
edit: link to the wu: http://www.primegrid.com/result.php?resultid=354000151
i use this application: llr64bit-win (VS2010) by Rebirther
client runs 8 pps tasks and one collatz- or milkyway-task on ati-gpu (hd5870 ccc12.1) |
|
|
|
I found the inofficial 27.4 gwnum and built a new 32bit avx version. I dont know what was changed so pls test it also with Bulldozer.
32bitavx-win |
|
|
|
Whatever was changed in 27.4, it didn't fix the Bulldozer problem. |
|
|
|
Whatever was changed in 27.4, it didn't fix the Bulldozer problem.
ok, thx, so time to wait until official code :/ |
|
|
|
27.4 does fix the bug mentioned in http://www.mersenneforum.org/showpost.php?p=289888&postcount=63 |
|
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1256 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Whatever was changed in 27.4, it didn't fix the Bulldozer problem.
George told me that it should be fixed. What problem are you seeing with v27.4? |
|
|
|
Whatever was changed in 27.4, it didn't fix the Bulldozer problem.
George told me that it should be fixed. What problem are you seeing with v27.4?
Every test fails with "Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle."
|
|
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1256 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Whatever was changed in 27.4, it didn't fix the Bulldozer problem.
George told me that it should be fixed. What problem are you seeing with v27.4?
Every test fails with "Fatal error at setup : Number sent to gwsetup is too large for the FFTs to handle."
Are you certain that it is a build that is linked with gwnum v27.4? |
|
|
|
Are you certain that it is a build that is linked with gwnum v27.4?
I'm certain. It solves the non-SSE4 bug, and contains the version number 27.4 near the end of the file. |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Update for Linux32/64 with gwnum27.4:
sllr32_dev387_gwnum274.tar.bz2
sllr64_dev387_gwnum274.tar.bz2
boinc@vmware2k-3:~/Cuda/llrCPU$ ./sllr64_dev387_gwnum274 -d -q"30448908048555*2^666666-1"
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-1 FFT length 72K, Pass1=96, Pass2=768
30448908048555*2^666666-1, iteration : 10000 / 666666 [1.50%]. Time per iteration : 1.527 ms.
^C
Caught signal. Terminating.
boinc@vmware2k-3:~/Cuda/llrCPU$ ./sllr32_dev387_gwnum274 -d -q"30448908048555*2^666666-1"
Error reading z7896541 intermediate file.
Starting Lucas Lehmer Riesel prime test of 30448908048555*2^666666-1
Using zero-padded Core2 type-1 FFT length 72K, Pass1=96, Pass2=768
30448908048555*2^666666-1, iteration : 10000 / 666666 [1.50%]. Time per iteration : 1.756 ms.
^C
Caught signal. Terminating.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
I updated the code with official one:
Download |
|
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1256 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
I updated the code with official one:
Download
Wrong thread. |
|
|
|
I updated the code with official one:
Download
Wrong thread.
No, linked to avx versions too. |
|
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1256 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
I updated the code with official one:
Download
Wrong thread.
No, linked to avx versions too.
I had posted to this thread instead of the AVX pfgw thread by mistake. I couldn't delete my post so I edited it. :-) |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Update for Linux32/64 with gwnum27.4 source from 09-Mar-2012:
sllr32_dev387_gwnum274.tar.bz2
sllr64_dev387_gwnum274.tar.bz2
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
32bit vs. 64bit latest win version:
~35s vs. 31s -->i5-2500k@4Ghz |
|
|
|
Jean has released now the official version 3.8.8 with gwnum 27.5 but only in 32bit. |
|
|
|
I've now tried both SoB and PSP with v3.8.7. SoB showed maybe a 10% performance gain. PSP doesn't seem to be any different to stock.
Am I doing something wrong here? i7-2600K with HT on. |
|
|
|
I've now tried both SoB and PSP with v3.8.7. SoB showed maybe a 10% performance gain. PSP doesn't seem to be any different to stock.
Am I doing something wrong here? i7-2600K with HT on.
With HT on its ok. The calculations need more cpu cache so its much slower than you run only 3 of 4 cores with SOB or PSP. |
|
|
|
With HT on its ok. The calculations need more cpu cache so its much slower than you run only 3 of 4 cores with SOB or PSP.
thanks, so is it best to run 6 or 7 out of 8? |
|
|
|
With HT on its ok. The calculations need more cpu cache so its much slower than you run only 3 of 4 cores with SOB or PSP.
thanks, so is it best to run 6 or 7 out of 8?
3-4 without HT on. |
|
|
|
With HT on its ok. The calculations need more cpu cache so its much slower than you run only 3 of 4 cores with SOB or PSP.
thanks, so is it best to run 6 or 7 out of 8?
3-4 without HT on.
ok. I'm going to try 7, then 6 then switch ht-off to see which is the best. |
|
|
|
4 x llr and 4 x wu of other project. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,209,141 RAC: 977,735
                               
|
It's going to vary depending on exactly which CPU you have, since CPU cache sizes and configurations vary from CPU to CPU.
Unless you're getting advice from someone with the same CPU, or someone with knowlege of how your CPU performs, take what anyone says with a fair dose of skepticism. What works best for their computer may not work best for yours.
____________
My lucky number is 75898524288+1 |
|
|
|
Jean has released now the official version 3.8.8 with gwnum 27.5 but only in 32bit.
[noob mode] It also works for BOINC, right? (of course first finish the wu's that are running now!)
And which one would I need? Would that be cllr388 or cllrd388? [/noob mode]
____________
PrimeGrid Challenge Overall standings --- Last update: From Pi to Paddy (2016)
|
|
|
|
Jean has released now the official version 3.8.8 with gwnum 27.5 but only in 32bit.
[noob mode] It also works for BOINC, right? (of course first finish the wu's that are running now!)
And which one would I need? Would that be cllr388 or cllrd388? [/noob mode]
cllr388, the "d" is standing for debug mode |
|
|
|
Unless you're getting advice from someone with the same CPU, or someone with knowlege of how your CPU performs, take what anyone says with a fair dose of skepticism. What works best for their computer may not work best for yours.
that's why I'm now testing different configs. Changing from 8-7 processors has seen PPS go from around 600s to 500s so that's already around 5% improvement per task. |
|
|
|
Yeah, but you wasting cpu cycles! ;) |
|
|
|
here's what I've found as an average time per task for PPS:
8 cores = 75s
5,6,7 cores = 70-72s
With HT off, 4 cores = 65-70s (not sure why the large variation here compared to the others).
HT off without avx = 90s
My current PSPs will finish later tonight so we'll see how much quicker they run without HT in a couple of days.
|
|
|
|
When you switching HT off, do it in bios. |
|
|
|
Some PSP tasks have now finished with HT off. With HT on with/without avx they were taking around 305-310Ksec.
With HT off, without avx they're taking around 153Ksec. So perhaps slightly quicker with HT off but I'm not seeing the expected (20-50%) improvements with AVX.
Any thoughts as to why this is?
|
|
|
|
anyone??? |
|
|
|
Some PSP tasks have now finished with HT off. With HT on with/without avx they were taking around 305-310Ksec.
With HT off, without avx they're taking around 153Ksec. So perhaps slightly quicker with HT off but I'm not seeing the expected (20-50%) improvements with AVX.
Any thoughts as to why this is?
The factors are larger than PPS so there is a cpu cache limit (waiting time). As I know about 4/4 cores running with SR5 you can expecting only a 10% speed increase. With 3 cores + 1 other project its nearly 50%. |
|
|
|
The factors are larger than PPS so there is a cpu cache limit (waiting time). As I know about 4/4 cores running with SR5 you can expecting only a 10% speed increase. With 3 cores + 1 other project its nearly 50%.
but I'm not seeing any speed increase unless the PSP tasks have suddenly increased by 10% since I switched on avx. |
|
|
|
The factors are larger than PPS so there is a cpu cache limit (waiting time). As I know about 4/4 cores running with SR5 you can expecting only a 10% speed increase. With 3 cores + 1 other project its nearly 50%.
but I'm not seeing any speed increase unless the PSP tasks have suddenly increased by 10% since I switched on avx.
Try to do a full run on 2 PSP tasks + perhaps 2 WCG. We will see whats happening. I tried the bigger TRP and got around 40% less in time but with 64bit AVX. |
|
|
|
Try to do a full run on 2 PSP tasks + perhaps 2 WCG. We will see whats happening. I tried the bigger TRP and got around 40% less in time but with 64bit AVX.
not sure what WCG is. I only run primegrid on this PC so the only option will be to switch it to 50% cores. I'll do that for a test but it sounds like in my situation it's not going to give me an overall improvement.
On another note, I've now discovered that my other (non-avx) PCs are slightly faster running with HT on.
|
|
|
|
Try to do a full run on 2 PSP tasks + perhaps 2 WCG. We will see whats happening. I tried the bigger TRP and got around 40% less in time but with 64bit AVX.
not sure what WCG is. I only run primegrid on this PC so the only option will be to switch it to 50% cores. I'll do that for a test but it sounds like in my situation it's not going to give me an overall improvement.
On another note, I've now discovered that my other (non-avx) PCs are slightly faster running with HT on.
This is not a good option. WCG=World Community Grid. Better to have another project too. You can also try PPSElow on PRPnet with 4-8 cores or PPS with PG on 4 cores (test later with 8) |
|
|
|
This is not a good option.
for who? I like primegrid so that's where I'm putting my resources.
You can also try PPSElow on PRPnet with 4-8 cores or PPS with PG on 4 cores
will do that again later but as long as there's badges to be had I'd like to complete the set so for now I'm running PSP and SoB on my fastest pc.
|
|
|
|
This is not a good option.
for who? I like primegrid so that's where I'm putting my resources.
I meant switched it to 50% only. I like PG too but only special subprojects with an end goal and solution. |
|
|
|
with 3 cores PSP is around 20% faster but obviously that means overall throughput is down. |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
This slow down is caused by a bandwidth limitation of the memory controller and can only be solved either by a tripple/quad channel memory interface or by memory with higher bandwidths (XDR, DDR4 etc).
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
This slow down is caused by a bandwidth limitation of the memory controller and can only be solved either by a tripple/quad channel memory interface or by memory with higher bandwidths (XDR, DDR4 etc).
so what you're saying is that I really need an i7 3930 or 3960? Tempting... but not right now ;-) |
|
|