| Author |
Message |
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
|
Exciting News!!!
Over in the PST forum, Ken and the crew have been working on a GPU application for sieving. The program is called ppsieve and is currently being used on the PPSE sieve. If all goes well, we may be able to merge the PPS and PPSE sieves bringing PPSE into BOINC.
Currently available for 32 & 64 bit Linux AND 32 bit Windows (will run on 64 bit). It should work on cards with any compute capability. You can download it here:
ppsieve-cuda.zip (source)
To test, please use the following command line:
./ppsieve-cuda-boinc-(version) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
It should output the following factors:
Range: 42070e9 to 42070030e6
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Found 27 factors
Please provide as much details about your system as possible.
Thank you for testing!
p.s. If you wish to test the CUDA time vs. CPU time, you can download the CPU build here: ppsieve-bin.zip (source)
Just run the same test range and then compare the results.
Other sample test cases:
Range: 20070e9 to 20070010e6
20070000475957 | 4995*2^1822738+1
20070001146497 | 4977*2^626298+1
20070001163929 | 3765*2^461308+1
20070001302811 | 7669*2^725426+1
20070001425977 | 5821*2^1775248+1
20070002245151 | 1221*2^646983+1
20070002606341 | 4809*2^497683+1
20070004816819 | 6699*2^1215561+1
20070005914001 | 9847*2^1634140+1
20070006187837 | 9923*2^287853+1
20070006875981 | 1645*2^965954+1
20070007170259 | 3889*2^49730+1
20070008329039 | 9065*2^832569+1
Found 13 factors
Range: 249871e9 to 2498711e8
249871003789289 | 6295*2^266404+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871027030549 | 8865*2^1534637+1
249871030776329 | 7815*2^1679937+1
249871032591751 | 2335*2^23512+1
249871038523049 | 7527*2^204096+1
249871049497963 | 6497*2^505399+1
249871066947839 | 8497*2^1221770+1
249871068167599 | 7311*2^450531+1
249871089712009 | 9281*2^1650023+1
249871091913587 | 2139*2^1290902+1
249871099624639 | 8381*2^350375+1
Found 13 factors
Range: 42070e9 to 42070100e6
Found 68 factors
____________
|
|
|
tocx Volunteer tester
 Send message
Joined: 23 Nov 09 Posts: 15 ID: 50535 Credit: 203,523,000 RAC: 0
               
|
|
System: Debian GNU Linux (Squeeze), Kernel 2.6.32 AMD64
Intel i5-750
GeForce 9500 GT (silent)
Nvidia Driver Ver.: 190.53
Cuda-Toolkit: 2.3 Ubuntu 9.04
Running AP26 on all 4 Cores, no other boinc-based GPU-apps running
GPU-Temperature changes from 38°C to 40°C during the sieve runs
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Starting 1 threads.
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 24.00 sec. (0.01 init + 23.99 sieve) at 131122 p/sec.
Processor time: 3.42 sec. (0.02 init + 3.40 sieve) at 925156 p/sec.
Average processor utilization: 1.14 (init), 0.14 (sieve)
____________
|
|
|
|
|
|
System: OpenSuSE 11.2 Kernel 2.6.31.12 amd64
CPU: Core 2 Quad Q9550 (E0 stepping)
GPU: GeForce GTX 260-192 - NVIDIA driver version: 190.53 (cuda 2.3 support)
Test run with the CPU idling and the GPU at stock clock:
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 5.81 sec. (0.02 init + 5.80 sieve) at 542504 p/sec.
Processor time: 1.35 sec. (0.02 init + 1.33 sieve) at 2363791 p/sec.
Average processor utilization: 1.10 (init), 0.23 (sieve)
Test run with the CPU idling and the GPU at 667 MHz (shaders linked):
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 4.98 sec. (0.02 init + 4.96 sieve) at 634269 p/sec.
Processor time: 1.22 sec. (0.02 init + 1.20 sieve) at 2617475 p/sec.
Average processor utilization: 0.95 (init), 0.24 (sieve) |
|
|
|
|
|
To give an impression of the current speeds here are the runtimes of the test range on a Q9550 @ 3.4 GHz (FSB 400 * 8.5) CPU use the linux-x86_64 version of the ppsieve-cpu application.
Running only 1 thread:
ppsieve version 0.3.4a (testing)
Compiled Feb 19 2010 with GCC 4.3.3
Algorithm not specified, starting benchmark...
bsf takes 350000; mul takes 490000; using standard algorithm.
nstart=1999980, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 8 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 16.44 sec. (0.90 init + 15.54 sieve) at 193956 p/sec.
Processor time: 16.41 sec. (0.86 init + 15.55 sieve) at 193898 p/sec.
Average processor utilization: 0.95 (init), 1.00 (sieve)
Running 4 threads:
ppsieve version 0.3.4a (testing)
Compiled Feb 19 2010 with GCC 4.3.3
Algorithm not specified, starting benchmark...
bsf takes 350000; mul takes 540000; using standard algorithm.
nstart=1999980, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Thread 1 starting
Thread 2 starting
Thread 3 starting
Thread 3 completed
Waiting for threads to exit
Thread 0 completed
Thread 1 completed
Thread 2 completed
Sieve complete: 42070000000000 <= p < 42070003000000
Found 8 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 4.94 sec. (0.91 init + 4.03 sieve) at 748919 p/sec.
Processor time: 16.11 sec. (0.91 init + 15.21 sieve) at 198232 p/sec.
Average processor utilization: 0.99 (init), 3.78 (sieve)
---
The "Elapsed time" on a GTX 260-192 @ 667 MHz is nearly the same as on the Q9550 @ 3.4 GHz. So the output of the GTX 260-192 at stock clock is roughly the same as that of 4 cores of my Q9550 at stock clock. This ratio also applies to the current AP26 apps (1.01 (cuda23) vs 1.04). |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Sieve complete: 42070000000000 <= p < 42070003000000
Found 8 factors
Whoops! It looks like I left some flags in the ppconfig.txt file for the CPU version of PPSieve that was on my site, including one that forces sieving for Riesel numbers instead of Proth.
The timing information here is fine, but before doing anything in the PPSE sieve, anyone getting "-1"s in their results file should either download what I just uploaded, or edit ppconfig.txt to remove the "riesel" line.
OK, now that's over with, about the GPU testing...
3M is actually a small test range for the GPU code. I use it because it has good known factors, and becuase I don't have a compute-capable GPU and the emulator's really slow! But a 30M range or something would probably be better for speed comparison, if you have a minute or four:
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
Also, I haven't tried the 32-bit app at all! I'd be interested to know if it works!
____________
|
|
|
|
|
|
64 bit GPU app - Test range 30M
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
My GTX 260-192 at stock clock with no load on the CPU:
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 49.30 sec. (0.02 init + 49.29 sieve) at 611635 p/sec.
Processor time: 3.85 sec. (0.02 init + 3.83 sieve) at 7866200 p/sec.
Average processor utilization: 1.04 (init), 0.08 (sieve)
and again at 667 MHz with no load on the CPU (shaders linked):
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 42.05 sec. (0.02 init + 42.02 sieve) at 717374 p/sec.
Processor time: 3.72 sec. (0.02 init + 3.70 sieve) at 8142356 p/sec.
Average processor utilization: 0.82 (init), 0.09 (sieve) |
|
|
|
|
|
32 bit GPU app - Test range 30M
./ppsieve-cuda-x86-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
My GTX 260-192 at stock clock with no load on the CPU:
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 49.59 sec. (0.02 init + 49.56 sieve) at 608232 p/sec.
Processor time: 4.10 sec. (0.02 init + 4.07 sieve) at 7399053 p/sec.
Average processor utilization: 1.11 (init), 0.08 (sieve)
and again at 667 MHz with no load on the CPU (shaders linked):
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 42.28 sec. (0.02 init + 42.26 sieve) at 713325 p/sec.
Processor time: 3.96 sec. (0.02 init + 3.94 sieve) at 7648693 p/sec.
Average processor utilization: 1.07 (init), 0.09 (sieve) |
|
|
|
|
|
64 bit CPU app - Test range 30M
1 thread - Same CPU and clock rate as stated above:
./ppsieve-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version 0.3.4a (testing)
Compiled Feb 19 2010 with GCC 4.3.3
Algorithm not specified, starting benchmark...
bsf takes 350000; mul takes 500000; using standard algorithm.
nstart=1999980, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
p=42070023724033, 196.6K p/sec, 1.00 CPU cores, 79.1% done. ETA 09 Mar 18:11
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 153.56 sec. (0.87 init + 152.69 sieve) at 196576 p/sec.
Processor time: 153.24 sec. (0.87 init + 152.37 sieve) at 196995 p/sec.
Average processor utilization: 1.00 (init), 1.00 (sieve)
4 threads - Same CPU and clock rate as stated above:
ppsieve version 0.3.4a (testing)
Compiled Feb 19 2010 with GCC 4.3.3
Algorithm not specified, starting benchmark...
bsf takes 350000; mul takes 520000; using standard algorithm.
nstart=1999980, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 1 starting
Thread 0 starting
Thread 3 starting
Thread 2 starting
Thread 3 completed
Waiting for threads to exit
Thread 1 completed
Thread 0 completed
Thread 2 completed
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 39.77 sec. (0.90 init + 38.88 sieve) at 772082 p/sec.
Processor time: 152.13 sec. (0.89 init + 151.24 sieve) at 198467 p/sec.
Average processor utilization: 0.99 (init), 3.89 (sieve) |
|
|
|
|
|
64 bit
System: Archlinux, Kernel 2.6.32-ARCH
Intel Xeon CPU X3360 @ 3.4GHz (C1 stepping)
GeForce GTX 285
Nvidia Driver Version: 190.53
Cuda-Toolkit: 2.3
Running TRP-Sieve on all 4 Cores, no other boinc-based GPU-apps running
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 285
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 3.60 sec. (0.02 init + 3.58 sieve) at 878148 p/sec.
Processor time: 0.99 sec. (0.01 init + 0.97 sieve) at 3232123 p/sec.
Average processor utilization: 0.68 (init), 0.27 (sieve)
____________
|
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
|
does this version work for 1.0 cards?
____________
|
|
|
|
|
does this version work for 1.0 cards?
I find no checks for a specific compute capability (1.0, 1.1 or 1.3) in the source code. You can download the zip file with the binaries in it (< 100 K) and simply start the 32 or the 64 bit binary with the commands John gave in his post. The test range is very short.
or, of course, we simply could read John's initial post. I overlooked it too:
Currently available for 32 & 64 bit Linux. It should work on cards with any compute capability. You can download it here:
____________
|
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
|
oh, sorry, i missed that to. :-(
____________
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
|
Ubuntu 9.10, kernel 2.6.31-20-generic
Core2 Quad Q9550
GeForce 9800 GT, NVIDIA driver 190.42
This 30M range was run with the CPU cores busy on TRP sieve. Another run with the cores idle was just as fast (as expected).
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1-beta (testing)
Compiled Mar 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
p=42070029884417, 498.0K p/sec, 0.14 CPU cores, 99.6% done. ETA 13 Mar 17:49
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 60.52 sec. (0.01 init + 60.51 sieve) at 498248 p/sec.
Processor time: 9.42 sec. (0.01 init + 9.41 sieve) at 3203673 p/sec.
Average processor utilization: 0.71 (init), 0.16 (sieve)
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Has anyone given any thought to what's going to happen when this goes live via BOINC?
In particular, consider how I've got PrimeGrid set up, which is probably a fairly common arrangement for people with CUDA capable GPUs:
1) I've got a quad-core CPU and a GTX280 GPU.
2) I run AP26 on the GPU
3) I run other PrimeGrid stuff on the CPU
4) I do not want to run AP26 on the CPU because I can run it much, much faster on the GPU (about 5 min/WU).
5) There's no explicit BOINC mechanism to say "run X on the CPU and Y on the GPU", unless PrimeGrid makes two separate sub-projects, "AP26-CPU" and "AP26-GPU".
6) So, I have ONLY the CPU tasks selected on project preferences page to feed the right tasks to the CPU
7) ... and "Send work from any subproject..." to send AP26 to the GPU. This works because nothing else exists for the GPU, so it has to send AP26 tasks.
As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU, unless you want to also allow those to run on the CPU (which I think most people would prefer not to do.)
One possible solution is to make separate sub-projects for the GPU versions, but I realize that's far from ideal.
____________
My lucky number is 75898524288+1 |
|
|
Vato Volunteer tester
 Send message
Joined: 2 Feb 08 Posts: 796 ID: 18447 Credit: 382,504,347 RAC: 225,569
                       
|
|
I think it would be a mistake to build too much complexity into local bespoke code.
We need the BOINC client to do the right thing with GPU scheduling. We need the BOINC server to allow per subproject CPU/GPU preferences. We should try and feed those requirements into the mainstream BOINC development process and then make use of it when released.
All IMHO of course, and somewhat idealistic, since BOINC development seems to follow whatever direction the Berkeley folks are meandering in at a particular moment in time...
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
I intentionally avoided the whole concept of "well, the BOINC software really *should* do this..." for the obvious reasons, primarily that we're going to have a problem in the near future whereas the BOINC client might eventually get around to solving a problem like this anywhere from tomorrow to never. They've got much bigger GPU scheduling problems to solve first before they could get around to this one.
____________
My lucky number is 75898524288+1 |
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU ...
The anonymous platform mechanism lets you control exactly what you want to run. I deployed it today on the Linux side of my quad and it even picked up existing tasks correctly.
Obviously, this isn't a solution for the masses, but it works for me.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU ...
The anonymous platform mechanism lets you control exactly what you want to run. I deployed it today on the Linux side of my quad and it even picked up existing tasks correctly.
Obviously, this isn't a solution for the masses, but it works for me.
It's been a loooooong time since I've set up app_info by hand, and I don't really remember how to do it. Anyone know of a good reference for what exactly needs to be done?
____________
My lucky number is 75898524288+1 |
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU ...
The anonymous platform mechanism lets you control exactly what you want to run. I deployed it today on the Linux side of my quad and it even picked up existing tasks correctly.
Obviously, this isn't a solution for the masses, but it works for me.
It's been a loooooong time since I've set up app_info by hand, and I don't really remember how to do it. Anyone know of a good reference for what exactly needs to be done?
The BOINC wiki has this article. You can use the example in the format section as a template.
Open the client_state file in the data directory and find the Primegrid project data. Copy the app_version section of the first subproject you want to run and paste over the app_version in the template removing only the platform tag. You can edit the flops value if you know it's off for your current duration correction factor. Correct the <app> <name> and declare the files in the app_version also with <file_info> tags like in the example. Repeat for other subprojects you want to run. Save as app_info.xml in the PG project folder and restart BOINC. You should make sure all the files you declared actually exist in the PG folder.
It is probably a good idea to run down your cache and/or make a backup of the data directory (suspend network activity, too!) before deploying.
Below is a portion of my Win app_info for Primegrid.
<app_info>
<app>
<name>ap26</name>
</app>
<file_info>
<name>primegrid_ap26_1.01_windows_intelx86__cuda23.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>ap26</app_name>
<version_num>101</version_num>
<plan_class>cuda23</plan_class>
<avg_ncpus>0.050000</avg_ncpus>
<max_ncpus>0.050000</max_ncpus>
<flops>5604000000.000000</flops>
<coproc>
<type>CUDA</type>
<count>1.000000</count>
</coproc>
<file_ref>
<file_name>primegrid_ap26_1.01_windows_intelx86__cuda23.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart.dll</file_name>
</file_ref>
</app_version>
<app>
<name>psp_sr2sieve</name>
</app>
<file_info>
<name>primegrid_sr2sieve_wrapper_1.12_windows_x86_64.exe</name>
<executable/>
</file_info>
<file_info>
<name>primegrid_sr2sieve_1.8.10_windows_x86_64.exe.orig</name>
<executable/>
</file_info>
<app_version>
<app_name>psp_sr2sieve</app_name>
<version_num>112</version_num>
<flops>2876776723.690640</flops>
<file_ref>
<file_name>primegrid_sr2sieve_wrapper_1.12_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>primegrid_sr2sieve_1.8.10_windows_x86_64.exe.orig</file_name>
<open_name>primegrid_sr2sieve_1.8.10_windows_x86_64.exe.orig</open_name>
</file_ref>
</app_version>
</app_info>
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Thanks, that makes sense and helps a lot.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, I've posted a release candidate version to the locations given at the beginning of the thread. It may be slightly faster; it may also lock up your GPU until it's done. Let me know how it goes.
In any case, I've recently been looking at some other projects' GPU speeds, and I'm finding myself disappointed with my speeds. When Milkyway@Home is 17 times faster on high-end NVidia (PDF), and even a simple Collatz app (not the Collatz, but the only source code I could find) is more than twice as fast as a CPU on a mid-range card, but my code is only as fast as a CPU on a high-end card, I wonder if I'm doing something wrong. Would any of the experienced CUDA developers around here care to give my code the once-over, to see if I'm doing something obviously stupid like not giving the card enough threads?
I suppose the other side of the coin could be that my CUDA code isn't bad, but that my and Geoff's CPU code is extraordinarily good.
Edit: P.S. There's a BOINC capable (I think) executable in the zipfile as well. :)
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
In any case, I've recently been looking at some other projects' GPU speeds, and I'm finding myself disappointed with my speeds. When Milkyway@Home is 17 times faster on high-end NVidia (PDF), and even a simple Collatz app (not the Collatz, but the only source code I could find) is more than twice as fast as a CPU on a mid-range card, but my code is only as fast as a CPU on a high-end card, I wonder if I'm doing something wrong.
One thing to consider is that some problems just don't lend themselves very well to parallel processing. Even with the best code in the world it still might not work very well on a GPU. Remember, the GPU isn't all that fast compared to a CPU. It's its ability to run several hundred calculations simultaneously that makes it fast. If the problem doesn't fit the hardware well, the GPU won't be able to crunch it very quickly.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Yeah...the application here is computationally-bound, and doesn't require much memory. Probably the slowest part are the 64-bit multiplies. When Fermi comes out, I expect that each stream processor will run my app (once recompiled) twice as fast.
Another part of it could be that others are comparing GPU speed to CPU speed on one core. In that case my app is 4 times as fast as the CPU version. :)
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 227 ID: 38042 Credit: 949,118,274 RAC: 172,923
                         
|
|
Nice work so far Ken.. I only see one issue, this appears to be a compute-mode only CUDA application, meaning it will not run on the primary adapter under Windows in current form (driver watchdog timer). Correct?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I have no idea what you just said! (I'm new at this CUDA stuff.)
If you mean it's not using the driver API, that's correct. I was hoping to avoid it.
Edit: Did you see cuda_sleep_memcpy.cu?
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 227 ID: 38042 Credit: 949,118,274 RAC: 172,923
                         
|
I have no idea what you just said! (I'm new at this CUDA stuff.)
If you mean it's not using the driver API, that's correct. I was hoping to avoid it.
In Windows, if a CUDA kernel runs longer than 5 seconds the program will be terminated by the driver. Briefly looking at your posted source, it appears you're running one huge kernel.
RE: app speeds, currently in AP26 a 1.3 CUDA card is about 5.5 times as fast as one core of an Intel Q6600 CPU. So your app isn't exactly slow, it's just doing things the GPU isn't good at.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Have you investigated whether some of the later compute capabilities add features that increase speed? It is nice to see an application that works on all CUDA cards, but given that only a handful of models are compute capable version 1.0 (G80 chips), added features such as atomic functions in compute capable 1.1 cards might help with speed depending on the processes computed in the application.
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
In Windows, if a CUDA kernel runs longer than 5 seconds the program will be terminated by the driver. Briefly looking at your posted source, it appears you're running one huge kernel.
Not exactly. I load up the GPU with either 384 or 768 P's per multiprocessor, run just those, further check any that found a factor on the CPU, then repeat. There's no specific time checking, but I estimate the kernel won't run more than 1 or 2 seconds at a time.
Scott: I looked into it. I'm not using much global memory, or any shared memory, so atomic functions don't matter. I'm not sure; double-precision might have enough precision to be useful in one case, but it would be tricky. Otherwise there's nothing until compute capability 2.0, which as I mentioned makes multiplication faster.
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 227 ID: 38042 Credit: 949,118,274 RAC: 172,923
                         
|
|
Ok, i'll have to pay more attention to the code when reading. That should work fine in Windows, too.
____________
|
|
|
|
|
|
64 bit GPU app - Test range 30M
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
My GTX 260-192 at stock clock with no load on the CPU:
ppsieve version cuda-0.1.1-rc1 (testing)
Compiled Mar 17 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 49.12 sec. (0.02 init + 49.10 sieve) at 613957 p/sec.
Processor time: 3.81 sec. (0.02 init + 3.79 sieve) at 7951250 p/sec.
Average processor utilization: 1.04 (init), 0.08 (sieve)
and again at 667 MHz with no load on the CPU (shaders linked):
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1-rc1 (testing)
Compiled Mar 17 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 41.87 sec. (0.02 init + 41.86 sieve) at 720229 p/sec.
Processor time: 3.69 sec. (0.02 init + 3.67 sieve) at 8215573 p/sec.
Average processor utilization: 1.04 (init), 0.09 (sieve)
____________
|
|
|
|
|
|
64 bit GPU app - Test range 30M - speed comparison
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
My GTX 260-192 at stock clock with no load on the CPU:
ppsieve version cuda-0.1.1-beta (testing) : Elapsed time: 49.30 sec. (0.02 init + 49.29 sieve) at 611635 p/sec.
ppsieve version cuda-0.1.1-rc1 (testing) : Elapsed time: 49.12 sec. (0.02 init + 49.10 sieve) at 613957 p/sec.
and again at 667 MHz with no load on the CPU (shaders linked):
ppsieve version cuda-0.1.1-beta (testing) : Elapsed time: 42.05 sec. (0.02 init + 42.02 sieve) at 717374 p/sec.
ppsieve version cuda-0.1.1-rc1 (testing) : Elapsed time: 41.87 sec. (0.02 init + 41.86 sieve) at 720229 p/sec.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Thanks for all your testing help, guys! I've got one more thing to test, and I hope you won't mind because I expect it to be slower. (But I've been wrong before!)
Linux 64-bit users *only*, with pre-Fermi cards (Fermi isn't in stores yet if you didn't know), please try the two binaries in this zipfile. This is an experiment in 24-bit multiplies instead of 64-bit ones. Both binaries do 24-bit multiplies, despite their names, but they do other stuff differently. Even if it doesn't work here, this is a plausible algorithm for ATI if I can ever figure out how to develop for OpenCL without their GPU.
If anyone reading this *does* have a Fermi (GTX4xx), I'd love to see a benchmark from the original code (linked in the first post by John). If Fermi doesn't run 50-100% faster per shader, I may have to recompile or something for maximum speed.
____________
|
|
|
Benva Volunteer tester
 Send message
Joined: 5 May 08 Posts: 73 ID: 22332 Credit: 2,715,050 RAC: 0
     
|
|
SYSTEM Ubuntu 9.10
Intel Core2Duo T9550 @ 2.66GHZ
G105M
195.36.15 drivers
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1-rc1 (testing)
Compiled Mar 17 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce G 105M
Detected compute capability: 1.1
Detected 1 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 80.06 sec. (0.05 init + 80.02 sieve) at 39313 p/sec.
Processor time: 12.80 sec. (0.03 init + 12.77 sieve) at 246337 p/sec.
Average processor utilization: 0.67 (init), 0.16 (sieve)
pps-cuda-a1
./ppsieve-cuda-64bit-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.0-beta (testing)
Compiled Mar 29 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Starting 1 threads.
Detected GPU 0: GeForce G 105M
Detected compute capability: 1.1
Detected 1 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 166.18 sec. (0.04 init + 166.14 sieve) at 18934 p/sec.
Processor time: 13.11 sec. (0.04 init + 13.07 sieve) at 240683 p/sec.
Average processor utilization: 1.01 (init), 0.08 (sieve)
./ppsieve-cuda-24bit-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.0-beta (testing)
Compiled Mar 29 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Starting 1 threads.
Detected GPU 0: GeForce G 105M
Detected compute capability: 1.1
Detected 1 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 163.43 sec. (0.04 init + 163.39 sieve) at 19252 p/sec.
Processor time: 12.32 sec. (0.04 init + 12.28 sieve) at 256167 p/sec.
Average processor utilization: 1.01 (init), 0.08 (sieve)
____________
|
|
|
|
|
|
64 bit GPU app - Test range 30M
./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
My GTX 260-192 at stock clock with no load on the CPU:
ppsieve version cuda-0.1.0-beta (testing)
Compiled Mar 29 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
p=42070020971521, 349.5K p/sec, 0.06 CPU cores, 69.9% done. ETA 01 Apr 05:42
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 87.11 sec. (0.02 init + 87.09 sieve) at 346140 p/sec.
Processor time: 4.61 sec. (0.02 init + 4.59 sieve) at 6566015 p/sec.
Average processor utilization: 1.11 (init), 0.05 (sieve)
./ppsieve-cuda-24bit-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
My GTX 260-192 at stock clock with no load on the CPU:
ppsieve version cuda-0.1.0-beta (testing)
Compiled Mar 29 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
p=42070019922945, 332.0K p/sec, 0.06 CPU cores, 66.4% done. ETA 01 Apr 05:47
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 94.13 sec. (0.02 init + 94.12 sieve) at 320306 p/sec.
Processor time: 4.69 sec. (0.02 init + 4.67 sieve) at 6449443 p/sec.
Average processor utilization: 1.10 (init), 0.05 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Yeah, apparently my test version is slower as expected. There's probably no need to test it any more. But thanks for the testing you did!
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 227 ID: 38042 Credit: 949,118,274 RAC: 172,923
                         
|
|
Ken, whenever you get the code finalized, I can build a Win32 version if you need.
____________
|
|
|
|
|
|
Okay, took another range: 174550 to 174551
Intel Xeon W3520 => 43 factors found; 855.903 k p/s
Nvidia GTX 260 => 43 factors found; 894.724 k p/s
Nvidia FX 580 => manually aborted due to long runtime; ~115 k p/s
Like in AP26 with mfl0ps newest app the GTX260 is a bit faster than the Xeon W3520. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, new version uploaded, probably the finalized version, at the links in the first post. I found a somewhat major bug in previous versions: around 30 of the highest N's in the ranges were being skipped! But that's fixed now.
Bryan, see if you can get this code to build in VS, perhaps without BOINC first. If you need to make changes, perhaps I should set up a GitHub account?
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 227 ID: 38042 Credit: 949,118,274 RAC: 172,923
                         
|
|
Ok, will try building Win32 version soon. Thanks Ken
____________
|
|
|
|
|
|
Ken, if you're ready for me to do a Mac port, I'd be very happy to start on that if I can get the source code.
Cheers
- Iain |
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
|
@Iain the source code is linked in the first post of this thread
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Well, I thought I was done, but I've made a few more changes for the release version of PPSieve-CUDA. The biggest change is compiling with CUDA 3.0. I hope it works for everyone!
The biggest code change is that I gave up using boinc_init_parallel() in favor of boinc_init(), because it's more compatible. The rest of the code changes are to header files and paths to BOINC header files. So nothing major there.
By the way, apparently CUDA 3.0 introduces an easier way to lower CPU usage. It might go from 5% down to 1 or 2%. But I'm going to leave lowering CPU usage for V0.1.2, if it's needed.
____________
|
|
|
|
|
|
Thanks Ken - I just got the 0.1.1-rc2 version ported to Mac OS X (only minor tweaks required as __thread attribute is not supported by GCC on Mac OS X), I'll pick up the new code and rebuild ASAP and post here when it's done (hopefully in a couple of days)... |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Compiling with CUDA 3.0 will probably mean that many will be forced to upgrade drivers to at least the 195.xx series. Just FYI...the 196.xx and 197.xx dirvers have been noted to slow down many cards computational speeds compared to the 190.xx and 191.xx drivers (especially 8xxx and 9xxx series cards under Win7 and Vista), so the gain in freer CPU may actually be lost (and maybe exceeded) by the loss in speed on some cards.
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, it doesn't have to be compiled with 3.0 (yet). I wasn't sure if 2.3 would support Fermi. Since it looks like it does (PDF), I'll see about going back to 2.3.
Edit: To be clear, only the binaries will change, not the source code.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Compiling with CUDA 3.0 will probably mean that many will be forced to upgrade drivers to at least the 195.xx series. Just FYI...the 196.xx and 197.xx dirvers have been noted to slow down many cards computational speeds compared to the 190.xx and 191.xx drivers (especially 8xxx and 9xxx series cards under Win7 and Vista), so the gain in freer CPU may actually be lost (and maybe exceeded) by the loss in speed on some cards.
The slow downs some people have reported with some versions of the drivers have been significant. Around 25%, IIRC.
____________
My lucky number is 75898524288+1 |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
OK, it doesn't have to be compiled with 3.0 (yet). I wasn't sure if 2.3 would support Fermi. Since it looks like it does (PDF), I'll see about going back to 2.3.
Edit: To be clear, only the binaries will change, not the source code.
That's good...am i reading it correctly that CUDA 2.3 devices will use the native CUBIN that can work with the older drivers, but Fermi devices will need to have the 195.xx driver or higher to utilize the PTX code?
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
...And we're back to binaries compiled with CUDA 2.3. :)
Edit: Scott, I think that's right. I doubt there are any drivers older than that for Fermi.
____________
|
|
|
|
|
|
The Mac OS X / CUDA version of ppsieve is now available for testing:
Mac 32 bit (OS 10.5+ required) - http://www.pyramid-productions.net/downloads/ppsieve-cuda-boinc-i686-apple-darwin.tar.gz
Note that only 32 bit CUDA executables are supported on the Mac, but as most runtime is spent on the GPU, this is not a problem. Since upgrading to Mac OS 10.6.3, Apple now only support CUDA 3.0, so this app is build and linked with the CUDA 3.0 libraries. However, it should work fine with machines where CUDA 2.3 is installed. If you have a Mac running OS 10.5 and/or CUDA 2.3 I'd be very grateful for your testing.
To test the app, please use the same inputs as in the original post, and obviously the output should be the same!
On my machine (MacBookPro, 2.66 GHz Core 2 Duo, GeForce 9400M / 9600M GT) with the CPU idling, the 9400M takes
Elapsed time: 67.96 sec. (0.02 init + 67.94 sieve) at 46303 p/sec.
Processor time: 6.93 sec. (0.03 init + 6.90 sieve) at 455624 p/sec.
Average processor utilization: 1.54 (init), 0.10 (sieve)
and the 9600M GT takes:
Elapsed time: 41.52 sec. (0.02 init + 41.50 sieve) at 75805 p/sec.
Processor time: 3.90 sec. (0.03 init + 3.87 sieve) at 812352 p/sec.
Average processor utilization: 1.35 (init), 0.09 (sieve)
Any problems or performance results please post to this thread.
Thanks
- Iain |
|
|
|
|
|
Does anyone have plans for a Proth Prime Sieve ATI GPU app in the near future?
____________
May the Force be with you always.
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
An ATI app should be possible. This is the kind of highly-parallel, low-memory work that should work very well on ATI.
However, I haven't even been able to get their OpenCL compiler to run. Right now I'm focusing on getting the CPU and CUDA apps into BOINC, so ATI is off my radar for now.
____________
|
|
|
KPX Send message
Joined: 8 Jan 07 Posts: 20 ID: 4756 Credit: 92,931,253 RAC: 22,439
                   
|
Ok, will try building Win32 version soon. Thanks Ken
Any progress on the Windows version? Please? :-) |
|
|
|
|
|
I ran the Mac version. Here's how the BOINC client sees my computer (24" iMac, OS X 10.6.3):
Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU E8335 @ 2.93GHz [x86 Family 6 Model 23 Stepping 10]
Processor features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM SSE3 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1
OS: Darwin: 10.3.0
Memory: 4.00 GB physical, 559.57 GB virtual
Disk: 595.85 GB total, 559.32 GB free
Local time is UTC -7 hours
NVIDIA GPU 0: GeForce GT 120 (driver version unknown, CUDA version 3000, compute capability 1.1, 256MB, 80 GFLOPS peak)
I first suspended all BOINC tasks. Here's the output:
% /usr/bin/time ./ppsieve-cuda-boinc-i686-apple-darwin -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
Compiled Apr 16 2010 with GCC 4.2.1 (Apple Inc. build 5646) (dot 1)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Found 12 factors
28.37 real 3.61 user 0.13 sys
Screen repainting was pretty herky-jerky when the test was running, not that that's unexpected... it would have been pretty annoying if I was trying to do anything else.
-- Gary |
|
|
|
|
|
That's great - thanks for testing Gary! |
|
|
|
|
|
I think we have a problem here...:
I downloaded the ppsieve-cuda from the link, the ppsieve i use is from http://primesearchteam.com/showthread.php?t=25 and of version 0.3.4
Here come my results:
ppsieve-0.3.4 on Intel Core 2 Quad Q9550
Running ppsieve-x86_64-linux with 4 threads.
ppsieve version 0.3.4 (testing)
Compiled Feb 21 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-46)
Scanning ABCD file...
Found K's from 1201 to 9999.
Found N's from 0 to 2000000.
Algorithm not specified, starting benchmark...
bsf takes 420000; mul takes 580000; using standard algorithm.
nstart=1999980, nstep=35
Reading ABCD file.
Read 324490054 terms from ABCD format input file `ppse_137TE0.txt'
ppsieve initialized: 1201 <= k <= 9999, 80 <= n <= 2000000
Sieve started: 174550000000000 <= p < 174551000000000
Thread 0 starting
Thread 3 starting
Thread 2 starting
Thread 1 starting
p=174550993787905, 668.5K p/sec, 3.62 CPU cores, 99.4% done. ETA 14 May 20:14
Thread 3 completed
Waiting for threads to exit
Thread 1 completed
Thread 0 completed
Thread 2 completed
Sieve complete: 174550000000000 <= p < 174551000000000
count=30492087,sum=0x87435f1f71650555
Elapsed time: 1595.65 sec. (82.99 init + 1512.67 sieve) at 661137 p/sec.
Processor time: 5550.66 sec. (78.34 init + 5472.32 sieve) at 182752 p/sec.
Average processor utilization: 0.94 (init), 3.62 (sieve)
Found 16 factors
Run completed successfully!
ppsieve-cuda 0.1.1-rc1 (testing) on GeForce GTX260
Running ppsieve-cuda-x86_64-linux.
ppsieve version cuda-0.1.1-rc1 (testing)
Compiled Mar 17 2010 with GCC 4.3.3
Scanning ABCD file...
Found K's from 1201 to 9999.
Found N's from 0 to 2000000.
nstart=80, nstep=32, gpu_nstep=35
Reading ABCD file.
Read 324490054 terms from ABCD format input file `ppse_137TE0.txt'
ppsieve initialized: 1201 <= k <= 9999, 80 <= n <= 2000000
Sieve started: 174550000000000 <= p < 174551000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
p=174550966262785, 895.5K p/sec, 0.07 CPU cores, 96.6% done. ETA 14 May 21:05
Thread 0 completed
Waiting for threads to exit
Sieve complete: 174550000000000 <= p < 174551000000000
count=30492087,sum=0x87435f1f71650555
Elapsed time: 1203.92 sec. (81.91 init + 1122.01 sieve) at 891327 p/sec.
Processor time: 161.52 sec. (80.80 init + 80.73 sieve) at 12388301 p/sec.
Average processor utilization: 0.99 (init), 0.07 (sieve)
Found 43 factors
Run completed successfully!
[roadrunner@rr022 ppsieve-cuda]# diff -u fppse_174550G-174551G.txt ../ppsieve/fppse_174550G-174551G.txt
--- fppse_174550G-174551G.txt 2010-05-14 21:05:53.000000000 +0200
+++ ../ppsieve/fppse_174550G-174551G.txt 2010-05-14 21:11:44.000000000 +0200
@@ -1,43 +1,16 @@
174550025415817 | 7911*2^73648+1
174550045592773 | 2793*2^586237+1
-174550069177949 | 8745*2^1984556+1
-174550072026563 | 5457*2^226986+1
-174550072429729 | 9075*2^1747880+1
174550087108373 | 3009*2^653483+1
-174550160034671 | 1329*2^1681186+1
174550160534359 | 3255*2^959816+1
174550164384991 | 8355*2^47924+1
174550169778407 | 8553*2^689552+1
174550180112447 | 2569*2^714210+1
-174550180935937 | 1933*2^370384+1
-174550234719989 | 9149*2^1030559+1
-174550274164087 | 6729*2^1373601+1
-174550276818167 | 6207*2^1373038+1
-174550316241167 | 7731*2^1931925+1
-174550374684949 | 8743*2^638110+1
-174550399908163 | 1383*2^1894880+1
174550460586391 | 6543*2^1032642+1
-174550469318573 | 9217*2^1762344+1
-174550494079007 | 4001*2^157237+1
-174550503180689 | 5391*2^1644311+1
-174550579748341 | 3225*2^291262+1
-174550596690163 | 6799*2^1459850+1
-174550612854811 | 2377*2^1165082+1
174550639882459 | 5079*2^1786863+1
-174550644475447 | 4901*2^820583+1
174550668538153 | 7905*2^1360676+1
-174550683576527 | 1715*2^1236227+1
174550695157613 | 2919*2^951421+1
-174550731814651 | 2127*2^1357850+1
174550734921757 | 8485*2^1891676+1
-174550743947477 | 4499*2^832027+1
-174550772799551 | 6351*2^874484+1
174550799235079 | 7803*2^1302508+1
174550886243347 | 8783*2^1311925+1
-174550889752607 | 3299*2^1161717+1
-174550900074251 | 2993*2^1487561+1
-174550900755097 | 3675*2^1656171+1
-174550902785663 | 6123*2^493737+1
174550916859199 | 3197*2^1643463+1
-174550954763071 | 4191*2^1993452+1
174550971619799 | 5577*2^1063059+1
That is not good i think... |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Hm, not good is right. The CUDA app performed fine - all those factors are valid. But the CPU app didn't! :Q
I'm in another race, but in about 3 hours I'll have enough free memory to look into this.
____________
|
|
|
|
|
|
Okay. Meanwhile i am cross checking some other ranges i have on file and keep the results posted. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Well, I can't reproduce your CPU results. I get exactly the same results as you got with CUDA. But there could be several reasons for that.
For one thing, the version I downloaded from the PPSE Sieve Reservations page is 0.3.3. Not that 0.3.4 should produce bad results like that, but I can't compare directly.
PPSieve runs my CPU hot, at least as hot as fast LLR tests. Is it possible your machine isn't entirely stable?
Otherwise, PM me and I'll see about getting your version of files to test with.
____________
|
|
|
|
|
|
Okay, i took 0.3.3 and all is fine.
Intel C2Q Q9550 vs Nvidia GeForce GTX260
diff -u fppse_174550G-174551G.txt ../ppsieve/fppse_174550G-174551G.txt
Doing 198000 to 198001 now on four plattforms:
Intel C2Q Q9550
Intel Xeon W3520
Nvidia GeForce GTX260
Nvidia Quadro FX580
I think the host could be ruled out since it does not produced faulty results in one year and boinc-WUs do validate without problems. |
|
|
|
|
|
All okay, the only thing that was different is:
# head -38 ../ppsieve/fppse_198000G-198500G.txt | diff -u fppse_198000G-198001G.gtx260.txt -
--- fppse_198000G-198001G.gtx260.txt 2010-05-15 08:00:15.000000000 +0200
+++ - 2010-05-15 13:15:06.343799000 +0200
@@ -13,9 +13,9 @@
198000446092087 | 8441*2^657907+1
198000480975821 | 5379*2^1828509+1
198000523921751 | 6541*2^909876+1
-198000544962289 | 8067*2^925640+1
198000545674577 | 5467*2^1099466+1
198000546654689 | 5593*2^925632+1
+198000544962289 | 8067*2^925640+1
198000583273579 | 2783*2^1822821+1
198000609451933 | 2667*2^1881395+1
198000664307197 | 1435*2^1989456+1
But this is okay, the numbers are only in another order.
All done with 0.3.3 for cpu and 0.1.1-rc1 for cuda.
My copy of 0.3.4 for cpu must be somewhat defective. |
|
|
|
|
|
What is that while computing the 198900G to 199000G range while using cuda-0.1.1-1rc?
Computation Error: no candidates found for p=198908075406077 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Two possibilities for computation errors. Most likely this one is because you're using 0.1.1-rc1. I believe I fixed a bug between rc2 and the final release that could rarely cause this error. It could definitely cause factors to be missed near NMax.
So please download the latest version from the link in the top post.
A computation error means that the GPU says it found some factor (it doesn't return what factor), but the CPU failed to find a factor in that range. So it could also be caused by an unstable GPU or rarely an unstable CPU.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
FYI, the source for PPSieve CUDA is now on GitHub!
http://github.com/Ken-g6/PSieve-CUDA
So now you can see the various directions I've considered. The redc branch is the current one, but I have an idea for the other branch that might pull it ahead, if I can find a large enough, fast-enough area of memory; maybe texture memory.
But first, since I've heard nothing from mfl0p, I think I'd better try to set up a WinXP VM and build a version for Windows.
____________
|
|
|
|
|
|
Thank you very much for setting up the repository. This makes it easier to follow the developments. I think it is time for me to reinstall the NVIDIA drivers and their CUDA toolkit under Lucid Lynx and get my GTX 260-192 out of hibernation mode again (in the last few weeks I've crunched with a HD 4770 under Windows and Linux).
By the way: The repo contains a file named pps/ppse_37TE1.txt that is a link to a file in a /downloads/... directory that is not in the repo. Is this file too large to include in the repository or are there other reasons not to include the file?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Heh, didn't know that file was in there. It's a 1.2 GB file, so there's no way to include it. Plus it's not going to be used with BOINC, so its only purpose here would be for testing with many_n_test.sh and maybe some of the other testing scripts. It's not useful for the testing we're doing in this thread.
Edit: By the way, the code hasn't changed in about a month. I just made the code and its previous changes easier to access.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, people, I need HELP with MSVC++.
I tried compiling the source with VC++ 2008 Express. The files compile fine, but when linking, it's like no file sees any other file's header. I included the header files - even some new ones to replace missing Linux versions, so I'm not sure what's going on.
If you know anything about MSVC++ (since I don't), could you please take a look at my source code?
Thanks!
P.S. What all has to be included in the source code to save the proper build instructions? Does the .sln file need to be there? I really want to avoid including the gigantic .ncb file.
____________
|
|
|
|
|
|
Ken, I took a look at your source code. I am pretty new to C++, but I think I may have spotted your problem.
In your code you used
#include <assert.h>
To include the header, I believe it needs to be inclosed in parenthesis.
#include "assert.h"
I tryied to compile the code after changing it. I got it to compile further before failing, but I think the reason I couldn't compile is because I am running the 2010 visual C++.
I also noticed that a couple of the headers you are trying to load don't appear to exist, "util.h" and "gfn_app.h" There is a "putil.h" so maybe the name was just mistyped.
Hope this helps,
____________
|
|
|
Jay Volunteer tester
 Send message
Joined: 28 Apr 10 Posts: 82 ID: 59636 Credit: 10,419,429 RAC: 0
                  
|
|
Tanya, angle brackets (<>) are used for including non-user-written libraries that are (or should be) in your compiler path. You use the quotes ("") when what you're including is in the same directory or in the directory of another file that includes it. If it's still not found, I think it then checks the compiler path.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
I also noticed that a couple of the headers you are trying to load don't appear to exist, "util.h" and "gfn_app.h" There is a "putil.h" so maybe the name was just mistyped. No, the name was not mistyped. Look at the #ifdef's. If USE_BOINC is #define'd, it will use util.h; but I'm not trying to compile the BOINC version yet.
By the way, I wasn't sure if project-level preprocessor directives got included in the file I zipped up. You should make sure NDEBUG is #defined in the project, or you may get more errors than I did.
I couldn't find any reference to gfn_app.h or gfn_main.h. Where did you see that?
But I still don't think that will fix the 131 errors with 74 unresolved externals.
____________
|
|
|
|
|
|
Jay, what you are saying sounds mostly right, so I'm not sure if you're saying I got something wrong in my earlier message. I do know that for including headers, at least with the 2010 version of Visual C++, that I need to use quotes or the header won't work, and I have used a non user-written library before: #include <iostream>. I didn't think that the non user-written library had to be included in the compilers path, although I don't know where it would be.
Perhaps I have misuderstood something, as I have done very little with C++.
____________
|
|
|
|
|
I couldn't find any reference to gfn_app.h or gfn_main.h. Where did you see that?
Instead of bringing up the full project in visual C++, I went looking through folder at the individual files. One file was named gfn_main.c which I opened and looked at the code. That was where I saw the line
#include <gfn_app.h>
That is also where the line "#include <gfn_main.c>" is.
By the way, I wasn't sure if project-level preprocessor directives got included in the file I zipped up. You should make sure NDEBUG is #defined in the project, or you may get more errors than I did.
I'm afraid you've lost me here.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Getting better. It seems most of my problem was self-inflicted. I read something about how to get rid of LINK : warning LNK4098: defaultlib "LIBCMT" conflicts with use of other libs; use /NODEFAULTLIB:library It involved not loading a bunch of default libraries. Hence linker errors.
Another large part was solved by linking the CUDA libraries. I'm down to two unresolved references, which are probably just because those functions aren't in MSVC.
Thanks! I'll let you know if I hit any more roadblocks.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, I think I have a working binary! So if you have Win32 or Win64 and want to test it, please download my source and binary zipfile, and run the usual test on the binary in the Release folder.
Next steps include making it work for BOINC and fixing a checkpointing bug in *all other versions*. Don't let me forget to do that!
____________
|
|
|
|
|
|
I downloaded the zipfile and tryed to run the exe in the release folder. I got a message that the program can't start because cudart.dll is missing from my computer. I think I may have found a place to get the cudart.dll. Do I need to get it and put it in the directory with the exe, or is there something else I need to do?
____________
|
|
|
|
|
|
just for fun i ran your windows cuda ppsieve on my win32 xp machine.. with NO cude card.
(yes i know)
afetr grabbing a cudart.dll (from the distributed.net beta client)
it run like this..
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: Device Emulation (CPU)
Detected compute capability: 9999.9999
Detected 16 multiprocessors.
Insufficient available memory on GPU 0.
Waiting for threads to exit
Sieve incomplete: 42070000000000 <= p < 42070000000001
Found 0 factors
count=0,sum=0x0000000000000000
Elapsed time: 0.03 sec. (0.03 init + 0.00 sieve) at -1 p/sec.
Processor time: 0.05 sec. (0.05 init + 0.00 sieve) at -1 p/sec.
Average processor utilization: 1.50 (init), -1.#J (sieve)
so.. it didnt fail. whcih is good. :)
as a comparision.. the dnetc client goes like
distributed.net client for CUDA 2.2 on Win32 Copyright 1997-2009, distributed.net
Please visit http://www.distributed.net/ for up-to-date contest information.
Start the client with '-help' for a list of valid command line options.
dnetc v2.9107-516-CTR-09122712 for CUDA 2.2 on Win32 (WindowsNT 5.1).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://bugs.distributed.net/
[Jun 04 01:04:36 UTC] Unable to locate CUDA module handle
[Jun 04 01:04:36 UTC] No CUDA-supported GPU found.
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
I downloaded the zipfile and tryed to run the exe in the release folder. I got a message that the program can't start because cudart.dll is missing from my computer. I think I may have found a place to get the cudart.dll. Do I need to get it and put it in the directory with the exe, or is there something else I need to do?
You should be able to copy it from the BOINC directory.
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
i7-920 (stock clocks)
6GB RAM
Vista Home Premium 64-bit [Version 6.0.6002]
BOINC suspended for tests
9500GT (512mb, factory OC card, 191.07 driver)
C:\Users\Scott\Downloads\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
p=42070007340033, 1.490K p/sec, 0.01 CPU cores, 24.5% done. ETA 04 Jun 00:09
all factors match.
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
p=42070007340033, 1.490K p/sec, 0.01 CPU cores, 24.5% done. ETA 04 Jun 00:09
all factors match.
I did a spit take! But then I realized that's the wrong line. What did the line that starts with "Elapsed time" say?
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
p=42070007340033, 1.490K p/sec, 0.01 CPU cores, 24.5% done. ETA 04 Jun 00:09
all factors match.
I did a spit take! But then I realized that's the wrong line. What did the line that starts with "Elapsed time" say?
Sorry, I stopped it running at about 25% (I am switching the card out this evening for an ATI 4670 that I just picked up). A wall clock estimate for the total run time based on the 25% complete would be in the neighborhood of about 3-3.5 hours. Also, and interestingly, I had very little delayed screen response.
I am at home, but tomorrow I can test it on 32-bit systems with various CUDA cards (9600 GSO, 9600GS, 8600 GT, 8400 GS, 8300 GS). Might try my laptop's 8400M GS tonight...is there a memory minimum limit?
EDIT:
Okay, before pulling the 9500GT, I have gone back and run the shorter 3M test with the following output/results:
ppsieve-cuda.exe -p42070e9 -P42 070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Ignoring invalid checkpoint in ppcheck42070e9.txt
Thread 0 starting
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
p=42070001310721, 21.85K p/sec, 0.11 CPU cores, 43.7% done. ETA 03 Jun 22:38
42070001040127 | 6471*2^37907+1
p=42070001572865, 4.369K p/sec, 0.04 CPU cores, 52.4% done. ETA 03 Jun 22:40
p=42070002097153, 8.738K p/sec, 0.03 CPU cores, 69.9% done. ETA 03 Jun 22:40
p=42070002359297, 4.369K p/sec, 0.03 CPU cores, 78.6% done. ETA 03 Jun 22:41
p=42070002621441, 4.369K p/sec, 0.02 CPU cores, 87.4% done. ETA 03 Jun 22:42
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
p=42070003145729, 3.757K p/sec, 0.02 CPU cores, 104.9% done. ETA 03 Jun 22:43
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 496.42 sec. (0.03 init + 496.38 sieve) at 6337 p/sec.
Processor time: 18.74 sec. (0.05 init + 18.69 sieve) at 168320 p/sec.
Average processor utilization: 1.38 (init), 0.04 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
i7-920 @ 2.8 GHz
6GB RAM
Win7-64
GTX 260 Core 216 (Factory OC)
BOINC suspended for all tests
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
p=42070030146561, 17.48K p/sec, 0.03 CPU cores, 100.5% done. ETA 03 Jun 21:40
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 842.70 sec. (0.03 init + 842.66 sieve) at 35775 p/sec.
Processor time: 37.74 sec. (0.03 init + 37.71 sieve) at 799528 p/sec.
Average processor utilization: 0.97 (init), 0.04 (sieve)
I swiped the cudart.dll from Collatz...
I am at home, but tomorrow I can test it on 32-bit systems with various CUDA cards (9600 GSO, 9600GS, 8600 GT, 8400 GS, 8300 GS). Might try my laptop's 8400M GS tonight...is there a memory minimum limit?
I am also curious as to what this may be...it depends on what version of the CUDA SDK this was compiled with...newer versions will run considerably faster on newer cards as well as include increased capabilities (double precision, anyone?)
I'll try getting the cudart.dll from a project like GPUGrid or Milkyway, which both use at least CUDA 2.2 (due to double precision support) and see what, if any, difference that makes...
EDIT: Whoa, put my foot in my mouth a bit there...Collatz would use 2.2...my bad... :p
And also, when I switched the cudart.dll with the one from GPUGrid, it made NO difference whatsoever...
____________
|
|
|
|
|
|
Here's my result from the shorter test Scott posted above:
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 11.19 sec. (0.05 init + 11.14 sieve) at 282411 p/sec.
Processor time: 4.18 sec. (0.06 init + 4.12 sieve) at 763818 p/sec.
Average processor utilization: 1.30 (init), 0.37 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, I had thought it was running over 1MP/s; it was just 1KP/s. I think something may be wrong with my sleep timing. I'll look into it and get back to you.
____________
|
|
|
|
|
OK, I had thought it was running over 1MP/s; it was just 1KP/s. I think something may be wrong with my sleep timing. I'll look into it and get back to you.
I did notice (through watching GPU usage on EVGA Precision) that the GPU usage never stayed constant...it would spike for a second or two to around 75% and then fall to zero for about 10-20 seconds....
Hope that helps!
And BTW, thanks Ken for building a Windows version! It seems like it has a some more ground to cover to catch up with the linux builds, but great job nonetheless!
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Well whaddaya know - I found a second bug, also in code I had thought was stable. That makes two bugs that - while they didn't affect results - could severely impact usability.
Alright, give the newly-updated zipfile a try and I'll see what's developed tomorrow. Thanks for testing!
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Updated code on 9500GT (short test - 3M):
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
p=42070001310721, 21.85K p/sec, 0.11 CPU cores, 43.7% done. ETA 04 Jun 01:14
42070001040127 | 6471*2^37907+1
p=42070001572865, 4.369K p/sec, 0.04 CPU cores, 52.4% done. ETA 04 Jun 01:16
p=42070002097153, 8.738K p/sec, 0.03 CPU cores, 69.9% done. ETA 04 Jun 01:16
p=42070002359297, 4.369K p/sec, 0.03 CPU cores, 78.6% done. ETA 04 Jun 01:17
p=42070002621441, 4.369K p/sec, 0.02 CPU cores, 87.4% done. ETA 04 Jun 01:17
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
p=42070003145729, 3.763K p/sec, 0.02 CPU cores, 104.9% done. ETA 04 Jun 01:19
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 496.11 sec. (0.04 init + 496.07 sieve) at 6341 p/sec.
Processor time: 18.77 sec. (0.05 init + 18.72 sieve) at 168040 p/sec.
Average processor utilization: 1.06 (init), 0.04 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
valterc Volunteer tester Send message
Joined: 30 May 07 Posts: 121 ID: 8810 Credit: 7,411,462,353 RAC: 9,377,891
                     
|
|
my own test follows Q9450@3400 W7U (cudart v2.3)
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce GTX 275
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 9.75 sec. (0.02 init + 9.73 sieve) at 323155 p/sec.
Processor time: 3.56 sec. (0.03 init + 3.53 sieve) at 892248 p/sec.
Average processor utilization: 2.00 (init), 0.36 (sieve) |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Pentium D 965 Extreme Edition (HT turned on)
ASUS 9600 GSO (factory OC "TOP" version, 384mb)
Microsoft Windows XP Pro (32-bit) [Version 5.1.2600]
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9600 GSO
Detected compute capability: 1.1
Detected 12 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 6.56 sec. (0.05 init + 6.52 sieve) at 482798 p/sec.
Processor time: 1.39 sec. (0.06 init + 1.33 sieve) at 2368548 p/sec.
Average processor utilization: 1.33 (init), 0.20 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Same 9600 GSO on 30M test:
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9600 GSO
Detected compute capability: 1.1
Detected 12 multiprocessors.
p=42070029360129, 489.3K p/sec, 0.19 CPU cores, 97.9% done. ETA 04 Jun 08:16
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 62.28 sec. (0.05 init + 62.23 sieve) at 484404 p/sec.
Processor time: 12.30 sec. (0.25 init + 12.05 sieve) at 2502438 p/sec.
Average processor utilization: 5.33 (init), 0.19 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
I ran the test on my Vista-32/Q6600/GTX280.
I won't bother posting the output from the program, because that's not the interesting part.
My GPU temperature *barely* nudged from its idle temperature. That's a really, really, bad sign and indicates that the GPU isn't being utilized efficiently.
Taking a look at the GPU utilization graph on GPU-Z, it showed that the vast majority of the time the utilization was 0%. About every 15 seconds or so, the utilization briefly spiked way up, then returned back to 0. Even stranger was that it wasn't using the CPU during the time the GPU was idle. CPU utilization was at about 10% to 20% of a single core according to task manager. (The output from the program said it was using 0.03 CPU cores, which was significantly lower than what task manager was showing.)
So, for most of the run time, it's not using the GPU or the CPU. I would guess that it's either waiting on a resource or sleeping.
____________
My lucky number is 75898524288+1 |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Pentium D 830 (using --device # option to test on both GPU)
9600 GS0 (ASUS factory OC "TOP" version, 384mb)
9600 GS (768 mb)
Microsoft Windows XP Home (32-bit) [Version 5.1.2600]
DEVICE 0:
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9600 GSO
Detected compute capability: 1.1
Detected 12 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 6.72 sec. (0.08 init + 6.64 sieve) at 473710 p/sec.
Processor time: 1.66 sec. (0.11 init + 1.55 sieve) at 2033602 p/sec.
Average processor utilization: 1.40 (init), 0.23 (sieve)
DEVICE 1:
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal --device 1
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 1: GeForce 9600 GS
Detected compute capability: 1.1
Detected 6 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 15.97 sec. (0.05 init + 15.92 sieve) at 197573 p/sec.
Processor time: 2.78 sec. (0.09 init + 2.69 sieve) at 1170503 p/sec.
Average processor utilization: 2.00 (init), 0.17 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
I ran the test on my Vista-32/Q6600/GTX280.
I won't bother posting the output from the program, because that's not the interesting part.
My GPU temperature *barely* nudged from its idle temperature. That's a really, really, bad sign and indicates that the GPU isn't being utilized efficiently.
Taking a look at the GPU utilization graph on GPU-Z, it showed that the vast majority of the time the utilization was 0%. About every 15 seconds or so, the utilization briefly spiked way up, then returned back to 0. Even stranger was that it wasn't using the CPU during the time the GPU was idle. CPU utilization was at about 10% to 20% of a single core according to task manager. (The output from the program said it was using 0.03 CPU cores, which was significantly lower than what task manager was showing.)
So, for most of the run time, it's not using the GPU or the CPU. I would guess that it's either waiting on a resource or sleeping.
Hmmm...this might show something about Vista specifically. On my 9600GSO under 32-bit XP Pro, GPU-Z shows the GPU utilization at 99% for the whole test. My unusually long 9500GT results (which on the OC'ed 32-shader card should be similar to the stock clocked 9600 GS 48-shader card) are also obtained on Vista (albeit 64-bit). Looks like something in the code is not activating the GPU properly under Vista (and I'd suspect under Win 7 also).
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
9600 GS on 30M test:
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q --device 1
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 1: GeForce 9600 GS
Detected compute capability: 1.1
Detected 6 multiprocessors.
p=42070024641537, 204.4K p/sec, 0.15 CPU cores, 82.1% done. ETA 04 Jun 08:42
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 147.89 sec. (0.06 init + 147.83 sieve) at 203930 p/sec.
Processor time: 22.25 sec. (0.09 init + 22.16 sieve) at 1360635 p/sec.
Average processor utilization: 1.50 (init), 0.15 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
I redownloaded the zipfile, got the cudart.dll. Now I get this:
The application was unable to start correctly (0x000007b).
Any idea what's causing this?
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
I redownloaded the zipfile, got the cudart.dll. Now I get this:
The application was unable to start correctly (0x000007b).
Any idea what's causing this?
A stop error with that code is usually associated with a problematic boot device (usually a hard drive)...kinda weird to see it with this CUDA application. You aren't by chance trying to run it off of a USB stick?
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
No USB stick. I've actually tryed running it on several different computers, just in case it was a problem with the NVidia card. All three computers were running windows 7.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Pentium 4 (HT) 3.6Ghz
8600 GT (256mb)
Microsoft Windows XP Pro (32-bit) [Version 5.1.2600]
3M Test:
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8600 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 18.92 sec. (0.05 init + 18.88 sieve) at 166660 p/sec.
Processor time: 3.31 sec. (0.08 init + 3.23 sieve) at 972592 p/sec.
Average processor utilization: 1.67 (init), 0.17 (sieve)
30M Test:
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8600 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
p=42070020185089, 165.3K p/sec, 0.17 CPU cores, 67.3% done. ETA 04 Jun 11:30
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 180.02 sec. (0.05 init + 179.97 sieve) at 167509 p/sec.
Processor time: 30.97 sec. (0.28 init + 30.69 sieve) at 982373 p/sec.
Average processor utilization: 6.00 (init), 0.17 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Pentium 4 (HT) 3.8 Ghz
8400 GS (256mb)
Microsoft Windows XP Pro (32-bit) [Version 5.1.2600]
3M Test:
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
p=42070002883585, 48.06K p/sec, 0.10 CPU cores, 96.1% done. ETA 04 Jun 11:43
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 63.83 sec. (0.05 init + 63.78 sieve) at 49319 p/sec.
Processor time: 6.34 sec. (0.20 init + 6.14 sieve) at 512281 p/sec.
Average processor utilization: 4.33 (init), 0.10 (sieve)
30M Test:
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
p=42070028835841, 47.17K p/sec, 0.09 CPU cores, 96.1% done. ETA 04 Jun 11:54
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 632.41 sec. (0.05 init + 632.36 sieve) at 47673 p/sec.
Processor time: 59.66 sec. (0.23 init + 59.42 sieve) at 507331 p/sec.
Average processor utilization: 5.00 (init), 0.09 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Pentium 4 (HT) 3.6Ghz
8300 GS (128mb) ...This is about as slow as CUDA devices get!
Microsoft Windows XP Pro (32-bit) [Version 5.1.2600]
3M Test:
ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8300 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
p=42070001572865, 26.21K p/sec, 0.10 CPU cores, 52.4% done. ETA 04 Jun 11:33
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 119.53 sec. (0.06 init + 119.47 sieve) at 26331 p/sec.
Processor time: 11.86 sec. (0.08 init + 11.78 sieve) at 267011 p/sec.
Average processor utilization: 1.25 (init), 0.10 (sieve)
30M Test:
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8300 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
p=42070029622273, 26.21K p/sec, 0.10 CPU cores, 98.7% done. ETA 04 Jun 11:56
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 1186.17 sec. (0.06 init + 1186.11 sieve) at 25416 p/sec.
Processor time: 119.03 sec. (0.30 init + 118.73 sieve) at 253899 p/sec.
Average processor utilization: 4.75 (init), 0.10 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, I'm getting the impression that sleep isn't fixed - at least not in all cases.
So, I'd like those of you who had problems with it in particular, and anyone else, to test the version I just uploaded, and please report the sleep diagnostics it outputs. By the way, "OVERslept by WAY TOO LONG!" is the line that indicates trouble; but I can only learn the magnitude of the trouble from the other lines.
The other option is to use one CPU core 100%, but I'd like to avoid that if I can.
____________
|
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
|
I've have this oversleping, too.
just some lines, there are really a lot more of these:
Will sleep 590625 usec next time.
Sleeping 590625 usec.
Actually sleeping 590625 usec.
OVERslept by 46875 usec.
Will sleep 543750 usec next time.
Sleeping 543750 usec.
Actually sleeping 543750 usec.
OVERslept by 46875 usec.
Will sleep 496875 usec next time.
Sleeping 496875 usec.
Actually sleeping 496875 usec.
OVERslept by 46875 usec.
Will sleep 450000 usec next time.
Sleeping 450000 usec.
Actually sleeping 450000 usec.
OVERslept by 46875 usec.
Will sleep 403125 usec next time.
Sleeping 403125 usec.
Actually sleeping 403125 usec.
OVERslept by 15625 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping 371875 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 403125 usec next time.
Sleeping 403125 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping 387500 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 403125 usec next time.
Sleeping 403125 usec.
Actually sleeping 403125 usec.
Underslept by 390625 usec.
Will sleep 793750 usec next time.
Sleeping 793750 usec.
Actually sleeping 793750 usec.
OVERslept by 46875 usec.
Will sleep 746875 usec next time.
Sleeping 746875 usec.
Actually sleeping 746875 usec.
OVERslept by 46875 usec.
Will sleep 700000 usec next time.
Sleeping 700000 usec.
Actually sleeping 700000 usec.
OVERslept by 46875 usec.
Will sleep 653125 usec next time.
Sleeping 653125 usec.
Actually sleeping 653125 usec.
OVERslept by 46875 usec.
Will sleep 606250 usec next time.
Sleeping 606250 usec.
Actually sleeping 590625 usec.
OVERslept by WAY TOO LONG!
Will sleep 543750 usec next time.
Sleeping 543750 usec.
Actually sleeping 528125 usec.
OVERslept by 46875 usec.
Will sleep 496875 usec next time.
Sleeping 496875 usec.
Actually sleeping 496875 usec.
OVERslept by WAY TOO LONG!
Will sleep 434375 usec next time.
Sleeping 434375 usec.
Actually sleeping 418750 usec.
OVERslept by 31250 usec.
Will sleep 403125 usec next time.
Sleeping 403125 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping 371875 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 403125 usec next time.
Sleeping 403125 usec.
Actually sleeping 403125 usec.
OVERslept by 15625 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping 387500 usec.
Underslept by 0 usec.
Will sleep 387500 usec next time.
Sleeping 387500 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 403125 usec next time.
Nvidia 8800 GTS 320MB G80
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 104.80 sec. (0.03 init + 104.77 sieve) at 287752 p/sec.
Processor time: 4.61 sec. (0.05 init + 4.56 sieve) at 6607465 p/sec.
Average processor utilization: 1.50 (init), 0.04 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Sleeping 403125 usec.
Actually sleeping 403125 usec.
Underslept by 390625 usec.
Will sleep 793750 usec next time.
There's the "money shot". Combined with other parts, this tells me that the timing is far too random for the current method to work.
I'll look at other options. I didn't see any timing less than 300,000 usec, so I might try finding the minimum.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Ok, here's the results:
Vista-32/Q6600/GTX28
BOINC shut down
C:\Temp\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Sleeping 0 usec.
Underslept by 826000 usec.
Will sleep 826000 usec next time.
Sleeping 826000 usec.
Actually sleeping 785000 usec.
Underslept by 825000 usec.
Will sleep 1651000 usec next time.
Sleeping 1651000 usec.
Actually sleeping 1628000 usec.
Underslept by 825000 usec.
Will sleep 2476000 usec next time.
Sleeping 2476000 usec.
Actually sleeping 2464000 usec.
Underslept by 825000 usec.
Will sleep 3301000 usec next time.
Sleeping 3301000 usec.
Actually sleeping 3261000 usec.
Underslept by 825000 usec.
Will sleep 4126000 usec next time.
Sleeping 4126000 usec.
Actually sleeping 4085000 usec.
Underslept by 825000 usec.
Will sleep 4951000 usec next time.
Sleeping 4951000 usec.
Actually sleeping 4933000 usec.
Underslept by 825000 usec.
Will sleep 5776000 usec next time.
Sleeping 5776000 usec.
Actually sleeping 5753000 usec.
Underslept by 825000 usec.
Will sleep 6601000 usec next time.
Sleeping 6601000 usec.
Actually sleeping 6578000 usec.
Underslept by 825000 usec.
Will sleep 7426000 usec next time.
Sleeping 7426000 usec.
Actually sleeping 7408000 usec.
Underslept by 825000 usec.
Will sleep 8251000 usec next time.
Sleeping 8251000 usec.
Actually sleeping 8208000 usec.
Underslept by 825000 usec.
Will sleep 9076000 usec next time.
Sleeping 9076000 usec.
Actually sleeping 9047000 usec.
Underslept by 825000 usec./sec, 0.17 CPU cores, 31.5% done. ETA 04 Jun 13:44
Will sleep 9901000 usec next time.
Sleeping 9901000 usec.
Actually sleeping 9872000 usec.
Underslept by 825000 usec.
Will sleep 10726000 usec next time.
Sleeping 10726000 usec.
Actually sleeping 10685000 usec.
Underslept by 826000 usec.
Will sleep 11552000 usec next time.
Sleeping 11552000 usec.
Actually sleeping 11544000 usec.
Underslept by 825000 usec.
Will sleep 12377000 usec next time.
Sleeping 12377000 usec.
Actually sleeping 12339000 usec.
Underslept by 828000 usec.
Will sleep 13205000 usec next time.
Sleeping 13205000 usec.
Actually sleeping 13146000 usec.
Underslept by 826000 usec./sec, 0.08 CPU cores, 43.7% done. ETA 04 Jun 13:45
Will sleep 14031000 usec next time.
Sleeping 14031000 usec.
Actually sleeping 14001000 usec.
Underslept by 825000 usec.
Will sleep 14856000 usec next time.
Sleeping 14856000 usec.
Actually sleeping 14831000 usec.
Underslept by 827000 usec.
Will sleep 15683000 usec next time.
Sleeping 15683000 usec.
Actually sleeping 15670000 usec.
Underslept by 825000 usec.
Will sleep 16508000 usec next time.
Sleeping 16508000 usec.
Actually sleeping 16472000 usec.
Underslept by 827000 usec./sec, 0.06 CPU cores, 53.3% done. ETA 04 Jun 13:46
Will sleep 17335000 usec next time.
Sleeping 17335000 usec.
Actually sleeping 17310000 usec.
Underslept by 826000 usec.
Will sleep 18161000 usec next time.
Sleeping 18161000 usec.
Actually sleeping 18128000 usec.
Underslept by 825000 usec.
Will sleep 18986000 usec next time.
Sleeping 18986000 usec.
Actually sleeping 18950000 usec.
Underslept by 825000 usec./sec, 0.04 CPU cores, 61.2% done. ETA 04 Jun 13:47
Will sleep 19811000 usec next time.
Sleeping 19811000 usec.
Actually sleeping 19758000 usec.
Underslept by 825000 usec.
Will sleep 20636000 usec next time.
Sleeping 20636000 usec.
Actually sleeping 20606000 usec.
Underslept by 827000 usec.
Will sleep 21463000 usec next time.
Sleeping 21463000 usec.
Actually sleeping 21432000 usec.
Underslept by 825000 usec./sec, 0.05 CPU cores, 68.2% done. ETA 04 Jun 13:48
Will sleep 22288000 usec next time.
Sleeping 22288000 usec.
Actually sleeping 22276000 usec.
Underslept by 826000 usec.
Will sleep 23114000 usec next time.
Sleeping 23114000 usec.
Actually sleeping 23096000 usec.
Underslept by 826000 usec.
Will sleep 23940000 usec next time.
Sleeping 23940000 usec.
Actually sleeping 23938000 usec.
Underslept by 825000 usec.
Will sleep 24765000 usec next time.
Sleeping 24765000 usec.
Actually sleeping 24746000 usec.
Underslept by 825000 usec.
Will sleep 25590000 usec next time.
Sleeping 25590000 usec.
Actually sleeping 25577000 usec.
Underslept by 826000 usec./sec, 0.04 CPU cores, 78.6% done. ETA 04 Jun 13:50
Will sleep 26416000 usec next time.
Sleeping 26416000 usec.
Actually sleeping 26391000 usec.
Underslept by 826000 usec.
Will sleep 27242000 usec next time.
Sleeping 27242000 usec.
Actually sleeping 27217000 usec.
Underslept by 825000 usec./sec, 0.03 CPU cores, 83.0% done. ETA 04 Jun 13:50
Will sleep 28067000 usec next time.
Sleeping 28067000 usec.
Actually sleeping 28041000 usec.
Underslept by 825000 usec.
Will sleep 28892000 usec next time.
Sleeping 28892000 usec.
Actually sleeping 28856000 usec.
Underslept by 825000 usec./sec, 0.03 CPU cores, 88.3% done. ETA 04 Jun 13:51
Will sleep 29717000 usec next time.
Sleeping 29717000 usec.
Actually sleeping 29693000 usec.
Underslept by 826000 usec.
Will sleep 30543000 usec next time.
Sleeping 30543000 usec.
Actually sleeping 30518000 usec.
Underslept by 825000 usec.
Will sleep 31368000 usec next time.
Sleeping 31368000 usec.
Actually sleeping 31345000 usec.
Underslept by 825000 usec.
Will sleep 32193000 usec next time.
Sleeping 32193000 usec.
Actually sleeping 32158000 usec.
Underslept by 824000 usec.
Will sleep 33017000 usec next time.
Sleeping 33017000 usec.
Actually sleeping 33015000 usec.
Underslept by 824000 usec./sec, 0.03 CPU cores, 97.9% done. ETA 04 Jun 13:52
Will sleep 33841000 usec next time.
Sleeping 33841000 usec.
Actually sleeping 33817000 usec.
Underslept by 826000 usec./sec, 0.02 CPU cores, 100.5% done. ETA 04 Jun 13:53
Will sleep 34667000 usec next time.
Sleeping 34667000 usec.
Actually sleeping 34621000 usec.
Underslept by 826000 usec.
Will sleep 35493000 usec next time.
Sleeping 35493000 usec.
Actually sleeping 35475000 usec.
Underslept by 824000 usec.sec, 0.03 CPU cores, 100.5% done. ETA 04 Jun 13:54
Will sleep 36317000 usec next time.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 819.74 sec. (0.05 init + 819.70 sieve) at 36778 p/sec.
Processor time: 39.81 sec. (0.05 init + 39.76 sieve) at 758125 p/sec.
Average processor utilization: 1.02 (init), 0.05 (sieve)
Same GPU spiking as before; average utilization was 5%.
____________
My lucky number is 75898524288+1 |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
There's the "money shot". Combined with other parts, this tells me that the timing is far too random for the current method to work.
I'll look at other options. I didn't see any timing less than 300,000 usec, so I might try finding the minimum.
Here are the results from one of the 9600GSO on XP Pro which look a fair bit different from the others:
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping 262500 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping 262500 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 231250 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 231250 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
p=42070028573697, 476.2K p/sec, 0.10 CPU cores, 95.2% done. ETA 04 Jun 14:53 Ac
tually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping -34375 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 63.91 sec. (0.05 init + 63.86 sieve) at 472077 p/sec.
Processor time: 6.56 sec. (0.27 init + 6.30 sieve) at 4787543 p/sec.
Average processor utilization: 5.67 (init), 0.10 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That is just weird!
The only way I can see that happening is if computation doesn't start when the kernel is called. Since it's not happening to everyone, maybe it's a CUDA runtime dll problem?
So let's all standardize on this DLL and try again.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
That is just weird!
The only way I can see that happening is if computation doesn't start when the kernel is called. Since it's not happening to everyone, maybe it's a CUDA runtime dll problem?
So let's all standardize on this DLL and try again.
Downloaded and ran, but my results look essentially the same:
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping -65625 usec.
.
.
.
Sleeping 246875 usec.
Actually sleeping -50000 usec.
Underslept by 0 usec.
Will sleep 246875 usec next time.
Sleeping 246875 usec.
Actually sleeping 246875 usec.
Underslept by 15625 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping 262500 usec.
OVERslept by 15625 usec.
Will sleep 246875 usec next time.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 62.47 sec. (0.03 init + 62.44 sieve) at 482828 p/sec.
Processor time: 8.92 sec. (0.27 init + 8.66 sieve) at 3482635 p/sec.
Average processor utilization: 8.50 (init), 0.14 (sieve)
Will give my Vista boxes (32-bit and 64-bit) a try when i get home.
____________
141941*2^4299438-1 is prime!
|
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
|
same on my side. i've pm'ed ken my std error output.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
same on my side. i've pm'ed ken my std error output.
Same as Scott; not the same as you had before. Note the lack of "OVERslept by WAY TOO LONG!" messages. Combined with the steady rate, that tells me the new DLL made the run efficient. :)
Now I'm just hoping it fixes Michael Goetz's problem too.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
That is just weird!
The only way I can see that happening is if computation doesn't start when the kernel is called. Since it's not happening to everyone, maybe it's a CUDA runtime dll problem?
So let's all standardize on this DLL and try again.
Ok, here's he result using that DLL, which I believe is the same as the one I was already using. Note that the sleep time starts at just under 1 second and steadily increases to around 30 seconds. During the sleep time, the GPU is idle.
EDIT: Correction, the new DLL is not the same one I used before. WinRAR seems to have a bug and was telling me they were the same when they were not. Nevertheless, this test was done with the correct DLL.
C:\Temp\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Sleeping 0 usec.
Underslept by 831000 usec.
Will sleep 831000 usec next time.
Sleeping 831000 usec.
Actually sleeping 778000 usec.
Underslept by 828000 usec.
Will sleep 1659000 usec next time.
Sleeping 1659000 usec.
Actually sleeping 1631000 usec.
Underslept by 833000 usec.
Will sleep 2492000 usec next time.
Sleeping 2492000 usec.
Actually sleeping 2470000 usec.
Underslept by 830000 usec.
Will sleep 3322000 usec next time.
Sleeping 3322000 usec.
Actually sleeping 3263000 usec.
Underslept by 828000 usec.
Will sleep 4150000 usec next time.
Sleeping 4150000 usec.
Actually sleeping 4095000 usec.
Underslept by 834000 usec.
Will sleep 4984000 usec next time.
Sleeping 4984000 usec.
Actually sleeping 4954000 usec.
Underslept by 829000 usec.
Will sleep 5813000 usec next time.
Sleeping 5813000 usec.
Actually sleeping 5780000 usec.
Underslept by 844000 usec.
Will sleep 6657000 usec next time.
Sleeping 6657000 usec.
Actually sleeping 6623000 usec.
Underslept by 829000 usec.
Will sleep 7486000 usec next time.
Sleeping 7486000 usec.
Actually sleeping 7458000 usec.
Underslept by 829000 usec.
Will sleep 8315000 usec next time.
Sleeping 8315000 usec.
Actually sleeping 8255000 usec.
Underslept by 830000 usec.
Will sleep 9145000 usec next time.
Sleeping 9145000 usec.
Actually sleeping 9098000 usec.
Underslept by 829000 usec./sec, 0.16 CPU cores, 31.5% done. ETA 04 Jun 15:52
Will sleep 9974000 usec next time.
Sleeping 9974000 usec.
Actually sleeping 9939000 usec.
Underslept by 904000 usec.
Will sleep 10878000 usec next time.
Sleeping 10878000 usec.
Actually sleeping 10812000 usec.
Underslept by 827000 usec.
Will sleep 11705000 usec next time.
Sleeping 11705000 usec.
Actually sleeping 11694000 usec.
Underslept by 829000 usec.
Will sleep 12534000 usec next time.
Sleeping 12534000 usec.
Actually sleeping 12479000 usec.
Underslept by 835000 usec.
Will sleep 13369000 usec next time.
Sleeping 13369000 usec.
Actually sleeping 13297000 usec.
Underslept by 862000 usec./sec, 0.07 CPU cores, 43.7% done. ETA 04 Jun 15:53
Will sleep 14231000 usec next time.
Sleeping 14231000 usec.
Actually sleeping 14122000 usec.
Underslept by 829000 usec.
Will sleep 15060000 usec next time.
Sleeping 15060000 usec.
Actually sleeping 15031000 usec.
Underslept by 828000 usec.
Will sleep 15888000 usec next time.
Sleeping 15888000 usec.
Actually sleeping 15867000 usec.
Underslept by 833000 usec.
Will sleep 16721000 usec next time.
Sleeping 16721000 usec.
Actually sleeping 16674000 usec.
Underslept by 839000 usec./sec, 0.06 CPU cores, 53.3% done. ETA 04 Jun 15:54
Will sleep 17560000 usec next time.
Sleeping 17560000 usec.
Actually sleeping 17525000 usec.
Underslept by 826000 usec.
Will sleep 18386000 usec next time.
Sleeping 18386000 usec.
Actually sleeping 18317000 usec.
Underslept by 826000 usec.
Will sleep 19212000 usec next time.
Sleeping 19212000 usec.
Actually sleeping 19164000 usec.
Underslept by 825000 usec./sec, 0.04 CPU cores, 61.2% done. ETA 04 Jun 15:55
Will sleep 20037000 usec next time.
Sleeping 20037000 usec.
Actually sleeping 19965000 usec.
Underslept by 827000 usec.
Will sleep 20864000 usec next time.
Sleeping 20864000 usec.
Actually sleeping 20821000 usec.
Underslept by 826000 usec.
Will sleep 21690000 usec next time.
Sleeping 21690000 usec.
Actually sleeping 21620000 usec.
Underslept by 832000 usec./sec, 0.04 CPU cores, 68.2% done. ETA 04 Jun 15:56
Will sleep 22522000 usec next time.
Sleeping 22522000 usec.
Actually sleeping 22498000 usec.
Underslept by 827000 usec.
Will sleep 23349000 usec next time.
Sleeping 23349000 usec.
Actually sleeping 23328000 usec.
Underslept by 827000 usec.
Will sleep 24176000 usec next time.
Sleeping 24176000 usec.
Actually sleeping 24172000 usec.
Underslept by 826000 usec.
Will sleep 25002000 usec next time.
Sleeping 25002000 usec.
Actually sleeping 24975000 usec.
Underslept by 825000 usec.
Will sleep 25827000 usec next time.
Sleeping 25827000 usec.
Actually sleeping 25810000 usec.
Underslept by 826000 usec./sec, 0.03 CPU cores, 78.6% done. ETA 04 Jun 15:57
Will sleep 26653000 usec next time.
Sleeping 26653000 usec.
Actually sleeping 26623000 usec.
Underslept by 827000 usec.
Will sleep 27480000 usec next time.
Sleeping 27480000 usec.
Actually sleeping 27452000 usec.
Underslept by 838000 usec./sec, 0.03 CPU cores, 83.0% done. ETA 04 Jun 15:58
Will sleep 28318000 usec next time.
Sleeping 28318000 usec.
Actually sleeping 28285000 usec.
Underslept by 827000 usec.
Will sleep 29145000 usec next time.
Sleeping 29145000 usec.
Actually sleeping 29082000 usec.
Underslept by 862000 usec./sec, 0.03 CPU cores, 88.3% done. ETA 04 Jun 15:59
Will sleep 30007000 usec next time.
Sleeping 30007000 usec.
Actually sleeping 29962000 usec.
Underslept by 829000 usec.
Will sleep 30836000 usec next time.
Sleeping 30836000 usec.
Actually sleeping 30793000 usec.
Underslept by 825000 usec.
Will sleep 31661000 usec next time.
Sleeping 31661000 usec.
Actually sleeping 31632000 usec.
Underslept by 826000 usec.
Will sleep 32487000 usec next time.
Sleeping 32487000 usec.
Actually sleeping 32446000 usec.
Underslept by 831000 usec.
Will sleep 33318000 usec next time.
Sleeping 33318000 usec.
Actually sleeping 33315000 usec.
Underslept by 826000 usec./sec, 0.03 CPU cores, 97.9% done. ETA 04 Jun 16:00
Will sleep 34144000 usec next time.
Sleeping 34144000 usec.
Actually sleeping 34107000 usec.
Underslept by 827000 usec./sec, 0.01 CPU cores, 100.5% done. ETA 04 Jun 16:01
Will sleep 34971000 usec next time.
Sleeping 34971000 usec.
Actually sleeping 34914000 usec.
Underslept by 829000 usec.
Will sleep 35800000 usec next time.
Sleeping 35800000 usec.
Actually sleeping 35778000 usec.
Underslept by 824000 usec.sec, 0.03 CPU cores, 100.5% done. ETA 04 Jun 16:02
Will sleep 36624000 usec next time.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 828.16 sec. (0.04 init + 828.12 sieve) at 36404 p/sec.
Processor time: 37.16 sec. (0.03 init + 37.13 sieve) at 811958 p/sec.
Average processor utilization: 0.78 (init), 0.04 (sieve)
I'm running driver version 197.45
Vista -32 SP3
Q6600
GTX 280
____________
My lucky number is 75898524288+1 |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
With the same DLL, I am getting the exact same results (just much longer) on my 9500GT in a Vista 64-bit system (driver 191.07). GPU-Z shows off-and-n spikes to about 30%, but never higher for GPU load.
So far, all the issues are on Vista even with the same DLL and across multiple driver versions. We really need a Win7 machine to test this, too (unfortunately my one Win7 box has an ATI card).
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
Here's my results:
Windows XP 32-bit SP3
Intel E6850
8500GT
latest program and standard cuda.dll (of this thread)
Driver 196.21
BOINC suspended
OVERslept by 46875 usec. p/sec, 0.00 CPU cores, 97.0% done. ETA 04 Jun 23:17
Will sleep 903125 usec next time.
Sleeping 903125 usec.
Actually sleeping 903125 usec.
OVERslept by 46875 usec.
Will sleep 856250 usec next time.
Sleeping 856250 usec.
Actually sleeping 856250 usec.
OVERslept by 46875 usec.
Will sleep 809375 usec next time.
Sleeping 809375 usec.
Actually sleeping 809375 usec.
OVERslept by 46875 usec.
Will sleep 762500 usec next time.
Sleeping 762500 usec.
Actually sleeping 762500 usec.
OVERslept by 46875 usec.
Will sleep 715625 usec next time.
Sleeping 715625 usec.
Actually sleeping 715625 usec.
OVERslept by WAY TOO LONG!
Will sleep 653125 usec next time.
Sleeping 653125 usec.
Actually sleeping 637500 usec.
OVERslept by WAY TOO LONG!
Will sleep 590625 usec next time.
Sleeping 590625 usec.
Actually sleeping 590625 usec.
OVERslept by 46875 usec.
Will sleep 543750 usec next time.
Sleeping 543750 usec.
Actually sleeping 543750 usec.
OVERslept by 46875 usec.
Will sleep 496875 usec next time.
Sleeping 496875 usec.
Actually sleeping 496875 usec.
OVERslept by 46875 usec.
Will sleep 450000 usec next time.
Sleeping 450000 usec.
Actually sleeping 450000 usec.
OVERslept by 46875 usec.
Will sleep 403125 usec next time.
Sleeping 403125 usec.
Actually sleeping 403125 usec.
OVERslept by 46875 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 325000 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 325000 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
Underslept by 0 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 0 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Sleeping 356250 usec.
Actually sleeping 356250 usec.
OVERslept by 15625 usec.
Will sleep 340625 usec next time.
Sleeping 340625 usec.
Actually sleeping 340625 usec.
Underslept by 15625 usec.
Will sleep 356250 usec next time.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 1139.53 sec. (0.06 init + 1139.47 sieve) at 26457 p/sec.
Processor time: 19.95 sec. (0.05 init + 19.91 sieve) at 1514427 p/sec.
Average processor utilization: 0.75 (init), 0.02 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
OVERslept by WAY TOO LONG! Nuts, that didn't work.
OK, one more try. I think I've figured out the CUDA initialization here. I can't test it because I don't have a real card, but if it works the app should be perfectly efficient.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
Nuts, that didn't work.
OK, one more try. I think I've figured out the CUDA initialization here. I can't test it because I don't have a real card, but if it works the app should be perfectly efficient.
T8100 Vostro 1510
8400M GS (256mb)
Microsoft Windows Vista (32-bit) [Version 6.0.6002]
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Resuming from checkpoint p=42070004194305 in ppcheck42070e9.txt
Thread 0 starting
Detected GPU 0: GeForce 8400M GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
p=42070029097985, 34.95K p/sec, 0.01 CPU cores, 97.0% done. ETA 04 Jun 20:15
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 698.68 sec. (0.37 init + 698.31 sieve) at 37164 p/sec.
Processor time: 4.38 sec. (0.14 init + 4.24 sieve) at 6116159 p/sec.
Average processor utilization: 0.38 (init), 0.01 (sieve)
GPU-Z reported 96%-99% GPU utilization (mostly 99%) with 38mb of VRAM used and GPU temps were excellent.
I will test in a little while on the 9500GT in 64-bit Vista, but I think this got it. Nicely done Ken!
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Looks good on the 9500GT also:
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
p=42070029360129, 161.7K p/sec, 0.00 CPU cores, 97.9% done. ETA 04 Jun 20:45
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 185.13 sec. (0.03 init + 185.09 sieve) at 162873 p/sec.
Processor time: 1.50 sec. (0.05 init + 1.45 sieve) at 20779138 p/sec.
Average processor utilization: 1.42 (init), 0.01 (sieve)
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Ah, I think I may have made a mistake extracting files from the zip file earlier so I wasn't running the latest executable.
Here's what I'm getting now, which looks completely reasonable:
C:\Temp\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 39.88 sec. (0.05 init + 39.83 sieve) at 756976 p/sec.
Processor time: 1.61 sec. (0.05 init + 1.56 sieve) at 19324594 p/sec.
Average processor utilization: 0.90 (init), 0.04 (sieve)
GPU temps were going up nicely, although this ran quickly enough that it didn't have enough time to go all the way up. GPU-Z was showing GPU utilization in the 90-100% range, which is excellent. I had other BOINC stuff running, but that shouldn't affect this test because it's running at a higher priority than normal BOINC tasks. CPU utilization was shown as 1%, which is about 4% on a single core.
That's the good news.
The bad news is that while this was running, the video display was rather unresponsive. The problem was bad enough so that I would never run this on a computer that had a live user. I think that is a side effect of the individual kernals running too long, but I may be wrong. If I'm correct, shortening the amount of work done by each kernal will make the screen more responsive, but will increase the amount of work the CPU has to do because it has to launch more kernels.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
The bad news is that while this was running, the video display was rather unresponsive. The problem was bad enough so that I would never run this on a computer that had a live user. I think that is a side effect of the individual kernals running too long, but I may be wrong. If I'm correct, shortening the amount of work done by each kernal will make the screen more responsive, but will increase the amount of work the CPU has to do because it has to launch more kernels.
Aha! I'd gotten reports of slow displays before, but I never knew what to do about it! So thanks!
Now that I know what to Google for, I'm seeing suggestions of 10-20ms kernel runtimes to avoid slow screens. Based on the runtimes (sleep times) above, I think I can do that by breaking the one kernel up into one setup kernel and 20 or so calls to one iteration kernel. But not tonight.
____________
|
|
|
Lumiukko Volunteer tester Send message
Joined: 7 Jul 08 Posts: 165 ID: 25183 Credit: 747,914,486 RAC: 29,735
                           
|
|
My results with:
Win7 x64
Intel E5345
GTS250
latest program and standard cuda.dll (of this thread)
Driver 197.45
BOINC running only CPU-work
c:\Temp\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 20000
00 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTS 250
Detected compute capability: 1.1
Detected 16 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 45.55 sec. (0.03 init + 45.52 sieve) at 662334 p/sec.
Processor time: 1.50 sec. (0.05 init + 1.45 sieve) at 20746020 p/sec.
Average processor utilization: 1.50 (init), 0.03 (sieve)
The display was reasonably responsive during test.
Could be used for normal office work, but not for watching videos.
The previous version was acting just like those Vista machines:
c:\Temp\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 20000
00 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTS 250
Detected compute capability: 1.1
Detected 16 multiprocessors.
Sleeping 0 usec.
Underslept by 262500 usec.
Will sleep 262500 usec next time.
Sleeping 262500 usec.
Actually sleeping 262500 usec.
Underslept by 234375 usec.
Will sleep 496875 usec next time.
Sleeping 496875 usec.
Actually sleeping 481250 usec.
Underslept by 234375 usec.
Will sleep 731250 usec next time.
Sleeping 731250 usec.
...
...
Underslept by 234375 usec./sec, 0.01 CPU cores, 99.6% done. ETA 04 Jun 23:13
Will sleep 37215625 usec next time.
Sleeping 37215625 usec.
Actually sleeping 37215625 usec.
Underslept by 234375 usec.
Will sleep 37450000 usec next time.
Sleeping 37450000 usec.
Actually sleeping 37450000 usec.
Underslept by 234375 usec.
Will sleep 37684375 usec next time.
Sleeping 37684375 usec.
Actually sleeping 37684375 usec.
Underslept by 218750 usec.
Will sleep 37903125 usec next time.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 3105.11 sec. (0.05 init + 3105.06 sieve) at 9709 p/sec.
Processor time: 46.30 sec. (0.05 init + 46.25 sieve) at 651818 p/sec.
Average processor utilization: 1.00 (init), 0.01 (sieve)
--
Lumiukko
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Looking good! :)
Now I'd like everyone who tried the last Windows version to try the version I just uploaded. It should make it easier to use your computer while it runs; but it may slow the computation. How much I'm not sure yet.
For those who missed it, the previous version is here, so you can compare the old and new versions too.
By the way, if this works, I will have a Linux version of it; I'm just in the middle of Windows coding right now.
____________
|
|
|
|
|
|
Win7 64
i7-920 w/6GB RAM
GTX 260 Core 216 (Factory OC)
Driver 197.45
BOINC running CPU tasks only
Here's two tests I ran using Ken's latest build...with the standard cudart.dll as agreed on earlier....
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 20000
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 20000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 0 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 0.71 sec. (0.05 init + 0.67 sieve) at 45328606 p/sec.
Processor time: 0.31 sec. (0.06 init + 0.25 sieve) at 120778519 p/sec.
Average processor utilization: 1.39 (init), 0.38 (sieve)
And the second one...
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 4.31 sec. (0.05 init + 4.27 sieve) at 737011 p/sec.
Processor time: 0.37 sec. (0.08 init + 0.30 sieve) at 10613046 p/sec.
Average processor utilization: 1.73 (init), 0.07 (sieve)
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Here's my results with the new executable:
C:\Temp\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 37.93 sec. (0.06 init + 37.87 sieve) at 796096 p/sec.
Processor time: 1.59 sec. (0.08 init + 1.51 sieve) at 19922258 p/sec.
Average processor utilization: 1.26 (init), 0.04 (sieve)
It's actually a second or two *faster* than before, but that could be due to outside factors. However, the screen lag seemed to be about as bad as before too.
I'm not sure I was running the right executable. The date of the .exe in the zip file was June 4th, at around 3PM. Is that the most recent file? Could you post a link to the correct download? I had to scroll back pretty far to find it and maybe I followed the wrong link.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 20000
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 20000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 0 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 0.71 sec. (0.05 init + 0.67 sieve) at 45328606 p/sec.
Processor time: 0.31 sec. (0.06 init + 0.25 sieve) at 120778519 p/sec.
Average processor utilization: 1.39 (init), 0.38 (sieve)
Uhhh...that doesn't look good. Which one was that?
I'm not sure I was running the right executable. The date of the .exe in the zip file was June 4th, at around 3PM. Is that the most recent file? Yes, that's the most recent file. I've had my Windows VM on pause, and its clock has gotten out of sync with real time. It still thinks it's June 4, at 3:06 PM.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
T8100
8400M GS (256mb)
Microsoft Windows Vista (32-bit)[Version 6.0.6002]
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8400M GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
p=42070029622273, 39.32K p/sec, 0.01 CPU cores, 98.7% done. ETA 05 Jun 23:40
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 804.80 sec. (0.04 init + 804.76 sieve) at 37460 p/sec.
Processor time: 5.05 sec. (0.06 init + 4.99 sieve) at 6038936 p/sec.
Average processor utilization: 1.49 (init), 0.01 (sieve)
A bit slower than before. Screen still with a good bit of lag, but not as bad as before.
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
Uhhh...that doesn't look good. Which one was that?
That was the most recent zip from your site....
Here, I re-downloaded it and ran again...I think I used incorrect command line parameters on the first run...
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 39.22 sec. (0.05 init + 39.17 sieve) at 769574 p/sec.
Processor time: 2.00 sec. (0.06 init + 1.93 sieve) at 15584353 p/sec.
Average processor utilization: 1.30 (init), 0.05 (sieve)
Screen lag was pretty bad...though not enough to make it completely unusable...
____________
|
|
|
|
|
|
Sorry, I didn't have the previous version so the below may not be of any use to you now.
All the best and thanks for your enthusiasm!
Pete
Computer was idle.
i7 920 2.6GHz 6GB
GeForce GTX 275 896MB:
Driver Ver 197.13
Graphics clock 633MHz
Processor clock 1404 MHz
Mem clock 1134MHz
cudart.dll ver 3.0.9 file ver 6.14.11.3000
ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z
normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 275
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 33.51 sec. (0.03 init + 33.48 sieve) at 900464 p/sec.
Processor time: 1.64 sec. (0.06 init + 1.58 sieve) at 19133263 p/sec.
Average processor utilization: 1.89 (init), 0.05 (sieve)
____________
35 x 2^3587843+1 is prime! |
|
|
|
|
|
CPU Stats: GenuineIntel
Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz [x86 Family 6 Model 23 Stepping 6]
(2 processors)
GPU Stats: NVIDIA GeForce 8600M GT (255MB)
Cuda:
ppsieve version cuda-0.1.1 (testing)
Compiled Apr 14 2010 with GCC 4.3.3
pmax not specified, using default pmax = pmin + 1e9
Please specify an input file or all of kmin, kmax, and nmax
mmillerick@mmillerick-laptop:~$ '/home/mmillerick/Desktop/ppsieve-cuda/ppsieve-cuda-x86_64-linux' -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normalppsieve version cuda-0.1.1 (testing)
Compiled Apr 14 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8600M GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 35.24 sec. (0.02 init + 35.22 sieve) at 89317 p/sec.
Processor time: 3.79 sec. (0.02 init + 3.78 sieve) at 833033 p/sec.
Average processor utilization: 0.95 (init), 0.11 (sieve)
CPU
mmillerick@mmillerick-laptop:~$ '/home/mmillerick/Desktop/ppsieve/ppsieve-x86_64-linux' -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal -t2
ppsieve version 0.3.6 (testing)
Compiled May 10 2010 with GCC 4.3.3
Algorithm not specified, starting benchmark...
bsf takes 500000; mul takes 690000; using standard algorithm.
nstart=1999980, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Thread 1 starting
42070000198537 | 3373*2^1046686+1
42070000070587 | 9475*2^197534+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
42070002482267 | 9951*2^1920408+1
42070002690167 | 2553*2^1888870+1
42070002698543 | 4239*2^368773+1
42070002875941 | 4081*2^1494668+1
Thread 1 completed
Waiting for threads to exit
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 13.33 sec. (1.20 init + 12.13 sieve) at 248562 p/sec.
Processor time: 24.47 sec. (1.20 init + 23.27 sieve) at 129554 p/sec.
Average processor utilization: 1.00 (init), 1.92 (sieve) |
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
My run stopped working due to the fact the graphic drivers were restarted because of unresponsiveness?
Here is the output:
$ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9
999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
p=42070027787265, 30.33K p/sec, 0.00 CPU cores, 92.6% done. ETA 27 Jun 16:37
$ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9
999 -N 2000000 -z normal -q
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Resuming from checkpoint p=42070019660801 in ppcheck42070e9.txt
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
p=42070029884417, 34.41K p/sec, 0.01 CPU cores, 99.6% done. ETA 27 Jun 16:48
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 320.45 sec. (0.09 init + 320.36 sieve) at 32731 p/sec.
Processor time: 2.20 sec. (0.11 init + 2.09 sieve) at 5008124 p/sec.
Average processor utilization: 1.23 (init), 0.01 (sieve)
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
I also tested the vc build in debugmode. Maybe you can use the output.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 936.68 sec. (0.54 init + 936.14 sieve) at 32203 p/sec.
Processor time: 12.19 sec. (0.23 init + 11.95 sieve) at 2522065 p/sec.
Average processor utilization: 0.43 (init), 0.01 (sieve)
Added the release build as well because I noticed it was slower than the binaries that you distributed:
ppsieve version cuda-0.1.1 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
p=42070029884417, 29.94K p/sec, 0.01 CPU cores, 99.6% done. ETA 28 Jun 22:42
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 921.02 sec. (0.21 init + 920.82 sieve) at 32739 p/sec.
Processor time: 5.70 sec. (0.08 init + 5.63 sieve) at 5359388 p/sec.
Average processor utilization: 0.38 (init), 0.01 (sieve)
Will this code become available within boinc or prpnet?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
My run stopped working due to the fact the graphic drivers were restarted because of unresponsiveness?
What do you mean by that? There shouldn't be any watchdog timer issue; the latest version breaks it up into ~15-20ms kernel runs.
Is it possible you're having driver or BIOS problems, as here?
Compiling for BOINC with VS didn't work on my virtual WinXP machine, and I've been putting off doing it on a real XP machine. If anyone with experience compiling BOINC apps on VS wants to do it themselves, they're more than welcome to do so. :)
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
What do you mean by that?
I was so fast to click that OS message away :( ;) ; it did not happen again.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
New version for testing! V0.1.2. Download at the usual location; source here.
The major change here is that update Windows users got that I promised to Linux users...almost a month ago! I've been slacking, haven't I? :embarrassed:
Well, it's here now, and it needs testing because I linked to the CUDA driver to make it work. If it works you should get the same ultra-low CPU usage.
Good luck! :)
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
I am trying to compile with VS2008 but it might take some work to get it to build.....
Maybe I am missing the latest project and solution file?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Thanks!
The most recent zip is the most recent zip, pretty much. I was building with VS2008 Express Edition.
There's also the GitHub source, but I was working on converting it to work with VS and haven't tested that conversion yet. So probably best to stick with the zipfile.
Edit: This might be helpful.
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
Ok, thought so ;) (I got it compiled before)
Will the PPS CUDA client be available for use with BOINC within the next two months?
Regards,
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That all depends on whether someone like you (or me) can get it to compile for BOINC in Visual Studio. But I think it's likely.
Oh, I forgot to mention, be sure to define the USE_BOINC macro when compiling for BOINC. That should improve your chances significantly. :)
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
Got the boinc trunk release to compile and included that static libraries in the ppssieve-cuda project. I also added the macro. Compiles and it seems to be running. But what did I do? :S
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
If it produces the correct results when tested, and outputs timing info among other things to stderr.txt, it's probably working. Sounds like you've built the PPSieve CUDA app we're going to use in Windows! :)
...Except that it seems I didn't get all the details straightened out about input formats yet. I'm waiting to hear from Rytis to get it figured out. Would you be willing to build another version if necessary? It seems to be much easier with the real VS2008 than with the VS2008 express I've been using.
Thanks!
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
It tested with the debug version.
(...)
Thread 0: all_done set
Thread 0: leaving get_chunk()
Getting factors from iteration at 1
Checking factors for iteration starting at 1 with P=825323923
Found 97 factors
(...)
stderr.txt
=======
21:55:44 (3920): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 1074.03 sec. (0.70 init + 1073.33 sieve) at 28087 p/sec.
Processor time: 13.56 sec. (0.25 init + 13.31 sieve) at 2264530 p/sec.
Average processor utilization: 0.36 (init), 0.01 (sieve)
22:13:39 (3920): called boinc_finish
**********
**********
Memory Leaks Detected!!!
Memory Statistics:
0 bytes in 0 Free Blocks.
35 bytes in 4 Normal Blocks.
4180 bytes in 4 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 4620923 bytes.
Total allocations: 4644439 bytes.
Dumping objects ->
...\ppsieve-cuda-vc\ppsieve-cuda\putil.c(27) : {75}
normal block at 0x00231700, 4 bytes long.
Data: < > 00 00 00 00
...\ppsieve-cuda-vc\ppsieve-cuda\putil.c(27) : {68}
normal block at 0x00231690, 19 bytes long.
Data: <ppcheck42070e9.t> 70 70 63 68 65 63 6B 34 32 30 37 30 65 39 2E 74
...\ppsieve-cuda-vc\ppsieve-cuda\putil.c(27) : {67}
normal block at 0x00231658, 8 bytes long.
Data: <42070e9 > 34 32 30 37 30 65 39 00
{60} normal block at 0x00231590, 4 bytes long.
Data: < 4# > 00 34 23 00
Object dump complete.
Doy you think that we need to get rid of the leaks? By the way I am using vs2008 express as well. Both boinc and the cuda toolset work with the express edition.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I doubt the memory leaks are a problem. There isn't that much allocation and deallocation going on during the main body of the program. But if you want to enable memory leak detection, I could look into it in more detail.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Would you be willing to build another version if necessary? It seems to be much easier with the real VS2008 than with the VS2008 express I've been using.
Thanks!
I don't think the express/full version should have any effect. The stuff missing in the express version is usually only important to enterprise level clients, such as integration with source management tools. The compiler itself should produce the same code. Even in enterprise environments, there isn't that much reason to not use the express editions anymore.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, BiBi, the input issues seem settled. Please go ahead and post your build. If you don't have a place to post it, PM me.
I don't think the express/full version should have any effect. The stuff missing in the express version is usually only important to enterprise level clients, such as integration with source management tools. The compiler itself should produce the same code. Even in enterprise environments, there isn't that much reason to not use the express editions anymore.
According to http://boinc.berkeley.edu/trac/wiki/CompileClient, the build requires a mess of extra stuff, including the Microsoft Visual C++ 2008 Redistributable Package. That package wouldn't install on my VM of XP.
Perhaps I was following the wrong instructions. Anyone know of better instructions?
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
I think I did not understand. I used the previous link you provided http://boinc.berkeley.edu/trac/wiki/CompileApp to compile boinc. It included some libraries that I linked with the ppssieve-cuda project.
At the moment there are two things I noticed:
- a lot of warnings;
- memory leaks.
The number of warnings can be reduced by using the _CRT_SECURE_NO_WARNINGS definition. There are other warnings that need to be adressed. I think they might give problems in overflow situations.
I also wanted to look into the memory leaks.
Having this said, there is also a third issue, TIME:
- Holland is still in the World Cup Football :D;
- The weather is extremely well here in Northern Europe;
- In July and August I like t get some vacations.
Edit: got the win32 release build compiled as well. I Need a place to upload it.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
http://reprog.wordpress.com/2010/06/13/how-correct-is-correct-enough/
I think the current code is correct enough, at least for now.
____________
|
|
|
|
|
|
Sorry for the cross post, but message 24735
I installed and linked to libcudart.so.2 (cuda23), but now I'm getting this:
<core_client_version>6.10.36</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Unrecognized XML in parse_init_data_file: hostid
Skipping: 123223
Skipping: /hostid
Unrecognized XML in parse_init_data_file: starting_elapsed_time
Skipping: 0.000000
Skipping: /starting_elapsed_time
../../projects/www.primegrid.com/primegrid_ppsieve_1.20_i686-pc-linux-gnu__cuda23: out of range argument --device 0
called boinc_finish
</stderr_txt>
]]>
Is it normal to have "Unrecognized XML"?
The test app works for me stand alone, so maybe I just need to reset. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
../../projects/www.primegrid.com/primegrid_ppsieve_1.20_i686-pc-linux-gnu__cuda23: out of range argument --device 0
That's a bug. I didn't realize there could be a device 0 (that wasn't the CPU emulator). Guess that's what happens when your app is developed by someone without the actual hardware.
I'll fix it, sooner rather than later.
____________
|
|
|
|
|
../../projects/www.primegrid.com/primegrid_ppsieve_1.20_i686-pc-linux-gnu__cuda23: out of range argument --device 0
That's a bug. I didn't realize there could be a device 0 (that wasn't the CPU emulator). Guess that's what happens when your app is developed by someone without the actual hardware.
I'll fix it, sooner rather than later.
I'd rather you develop it without the hardware than not at all. I'm perfectly happy with being a tester. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, happy tester ;), try the Linux version I've just uploaded, with "--device 0"; and also try hitting ctrl-c in the middle of a run. (Edit: Nothing special should happen when you hit ctrl-c; it should just stop normally.)
____________
|
|
|
|
|
OK, happy tester ;), try the Linux version I've just uploaded, with "--device 0"; and also try hitting ctrl-c in the middle of a run. (Edit: Nothing special should happen when you hit ctrl-c; it should just stop normally.)
Assuming, since I can't download any work, you mean as standalone.
./ppsieve-cuda-x86-linux --device 0 -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.2a (testing)
Compiled Jul 7 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000345343 | 1715*2^635711+1
42070000464001 | 4179*2^1577462+1
^C42070000949861 | 4707*2^571847+1
42070001011573 | 7113*2^215532+1
42070001040127 | 6471*2^37907+1
Thread 0 interrupted
Waiting for threads to exit
Sieve incomplete: 42070000000000 <= p < 42070001048577
Found 8 factors
count=33265,sum=0x136be0270769b89f
Elapsed time: 2.58 sec. (0.03 init + 2.55 sieve) at 411352 p/sec.
Processor time: 0.19 sec. (0.04 init + 0.15 sieve) at 6839045 p/sec.
Average processor utilization: 1.23 (init), 0.06 (sieve) |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Looks good! Alright, the Linux apps are good to go! :)
Now to Windows. Bibi, if you want to build a quick fix, in app.c under "case 'd':", change the 1 to a 0 inside the parse_uint(). Otherwise, I'd like to learn how to build these apps myself. Would you mind giving me an overview of how you did it and what files you downloaded and put where? (Maybe by PM?)
Thanks, everyone! We're getting there...
____________
|
|
|
|
|
Looks good! Alright, the Linux apps are good to go! :)
Now to Windows. Bibi, if you want to build a quick fix, in app.c under "case 'd':", change the 1 to a 0 inside the parse_uint(). Otherwise, I'd like to learn how to build these apps myself. Would you mind giving me an overview of how you did it and what files you downloaded and put where? (Maybe by PM?)
Thanks, everyone! We're getting there...
Now that I have a working app, when do you think we'll have work again? |
|
|
|
|
|
Well, I got work...
<core_client_version>6.10.36</core_client_version>
<![CDATA[
<message>
process exited with code 127 (0x7f, -129)
</message>
<stderr_txt>
../../projects/www.primegrid.com/primegrid_ppsieve_1.21_i686-pc-linux-gnu__cuda23: error while loading shared libraries: ./libcudart.so.2: invalid ELF header
</stderr_txt>
]]>
I tried to run the app standalone and got this.
./primegrid_ppsieve_1.21_i686-pc-linux-gnu__cuda23: error while loading shared libraries: libcudart.so.2: wrong ELF class: ELFCLASS64
A 64 bit libcudart.so.2 is being shipped with the 32 bit app. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
The 64-bitness should be fixed shortly, if it isn't already.
In the meantime, I realized what I did wrong trying to make the screen more responsive. There are 1000ms in one second, not 100, so I was counting in centiseconds! I've now decreased each kernel runtime from 15cs to 15ms or so.
So, please, someone with Linux who had display usability problems, try this new version I just uploaded (manually, from the download link, not with BOINC). If this doesn't work, I have one more idea as well.
____________
|
|
|
|
|
The 64-bitness should be fixed shortly, if it isn't already.
In the meantime, I realized what I did wrong trying to make the screen more responsive. There are 1000ms in one second, not 100, so I was counting in centiseconds! I've now decreased each kernel runtime from 15cs to 15ms or so.
So, please, someone with Linux who had display usability problems, try this new version I just uploaded (manually, from the download link, not with BOINC). If this doesn't work, I have one more idea as well.
The responsiveness of the GUI (when running the new version) is much better though the display still lags recognizable.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Good! How does the speed compare? There may be a tradeoff between speed of GUI and speed of client.
If you don't have the previous version to compare against, either get it from BOINC or get it here.
____________
|
|
|
|
|
Good! How does the speed compare? There may be a tradeoff between speed of GUI and speed of client.
If you don't have the previous version to compare against, either get it from BOINC or get it here.
The new version:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 5.79 sec. (0.02 init + 5.77 sieve) at 545007 p/sec.
Processor time: 0.14 sec. (0.02 init + 0.12 sieve) at 26214400 p/sec.
Average processor utilization: 1.17 (init), 0.02 (sieve)
The old version (revision 12):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 12 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 5.75 sec. (0.02 init + 5.74 sieve) at 548259 p/sec.
Processor time: 0.14 sec. (0.02 init + 0.12 sieve) at 26214400 p/sec.
Average processor utilization: 1.22 (init), 0.02 (sieve)
---
Nearly the same timings for the 3M test... the 30M test follows
____________
|
|
|
|
|
|
The 30M test (new version):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 49.44 sec. (0.02 init + 49.43 sieve) at 609944 p/sec.
Processor time: 0.52 sec. (0.01 init + 0.51 sieve) at 59110902 p/sec.
Average processor utilization: 0.63 (init), 0.01 (sieve)
The 30M test (old version - revision 12):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 49.01 sec. (0.02 init + 48.99 sieve) at 615315 p/sec.
Processor time: 0.42 sec. (0.02 init + 0.40 sieve) at 75366400 p/sec.
Average processor utilization: 1.22 (init), 0.01 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Sounds worthwhile - less than a 1% slowdown for, I guess, much improved usability.
Now I just need to figure out Windows.
____________
|
|
|
|
|
|
So, who's in charge of releasing the work for cuda 32-bit? |
|
|
|
|
|
I'm certainly looking forward to running this on 64-bit Windows. :) |
|
|
|
|
The 64-bitness should be fixed shortly, if it isn't already.
Is this maybe a problem with the Mac GPU version as well? I had about 20 tasks lined up in the queue, and they all errored out immediately. Here is stderr:
<core_client_version>6.10.56</core_client_version>
<![CDATA[
<message>
process got signal 5
</message>
<stderr_txt>
dyld: Library not loaded: @rpath/libcudart.dylib
Referenced from: /Library/Application Support/BOINC Data/slots/3/../../projects/www.primegrid.com/primegrid_ppsieve_1.20_x86_64-apple-darwin__cuda23
Reason: no suitable image found. Did find:
/usr/local/cuda/lib//libcudart.dylib: mach-o, but wrong architecture
/usr/local/cuda/lib//libcudart.dylib: mach-o, but wrong architecture
</stderr_txt>
]]>
I'm on OS 10.6.3, cuda driver 3.0.14 for nvidia GT120 255MB. The Cuda control panel app reports that Cuda 3.1.10 is available. Is that the issue? Or I think OS 10.6.4 is downloadable now too.
FYI on the CPU version, the tasks run, complete, and report successfully, but if you watch their progress using the boinc client (I'm on 6.10.56), the progress percentage stays at 0.000% until they finish.
I did reset the primegrid project earlier today before I re-verified all of this.
--Gary |
|
|
|
|
|
Hi Gary, I think in CUDA 3.0 only the 32 bit library was shipped. Can you upgrade to 3.1 and confirm if this resolves the problem? We may have to come up with a better workaround but if you can let me know if this helps that would be great.
Thanks
- Iain |
|
|
gruenyVolunteer tester Send message
Joined: 12 Mar 08 Posts: 22 ID: 20129 Credit: 596,251 RAC: 0
              
|
|
i have the same problem like gary with cuda driver 3.1.10
sample task: http://www.primegrid.com/result.php?resultid=179330409
grueny |
|
|
RytisVolunteer moderator Project administrator
 Send message
Joined: 22 Jun 05 Posts: 2651 ID: 1 Credit: 58,387,426 RAC: 116,228
                     
|
|
We are now bundling the library with Mac WUs. Let's see if it helps.
____________
|
|
|
|
|
|
This is very frustrating.
<core_client_version>6.10.36</core_client_version>
<![CDATA[
<message>
process exited with code 127 (0x7f, -129)
</message>
<stderr_txt>
../../projects/www.primegrid.com/primegrid_ppsieve_1.22_i686-pc-linux-gnu__cuda23: error while loading shared libraries: ./libcudart.so.2: invalid ELF header
</stderr_txt>
]]>
The app is looking for libcudart.so.2, but libcudart.so.2.32bit is downloaded. Also, it seems like a million tasks get downloaded before I get a chance to test it.
I did the following and now it's working.
ldconfig /var/lib/boinc/projects/www.primegrid.com
____________
6r39 7ri99
Beware the dual headed Gentoo with Wine! |
|
|
RytisVolunteer moderator Project administrator
 Send message
Joined: 22 Jun 05 Posts: 2651 ID: 1 Credit: 58,387,426 RAC: 116,228
                     
|
|
Actually, a version 1.23 has already been released, which should fix the library naming problems.
____________
|
|
|
|
|
|
Looks like each task is taking about 2-3 seconds on my mac with 8800GT.
Edit: Nuts. There all being marked as invalid.
Here is a sample:
http://www.primegrid.com/result.php?resultid=179479248
____________
Reno, NV
|
|
|
|
|
|
Ugh. Same problem as with the OSX AP26 CUDA app. One of my macs has a GT120 with 256mb RAM. But because a small amount of that is used by the OS, the tasks refuse to run. Can the required amount please be dropped by about 25mb?
____________
Reno, NV
|
|
|
|
|
|
Is it normal for the app to take up one cpu core and slow my video way down? It reminds me of the Einstein cuda app.
____________
6r39 7ri99
Beware the dual headed Gentoo with Wine! |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Is it normal for the app to take up one cpu core
No! One entire core at or near 100%?! What OS are you using?
and slow my video way down?
That's normal, for now. I have a fix in the pipeline - and in the current binaries above too. But I also need to work on suspending, which I think I've solved for CPU apps now.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I see you're using Linux 32-bit. The code to avoid using 100% CPU is mostly now in the driver. Perhaps you need a different driver version?
I do have a different mechanism to use very little CPU, if only I could detect this condition.
____________
|
|
|
|
|
|
Well I'm running Gentoo Linux and both Collatz and DNETC work well on it. Even though it used one core, the Einstein app also worked well. None of these slow my desktop down. I understand it's hard to optimize something when you don't have the hardware. What would you need to be able to detect it?
____________
6r39 7ri99
Beware the dual headed Gentoo with Wine! |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
What would you need to be able to detect it?
Step 1: Read the code. It returns true if it worked or false if it didn't - doh!
Step 2: Write code to handle that case with my sleep timer.
Step 3: Handle cases like Michael Goetz's here. Need to think a little on that.
____________
|
|
|
|
|
|
Uh, panic - I think you guys need to pull the OSX application. It's throwing invalid work units all over the place, for example one of the work units my Q6600 completed was valid while the other machine wasn't, and it's got a task list of over 4,000.
EDIT: Oh, and a question - will this application be available for manual sieving? |
|
|
|
|
Hi Gary, I think in CUDA 3.0 only the 32 bit library was shipped. Can you upgrade to 3.1 and confirm if this resolves the problem? We may have to come up with a better workaround but if you can let me know if this helps that would be great.
Thanks
- Iain
Sorry about the slow response; I've been away.
No joy, but in a different way. This morning, before I upgraded anything (OSX 10.6.3, cuda 3.0, boinc 6.10.56), I was having the same behavior as reported by Seventh and Zombie: my WUs were "completing" in about 2 seconds, but all were marked invalid. Note that this was different than my original post, which was on an older rev of the app (I think) where the WUs error'd out immediately with the stderr output that I posted.
Later today I updated the cuda driver to 3.1.10, and reset the project. Now, the behavior is that the tasks run, but they never seem to finish. I let one go for almost two hours before I aborted it; its %-progress never budged from 0.0, and a couple other tasks for nearly as long. Or should I have been more patient; is it just that the sieving range is so large that it would be expected to run that long? I'm used to the CPU tasks which seem to run for 90 minutes or so. Also, the screen lag when a task is running is so bad that the machine isn't really usable.
EDIT: I see that I did manage to get one task validated overnight (http://www.primegrid.com/workunit.php?wuid=123759161) in a little over 2 hours wall clock time! So I'm going to let some more run and see what happens. Stay tuned.
--Gary |
|
|
|
|
|
Updated the driver again, and it works! Took about 30 minutes on my 8800GT.
However, the system became VERY laggy. Unusable even for simple things like reading email.
____________
Reno, NV
|
|
|
|
|
|
These PPS Sieve GPU tasks seem to be completing and validating now on the Mac, hooray! It was the update of the CUDA driver to 3.1.10 that seemed to do the trick.
I'm having the same poor experience as Zombie is with regard to screen update lag... too slow to be usable for much of anything. Simple typing of an email exhibits a delay of a few seconds between the time I press the key to the time the character appears on the screen. An improvement here would be *very* appreciated! Collatz has barely noticeable lag except under extreme graphic stress.
I guess I was just impatient initially and over-expectant on the WU duration. I had this mental picture that even with my fairly lame gpu (GT120) I'd be cranking them out every few minutes. Instead, it seems the wall-clock duration is running a bit over 2 hours for me. Is that to be expected? The ETC for the initial set of WUs was about 3 minutes each, which caused me to download a whole boatload of them (some of which I have aborted, since really it was several *days* worth of work, and I wasn't getting any CPU tasks in the meantime).
Anyway, thanks for getting my gpu working on finding primes!
--Gary |
|
|
Jay Volunteer tester
 Send message
Joined: 28 Apr 10 Posts: 82 ID: 59636 Credit: 10,419,429 RAC: 0
                  
|
|
I have the latest Mac CUDA driver, and I'm still unable to do anything on the Mac. The BOINC tasks sit at "Ready to start" and never run, and they just keep downloading every couple of seconds forever, until I manually stop them.
____________
|
|
|
|
|
|
Jay, it sounds almost like you have your BOINC manager set to "Use GPU Never," in which case it would download jobs, but not run them.
At least, this is one possible scenario.
____________
May the Force be with you always.
|
|
|
|
|
|
I have this problem on my iMac too. See here:
http://www.primegrid.com/orig/forum_thread.php?id=1737&nowrap=true#24823
It is because there is not enough ram for the GPU. Mine has 256. But because some of it gets used in overhead, BOINC thinks there is not enough memory available.
The only way to fix this is to reduce the memory required, by a few mb in the app.
Edit: More info here:
http://www.primegrid.com/forum_thread.php?id=1669&nowrap=true#21475
____________
Reno, NV
|
|
|
Jay Volunteer tester
 Send message
Joined: 28 Apr 10 Posts: 82 ID: 59636 Credit: 10,419,429 RAC: 0
                  
|
|
Zombie, you're right, I'm in the same situation. I don't know how to construct an app_info.xml, file so I guess I'm out of luck unless they change the requirement?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, Linux users, I have some testing I need done. (How novel for a testing thread! :P)
I have a crippled version of PPSieve CUDA BOINC that will only use my sleep-wait system instead of the driver's better system. The question is whether the changes I made to improve GUI performance will make it use 95% of a CPU, or whether it will still use only 5%.
So can somebody please test it out and let me know?
Thanks!
____________
|
|
|
|
|
|
The cuda-version that is handed out via BOINC gives me an error. However the WU was stopped and resumed once.
This is a 9400 GT with 2 SP and 1 GB of RAM by the way. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
"Maximum elapsed time exceeded" seems to be a BOINC configuration problem in general. SETI and AQUA have posts about it happening there too.
You say you stopped and resumed the WU. Did you do that shortly after it started? I don't see any evidence of checkpointing.
P.S. Everyone: Don't forget about my testing request two posts up!
____________
|
|
|
|
|
|
Problem was x was eating up one core (2 threads with hyperthreading btw), switching to console-mode was only possible after setting "use gpu while user active" to no via remote-boincmanager and resuming after switching to console. |
|
|
|
|
|
Hi,
I have a laptop with a NVidia Quattro 140M cards (Ubuntu 10.04 + Cuda 3.1 installed)
To crunch a single WU on this GPU it takes almost 10 hours
This task for ex.
Client : Proth Prime Search (Sieve) v1.23 (cuda23)
Time : 34,176.15
But only 134.87 Granted credit !!!
Is there a problem with my card? too long? Or 10 hours is a normal duration?
What about granted credit? 134 for 10 hours? Normal too?
Thanks
____________
Badge Score: 1*2 + 5*5 + 9*6 + 3*7 + 2*8 + 1*9 = 127 |
|
|
|
|
I have a laptop with a NVidia Quattro 140M cards (Ubuntu 10.04 + Cuda 3.1 installed)
I think your GPU may be too slow - it looks like a 9400GT due to having only 16 cores. |
|
|
|
|
|
ok, too bad...
PS: How to disable GPU for primegrid ? I only find "suspend while in use".
[Edit] Found: Sorry. Was looking at the wrong place.
Thanks
____________
Badge Score: 1*2 + 5*5 + 9*6 + 3*7 + 2*8 + 1*9 = 127 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, nobody's tested my crippled version that I know of, but I'm releasing PPSieve CUDA V0.1.3 anyway. The major changes are:
All versions could use some testing, so if there are testers hiding in the woodwork, now would be a good time to come out. ;)
____________
|
|
|
|
|
|
I would like to help, but no CUDA capable cards just ATIs. |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
Alright, nobody's tested my crippled version that I know of, but I'm releasing PPSieve CUDA V0.1.3 anyway. The major changes are:
All versions could use some testing, so if there are testers hiding in the woodwork, now would be a good time to come out. ;)
Sorry, I don't have good access to my couple of Linux boxes to test. I can test on Windows, but the link above only has the Linux version.
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
How would I go about getting this running on BOINC with Windows? I've never made my own app_info.xml file before. :) |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Scott, when I download the link at the top, I see two .exe files. Don't you?
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
Scott, when I download the link at the top, I see two .exe files. Don't you?
Doh!...clicked the Linux link in error and got the ppssieve-cuda-a1 zip file by mistake.
Downloaded from the correct link and here are the test results:
13:59:49 (2688): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9600 GSO
Detected compute capability: 1.1
Detected 12 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 62.99 sec. (18433820296920.45 init + 12923776852.09 sieve) at 0 p/sec.
Processor time: 2.25 sec. (0.00 init + 2.25 sieve) at 13398471 p/sec.
Average processor utilization: 0.00 (init), 0.00 (sieve)
14:00:52 (2688): called boinc_finish
Modest screen lag, but useable with this card.
Pentium D extreme edition 965 (2 cores with HT = 4 threads)
4GB RAM (3.5 recognized by OS)
Win XP Pro (SP3)
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
I got the following on a 9800GT Eco/Green Edition under XP x64.
0003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 6 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 6.98 sec. (0.03 init + 6.95 sieve) at 452419 p/sec.
Processor time: 0.17 sec. (0.02 init + 0.16 sieve) at 20132659 p/sec.
Average processor utilization: 0.50 (init), 0.02 (sieve)
More screen lag compared to Collatz, but not unusable. |
|
|
vasm Volunteer tester
 Send message
Joined: 6 Dec 08 Posts: 47 ID: 32604 Credit: 990,892 RAC: 0
                
|
|
Windows XP 32-bit, 8800GT.
ppsieve version cuda-0.1.3 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 6 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 5.73 sec. (0.03 init + 5.70 sieve) at 551580 p/sec.
Processor time: 0.33 sec. (0.06 init + 0.27 sieve) at 11842741 p/sec.
Average processor utilization: 2.00 (init), 0.05 (sieve)
The longer test gave 67 factors as expected.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 67 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 56.09 sec. (0.02 init + 56.08 sieve) at 537581 p/sec.
Processor time: 1.31 sec. (0.05 init + 1.27 sieve) at 23819504 p/sec.
Average processor utilization: 3.00 (init), 0.02 (sieve)
There was a screen lag (indeed more than the collatz app). You wouldn't call it unusable but over a long running time it's certainly annoying. |
|
|
|
|
|
So now on this host without X-Window top gives me this:
5363 root 39 19 20544 9644 488 S 100.6 0.2 67:45.48 primegrid_ppsie
5367 root 39 19 20536 9636 488 S 100.6 0.2 67:33.83 primegrid_ppsie
5359 root 39 19 20544 9644 488 S 99.6 0.2 67:44.06 primegrid_ppsie
5370 root 39 19 20328 9424 488 S 99.6 0.2 67:08.97 primegrid_ppsie
5371 root 39 19 20328 9428 488 S 74.7 0.2 68:07.46 primegrid_ppsie
5360 root 39 19 20544 9640 488 S 73.7 0.2 68:06.28 primegrid_ppsie
5365 root 39 19 20544 9636 488 S 73.7 0.2 67:46.84 primegrid_ppsie
5368 root 39 19 20480 9580 488 S 73.7 0.2 68:29.55 primegrid_ppsie
5466 root 24 9 59592 23m 14m S 50.8 0.6 19:28.17 primegrid_ppsie
One gpu-WU seems to be completed in about 45 minutes, one cpu-WU in about 180 minutes - 32 WU/day versus 64 WU/day.
This seems to not be the version that was used for off-campus work via primesearchteam-forum: There my GT260 achieved performance slightly better than the CPU of the computer it was built in - a Xeon W3520...
If my downloaded work is done i would test the test-version if this is still needed. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
So I assume that last process is supposed to be the GPU process? It's using 50% of one core? That's not right! It's likely you need to update your drivers. I wish I could give a user messages to that effect; right now they appear in the stderr log, but nobody sees them.
____________
|
|
|
|
|
|
Which version of drivers should i use?
Right now i have 190.53 which worked with the off-campus version and with the AP26-app (and my furthermore accelerated AP26-app) like a charm.
I tested now X-Window - but i is not usable in any way. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I don't actually know what drivers to use; I don't have a compute-capable video card. But upgrading drivers has helped others.
You might try running the file being distributed through BOINC through the manual test. If it doesn't report as V0.1.3, it's not the latest version.
Also, feel free to look through the source (this link; the top post's link is dead) if you want.
____________
|
|
|
|
|
|
It reported 0.1.2a (testing) via console - however the app was downloaded today:
(...)
So 18 Jul 2010 15:41:45 CEST PrimeGrid Sending scheduler request: Requested by user.
So 18 Jul 2010 15:41:45 CEST PrimeGrid Requesting new tasks for CPU and GPU
So 18 Jul 2010 15:41:50 CEST PrimeGrid Scheduler request completed: got 22 new tasks
So 18 Jul 2010 15:41:52 CEST PrimeGrid Started download of primegrid_ppsieve_1.21_x86_64-pc-linux-gnu__cuda23
So 18 Jul 2010 15:41:52 CEST PrimeGrid Started download of primegrid_ppsieve_1.20_x86_64-pc-linux-gnu
So 18 Jul 2010 15:42:07 CEST PrimeGrid Finished download of primegrid_ppsieve_1.21_x86_64-pc-linux-gnu__cuda23
So 18 Jul 2010 15:42:07 CEST PrimeGrid [coproc_debug] Assigning CUDA instance 0 to pps_sr2sieve_1413304_3
(...)
It is not eating up cputime via profiler:
(...)
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20718.145 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20719.168 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20718.336 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20718.305 ] cputime=[ 0.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20719.039 ] cputime=[ 0.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20718.912 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20721.473 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
(...)
I took the one from off-campus-work and ran it versus the one compiled from the source of your link:
compiled-from-provided-source-version cuda-0.1.3 (testing)
p=249871015204353, 253.4K p/sec, 0.90 CPU cores, 1.5% done. ETA 18 Jul 19:48
off-campus-version cuda-0.1.1 (testing)
p=249871082837505, 917.4K p/sec, 0.08 CPU cores, 8.3% done. ETA 18 Jul 19:03 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Ah. I've been decreasing the kernel size (breaking up the tasks) in order to leave more time for the OS and prevent the screen from updating too slowly. That size is controlled by the ITERATIONS_PER_KERNEL macro in appcu.h. Bigger numbers mean bigger kernel runs.
In testing no one had much of a slowdown from the change (implemented in v0.1.2a from 0.1.2), but I guess you are having problems. I'm also guessing you found a way for the AP26 app to be fast but not interfere with the screen? I'd appreciate any suggestions.
____________
|
|
|
|
|
|
In fact the 0.1.1-version looked like this in the cuda_profile.log
(...)
method=[ memcpyDtoH ] gputime=[ 7.776 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 31.680 ] cputime=[ 60.000 ]
method=[ _Z10d_check_nsPKmPh ] gputime=[ 764273.500 ] cputime=[ 2.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 7.840 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 31.616 ] cputime=[ 62.000 ]
method=[ _Z10d_check_nsPKmPh ] gputime=[ 764225.000 ] cputime=[ 2.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 7.840 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 31.552 ] cputime=[ 59.000 ]
method=[ _Z10d_check_nsPKmPh ] gputime=[ 764022.250 ] cputime=[ 2.000 ] occupancy=[ 0.750 ]
(...)
compared with the 0.1.2-version
(...)
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20716.512 ] cputime=[ 0.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20717.633 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20716.385 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20717.729 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20719.232 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20716.545 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20719.359 ] cputime=[ 0.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20718.656 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20721.119 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
method=[ _Z15d_check_more_nsPKmS0_PmjPh ] gputime=[ 20719.359 ] cputime=[ 1.000 ] occupancy=[ 0.750 ]
(...)
The 0.1.1-version had almost no cpu-usage while being three times faster via console, i tested this old version now with x-window which used 100% CPU too, didn't tested that one before, i do use my clients more as crunchers not as surf-stations. ;)
It seems like setting CUDABlockingSync via calling the method SetCUDABlockingSync is not really working.
I will look further into it, but that needs a bit time. |
|
|
|
|
|
Okay i quick and dirt swapped this
bool SetCUDABlockingSync(int device) {
CUdevice hcuDevice;
CUcontext hcuContext;
CUresult status = cuInit(0);
if(status != CUDA_SUCCESS)
return false;
status = cuDeviceGet( &hcuDevice, device);
if(status != CUDA_SUCCESS)
return false;
status = cuCtxCreate( &hcuContext, 0x4, hcuDevice );
if(status != CUDA_SUCCESS)
return false;
return true;
}
with this
bool SetCUDABlockingSync(int device) {
cudaError_t status = cudaGetLastError();
if(status != cudaSuccess)
return false;
status = cudaSetDevice(device);
if(status != cudaSuccess)
return false;
status = cudaSetDeviceFlags(cudaDeviceScheduleSpin);
if(status != cudaSuccess)
return false;
return true;
}
first version on console gave this:
# time ./primegrid_ppsieve_1.21_x86_64-pc-linux-gnu__cuda23 -p249871e9 -P249872e9 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.1.3 (testing)
Compiled Jul 18 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=80, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 80 <= n <= 2000000
Sieve started: 249871000000000 <= p < 249872000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
249871003789289 | 6295*2^266404+1
249871003804313 | 1897*2^1790254+1
249871004642153 | 4393*2^720262+1
249871008061891 | 3105*2^1189485+1
249871008485251 | 4787*2^131683+1
249871009106447 | 8785*2^1246050+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
p=249871015204353, 253.4K p/sec, 0.70 CPU cores, 1.5% done. ETA 18 Jul 20:46
249871017008411 | 7771*2^828544+1
249871018975427 | 5057*2^799271+1
Thread 0 interrupted
Waiting for threads to exit
Sieve incomplete: 249871000000000 <= p < 249871019398657
Found 10 factors
count=585037,sum=0xecb522c3b23b7573
Elapsed time: 76.37 sec. (0.05 init + 76.32 sieve) at 254167 p/sec.
Processor time: 51.30 sec. (0.05 init + 51.25 sieve) at 378509 p/sec.
Average processor utilization: 1.05 (init), 0.67 (sieve)
second version this:
# time ./primegrid_ppsieve_1.21_x86_64-pc-linux-gnu__cuda23 -p249871e9 -P249872e9 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.1.3 (testing)
Compiled Jul 18 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=80, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 80 <= n <= 2000000
Sieve started: 249871000000000 <= p < 249872000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
249871003789289 | 6295*2^266404+1
249871003804313 | 1897*2^1790254+1
249871004642153 | 4393*2^720262+1
249871008061891 | 3105*2^1189485+1
249871008485251 | 4787*2^131683+1
249871009106447 | 8785*2^1246050+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871017008411 | 7771*2^828544+1
249871018975427 | 5057*2^799271+1
249871020273263 | 5591*2^103221+1
Thread 0 interrupted
Waiting for threads to exit
Sieve incomplete: 249871000000000 <= p < 249871021495809
Found 11 factors
count=648218,sum=0xc7cc2f6da2bbb5a0
Elapsed time: 25.62 sec. (0.04 init + 25.58 sieve) at 840229 p/sec.
Processor time: 25.59 sec. (0.04 init + 25.55 sieve) at 841484 p/sec.
Average processor utilization: 1.06 (init), 1.00 (sieve)
real 0m25.626s
user 0m25.455s
sys 0m0.133s |
|
|
|
|
|
Using cudaSleepMemcpyFromTime opposed to cudaMemcpy yields following results:
# time ./primegrid_ppsieve_1.21_x86_64-pc-linux-gnu__cuda23 -p249871e9 -P249872e9 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.1.3 (testing)
Compiled Jul 18 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=80, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 80 <= n <= 2000000
Sieve started: 249871000000000 <= p < 249872000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 27 multiprocessors.
249871003789289 | 6295*2^266404+1
249871003804313 | 1897*2^1790254+1
249871004642153 | 4393*2^720262+1
249871008061891 | 3105*2^1189485+1
249871008485251 | 4787*2^131683+1
249871009106447 | 8785*2^1246050+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871017008411 | 7771*2^828544+1
249871018975427 | 5057*2^799271+1
249871020273263 | 5591*2^103221+1
249871021502419 | 1387*2^1058576+1
249871022451563 | 1583*2^903713+1
249871024708423 | 5167*2^1078420+1
249871026238859 | 3819*2^582413+1
249871027030549 | 8865*2^1534637+1
249871027591477 | 1465*2^563238+1
249871028300963 | 3985*2^1927304+1
249871030282733 | 3155*2^1844083+1
249871030776329 | 7815*2^1679937+1
249871032591751 | 2335*2^23512+1
Thread 0 interrupted
Waiting for threads to exit
Sieve incomplete: 249871000000000 <= p < 249871034603009
Found 21 factors
count=1043691,sum=0x232990447681918b
Elapsed time: 41.39 sec. (0.04 init + 41.35 sieve) at 836819 p/sec.
Processor time: 3.74 sec. (0.04 init + 3.69 sieve) at 9368777 p/sec.
Average processor utilization: 1.06 (init), 0.09 (sieve)
X-window is usable with 30% load on one cpu.
This is no matter what device-flag is set. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Well that's a surprise! The one piece of official BOINC CUDA code I used, and it's wrong?
Does that new code depend on CUDA 3.0 or higher? I shall have to investigate more after lunch.
____________
|
|
|
|
|
|
cudaSleepMemcpyFromTime with usleep is okay (in case you missed my newest post).
I will test the newest driver 256.40 right now.
EDIT: forget it, i miscoppied.
driver 190.53
official blockingsync-code with cudaMemcpy: 250 k p/s, unusable X-window (100 % cpu)
official blockingsync-code with cudaSleepMemcpyFromTime: 250 k p/s, usable X-window (~30 % CPU)
my (in fact mfl0p's) ScheduleSpin-code with cudaMemcpy: 840 k p/s, unusable X-window (100 % cpu)
my (in fact mfl0p's) ScheduleSpin-code with cudaSleepMemcpyFromTime: 830 k p/s, usable X-window (~30 % cpu)
I you do something in X-window CPU-usage goes up to 100%, i you only watch it it is down to 30 %.
I will now try the new driver. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Drivers! Right!
First, thanks for testing cudaSleepMemcpyFromTime. It worked fine in Linux (this was a rewrite of it that needed testing), but in Windows I got things like this. So I included that code from BOINC, and Windows both worked and used less CPU than Linux.
So then I decided to incorporate that BOINC code into Linux. And I think that's when the problem arose, because compiling that code in Linux requires compiling with the drivers. Why they did it that way I'll never know, but I compiled 0.1.2-0.1.3 with 256.25 drivers. That has to be the problem!
I wonder how old a driver I could use? Or does that code you posted do the same thing without requiring the drivers?
____________
|
|
|
|
|
|
At least it uses the "CUDA runtime API" as opposed to the official code-snippet which uses the "CUDA Drvier API".
In the programming guide it is stated that whenever you are using the "CUDA Driver API" you must load your kernel as PTX-modules. In your code you are using the "CUDA runtime API", mixing of both is not described in the programming guide, i am puzzled. The "CUDA Driver API" seems to be a lot more complicated for no good reason other than using third-party-modules... |
|
|
|
|
|
Now with 256.40 it seems to be not using half a cpu in console mode with the 1.21 i got downloaded via boinc from the project. X-window is not usable though hogging one cpu at 100 %. Runtime seems to be now down to 20 minutes if i extrapolate - that would be the expected performance. As soon as the boinc-wu is done i would test the four source-code versions as before with the 190.53-driver and post the results. |
|
|
|
|
|
There seems to be no difference now between the boinc-proposed-code and the usleep-code with cudaDeviceScheduleSpin. With the code proposed by boinc one needs the newest driver therefore i would use the usleep-code for linux since one has not to have the newest driver. But my code needs some context-creation and thread-handling in addition which is handled by the driver API as it seems.
In fact the problem seems to be cudaDeviceScheduleSpin in check_ns prior to the first cudaMemcpy.
Without this (a.k.a. deleted) i got around 850 k p/s and "Average processor utilization: 1.07 (init), 0.01 (sieve)" - was 0.08 (sieve) with cudaDeviceScheduleSpin.
(...)
unsigned int n;
// timing variables:
cudaError_t res;
// Pass P.
cudaSetDeviceFlags(cudaDeviceScheduleSpin);
res = cudaMemcpy(d_P, P, cthread_count*sizeof(uint64_t), cudaMemcpyHostToDevice);
(...) |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I'm not sure what you conclusion was in that last post. Would you be willing to just upload it to GitHub for me to look at? If so, send me your GitHub username and I'll add you to the project.
____________
|
|
|
|
|
|
I only wanted to say that with the new drivers it is now okay the way it is proposed via boinc, i only had built in a small error in my version of the source with a second setDeviceFlags that made problems after updating the driver.
I think the driver that was used while building the app should be stated since older ones do make trouble.
Plus i have no idea why i got the old app downloaded today/yesterday (depending on timezone one is situated in). |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, I've pushed PPSieve-CUDA 0.1.3a, with Roadrunner_gs' changes to eliminate the driver dependency. So on Linux this version should be more likely to work for any given user.
Let me know if there's any problem with it. I'd like Iain to use this source for the new Mac builds whenever he gets back.
____________
|
|
|
|
|
|
I still have the same 0.1.2a (testing) that i obtained on the 18th of july, i got no new application downloaded... :/ |
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
How are things going with the windows version?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
BiBi, thanks for confirming I had the right libraries. I forgot to get back to you; but the Windows apps seem to be working fine!
Except, Gerrit, I may have included the wrong Windows binary in that zipfile! The Windows BOINC binary and all the other binaries look correct, so try one of those.
____________
|
|
|
|
|
|
Heck?
I am speaking of the application that was handed out by the project via boinc.
Mine seems to be 1.21 (says so in boincmanager WU-properties) opposed to the 1.23 that is stated on the apps-page. I w
I haven't downloaded new source-codes from git.
I am just scratching my head with boinc/project...
whoa! I got downloaded some 27 hours worth of GPU-work... |
|
|
|
|
|
The 9800GT in my system is looking forward to sieving. Just waiting on that Windows app. :D |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Unfortunately, the BOINC Apps Page Version number doesn't match my version numbers at all! The version numbers on the Apps Page are from 1.20 to 1.23; but my CPU apps there are 0.3.6 and GPU apps there are probably 0.1.2.
The most current CPU apps are 0.3.8 or 0.3.4c, and the most current GPU apps are 0.1.3a. The apps in BOINC haven't been updated for two reasons: One, the people with access to that stuff (just Rytis?) have been busy, and two, some of the GPU apps haven't been tested enough yet.
So the sooner we get some testing done, here, with the procedure given at the top of the thread, the sooner we can hopefully get updated BOINC apps.
Edit2: Don't forget: the short test should now print only these 6 factors:
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
____________
|
|
|
|
|
|
Using ppsieve-cuda-x86-windows.exe and all 4 cores under load on PRPNet/Geneferx64:
ppsieve version cuda-0.1.3 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 6 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 7.11 sec. (0.03 init + 7.08 sieve) at 444430 p/sec.
Processor time: 0.38 sec. (0.05 init + 0.33 sieve) at 9586981 p/sec.
Average processor utilization: 1.50 (init), 0.05 (sieve)
And using ppsieve-cuda-boinc-x86-windows.exe:
21:26:00 (160): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070003000000
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 33.56 sec. (18433819337745.99 init + 12924735997.13 sieve) at 0 p/sec.
Processor time: 0.31 sec. (0.00 init + 0.31 sieve) at 10066330 p/sec.
Average processor utilization: 0.00 (init), 0.00 (sieve)
21:26:37 (160): called boinc_finish
It seems the BOINC version doesn't like it when all 4 cores are under load, because when I free one of them, I get the following:
21:27:18 (2088): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070003000000
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 7.16 sec. (18433819337671.40 init + 12924736045.31 sieve) at 0 p/sec.
Processor time: 0.30 sec. (0.00 init + 0.30 sieve) at 10596136 p/sec.
Average processor utilization: 0.00 (init), 0.00 (sieve)
21:27:25 (2088): called boinc_finish
33 seconds down to 7 - all 6 factors are written to ppfactors.txt though. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
Win7 Prof. x64
GTX460
ppsieve-cuda-x86-windows.exe:
ppsieve version cuda-0.1.3 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 67 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 14.87 sec. (0.03 init + 14.84 sieve) at 2032024 p/sec.
Processor time: 2.15 sec. (0.06 init + 2.09 sieve) at 14421342 p/sec.
Average processor utilization: 2.01 (init), 0.14 (sieve)
~99% GPU load over the full time, although prime95 torture test is running on all CPU threads.
ppsieve-cuda-boinc-x86-windows.exe, prime95 on all CPU threads:
23:34:14 (3484): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 23.19 sec. (18433819333655.50 init + 12924740077.24 sieve) at 0 p/sec.
Processor time: 2.29 sec. (0.00 init + 2.29 sieve) at 13145986 p/sec.
Average processor utilization: 0.00 (init), 0.00 (sieve)
23:34:37 (3484): called boinc_finish
GPU load varying between 50% and 80%, average ~60%.
ppsieve-cuda-boinc-x86-windows.exe, one CPU thread idle:
23:39:06 (1712): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 14.97 sec. (18433819333362.65 init + 12924740361.88 sieve) at 0 p/sec.
Processor time: 1.87 sec. (0.00 init + 1.87 sieve) at 16103828 p/sec.
Average processor utilization: 0.00 (init), 0.00 (sieve)
23:39:21 (1712): called boinc_finish
~99% GPU load.
So I can confirm Seventh Serenity's observation, that the BOINC version slows down significantly if all CPU cores are under load.
But it's still running better than GPUGRID's app (50% GPU load if 100% CPU load, 63% with one core idle, only 83% if 0% CPU load).
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
My first Fermi test! 2MP/s, nice!
I think I have a fix for the slowness as well; try v0.1.3b that I just uploaded. If that doesn't work, can you tell me the program's process and thread priorities in Task Manager?
____________
|
|
|
|
|
|
v0.1.3b seems to be even slower, by more than double:
23:42:10 (3096): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070003000000
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 75.78 sec. (0.44 init + 75.34 sieve) at 41752 p/sec.
Processor time: 0.34 sec. (0.06 init + 0.28 sieve) at 11184811 p/sec.
Average processor utilization: 0.14 (init), 0.00 (sieve)
23:43:26 (3096): called boinc_finish
Task Manager reports the process running as 'Normal' priority. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
v0.1.3b
non-BOINC:
ppsieve version cuda-0.1.3b (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 14.80 sec. (0.03 init + 14.77 sieve) at 2041227 p/sec.
Processor time: 1.98 sec. (0.05 init + 1.93 sieve) at 15584353 p/sec.
Average processor utilization: 1.51 (init), 0.13 (sieve)
Process priority normal, thread priority normal.
BOINC:
00:35:45 (3740): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 19.44 sec. (0.03 init + 19.41 sieve) at 1553057 p/sec.
Processor time: 2.42 sec. (0.05 init + 2.37 sieve) at 12713550 p/sec.
Average processor utilization: 1.51 (init), 0.12 (sieve)
00:36:04 (3740): called boinc_finish
Process priority normal, thread priority idle!
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, I believe I've solved the problem. I was setting thread and process priority, but apparently the BOINC API overrides those settings! Furthermore, the only way to change BOINC's settings is through the C++ API, so I had to create two new files to do it! [/rant]
So, give 0.1.3c a try. I'm pretty sure it will work this time.
____________
|
|
|
|
|
|
You've certainly fixed it. :D
19:07:11 (2684): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070003000000
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 7.20 sec. (0.03 init + 7.17 sieve) at 438620 p/sec.
Processor time: 0.28 sec. (0.03 init + 0.25 sieve) at 12582912 p/sec.
Average processor utilization: 1.00 (init), 0.03 (sieve)
19:07:18 (2684): called boinc_finish |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
0.1.3c is working fine, looking forward to crunching real WUs with it. :)
22:36:37 (3412): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 14.80 sec. (0.03 init + 14.77 sieve) at 2041641 p/sec.
Processor time: 2.03 sec. (0.05 init + 1.98 sieve) at 15216214 p/sec.
Average processor utilization: 1.42 (init), 0.13 (sieve)
22:36:52 (3412): called boinc_finish
____________
|
|
|
|
|
|
with 0.1.3c under linux-x86-64 i got several:
Computation Error: no candidates found for p=249871261264783.
Computation Error: no candidates found for p=249871202042827.
Is this a hint to a hardware-error?
I hope not, i bought the card today... ;)
In complete the test took:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249872000000000
Found 321 factors
count=30166916,sum=0xa0b7dde9a581c7d4
Elapsed time: 503.48 sec. (0.05 init + 503.43 sieve) at 1986536 p/sec.
Processor time: 19.30 sec. (0.05 init + 19.25 sieve) at 51949276 p/sec.
Average processor utilization: 1.04 (init), 0.04 (sieve)
Furthermore i remembered the following in the code:
(...)
cthread_count = (gpuprop.major == 1 && gpuprop.minor < 2)?384:768;
if(gpuprop.major == 2) cthread_count = 1024;
cthread_count *= gpuprop.multiProcessorCount;
shouldn't the thread_count for cuda-cap >=2 be increased?
With compute-cap 1 there were 8 SP per Multiprocessor, now there are 32 (2.0) or 48 (2.1) but les Multiprocessors.
EDIT: hum, makes the CPU-load go up... |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
with 0.1.3c under linux-x86-64 i got several:
Computation Error: no candidates found for p=249871261264783.
Computation Error: no candidates found for p=249871202042827.
Is this a hint to a hardware-error?
I hope not, i bought the card today... ;)
'Fraid so. If you get one of those in BOINC, it will error out. :(
In complete the test took:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249872000000000
Found 321 factors
count=30166916,sum=0xa0b7dde9a581c7d4
Elapsed time: 503.48 sec. (0.05 init + 503.43 sieve) at 1986536 p/sec.
Processor time: 19.30 sec. (0.05 init + 19.25 sieve) at 51949276 p/sec.
Average processor utilization: 1.04 (init), 0.04 (sieve)
Another GTX 260? They tend to get about 2M p/sec. I'm strongly considering getting one.
Furthermore i remembered the following in the code:
(...)
cthread_count = (gpuprop.major == 1 && gpuprop.minor < 2)?384:768;
if(gpuprop.major == 2) cthread_count = 1024;
cthread_count *= gpuprop.multiProcessorCount;
shouldn't the thread_count for cuda-cap >=2 be increased?
With compute-cap 1 there were 8 SP per Multiprocessor, now there are 32 (2.0) or 48 (2.1) but les Multiprocessors.
EDIT: hum, makes the CPU-load go up...
I'm not sure. The newer CUDA Occupancy Calculator you pointed me to doesn't seem to think so, but I'm not sure. Did you try it? Did the computation rate go up, even if CPU load did too? I have a few ideas for decreasing the CPU load if it gets onerous.
____________
|
|
|
|
|
|
Well, the GTX460 may be getting 2MP/s but it makes me wonder what ATI cards would get. They simply throw NVIDIA cards in the water on MilkyWay@Home, distributed.net and even Collatz (the HD4850 in my current machine is 3-4x faster in two of the projects compared to my 9800GT). |
|
|
|
|
with 0.1.3c under linux-x86-64 i got several:
Computation Error: no candidates found for p=249871261264783.
Computation Error: no candidates found for p=249871202042827.
Is this a hint to a hardware-error?
I hope not, i bought the card today... ;)
'Fraid so. If you get one of those in BOINC, it will error out. :(
Could not say up to now, nothing what i have crunched today of ppsSieve was validated.
In complete the test took:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249872000000000
Found 321 factors
count=30166916,sum=0xa0b7dde9a581c7d4
Elapsed time: 503.48 sec. (0.05 init + 503.43 sieve) at 1986536 p/sec.
Processor time: 19.30 sec. (0.05 init + 19.25 sieve) at 51949276 p/sec.
Average processor utilization: 1.04 (init), 0.04 (sieve)
Another GTX 260? They tend to get about 2M p/sec. I'm strongly considering getting one.
Furthermore i remembered the following in the code:
(...)
cthread_count = (gpuprop.major == 1 && gpuprop.minor < 2)?384:768;
if(gpuprop.major == 2) cthread_count = 1024;
cthread_count *= gpuprop.multiProcessorCount;
shouldn't the thread_count for cuda-cap >=2 be increased?
With compute-cap 1 there were 8 SP per Multiprocessor, now there are 32 (2.0) or 48 (2.1) but les Multiprocessors.
EDIT: hum, makes the CPU-load go up...
I'm not sure. The newer CUDA Occupancy Calculator you pointed me to doesn't seem to think so, but I'm not sure. Did you try it? Did the computation rate go up, even if CPU load did too? I have a few ideas for decreasing the CPU load if it gets onerous.
No it was not faster, only the load went up.
The problem with the determination of how many threads should be run is this (i think):
GTS 250: 16 MPs; cthread_count= 6144
GTX 260: 27 MPs; cthread_count= 20736
GTX 460: 7 MPs; cthread_count= 7168
GTX 470: 14 MPs; cthread_count= 14336
That treads the newer chips worse thant the old ones.
I can't even read something about this in the docs or in the occupation-calculator, the only hint was in the 2.2.1-docs where stood the number of threads executed by a kernel should be rather in the 100k than in the 10k region.
But maybe i am wrong here.
I can't even find something about this in the new docs.
EDITH says: all 8 WUs crunched with the new GTX 460 validated, runtime was 480 to 482 seconds.
The new card is more silent than the GTX 260 albeit drawing more power under load.
In my system.
GTX 260: 100/200/265 W (all idle/cpu load/cpu+gpu load)
GTX 460: 77/185/290 W (all idle/cpu load/cpu+gpu load) |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
The problem with the determination of how many threads should be run is this (i think):
GTS 250: 16 MPs; cthread_count= 6144
GTX 260: 27 MPs; cthread_count= 20736
GTX 460: 7 MPs; cthread_count= 7168
GTX 470: 14 MPs; cthread_count= 14336
That treads the newer chips worse thant the old ones.
I can't even read something about this in the docs or in the occupation-calculator, the only hint was in the 2.2.1-docs where stood the number of threads executed by a kernel should be rather in the 100k than in the 10k region.
But maybe i am wrong here.
I do remember something about that, but I'm just going by the Occupancy Calculator. Try this: raise BLOCKSIZE in appcu.h to 192, then raise the 2.0 or higher threads per multiprocessor to 1536 (from 1024). It shouldn't be worse, in any case.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
With the CUDA app up and running, I've been considering some other possibilities. Like ATI, for one. And an old branch I abandoned for another.
I think this branch could lead to faster processing for all GPUs except probably Fermi (because Fermi is so good at 32-bit multiplication). But I think I had a bug in my last test on that branch, so I need a new baseline test to work from.
So, 64-bit Linux users with nVIDIA GPUs older than Fermi (so no GTX 460's), please test this app against the current app. (So post both their speeds.) Don't get too excited if the test app is a little faster: it would need to be at least 20% faster on this test before I'd consider using it in production. And it's not nearly ready yet; it's not built for BOINC and it's missing some bugfixes I've done on the other branch.
____________
|
|
|
|
|
|
The Windows CUDA app isn't working.
I just had all my work units error out, and I have the latest NVIDIA driver. |
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
The Windows CUDA app isn't working.
I just had all my work units error out, and I have the latest NVIDIA driver.
I've exactly the same problem.
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
The newest driver reports the GS 8400 having 243MB memory :(
Thanks for the help? The GS8400s worked fine during testing.... Please lower the memory requirement to 240MB
01/08/2010 19:05:27 NVIDIA GPU 0: GeForce 8400 GS (driver version 25721, CUDA version 3010, compute capability 1.1, 243MB, 22 GFLOPS peak)
02/08/2010 21:42:40 PrimeGrid Message from server: Your NVIDIA GPU has insufficient memory (need 250MB)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
The Windows CUDA app isn't working.
I just had all my work units error out, and I have the latest NVIDIA driver.
Very strange. The app worked for you when manually testing, didn't it? Can you copy the app out of the BOINC directory and manually test it again?
No STDERR output = no chance for debugging. :(
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
After copying cudart.dll into the project directory, the windows app is not crashing immediately. But after some time, all WUs error out with something like Computation Error: no candidates found for p=332325478168793
Tested only with the GTX460 so far, I'll check if it's also failing on the 9800GT. Maybe our manual test range was to small to find this error?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
On a second look, I see the error message:
- exit code -1073741515 (0xc0000135)
That seems to be related to initializing a DLL. Perhaps a cudart.dll needs to be distributed with this app?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
After copying cudart.dll into the project directory, the windows app is not crashing immediately. But after some time, all WUs error out with something like Computation Error: no candidates found for p=332325478168793
Tested only with the GTX460 so far, I'll check if it's also failing on the 9800GT. Maybe our manual test range was to small to find this error?
I'm afraid a Computation Error usually means your card is flaky. :( Can you let the card cool down and then run a manual test with that number in the range, like:
./ppsieve-cuda-x86_64-linux -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
I'll try doing that with the emulator too, to make sure.
Edit: Emulator run finished. (It's very slow.) No errors. Sorry, looks like your card is flaky. :(
____________
|
|
|
|
|
|
I just copied cudart.dll from my Collatz Conjecture project folder and so far, it's crunching fine. Work unit is progressing pretty well. Looks like 30-40 minutes per work unit on my 9800GT, but I guess we'll see when it finishes. |
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
Any chance that I get my card crunching? What is the memory requirement of the PG cuda app and can it be lowered to 240MB so that GS8400 cards can get some work?
____________
|
|
|
|
|
|
According to GPU-Z, the memory usage on my 9800GT is about 50MB (on a 1080p monitor). |
|
|
|
|
After copying cudart.dll into the project directory, the windows app is not crashing immediately. But after some time, all WUs error out with something like Computation Error: no candidates found for p=332325478168793
Tested only with the GTX460 so far, I'll check if it's also failing on the 9800GT. Maybe our manual test range was to small to find this error?
Could you give me the invocation parameter so that i can test with my GTX 460?
I had the same error twice in another range. |
|
|
|
|
|
Just crunched a windows CUDA WU on my 9800GT:
http://www.primegrid.com/workunit.php?wuid=126941822
Took 27 min. No problems here. Didn't do anything special, just changed my preferences to allow GPU tasks. Was running 7 TRP tasks concurrently (this is on a ci7, but needed a full core for my two GPUs to run). Hope this is helpful...
Cheers!
Alan
Edit: screen response is a bit slow while crunching, but not too shabby:)
____________
|
|
|
|
|
|
Since we do not use the "Driver API" any more bute the "Runtime API" we do need to have the cudart.dll/cudart.so at hand to crunch.
Forgot those. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
I'm afraid a Computation Error usually means your card is flaky. :( Can you let the card cool down and then run a manual test with that number in the range, like:
./ppsieve-cuda-x86_64-linux -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
While it's working on the 9800GT, it crashes on the GTX460 at p=332325478168793 every time (using the windows app on both). I'm also able to reproduce the errors at p=249871202042827 and p=249871261264783, which roadrunner_gs posted before. So maybe there's a general problem with the GTX460?
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
According to GPU-Z my card has 256MB of RAM; it must be the boinc manager wrongly reporting 243 MB of RAM.
According to this thread (http://setiathome.ssl.berkeley.edu/forum_thread.php?id=59273&sort=5) it was introduced with the 197.x drives ....
Another time, lower the memory requirement, please!
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I can't reproduce this GTX 460 error. I'm sure it really is computing incorrectly - the GPU found a factor, but running the same test on the CPU didn't find a factor. So I don't really want to do any workarounds until the problem is found.
But since I don't have a GTX 460, I can't debug it. Have they created any firmware updates yet?
Another time, lower the memory requirement, please!
FYI, 10MB should be enough to run this app, even in the worst case I could think of.
____________
|
|
|
|
|
|
Just out of curiosity, is it normal for each GPU to use 0.51 CPU?
I've been monitoring my CPU usage and it stays fairly consistent at 88% when running CPU and GPU tasks, at 100% when I just run CPU tasks, and when I suspend all my CPU tasks (running only the two GPU WUs) my CPU usage is at 1-2%. The GPUs do not seem to be using as much CPU power as implied by BOINC, but it is preventing my ci7 from running 8 simultaneous CPU tasks even though I have enough remaining processing power to run 8. Any ideas?
Cheers!
Alan
____________
|
|
|
|
|
I'm afraid a Computation Error usually means your card is flaky. :( Can you let the card cool down and then run a manual test with that number in the range, like:
./ppsieve-cuda-x86_64-linux -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
While it's working on the 9800GT, it crashes on the GTX460 at p=332325478168793 every time (using the windows app on both). I'm also able to reproduce the errors at p=249871202042827 and p=249871261264783, which roadrunner_gs posted before. So maybe there's a general problem with the GTX460?
The 0.1.3 (testing) with my built in patches
[roadrunner@rr022 pps]$ time ./ppsieve-cuda-x86_64-linux -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3 (testing)
Compiled Jul 30 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=82, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 82 <= n <= 2000000
Sieve started: 332325478000000 <= p < 332325479000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
memory needed: 93184 bytes.
Computation Error: no candidates found for p=332325478168793.
332325478382197 | 1579*2^916630+1
332325478899881 | 6055*2^1462552+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 332325478000000 <= p < 332325479000000
Found 2 factors
count=29910,sum=0x89f17583e61918fc
Elapsed time: 2.41 sec. (0.05 init + 2.36 sieve) at 443812 p/sec.
Processor time: 1.70 sec. (0.05 init + 1.65 sieve) at 635597 p/sec.
Average processor utilization: 1.06 (init), 0.70 (sieve)
real 0m2.486s
user 0m0.106s
sys 0m1.594s
So it brakes at the same number as yours. I am using linux though.
We should round up some more GTX 460 users to verify this and file a report against Nvidia i think. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I'm wondering what happens if the code is compiled with V3.1 of the toolkit rather than V2.3. Do you want to make a build or shall I?
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
Another time, lower the memory requirement, please!
FYI, 10MB should be enough to run this app, even in the worst case I could think of.
Who at PG is able to lower this requirement because it is set at the server side of the project? All 256MB cards will suffer from this driver isssue :(
____________
|
|
|
|
|
I'm wondering what happens if the code is compiled with V3.1 of the toolkit rather than V2.3. Do you want to make a build or shall I?
I will give it a try but i need a day or two.
I will set up another buildroot in a vm in order not to spoil my real one. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I've got a Linux 32-bit VM that I can take a snapshot of, then build with the 3.1 SDK. So try this Linux 32-bit binary.
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
Mine is crunching :D:D:D
Downgraded the drivers to 196.21, which I think is stupid because the extra memory that is reported must be void. Any ideas?
03/08/2010 00:59:49 NVIDIA GPU 0: GeForce 8400 GS (driver version 19621, CUDA version 3000, compute capability 1.1, 256MB, 22 GFLOPS peak)
____________
|
|
|
|
|
Another time, lower the memory requirement, please!
FYI, 10MB should be enough to run this app, even in the worst case I could think of.
Who at PG is able to lower this requirement because it is set at the server side of the project? All 256MB cards will suffer from this driver isssue :(
+1 We need to get this false minimum requirement lowered some how.
____________
Reno, NV
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
Another time, lower the memory requirement, please!
FYI, 10MB should be enough to run this app, even in the worst case I could think of.
Who at PG is able to lower this requirement because it is set at the server side of the project? All 256MB cards will suffer from this driver isssue :(
+1 We need to get this false minimum requirement lowered some how.
It will be addressed but it's the middle of the night in Lithuania.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
Another time, lower the memory requirement, please!
FYI, 10MB should be enough to run this app, even in the worst case I could think of.
Who at PG is able to lower this requirement because it is set at the server side of the project? All 256MB cards will suffer from this driver isssue :(
+1 We need to get this false minimum requirement lowered some how.
It will be addressed but it's the middle of the night in Lithuania.
Just a quick note or two for tomorrow's fix:
1) The issue varies widely depending on driver and OS. For example, almost all 197.xx and later versions produce the issue in Win XP, but for some the under report is only 1mb (i.e., report 255mb). In other cases, I have seen as much as a 30mb or so under report so adjusting down to 240mb may not be enough.
2) Although it is a popular notion that the lowest memory in CUDA capable devices is 256mb, numerous lower memory boards exist. For example, there have been several OEM 8300GS/8400GS cards with only 128mb in both desktops and laptops as well as small (and sometimes variable) memory levels with CUDA capable on-motherboard video. Hopefully, given the seemingly low memory usage of this app, tomorrow's adjustment will not exclude these devices.
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Tried running the Windows version today.
Pertinent system info:
Windows Vista, 32 bit
GTX 280 @1350 (slight factory OC), driver 197.45, 1 gig ram
C2Q Q6600 @2.4
3 gig ram
First attempts failed due to missing cudart.dll.
Once that file (the CUDA 3.2 version) was manually placed in the primegrid directory, it started running.
The first WU is still in progress and the turn time looks like it will be under 20 minutes. That compares to about 3.5 hours on the CPU.
GPU temperature is very hot (that's good), and utilization is a whopping 98%, which is excellent. CPU usuage is almost unmeasurably small, which is also excellent.
However, the GUI is noticably choppy -- enough so that I really wouldn't want to use this computer while the GPU is crunching. Attempting to play a game would probably not be pleasant. This is probably a big enough problem to stop me from running this application.
Here's the stderr output:
<core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
Sieve started: 336214000000000 <= p < 336215000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 336214000000000 <= p < 336215000000000
count=29899365,sum=0xf3d9b7f78b3d87bf
Elapsed time: 1051.37 sec. (0.09 init + 1051.28 sieve) at 951300 p/sec.
Processor time: 16.93 sec. (0.11 init + 16.82 sieve) at 59468682 p/sec.
Average processor utilization: 1.21 (init), 0.02 (sieve)
22:11:07 (7760): called boinc_finish
</stderr_txt>
]]>
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
I saw the new Window64 cuda-application on stock and wanted to test it.
(However when it's on stock it should be in production and not a test version ;))
It runs fine. I want to let you know the result:
http://www.primegrid.com/result.php?resultid=184041093
<core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
Sieve started: 269614000000000 <= p < 269615000000000
Thread 0 starting
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
Thread 0 completed
Sieve complete: 269614000000000 <= p < 269615000000000
count=30092825,sum=0xd4e74dcd5144c1cf
Elapsed time: 1990.83 sec. (0.11 init + 1990.72 sieve) at 502370 p/sec.
Processor time: 16.55 sec. (0.12 init + 16.43 sieve) at 60880568 p/sec.
Average processor utilization: 1.14 (init), 0.01 (sieve)
13:42:51 (320): called boinc_finish
</stderr_txt>
]]>
A bit over 33 minutes wall clock time. If I remember right it was > 4 hours for the CPU-version on this Phenom I with a FSB800 and 2.1GHz.
____________
|
|
|
|
|
I'm wondering what happens if the code is compiled with V3.1 of the toolkit rather than V2.3. Do you want to make a build or shall I?
Whoa look at this post and this host.
He has a lot of:
Computation Error: no candidates found for p=${number}
Will test his numbers when at home with my GTX 460. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
So the problem's affecting all Fermi cards, not just 460s. At this point, I'm not too surprised. :\
Have you been able to test my 32-bit Linux CUDA binary built with the 3.1 SDK yet? Actually, if it works for Fermi users and doesn't slow down non-Fermi users, I'd just go ahead with that. Otherwise, once I know the status of things, I'll make a post in nVIDIA's forums to report it.
____________
|
|
|
|
|
|
no i am at work actually, could test when at home in around seven hours or so. |
|
|
|
|
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
strange - does it have a GT-215 chip? then it should have 8 ROPs.
afaik the only GPU with 12 is around as GT-330.. |
|
|
|
|
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
strange - does it have a GT-215 chip? then it should have 8 ROPs.
afaik the only GPU with 12 is around as GT-330..
Yes, a GT 240 has at GT-215-chip with 12 MP and therefore 96 SP.
Not to be confused with a GTS 240 which has a G92a with 14 MP and 112 SP.
Furthermore i found out:
A GT 330 has a G92b with 12/14 MP (96/112 SP).
A GT 340 has a GT-215 with 12 MP (96 SP). |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
strange - does it have a GT-215 chip? then it should have 8 ROPs.
afaik the only GPU with 12 is around as GT-330..
Yes, a GT 240 has at GT-215-chip with 12 MP and therefore 96 SP.
Not to be confused with a GTS 240 which has a G92a with 14 MP and 112 SP.
Furthermore i found out:
A GT 330 has a G92b with 12/14 MP (96/112 SP).
A GT 340 has a GT-215 with 12 MP (96 SP).
The 8800GS and 9600GSO (G92 based) also both have 12 MP (96sp).
*Note: some 9600GSO are G94 based and have fewer MP (and only 48sp).
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
|
just started one on my GTS 250 (high clocked @800/2000)
took it a little more 20 minutes.
Sieve started: 273280000000000 <= p < 273281000000000
Thread 0 starting
Detected GPU 0: GeForce GTS 250
Detected compute capability: 1.1
Detected 16 multiprocessors.
Thread 0 completed
Sieve complete: 273280000000000 <= p < 273281000000000
count=30088793,sum=0xc09d1de7fe000ae7
Elapsed time: 1224.37 sec. (0.08 init + 1224.29 sieve) at 816864 p/sec.
Processor time: 2.61 sec. (0.08 init + 2.53 sieve) at 395723737 p/sec.
Average processor utilization: 1.00 (init), 0.00 (sieve)
16:48:03 (3636): called boinc_finish
fan went up to 80%, temp @64C
RTP was 47 seconds - the PII-940 @ 3.3GHz takes about 100 minutes.
DCF goes mad...
.. and the CPU-tasks are estimated to run 16 hours. :( |
|
|
|
|
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
strange - does it have a GT-215 chip? then it should have 8 ROPs.
afaik the only GPU with 12 is around as GT-330..
8 or 12: It works fine for me and that's the most important thing.
It's a Palit GeForce GT 240 (512MB GDDR5)
But, you're right Frank. It's a GT215 chip and GPU-Z says ROPs are 8.
In the picture below you can see it and also how it's overclocked.
Temperatures at 73 C. That's about 5 degrees higher than with GPUGRID. Fan @ 63%
When running PG PPS Sieve the screen output becomes a bit sluggish.
|
|
|
|
|
|
ROP might be the pixle-shader-count, the GT 240 indeed has 8 pixel-shader. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, new version 0.1.3d released for you to try.
I had another idea about what might be causing the Fermi problems. I set the kernel calls to use the stream Gerrit (roadrunner_gs) set up, as he told me to do. However, I hadn't set the memcopies to do the same. I've now done that and added a cudaThreadSynchronize() after getting the results. If it doesn't fix the problem, maybe it will at least print a useful error message.
I also increased the block size to 192 from 128. That should fill out the Fermi occupancy, but I'm not sure if it will have any real effect, positive or negative. If you find the new code is surprisingly slower, that's probably the reason.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
0.1.3d is not working at all:
Cuda error: getting factors found: invalid argument
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Are you using host-pinned memory (allocated via cudaMallocHost)? The Async calls require that CPU-side memory is allocated this way.
Uggh. See, this is why I wasn't using streams in the first place.
____________
|
|
|
|
|
So the problem's affecting all Fermi cards, not just 460s. At this point, I'm not too surprised. :\
Have you been able to test my 32-bit Linux CUDA binary built with the 3.1 SDK yet? Actually, if it works for Fermi users and doesn't slow down non-Fermi users, I'd just go ahead with that. Otherwise, once I know the status of things, I'll make a post in nVIDIA's forums to report it.
No luck:
# ./ppsieve-cuda31-boinc-x86-linux -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3c (testing)
Compiled Aug 2 2010 with GCC 4.3.3
nstart=82, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 82 <= n <= 2000000
# cat stderr.txt
Can't open init data file - running in standalone mode
Sieve started: 332325478000000 <= p < 332325479000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Computation Error: no candidates found for p=332325478168793.
called boinc_finish
The stream is only to check for the stopping event.
Since we are not using several streams on one device in parallel we do not need to allocate the memory to/for specific streams. The kernels are executed in order since they belong to the same stream - if i do not read the docs wrong. |
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
My WU failed: it exceeded the maximum time :(
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Sieve started: 336356000000000 <= p < 336357000000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
</stderr_txt>
]]>
____________
|
|
|
|
|
|
Wait, is the libcudart.so of Cuda 3.1 realy named libcudart.so.2? |
|
|
|
|
My WU failed: it exceeded the maximum time :(
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Sieve started: 336356000000000 <= p < 336357000000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
</stderr_txt>
]]>
Your GPU maybe is too slow to run the kernel in the specified time, it only has 1 MP though!
The time a kernel is allowed to run is specified with 10 seconds i think - if i remember right.
This only applies to kernel started when a GUI is running, on the commandline there is no maximum runtime. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
"Maximum elapsed time exceeded" is a BOINC configuration issue, not a CUDA watchdog timer problem. My app never runs into the watchdog timer.
Another thing for Rytis to fix.
____________
|
|
|
|
|
Wait, is the libcudart.so of Cuda 3.1 realy named libcudart.so.2?
No it is not, built with cuda-toolkit 3.1 i see following dependencies:
ldd ppsieve-cuda-x86_64-linux
linux-vdso.so.1 => (0x00007fff8eecb000)
libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f6080c9c000)
libcudart.so.3 => /usr/lib64/libcudart.so.3 (0x00007f6080a61000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f6ca00000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003f72600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003f6c200000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003f70e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003f6be00000)
libz.so.1 => /lib64/libz.so.1 (0x0000003f6ce00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003f6c600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003f6d200000)
/lib64/ld-linux-x86-64.so.2 (0x0000003f6b600000)
alas built in RHEL6-beta i can't use the app on my other hosts, i am soooo dumb, lol |
|
|
|
|
|
WOW!
How about this?
./ppsieve-cuda-x86_64-linux -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3 (testing)
Compiled Aug 3 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=82, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 82 <= n <= 2000000
Sieve started: 332325478000000 <= p < 332325479000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
memory needed: 93184 bytes.
332325478382197 | 1579*2^916630+1
332325478899881 | 6055*2^1462552+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 332325478000000 <= p < 332325479000000
Found 2 factors
count=29910,sum=0x89f17583e61918fc
Elapsed time: 2.28 sec. (0.05 init + 2.24 sieve) at 468929 p/sec.
Processor time: 1.86 sec. (0.05 init + 1.81 sieve) at 579732 p/sec.
Average processor utilization: 1.07 (init), 0.81 (sieve)
Only change:
increased ITERATIONS_PER_KERNEL from 300 to 1000 in appcu.h
EDITH says: 200 is good too... |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That doesn't look like a permanent fix. More like you moved the errors somewhere else. It is interesting, though.
____________
|
|
|
|
|
|
Yes i won't get it either why this fixes the problem...
With this fix i tested some WUs that generated errors with the official app but could not get errors...
a cudaThreadSynchronize(); before the stream creation and just before leaving check_ns with 300 ITERATIONS_PER_THREAD yiels the same error as before.
I reaplied my patch i provided to you with -R to reverse it, i now have the initial SetCUDABlockingSync-method using the "Driver-API" in it and this also fails
./ppsieve-cuda-x86_64-linux.old -p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3 (testing)
Compiled Aug 3 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=82, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 82 <= n <= 2000000
Sieve started: 332325478000000 <= p < 332325479000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Computation Error: no candidates found for p=332325478168793.
332325478382197 | 1579*2^916630+1
332325478899881 | 6055*2^1462552+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 332325478000000 <= p < 332325479000000
Found 2 factors
count=29910,sum=0x89f17583e61918fc
Elapsed time: 2.46 sec. (0.05 init + 2.40 sieve) at 436129 p/sec.
Processor time: 2.42 sec. (0.05 init + 2.36 sieve) at 443627 p/sec.
Average processor utilization: 1.04 (init), 0.98 (sieve)
So i think we could rule out my changes as the root-cause. |
|
|
|
|
|
302, 299, 297 yields the problem, 298, 296, 295, 301, 303 not.
It is always
Computation Error: no candidates found for p=332325478168793. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I believe I've found the problem! And there's nothing wrong with your Fermi-based cards. Full details to come when I have a fix up.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, new version pushed, and here's what went wrong:
First of all, I have two different counters when looping through the N's. I have one inside the CUDA kernel. And because the CUDA kernel's work is split up into bite-sized (not byte-sized) chunks to keep the screen usable, and the kernel forgets where it was between chunks, I have a counter outside the kernel that passes in where it should be. Since the internal counter goes in steps of ~35 N's, the external counter needs to jump a multiple of that.
Now, there is another consideration for the size of that external counter. Cards with some compute capabilities can run more threads at the same time than others. So I divided the size of the external counter (times 384) by that number. Well, it so happens that low compute-capability cards run 384 threads at a time per multiprocessor, and somewhat newer ones run 768 (384*2). But Fermis were running 1024. Since 384 does not divide 1024 evenly, the external counter was not counting by a multiple of the step size! Oops!
Special thanks to Gerrit (roadrunner_gs) for putting up with my obsessive metaphorical search for my keys in his house when I should have been looking under the lamppost.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
No test results on 0.1.3e yet? Are all the testers asleep?
I need lower-end cards' test results as well, since I changed the block size to 192 from 128.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
No test results on 0.1.3e yet? Are all the testers asleep?
I need lower-end cards' test results as well, since I changed the block size to 192 from 128.
Still just Linux, or is it ready for Windows boxes to test also?
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Linux, Windows, just haven't gotten Iain to make a Mac version yet.
____________
|
|
|
|
|
No test results on 0.1.3e yet? Are all the testers asleep?
I need lower-end cards' test results as well, since I changed the block size to 192 from 128.
is this card old enough and slow enough for your tests??
http://www.primegrid.com/show_host_detail.php?hostid=104626 |
|
|
|
|
|
When speaking of pushed you mean to google i suppose?
(...)
Now, there is another consideration for the size of that external counter. Cards with some compute capabilities can run more threads at the same time than others. So I divided the size of the external counter (times 384) by that number. Well, it so happens that low compute-capability cards run 384 threads at a time per multiprocessor, and somewhat newer ones run 768 (384*2). But Fermis were running 1024. Since 384 does not divide 1024 evenly, the external counter was not counting by a multiple of the step size! Oops!
What is it at now for Fermi? 1152 or 1536?
Special thanks to Gerrit (roadrunner_gs) for putting up with my obsessive metaphorical search for my keys in his house when I should have been looking under the lamppost.
No matter. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
1536. And now it's midnight *here* and *I'm* going to sleep.
____________
|
|
|
|
|
|
With 1536 it is considerably slower now.
1406 M p/sec vs 2000 M p/sec.
Maybe the THREADS should be lowered to 1152 or even 768?
I will give it a try with my (old?) sources if you haven't changed to much. |
|
|
|
|
|
1920 yields around 1910 M p/s on my GTX 460.
I put the block-size down to 128 again as this is faster. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Can you try it back at 128 blocksize and 1024 threads/multiprocessor? That's where I'd like to put it if 1536 didn't work.
____________
|
|
|
|
|
|
You mean BLOCKSIZE in appcu.h set to 128 and cthread_count in cuda_app_init set to 1024?
But that was the initial value that brought up the error... |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I know. I believe I found the bug. I suppose it should be tested with the original variables to make sure anyway.
appcu.cu, around lines 255-260, should look like:
// N's to search each time a kernel is run:
ld_kernel_nstep = ITERATIONS_PER_KERNEL;
// Adjust for differing block sizes.
ld_kernel_nstep *= 384;
ld_kernel_nstep /= (cthread_count/gpuprop.multiProcessorCount);
// Finally, make sure it's a multiple of ld_nstep!!!
ld_kernel_nstep *= ld_nstep;
See, I moved the multiplication by ld_nstep from the beginning to the end.
____________
|
|
|
|
|
|
Okay, took the git and put in the old values, seems to work with at least two ranges that errored before, one short and one WU-long range. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
How fast?
I'll start preparing a build with those values.
____________
|
|
|
|
|
|
Speed is now
(...)
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249872000000000
Found 321 factors
count=30166916,sum=0xa0b7dde9a581c7d4
Elapsed time: 498.77 sec. (0.05 init + 498.72 sieve) at 2005291 p/sec.
Processor time: 18.81 sec. (0.05 init + 18.76 sieve) at 53300190 p/sec.
Average processor utilization: 1.05 (init), 0.04 (sieve)
real 8m18.770s
user 0m17.195s
sys 0m1.619s
was before (0.1.3c-app with the errors in it)
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249872000000000
Found 321 factors
count=30166916,sum=0xa0b7dde9a581c7d4
Elapsed time: 503.48 sec. (0.05 init + 503.43 sieve) at 1986536 p/sec.
Processor time: 19.30 sec. (0.05 init + 19.25 sieve) at 51949276 p/sec.
Average processor utilization: 1.04 (init), 0.04 (sieve)
As fast as before, even a tad bit faster but i attribute this to statistic divergence.
BLOCKSIZE=128
cthread_count=1024 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, PPSieve-Cuda version 0.1.3f (as in Finally!!!) is out. I don't expect much if any testing is necessary. And this should have fixed the Fermi bug!
____________
|
|
|
|
|
|
I think it should be pushed fast into BOINC if no more problems arise, e.g. this host is trashing WUs in large amounts... |
|
|
|
|
|
roadrunner_gs:
As you requested, I ran the following command:
C:\ProgramData\BOINC\projects\www.primegrid.com>primegrid_ppsieve_1.25_windows_intelx86__cuda23 -p332325478000000 -332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
The result in the stderr.txt file is:
16:51:50 (5652): Can't open init data file - running in standalone mode
pmax not specified, using default pmax = pmin + 1e9
Please specify an input file or all of kmin, kmax, and nmax
16:51:50 (5652): called boinc_finish
16:52:34 (712): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 332325478000000 <= p < 332325479000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Computation Error: no candidates found for p=332325478168793.
16:52:34 (712): called boinc_finish
____________
|
|
|
|
|
|
Could you please try the app from the download in the first post? It should be the newest app and therefore error-free. Unfortunately the boinc-apps are faulty for Fermi-cards. |
|
|
|
|
Could you please try the app from the download in the first post? It should be the newest app and therefore error-free. Unfortunately the boinc-apps are faulty for Fermi-cards.
Looks better now that I DL'ed your latest version. Heres's the output:
Command Line:
C:\ProgramData\BOINC\projects\www.primegrid.com>ppsieve-cuda-boinc-x86-windows -
p332325478000000 -P332325479000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3f (testing)
nstart=82, nstep=35, gpu_nstep=35
ppsieve initialized: 1201 <= k <= 9999, 82 <= n <= 2000000
332325478382197 | 1579*2^916630+1
332325478899881 | 6055*2^1462552+1
stderr.txt:
18:18:59 (5024): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 332325478000000 <= p < 332325479000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 332325478000000 <= p < 332325479000000
count=29910,sum=0x89f17583e61918fc
Elapsed time: 1.28 sec. (0.14 init + 1.14 sieve) at 919802 p/sec.
Processor time: 0.76 sec. (0.16 init + 0.61 sieve) at 1723489 p/sec.
Average processor utilization: 1.11 (init), 0.53 (sieve)
18:19:01 (5024): called boinc_finish
____________
|
|
|
|
|
|
May I ask when development on an ATI app will begin? I have really been looking forward to an ATI app for quite some time now.
____________
May the Force be with you always.
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
No ATI work until I get this next algorithm tested. ATI is even worse at 32-bit multiplies than pre-Fermi nVIDIA, so I'm thinking a texture table lookup may be the way to go.
But first, can I get some baseline tests done?
Edit: Of course, anyone else is quite welcome to port the app to ATI. I'm planning to use OpenCL when I get around to it, but you're welcome to use the old Brook+ if you think it will be faster.
____________
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
Alright, PPSieve-Cuda version 0.1.3f (as in Finally!!!) is out. I don't expect much if any testing is necessary. And this should have fixed the Fermi bug!
The sooner we can get some more feedback on this version, the faster we can get it uploaded to BOINC...presuming that it's positive feedback. :)
ppsieve-cuda.zip
And the faster we can nail down the CUDA app, the sooner Ken can start work on the ATI app. ;)
____________
|
|
|
|
|
|
I would test but the latest builds don't support OSX. And the older CUDA apps for OSX leaves the GUI frozen or unresponsive. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
Test results for ppsieve-cuda-boinc-x86-windows.exe 0.1.3f on Windows 7 Prof. x64:
9800GT @ 720/1875/1050
12:16:11 (3836): Can't open init data file - running in standalone mode
Sieve started: 249871200000000 <= p < 249871230000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 249871200000000 <= p < 249871230000000
count=905431,sum=0x43ba01dc413353cd
Elapsed time: 43.35 sec. (0.06 init + 43.29 sieve) at 696433 p/sec.
Processor time: 0.64 sec. (0.06 init + 0.58 sieve) at 52145401 p/sec.
Average processor utilization: 0.98 (init), 0.01 (sieve)
12:16:54 (3836): called boinc_finish
No surprise, same speed as before and no errors.
GTX460 @ 800/1600/2000:
12:18:25 (1044): Can't open init data file - running in standalone mode
Sieve started: 249871200000000 <= p < 249871230000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 249871200000000 <= p < 249871230000000
count=904913,sum=0x41ed865ceb5e220d
Elapsed time: 12.86 sec. (0.09 init + 12.76 sieve) at 2361893 p/sec.
Processor time: 1.09 sec. (0.11 init + 0.98 sieve) at 30673937 p/sec.
Average processor utilization: 1.19 (init), 0.08 (sieve)
12:18:38 (1044): called boinc_finish
No errors, same factors found, very nice. :)
____________
|
|
|
Sysadm@Nbg Volunteer moderator Volunteer tester Project scientist
 Send message
Joined: 5 Feb 08 Posts: 1199 ID: 18646 Credit: 634,177,371 RAC: 345,971
                      
|
But first, can I get some baseline tests done?
GeForce 9800 GTX+ on Ubuntu-Linux 64-bit (server):
official app
Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GTX+
Detected compute capability: 1.1
Detected 16 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 44.86 sec. (0.02 init + 44.84 sieve) at 672311 p/sec.
Processor time: 1.17 sec. (0.03 init + 1.14 sieve) at 26444351 p/sec.
Average processor utilization: 1.69 (init), 0.03 (sieve)
called boinc_finish
Found 67 factors
alpha app
$ ./ppsieve-cuda-64bit-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.0-beta (testing)
Compiled Aug 2 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Starting 1 threads.
Detected GPU 0: GeForce 9800 GTX+
Detected compute capability: 1.1
Detected 16 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 97 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 71.67 sec. (0.02 init + 71.65 sieve) at 418730 p/sec.
Processor time: 9.14 sec. (0.02 init + 9.12 sieve) at 3289600 p/sec.
Average processor utilization: 1.13 (init), 0.13 (sieve)
There is a differenz in found factors!
____________
Sysadm@Nbg
my current lucky number: 113856050^65536 + 1
PSA-PRPNet-Stats-URL: http://u-g-f.de/PRPNet/
|
|
|
|
|
|
XP-32 - NVS-3100M:
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.3f (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: NVS 3100M
Detected compute capability: 1.2
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 6 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 56.31 sec. (0.05 init + 56.27 sieve) at 55909 p/sec.
Processor time: 0.22 sec. (0.06 init + 0.16 sieve) at 20132659 p/sec.
Average processor utilization: 1.33 (init), 0.00 (sieve) |
|
|
|
|
|
No errors here with the downloaded boinc-bin, however i got 67 factors
ppsieve version cuda-0.1.3f (testing)
Compiled Aug 4 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
(...)
42070029355117 | 6241*2^1814160+1
42070029605521 | 9537*2^384248+1
Found 67 factors
real 29m28.677s
user 21m7.112s
sys 0m0.212s
cat stderr.txt
Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9400 GT
Detected compute capability: 1.1
Detected 2 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 1766.60 sec. (0.02 init + 1766.58 sieve) at 17065 p/sec.
Processor time: 1267.25 sec. (0.03 init + 1267.22 sieve) at 23790 p/sec.
Average processor utilization: 1.10 (init), 0.72 (sieve)
called boinc_finish
p.s.: Compiled with 1024 as cthread_count it was two times faster, the cpu-load was lower too, no idea how X-window would have performed
(...)
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 67 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 958.57 sec. (0.02 init + 958.54 sieve) at 31450 p/sec.
Processor time: 489.08 sec. (0.02 init + 489.05 sieve) at 61643 p/sec.
Average processor utilization: 1.14 (init), 0.51 (sieve)
real 15m58.570s
user 8m8.893s
sys 0m0.185s |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That's an awfully high CPU load. I wonder why? Are your drivers updated?
That's also the nice thing about having the real hardware - you can fiddle with things and see how fast they run. 9400GT, that's, what, compute capability 1.1? I infer from Michael Goetz that compute capabilities 1.2-1.3 are fine with 768 threads/MP. Perhaps that would work better for 1.0-1.1 as well?
____________
|
|
|
|
|
|
I have the "old" drivers, 190.53.
With the ap26-app i do not see any load but that app was heavily memory-bound and had larger kernels(for one Number: 9400 GT: 365 s; GTX 260: 33 s; GTX 460: 39 s). Maybe the sleep during checking for kernel-completion should be tweaked. But overall i think this card is way to slow to even consider using it.
# /opt/nvidia_cuda_sdk/bin/linux/release/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GeForce 9400 GT"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.40 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
Test PASSED
Press ENTER to exit...
I do even have my GTX 260, but no capable PSU at the moment. |
|
|
|
|
Another time, lower the memory requirement, please!
FYI, 10MB should be enough to run this app, even in the worst case I could think of.
Who at PG is able to lower this requirement because it is set at the server side of the project? All 256MB cards will suffer from this driver isssue :(
+1 We need to get this false minimum requirement lowered some how.
It will be addressed but it's the middle of the night in Lithuania.
256MB card
07-Aug-2010 20:27:15 [PrimeGrid] Message from server: No work sent
07-Aug-2010 20:27:15 [PrimeGrid] Message from server: Your NVIDIA GPU has insufficient memory (need 250MB)
07-Aug-2010 20:27:15 [PrimeGrid] Message from server: No work available for the applications you have selected. Please check your preferences on the w
Any News? :(
____________
Member of Crunching Family
http://crunching-family.at/ |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I've just posted v0.1.4 of PPSieve-CUDA. There are two major changes:
1. I combined Gerrit's sleep-loop code with my sleep-wait code. I'm hoping that lowers CPU use on machines with older drivers like his.
2. More importantly, I've added a switch, "-m", to vary the number of blocks or threads assigned to each GPU multiprocessor. It's come to my attention that the CUDA Occupancy Calculator may not be right in all cases. So I'd like testers to try a few different values. It's probably best to specify the number in blocks: threads are rounded down mod 128. The default values for various compute capabilities are:
1.0: 3
1.1: 3
1.2: 6
1.3: 6
2.0: 8
2.1: 8
I suspect the first two should be 6, but I'm not sure. So please experiment and let me know what's fastest. When the numbers settle down, if they're different from current defaults, I'll post v0.1.4a with changed defaults.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
Test results for 0.1.4 (ppsieve-cuda-boinc-x86-windows), range 249871200000000 <= p < 249871230000000:
9800 GT (compute capability 1.1, driver 258.96)
default (=-m3)
Elapsed time: 43.36 sec. (0.06 init + 43.30 sieve) at 696150 p/sec.
Processor time: 0.64 sec. (0.06 init + 0.58 sieve) at 52145401 p/sec.
Average processor utilization: 1.05 (init), 0.01 (sieve)
-m6
Elapsed time: 43.63 sec. (0.06 init + 43.57 sieve) at 691906 p/sec.
Processor time: 0.52 sec. (0.05 init + 0.47 sieve) at 64312661 p/sec.
Average processor utilization: 0.81 (init), 0.01 (sieve)
GTX 460 (compute capability 2.1, driver 258.96)
default (=-m8)
Elapsed time: 12.91 sec. (0.05 init + 12.87 sieve) at 2343166 p/sec.
Processor time: 0.70 sec. (0.05 init + 0.66 sieve) at 46010952 p/sec.
Average processor utilization: 0.96 (init), 0.05 (sieve)
-m6
Elapsed time: 15.20 sec. (0.05 init + 15.15 sieve) at 1989890 p/sec.
Processor time: 0.58 sec. (0.05 init + 0.53 sieve) at 56837084 p/sec.
Average processor utilization: 0.96 (init), 0.04 (sieve)
____________
|
|
|
|
|
|
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal -m3
ppsieve version cuda-0.1.4 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: NVS 3100M
Detected compute capability: 1.2
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 6 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 47.17 sec. (0.09 init + 47.08 sieve) at 66819 p/sec.
Processor time: 0.23 sec. (0.09 init + 0.14 sieve) at 22369621 p/sec.
Average processor utilization: 1.00 (init), 0.00 (sieve)
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal -m6
ppsieve version cuda-0.1.4 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: NVS 3100M
Detected compute capability: 1.2
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070000300049 | 9139*2^461846+1
42070000464001 | 4179*2^1577462+1
42070001011573 | 7113*2^215532+1
42070002690167 | 2553*2^1888870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 6 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 47.55 sec. (0.03 init + 47.52 sieve) at 66204 p/sec.
Processor time: 0.31 sec. (0.05 init + 0.27 sieve) at 11842741 p/sec.
Average processor utilization: 1.50 (init), 0.01 (sieve)
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
-m3 did better than -m6 on a compute-capability 1.2 card? Interesting. Can you reproduce that repeatably? How about on the longer test?
Also, just because I listed a few defaults doesn't mean anyone should be limited to them. Try everything from, say, 1 to 10; more if you're getting good results at 10.
____________
|
|
|
|
|
-m3 did better than -m6 on a compute-capability 1.2 card? Interesting. Can you reproduce that repeatably? How about on the longer test?
Also, just because I listed a few defaults doesn't mean anyone should be limited to them. Try everything from, say, 1 to 10; more if you're getting good results at 10.
maybe because it's a real weak thing - ok, did another run with different m-values:
M1: 58.67
M2: 46.94
M3: 47.14
M4: 47.16
M5: 46.91
M6: 46.61
M7: 51.94
M8: 53.75
M9: 55.42
i'll run some longer tests.. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
OK, some more numbers for the 9800GT (30M range):
-m1: 54.47
-m2: 43.77
-m3: 43.36
-m4: 46.13
-m5: 43.78
-m6: 43.60
-m7: 45.25
-m8: 43.95
-m9: 43.94
-m10: 44.80
And the GTX460:
-m1: 64.48
-m2: 34.30
-m3: 24.70
-m4: 20.40
-m5: 18.66
-m6: 17.42
-m7: 15.03
-m8: 14.72
-m9: 21.17
-m10: 19.95
-m16: 14.69
-m20: 16.18
-m24: 14.81
-m32: 15.00
-m40: 14.96
-m48: 15.30
____________
|
|
|
|
|
|
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070025e6 -k 1201 -K 9999 -
N 2000000 -z normal -m2 1>m
Sieve started: 42070000000000 <= p < 42070025000000
Resuming from checkpoint p=42070003670017 in ppcheck42070e9.txt
Thread 0 starting
Detected GPU 0: GeForce GTS 250
Detected compute capability: 1.1
Detected 16 multiprocessors.
times for different -m levels:
m1: 37.74 sec. (0.03 init + 37.71 sieve) at 667435 p/sec.
m2: 38.06 sec. (0.03 init + 38.03 sieve) at 661686 p/sec.
m3: 36.78 sec. (0.03 init + 36.75 sieve) at 684716 p/sec.
m4: 39.90 sec. (0.02 init + 39.89 sieve) at 630892 p/sec.
m5: 36.69 sec. (0.03 init + 36.66 sieve) at 686464 p/sec.
m6: 36.71 sec. (0.03 init + 36.68 sieve) at 686172 p/sec.
m7: 38.61 sec. (0.03 init + 38.58 sieve) at 652321 p/sec.
looks like m5 and m6 are working best
same tests on my GTX260 came out between 41 for m1 and 39.5 for m6, even higher values did not improve anything. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, I'm convinced that any -m deviations are within the margin of error, so I'm sticking with the original defaults.
But I have a new release, 0.1.5, with two changes:
First, I decreased the number of factors returned again! This helps PrimeGrid's validator keep up with all our results. Now the short test should only return two factors:
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
Expanding the test to -P42070030e6 returns 25 more:
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
So I'd suggest running at least to -P42070010e6, which would return 10 factors in total.
The second change is in how I built the app. Due to some confusion over a minor bug (since fixed), I made a real Makefile. It seems to have produced smaller code too. But it should definitely be tested.
So let's get this tested and into production so the Fermi bug can be put to rest once and for all!
____________
|
|
|
|
|
|
I'd be more than up to testing this, but every time I try to, I get the following error:
./ppsieve-cuda-x86-linux: error while loading shared libraries: libcudart.so.2: cannot open shared object file: No such file or directory
Error after building from source. Sorry. :(
Do you have any idea how to fix this problem? Note that the current CUDA app in BOINC works perfectly fine on my machine.
____________
|
|
|
|
|
I'd be more than up to testing this, but every time I try to, I get the following error:
./ppsieve-cuda-x86-linux: error while loading shared libraries: libcudart.so.2: cannot open shared object file: No such file or directory
Error after building from source. Sorry. :(
Do you have any idea how to fix this problem? Note that the current CUDA app in BOINC works perfectly fine on my machine.
You try running the x86-binary, do you have a 64-bit-Linux? If so you need the 32-bit version of the libcudart.so.2 in addition to your 64-bit-version. Or try the ppsieve-cuda-x86_64-linux binary, that should work. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
Test results for 0.1.5 on Win7:
9800GT @ 600/1500/900 (note the 375 MHz decrease in shader clock, if you compare it to my earlier results):
11:52:30 (3804): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 61.52 sec. (0.03 init + 61.49 sieve) at 490242 p/sec.
Processor time: 1.17 sec. (0.06 init + 1.11 sieve) at 27174364 p/sec.
Average processor utilization: 2.13 (init), 0.02 (sieve)
11:53:31 (3804): called boinc_finish
27 factors found, same speed as 0.1.4.
GTX460 @ 800/1600/2000:
12:27:37 (3044): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 14.89 sec. (0.02 init + 14.87 sieve) at 2027361 p/sec.
Processor time: 0.95 sec. (0.03 init + 0.92 sieve) at 32753546 p/sec.
Average processor utilization: 1.42 (init), 0.06 (sieve)
12:27:52 (3044): called boinc_finish
27 factors found.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
I'd be more than up to testing this, but every time I try to, I get the following error:
./ppsieve-cuda-x86-linux: error while loading shared libraries: libcudart.so.2: cannot open shared object file: No such file or directory
Error after building from source. Sorry. :(
Do you have any idea how to fix this problem? Note that the current CUDA app in BOINC works perfectly fine on my machine.
First, try copying the libcudart-type file you find in the BOINC projects/www.primegrid.com directory to the directory where you're testing ppsieve-cuda.
If that doesn't work, this file comes with the SDK, here.
____________
|
|
|
|
|
|
I got it to work by using the actual application rather than the .sh file that was included.
8600M GT on Ubuntu 9.10 64 bit:
Short Test:
mmillerick@mmillerick-laptop:~$ '/home/mmillerick/Desktop/ppsieve-cuda/ppsieve-cuda-x86_64-linux' -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.5 (testing)
Compiled Aug 10 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8600M GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 2 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 35.12 sec. (0.02 init + 35.10 sieve) at 89622 p/sec.
Processor time: 0.24 sec. (0.01 init + 0.23 sieve) at 13677078 p/sec.
Average processor utilization: 0.60 (init), 0.01 (sieve)
Long Test:
mmillerick@mmillerick-laptop:~$ '/home/mmillerick/Desktop/ppsieve-cuda/ppsieve-cuda-x86_64-linux' -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.5 (testing)
Compiled Aug 10 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8600M GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
p=42070005505025, 91.75K p/sec, 0.01 CPU cores, 18.4% done. ETA 11 Aug 09:30
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
p=42070010747905, 87.38K p/sec, 0.00 CPU cores, 35.8% done. ETA 11 Aug 09:30
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
p=42070015990785, 87.38K p/sec, 0.00 CPU cores, 53.3% done. ETA 11 Aug 09:30
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
p=42070020971521, 83.01K p/sec, 0.00 CPU cores, 69.9% done. ETA 11 Aug 09:30
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
p=42070026214401, 84.74K p/sec, 0.00 CPU cores, 87.4% done. ETA 11 Aug 09:30
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 345.74 sec. (0.02 init + 345.73 sieve) at 87198 p/sec.
Processor time: 1.30 sec. (0.02 init + 1.28 sieve) at 23552000 p/sec.
Average processor utilization: 1.21 (init), 0.00 (sieve)
____________
|
|
|
|
|
|
Works for the GTX 460, will test the 9400 GT later in the day/night.
time ./ppsieve-cuda-boinc-x86_64-linux.new -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.5 (testing)
Compiled Aug 10 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 19.08 sec. (0.02 init + 19.06 sieve) at 1582055 p/sec.
Processor time: 2.58 sec. (0.02 init + 2.55 sieve) at 11800837 p/sec.
Average processor utilization: 1.09 (init), 0.13 (sieve)
real 0m19.080s
user 0m0.972s
sys 0m1.608s |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Windows Vista 32 bit
Core 2 Quad Q6600 @ 2.4 GHz
EVGA GTX 280 factory OC @ 621/1350/1134
All tests run with BOINC suspended an no appreciable loads on CPU or GPU.
Short test:
C:\Temp\ppsieve-cuda-0-0-15>ppsieve-cuda-x86-windows -p42070e9 -P42070003e6 -k 1
201 -K 9999 -N 2000000
ppsieve version cuda-0.1.5 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 2 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 4.71 sec. (0.05 init + 4.65 sieve) at 676064 p/sec.
Processor time: 0.30 sec. (0.03 init + 0.27 sieve) at 11861675 p/sec.
Average processor utilization: 0.60 (init), 0.06 (sieve)
Long test:
C:\Temp\ppsieve-cuda-0-0-15>ppsieve-cuda-x86-windows -p42070e9 -P42070030e6 -k 1
201 -K 9999 -N 2000000
ppsieve version cuda-0.1.5 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 37.18 sec. (0.05 init + 37.13 sieve) at 811810 p/sec.
Processor time: 1.70 sec. (0.05 init + 1.65 sieve) at 18230756 p/sec.
Average processor utilization: 0.96 (init), 0.04 (sieve)
Same behavior as before, stellar GPU utilization at 98%, but the GUI was too choppy to really use. If this was production, I'd only let it run when the computer's not in use whereas normally with all other GPU or CPU applications I let them run in the background while I use the computer.
I also tried running the long test with various -m values, Here are the elapsed times:
-m1: 46.15 (GUI *very* choppy)
-m2: 37.20
-m3: 36.99
-m4: 37.34
-m5: 37.06
-m6: 37.27
-m7: 38.41
-m8: 37.81
-m9: 37.28 (slightly less awful GUI response??)
-m10: 36.99
-m11: 37.37
-m12: 37.28
-m13: 39.06 (double-checked; this m value takes 2 seconds longer than the others)
-M14: 37.26
-m15: 37.67
-m16: 37.83
In no case was the GUI ever usable enough that I would consider allowing this to run while I was using the computer.
EDIT: For giggles, I pushed the overclocking up to 668/1452/1134. That chopped about 2 seconds off the run-time.
As far as CPU utilization is concerned, the very low utilization I reported previously is deceptive and incorrect. The ppsieve executable itself if using a negligible amount of CPU time, but some other process called "dwm.exe" is using about 80% of one core. I suspect that's part of the driver system. So I guess the BOINC estimate of 0.75 CPUs is pretty close to accurate.
____________
My lucky number is 75898524288+1 |
|
|
|
|
(...)
As far as CPU utilization is concerned, the very low utilization I reported previously is deceptive and incorrect. The ppsieve executable itself if using a negligible amount of CPU time, but some other process called "dwm.exe" is using about 80% of one core. I suspect that's part of the driver system. So I guess the BOINC estimate of 0.75 CPUs is pretty close to accurate.
Sorry, i can't confirm this on Linux, only that i found that the slower the GPU is the more load does it cause on the CPU. The DWM seems to be the "Desktop Window Manager". But i have to say that i do not use X-Window on the crunchers except on my laptops.
Intel Xeon W3520 w/ Nvidia GTX 460
$ top -d 1
top - 21:36:41 up 1:49, 1 user, load average: 0.02, 0.01, 0.03
Tasks: 194 total, 1 running, 193 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.6%us, 0.1%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4028708k total, 1561252k used, 2467456k free, 101668k buffers
Swap: 6340600k total, 0k used, 6340600k free, 1195136k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3457 roadrunn 15 0 125m 22m 15m S 4.0 0.6 0:05.90 ppsieve-cuda-bo
3464 roadrunn 15 0 12740 1152 812 R 1.0 0.0 0:00.02 top
1 root 15 0 10348 628 532 S 0.0 0.0 0:00.87 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.02 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:01.48 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root RT -5 0 0 0 S 0.0 0.0 0:00.02 migration/2
9 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2
10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
11 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/3
12 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3
13 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
14 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/4
15 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/4 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
dwm.exe, huh? Have you tried stopping it? It looks like it makes Vista (and probably Windows 7 as well) "pretty". If I was you I'd turn it off. :P
Gerrit, have you tried the latest build on your old GPU with the old drivers? I think it should be faster and/or use less CPU. (Edit: I see you'll try it later.)
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
As far as CPU utilization is concerned, the very low utilization I reported previously is deceptive and incorrect. The ppsieve executable itself if using a negligible amount of CPU time, but some other process called "dwm.exe" is using about 80% of one core. I suspect that's part of the driver system. So I guess the BOINC estimate of 0.75 CPUs is pretty close to accurate.
dwm.exe is a part of Windows Aero, maybe the high CPU usage occurs only if you're doing anything on the screen while running ppsieve-CUDA.
____________
|
|
|
|
|
dwm.exe, huh? Have you tried stopping it? It looks like it makes Vista (and probably Windows 7 as well) "pretty". If I was you I'd turn it off. :P
Gerrit, have you tried the latest build on your old GPU with the old drivers? I think it should be faster and/or use less CPU.
oops - DWM on crunching systems - bad idea!
Start
Run
services.msc
Find it in the list.
Go to like File or Edit or something
Change status to Disable (as opposed to manual or automatic)
Manually stop the process.
|
|
|
|
|
dwm.exe, huh? Have you tried stopping it? It looks like it makes Vista (and probably Windows 7 as well) "pretty". If I was you I'd turn it off. :P
Gerrit, have you tried the latest build on your old GPU with the old drivers? I think it should be faster and/or use less CPU. (Edit: I see you'll try it later.)
short version is done
$ ./ppsieve-cuda-boinc-x86_64-linux.new -p42070000000000 -P42070010000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.5 (testing)
Compiled Aug 10 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070010000000
Thread 0 starting
Detected GPU 0: GeForce 9400 GT
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
p=42070004194305, 69.89K p/sec, 0.01 CPU cores, 41.9% done. ETA 11 Aug 22:17
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
p=42070008126465, 65.53K p/sec, 0.00 CPU cores, 81.3% done. ETA 11 Aug 22:17
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070010000000
Found 10 factors
count=318533,sum=0xb9f8cbeb13d00db3
Elapsed time: 152.43 sec. (0.03 init + 152.40 sieve) at 67084 p/sec.
Processor time: 0.78 sec. (0.03 init + 0.75 sieve) at 13615404 p/sec.
Average processor utilization: 1.11 (init), 0.00 (sieve)
No significant CPU-load, the gdm-welcome-screen is shown, i checked that with my old Dell P1110 21"-CRT, but either the cable or the monitor is defect - it is very very redish... ^^
Will test the longer until 42070030000000 now.
EDITH say: Oh, why has he found 10 factors? |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
As far as CPU utilization is concerned, the very low utilization I reported previously is deceptive and incorrect. The ppsieve executable itself if using a negligible amount of CPU time, but some other process called "dwm.exe" is using about 80% of one core. I suspect that's part of the driver system. So I guess the BOINC estimate of 0.75 CPUs is pretty close to accurate.
dwm.exe is a part of Windows Aero, maybe the high CPU usage occurs only if you're doing anything on the screen while running ppsieve-CUDA.
Well, this is certainly interesting. I did a bit of research (thanks to everyone here!!!) and dwm is indeed the part that makes aero do what it does. It's essentially a screen virtualization layer, taking the output of every window and combining them into on display image.
And when ppsieve is running, dwm consumes a lot of CPU time. If aero is turned off (no need to stop the drm service), drm does nothing when ppsieve runs. This does not affect the run time of ppsieve, however. Also, the GUI is just as choppy with aero turned off as it is with aero turned on.
Which brings us to a VERY interesting question: Dwm should only be doing something when a window makes a change to the screen -- which isn't happening here. Something ppsieve is doing is causing the dwm service to think the screen was updated, when, in actuality, it wasn't. The result is almost continuous operation of the dwm service -- and, likely, also the cause of the choppy GUI response.
For comparison, I started up a GPUGRID task. This task isn't as efficient as PprimeGrid's application, because it uses more CPU (about 25% of one core) and the GPU isn't kept as busy.
But GPUGRID doesn't cause DWM to go crazy the way PrimeGrid does.
I suspect that the ppsieve application is doing something wrong, somewhere, which causes the CUDA drivers to trigger a screen update, possibly when each kernel loads or completes. This, in turn, caused both the excessive DWM cpu usage and the crappy GUI responsiveness, or lack thereof.
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
longer test completed, 27 factors found
$ ./ppsieve-cuda-boinc-x86_64-linux.new -p42070000000000 -P42070030000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.5 (testing)
Compiled Aug 10 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9400 GT
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
p=42070004194305, 69.89K p/sec, 0.01 CPU cores, 14.0% done. ETA 11 Aug 22:28
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
p=42070008126465, 65.53K p/sec, 0.00 CPU cores, 27.1% done. ETA 11 Aug 22:28
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
p=42070012058625, 65.53K p/sec, 0.00 CPU cores, 40.2% done. ETA 11 Aug 22:28
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
p=42070015990785, 65.53K p/sec, 0.00 CPU cores, 53.3% done. ETA 11 Aug 22:28
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
p=42070020185089, 66.61K p/sec, 0.00 CPU cores, 67.3% done. ETA 11 Aug 22:28
p=42070024117249, 65.53K p/sec, 0.00 CPU cores, 80.4% done. ETA 11 Aug 22:28
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
p=42070028049409, 65.53K p/sec, 0.00 CPU cores, 93.5% done. ETA 11 Aug 22:28
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 456.38 sec. (0.03 init + 456.35 sieve) at 66060 p/sec.
Processor time: 1.88 sec. (0.03 init + 1.86 sieve) at 16236469 p/sec.
Average processor utilization: 1.07 (init), 0.00 (sieve)
I think i will spare me the whole bunch of a complete 1G... |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
short version is done
$ ./ppsieve-cuda-boinc-x86_64-linux.new -p42070000000000 -P42070010000000 -k 1201 -K 9999 -N 2000000 -z normal
[snip]
EDITH say: Oh, why has he found 10 factors?
Because that was the medium-length version I recommended as a newer "short" version. 10 factors is correct for that range.
Glad it's not driving the CPU crazy anymore. :)
Michael: I wish somebody could point out what my app is doing wrong. I dramatically shortened the runtime of each kernel at one point, to decrease screen choppiness. (Yes, I think it's decreased from where it once was.) The frequent kernel runs might make dwm think the screen is updating a lot; I'm not sure. I'd welcome other suggestions.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Michael: I wish somebody could point out what my app is doing wrong. I dramatically shortened the runtime of each kernel at one point, to decrease screen choppiness. (Yes, I think it's decreased from where it once was.) The frequent kernel runs might make dwm think the screen is updating a lot; I'm not sure. I'd welcome other suggestions.
I wish I knew what was wrong. My guess is that it's not the frequency of what you're doing, but rather the method by which you're doing something. GPUGRID doesn't show less DWM usage, or less GUI sluggishness; it exhibits none at all. And I don't recall either problem with PrimeGrid's AP26 CUDA application, either. So I don't think it's a matter of tuning the number of kernels, or anything like that.
If I had to make a total wild assed guess, I'd say maybe it's the memory area you're using, and perhaps that's causing the driver to think updates are occuring. Maybe it's a parameter in an SDK call. Although I've read up on the documentation for CUDA, and looked at some sample code, I've never actually written anything for it. So my insight into what specifically is causing the problem is somewhat limited.
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
I logged into my Xeon E5504 ES with the GT 9400 while running the new longer test.
No high Xorg-usage (around 20-30% on one Core/Thread) and the GUI feels very smooth and usable.
Will try the GTX 460 now.
EDITH says: Xorg-usage there is about 6 %, no high spikes, very smooth and usable. Seems to be the Windows-driver...
Furthermore on my laptop with the NVS 140M: Xorg-usage at around 90%, GUI is slow but usable. This has 2 MPs @ 800 MHz and very old drivers (185.18.14), they seem to be prehistoric...
okay, it is done:
# ./ppsieve-cuda-boinc-x86_64-linux -p42070000000000 -P42070030000000 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.1.5 (testing)
Compiled Aug 10 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: Quadro NVS 140M
Detected compute capability: 1.1
Detected 2 multiprocessors.
(...)
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 820.11 sec. (0.02 init + 820.10 sieve) at 36760 p/sec.
Processor time: 3.12 sec. (0.02 init + 3.10 sieve) at 9710513 p/sec.
Average processor utilization: 1.13 (init), 0.00 (sieve) |
|
|
|
|
But I have a new release, 0.1.5, with two changes:
First, I decreased the number of factors returned again! This helps PrimeGrid's validator keep up with all our results.
Ken,
can i ask HOW you have reduced the factors reported?
are you refactoring the factors or something?
Cheers
ShoeLace |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, Shoelace. First of all, recall that each factor is a prime that divides a Proth number. But there may have been smaller factors that already factored that number. For instance, let's take these three:
42070002875941 | 4081*2^1494668+1
42070003003673 | 6119*2^113963+1
42070003101727 | 4207*2^1054290+1
Even before getting to those, I removed some. It turns out that there's an easy way to test for divisibility by 2^x-1. This really helps with taking 2^N mod 2^x-1. So I modified that to test for divisibility by 3, 5, and 7. (How'd I do 5? By testing 15.)
Now my earlier change was just to add some larger numbers. ((4081*2^1494668+1) mod 255) mod 17 == 0, for instance, so that number had already been factored and was removed. This leaves:
42070003003673 | 6119*2^113963+1
42070003101727 | 4207*2^1054290+1
But this approach can only be taken so far. There are only so many primes divisible by small powers of 2 minus one. Plus, the larger the power of 2, the fewer K's I can easily use it on. So I decided to take the next logical step and make a full trial division system for each factored number. This system is very similar to the full sieve, except that I set it up so it only works on primes < 32768 (2^15). And so I now test all those primes. Most of them don't divide either of these two numbers, but it turns out that:
1493 | 6119*2^113963+1
So that number also gets removed. That leaves:
42070003101727 | 4207*2^1054290+1
Now I happen to know that 4207*2^1054290+1 has a smaller factor that this one. I know because it was eliminated from the sieve file (the 1.2GB one we mention occasionally) long ago. I don't know what its factor was, because I haven't tested above 32768. Theoretically, I could eliminate all numbers that have been factored before, but doing that with this method would require more work than the original sieve! So I leave
42070003101727 | 4207*2^1054290+1
and factors like it to be sorted out at the server.
____________
|
|
|
|
|
|
Sorry to double-post, but if you are a Mac user, and especially if you have the CUDA 2.3 driver installed, testing of the new app posted here http://www.primegrid.com/forum_thread.php?id=2639&nowrap=true#25559 would be really helpful!
Thanks! |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
I have some followup information on th DWM.EXE and sluggish GUI situation.
This is going to sound weird, and I can not fully explain it.
The weather got a little cooler, and the air conditioning got turned off. So I turned on the ppsieve-cuda (the real BOINC version), and set BOINC to only run it when the computer's not in use because of the affect that the CUDA program has on the GUI.
So ppsieve is happily crunching away when I'm not pounding on the keys, and everything is good. Until I look at the task manager. It's showing all the cores crunching BOINC tasks at nearly 100%. No (or neglible) DWM. What happened to all that CPU usage by DWM that I saw in the tests?
Now, one difference between running for real and running the tests on ppsieve was that for the tests BOINC was suspended and the computer was idle. Sure enough, if I suspend on of the BOINC tasks, thus freeing up a core, DWM starts running full bore on that available core.
So DWM behaves differently with the computer running at 100%. With BOINC running, the DWM processing is suspended -- the BOINC tasks are running instead of DWM. That's very strange, because DWM has a "high" processes priority. Even more interesting, pp-sieve interferes with the GUI less when BOINC is running than when the computer is idle, most likely because DWM is doing less of whatever it's doing. The GUI is still fairly unresponsive, but not as bad as when the computer is idle.
How's that for bizarre?
____________
My lucky number is 75898524288+1 |
|
|
|
|
I have some followup information on th DWM.EXE and sluggish GUI situation.
which desktop-theme are you running?
try to switch to windows-classic
anyway you simply can turn it off and on command prompt:
Stop Service : net stop uxsms
Start Service : net start uxsms
Disable Service: sc config uxsms start= disabled
Enable Service : sc config uxsms start= auto |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
I have some followup information on th DWM.EXE and sluggish GUI situation.
which desktop-theme are you running?
try to switch to windows-classic
anyway you simply can turn it off and on command prompt:
Stop Service : net stop uxsms
Start Service : net start uxsms
Disable Service: sc config uxsms start= disabled
Enable Service : sc config uxsms start= auto
I know how to shut down Aero. :)
If you read my earlier post, you'll see that shutting off aero, while eliminating the CPU usage of dwm.exe, does nothing to improve the responsiveness of the GUI. That's far more of a concern than dwm.
____________
My lucky number is 75898524288+1 |
|
|
|
|
If you read my earlier post, you'll see that shutting off aero, while eliminating the CPU usage of dwm.exe, does nothing to improve the responsiveness of the GUI. That's far more of a concern than dwm.
what do you see in Nvidia control panel and under "PhysX Configuration" ?
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
If you read my earlier post, you'll see that shutting off aero, while eliminating the CPU usage of dwm.exe, does nothing to improve the responsiveness of the GUI. That's far more of a concern than dwm.
what do you see in Nvidia control panel and under "PhysX Configuration" ?
It's enabled.
I'll see if disabling it changes anything and will update this post shortly...
UPDATE:
Enabling or disabling Physx has no appreciable affect on CPU utilization or GUI responsiveness.
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
Found the nasty GUI-slowing "bug":
For EVERY other project, all project's executables run at LOWER or LOWEST priority, so that anything (win-aero, winamp, mediaplayer etc.) should get the power they want. (yes even the other gpu-apps are on lowprio)
This is the only application running on NORMAL priority, so it shares the same priority with all other apps. And because it is very very gpu-hungry, everything else gets crappy.
I tested it out by manually setting task priority to lowest in task manager - after that all works fine ^.^
Edit: OK windows aero is a little bit "unresponsive" after prio change, but that only means app is working good. With pps-sieve on normal level it slowed even the other cpu-apps that much down that my computer was 10°C cooler -.-
Edit2: OK OK after rereading the threat I realized you were discussing another kind of slowdown... But what I found is responsible for slowdowns for anyone only using BOINC.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
According to the BOINC website, CUDA apps should have normal priority...although now I read it again it says normal thread priority. Can you check in Task Manager for both the process and thread priorities? (One is in a submenu when right-clicking I think.)
Edit: P.S. This wouldn't be the first time that site was wrong; but as I recall at the usual idle priority one has let one CPU core sit idle to get full-speed GPU.
____________
|
|
|
|
|
According to the BOINC website, CUDA apps should have normal priority...although now I read it again it says normal thread priority. Can you check in Task Manager for both the process and thread priorities? (One is in a submenu when right-clicking I think.)
Edit: P.S. This wouldn't be the first time that site was wrong; but as I recall at the usual idle priority one has let one CPU core sit idle to get full-speed GPU.
That was the problem with the AP26 Linux app: At the normal "low" priority the runtime was around 60 minutes for a WU (GTX260) when the CPU (quad core) was under full load. After changing the priority to zero the runtime changed to 10 minutes per WU.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
My recollection is that, in Windows, normally BOINC CPU tasks run at "Low" priority while GPU tasks run at "Below Normal" priority -- one step above the CPU tasks.
But, lately I've been seeing the CPU tasks running at "Normal", with no effect on performance. It might have something to do with newer versions of the BOINC client.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, I definitely should have done this earlier, but Windows testers, try this. It may or may not fix the jerkiness issue; but it's also likely to be slower than the current app.
Try it; even if it is slower, it might be the better app to distribute to Windows users by default.
P.S. If this works, Windows has the worst graphics multitasking I can imagine (while still multitasking).
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Alright, I definitely should have done this earlier, but Windows testers, try this. It may or may not fix the jerkiness issue; but it's also likely to be slower than the current app.
Try it; even if it is slower, it might be the better app to distribute to Windows users by default.
P.S. If this works, Windows has the worst graphics multitasking I can imagine (while still multitasking).
No joy.
I ran both the new executable and the 0.0.15 version, and the behaviors were identical. Same GUI choppiness and the runtime was different by only 0.01 seconds.
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
Can't find the thread priority in task manager... But its the base priority that is on "normal".
It should run on "lower"... Maybe this is also called "below normal", as I am re-translating from german to english. In german "Niedriger als Normal".
I don't mind the WU taking longer - I only use 3 of 4 cores on my CPU because of overheating, so the GPU task may have part of my 4th core... As it does in milky or GPUgrid or did on AP26 ^.^
____________
|
|
|
|
|
Can't find the thread priority in task manager... But its the base priority that is on "normal".
you need to add that tab.
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I just re-checked and I've #ifdef-ed out the section of code that changes priorities. So priorities are as BOINC says they should be.
I notice in Gerrit's testing on Linux that he had a little jerkiness too, on machines with old drivers. Have people with Windows jerkiness tried updating their drivers?
Otherwise, I'll also point you to the BOINC CUDA/CAL FAQ. See the section entitled "My system slows down when I run CUDA or CAL. I can't work like this!" And that doesn't even include the "Use GPU while computer is in use" checkbox, in BOINC Manager -> Advanced -> Preferences..., Processor Usage tab.
____________
|
|
|
|
|
I just re-checked and I've #ifdef-ed out the section of code that changes priorities. So priorities are as BOINC says they should be.
did you consider to make priority a command-line switch?
much easyer to test effects and run it as one want's.. |
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
I use my HTPC to crunch during idle time. I noticed that when boinc is running the LLR tasks I cant use my media center because of its low responisveness. I need to disable the bonc tasks to get the responsiveness back. It is specifically noticable when playing music or videos (the reason to have an HTPC). I did not notice this behaviour running the Sieve applications.
Maybe this adds to the discussion, maybe it doesn't ;)
____________
|
|
|
|
|
I notice in Gerrit's testing on Linux that he had a little jerkiness too, on machines with old drivers. Have people with Windows jerkiness tried updating their drivers?
I have the latest drivers and latest CUDA, and there is also choppiness in Ubuntu 9.10.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
I just re-checked and I've #ifdef-ed out the section of code that changes priorities. So priorities are as BOINC says they should be.
did you consider to make priority a command-line switch?
much easyer to test effects and run it as one want's..
That's a good idea. I have a command-line switch that's unused in BOINC, so I could make it act on the BOINC priority setting instead.
But did you know we've tested idle priority before? All Windows versions before 0.1.3c used idle priority. Here's an example of the result: 33 seconds instead of 7 to run a test. :(
____________
|
|
|
|
|
But did you know we've tested idle priority before? All Windows versions before 0.1.3c used idle priority. Here's an example of the result: 33 seconds instead of 7 to run a test. :(
yes, i know. but there are 6 levels of priority one can set via taskman.
background, low, belownormal, normal, high and realtime.
where in fact you can set them from 0 to 31: http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx
you might want to check this with process-explorer: http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Yes, but there are only two levels of priority I can set with BOINC. Default is normal, or I can set options.normal_thread_priority = 1. BOINC will force the process priority to these values unless I reintroduce threads; but that brings back the suspend problem.
BOINC is very limited on what it can currently do to solve this issue.
____________
|
|
|
|
|
Yes, but there are only two levels of priority I can set with BOINC. Default is normal, or I can set options.normal_thread_priority = 1. BOINC will force the process priority to these values unless I reintroduce threads; but that brings back the suspend problem.
BOINC is very limited on what it can currently do to solve this issue.
i know, rocket-science - so you need to do this inside your app on your own.
functions are GetPriorityClass and SetPriorityClass. even if boinc fouled it up, your app should know it's own process-id. and there you go..;)
ps.: do not try anything above 15.. |
|
|
|
|
|
and btw.: SetProcessAffinityMask is next the next option to squeeze out a little bit. every time the task-sheduler decides to switch to another core, the cache gets invalid and needs to be reloaded. probably little of a problem with this app, but in general.
since D.A. refuses to implement this in boinc, it's left to those writing the apps.. :( |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I used to do that. When I set the thread priority, at least, BOINC un-did it! I might be able to fiddle with process priority but I'm not sure, and I'm not sure how much good it would do.
On the other hand, I've found a different idea that might work much better: a program called TThrottle.
First of all, I haven't tried TThrottle at all. So download and use at your own risk, etc. But it looks like TThrottle could modify both GPU process priority (with the special-edition) and GPU usage (which I suspect is more likely to improve performance). If you find a priority that neither slows down your system nor slows down WUs then I could try setting that in the app itself.
____________
|
|
|
|
|
I used to do that. When I set the thread priority, at least, BOINC un-did it! I might be able to fiddle with process priority but I'm not sure, and I'm not sure how much good it would do.
you mean they really implemented something to monitor and control the apps?
if they did that without giving crunchers the control about it - yerks! :(
On the other hand, I've found a different idea that might work much better: a program called TThrottle.
i'll try and fiddle around with that.
i had written some watchdog to monitor ap26-cuda tasks and push them to normal, but i can't find that thing anymore. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
FYI -- I'm not going to be able to help with the testing for a little while. I had some bizarre kind of failure today, and somehow managed to corrupt/damage not only the RAID 1 mirrors in the running system, but the old RAID disks I pulled out last month. So I now have 4 disks that won't boot.
Until I figure out what the heck happened, I'm dead in the water as far as crunching goes. If it's not a hardware problem on the MB, I'll probably be back online in a day or two, but if the MB or CPU is bad I need to think pretty hard about what I want to do to replace it.
So, TTFN.
Anyone know if there's a BOINC client for a BlackBerry? (j/k)
Mike
____________
My lucky number is 75898524288+1 |
|
|
|
|
I notice in Gerrit's testing on Linux that he had a little jerkiness too, on machines with old drivers. Have people with Windows jerkiness tried updating their drivers?
I have the latest drivers and latest CUDA, and there is also choppiness in Ubuntu 9.10.
Your host with 8600M GT with only 4 MPs qualifies for a slow GPU, it also is a notebook.
On my notebook the GUI was slow too, but not unusable. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Version 0.1.5a is now at the link from the OP. This fixes the mess I made of the Linux BOINC versions with my Makefile. It also includes better error reporting.
Probably the best way to test this would be in BOINC, with an app_info.xml file. But I don't know how to make one for a CUDA app.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
Probably the best way to test this would be in BOINC, with an app_info.xml file. But I don't know how to make one for a CUDA app.
Should be something like this:
<app_info>
<app>
<name>pps_sr2sieve</name>
<user_friendly_name>Proth Prime Search (Sieve)</user_friedly_name>
</app>
<file_info>
<name>xxxxxx</name>
<executable/>
</file_info>
<file_info>
<name>libcudart.so</name>
</file_info>
<app_version>
<app_name>pps_sr2sieve</app_name>
<version_num>127</version_num>
<file_ref>
<file_name>xxxxxx</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>libcudart.so</file_name>
<open_name>libcudart.so</open_name>
</file_ref>
<plan_class>cuda23</plan_class>
<coproc>
<type>CUDA</type>
<count>1.000000</count>
</coproc>
</app_version>
</app_info>
Of course, you have to replace xxxxxx by the name of the binary. ;)
____________
|
|
|
|
|
|
It worked perfectly fine on my 8600M GT.
mmillerick@mmillerick-laptop:~/Desktop/ppsieve-cuda$ ./ppsieve-cuda-boinc-x86_64-linux -p42070e9 -P42070010e6 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.1.5a (testing)
Compiled Aug 22 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
Found 10 factors
____________
|
|
|
RytisVolunteer moderator Project administrator
 Send message
Joined: 22 Jun 05 Posts: 2651 ID: 1 Credit: 58,387,426 RAC: 116,228
                     
|
|
Apps have been published to BOINC.
____________
|
|
|
|
|
|
Ooops, wrong thread. Moved. :)
____________
Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. (Benjamin Franklin) |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
FYI -- I'm not going to be able to help with the testing for a little while. I had some bizarre kind of failure today, and somehow managed to corrupt/damage not only the RAID 1 mirrors in the running system, but the old RAID disks I pulled out last month. So I now have 4 disks that won't boot.
Until I figure out what the heck happened, I'm dead in the water as far as crunching goes. If it's not a hardware problem on the MB, I'll probably be back online in a day or two, but if the MB or CPU is bad I need to think pretty hard about what I want to do to replace it.
So, TTFN.
Anyone know if there's a BOINC client for a BlackBerry? (j/k)
Mike
I'm back online again, running Windows 7 in 64 bit mode instead of the 32 bit Vista I was running before.
Is there anything I need to be testing? I did run a production CUDA WU (v1.27), and although the slow GUI problem still existed, at least I wasn't seeing a full CPU core being used by DWM. The CUDA WU had no noticeable on CPU utilization, which is excellent. I don't know if this is due to a change in the application, switching from Vista to Windows 7, switching from 32 bits to 64 bits, or (presumably) using a different video driver. Too many things changed when I rebuilt for me to isolate what caused the difference.
Mike
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Is there anything I need to be testing?
Not right now. I think we're good. :)
I don't see the jerkiness going away any time soon. It just seems to be what happens when one fully utilizes a video card. If BOINC could come up with a standard way to less-than-fully utilize a video card, then it might be possible to make it go away; but they don't think they can do that.
I don't know if this is due to a change in the application, Nope.
switching from Vista to Windows 7, Maybe.
switching from 32 bits to 64 bits, Nope.
or (presumably) using a different video driver. Sounds likely.
Glad it's working for you. :)
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Is there anything I need to be testing?
Not right now. I think we're good. :)
I don't see the jerkiness going away any time soon. It just seems to be what happens when one fully utilizes a video card. If BOINC could come up with a standard way to less-than-fully utilize a video card, then it might be possible to make it go away; but they don't think they can do that.
I guess that's the price you pay for pegging the meters. :)
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
I am trying to get ppsieve-CUDA working for my own riesel files. However, when I give it the -R option it says invalid option. Has this functionality been removed for CUDA? If so then this needs removing from the help(or fixing). |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I believe I accidentally left out the short switch. Try --riesel.
____________
|
|
|
|
|
I believe I accidentally left out the short switch. Try --riesel.
Pardon my confusion, but does this mean the CUDA ap works, or will work for Riesel Sieve?
Cheers!
Alan
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Technically, yes; practically, no.
The problem is that PPSieve works well on relatively many K's and relatively few N's. I estimate that the K/N ratio must be greater than about 1/3000 for it to be more effective than srsieve.
Now, Riesel Sieve is sieving 64 K's to 50,000,000. That's a ratio of about 1/800,000. So PPSieve would be about 1/250th as fast as srsieve. Very impractical.
____________
|
|
|
|
|
|
Ah, I see. Thanks Ken.
I wonder...would the cullen/woodall sieve benefit then since, if I'm not mistaken, the k/n ratio is 1?
Cheers!
Alan
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Interesting. It looks like the GCW Sieve app only looks at one N at a time, similar to what PPSieve does. Theoretically, PPSieve might actually be faster than the current GCW sieve, if it were set up to work with these kinds of numbers. The varying kmax (which equals kmin) could lead to a further optimization. Furthermore, in theory, the two sieves' algorithms could be combined, at least on the CPU, to produce an even faster search!
Of course, this is all theoretical, based on an incomplete download of the sieve file and the comments at the top of sieve.c in the GCW Sieve source. Those comments also state that, "The actual implementation below has become a bit more complicated."
But it looks like it's worth investigating.
____________
|
|
|
|
|
I believe I accidentally left out the short switch. Try --riesel.
Looking in the source there doesn't seem to be either the short or the long(I tested the long as well). I tried adding what was needed based on app.c. It compiled but gave lots of:
Computation Error: no candidates found for p=9000044821.
Computation Error: no candidates found for p=9000080209.
Computation Error: no candidates found for p=9000127391.
Computation Error: no candidates found for p=9000151297.
Computation Error: no candidates found for p=9000245209.
errors when using the switch I had added.
Any ideas?
A CUDA riesel sieving program would be very helpful for projects like NPLB. |
|
|
|
|
|
Hear, hear! :-) FYI, Gary (the main NPLB admin) is going to be getting a GPU of his own (GTX 460) within a few days...I'll have remote access to it and would be glad to help with any testing/debugging that's needed.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, I fixed the problems with Riesel sieving. (There were more than I expected!)
I also added code that should produce a small speedup for both Proth and Riesel sieving. It would be good if someone could test this version against an earlier version to make sure this one is faster.
By the way, what kind of sieve file are you planning to use for NPLB, if any? I don't think the NewPGen-file-reading code has ever been tested, so that should get some extra testing if you use it.
____________
|
|
|
|
|
|
We usually use ABCD format for large sieves, then convert to NewPGen format for primality testing once we're done sieving. However, if the NewPGen file reading code needs testing, I'm sure that can be arranged. :-) Besides NewPGen format, what formats does ppsieve accept?
Your mention of "if any" regarding a sieve file reminded me of one other thing I was wondering: how much, if at all, does ppsieve slow down when used without a sieve file as opposed to with one? Normally I would expect the slowdown to be significant since quite a number of pairs are removed at very low p-values, but from what I gather ppsieve uses a very different algorithm than sr*sieve so it may be moot.
Edit: oh, one more thing. Is there a CUDA version of tpsieve as well? I know ppsieve and tpsieve are very closely related, and seeing that the link in the first post of this thread to the ppsieve CUDA source goes to a website entitled "PSieve-CUDA" (not ppsieve specifically) got me wondering whether the CUDA stuff is sufficiently far up the line that both programs can benefit from it. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
We usually use ABCD format for large sieves, then convert to NewPGen format for primality testing once we're done sieving. However, if the NewPGen file reading code needs testing, I'm sure that can be arranged. :-) Besides NewPGen format, what formats does ppsieve accept?
ABCD is the primary input format - but a specific ABCD:
ABCD 1201*2^$a+1 [116]
I haven't generalized this yet. Do you need me to?
Your mention of "if any" regarding a sieve file reminded me of one other thing I was wondering: how much, if at all, does ppsieve slow down when used without a sieve file as opposed to with one? Normally I would expect the slowdown to be significant since quite a number of pairs are removed at very low p-values, but from what I gather ppsieve uses a very different algorithm than sr*sieve so it may be moot.
Processing each factor takes a measurable amount of CPU time. Off the top of my head I'm thinking 100 factors/s, but that's probably low. You should use a sieve file if you have the memory.
Also note that PPSieve only outputs factors, not sieve files. You need another program to remove the factors from the sieve file.
Edit: oh, one more thing. Is there a CUDA version of tpsieve as well?
It's on my todo list - or maybe that should be todo web now:
PPSieve + TPSieve + more work -> CWPsieve (Cullen/Woodall, not generalized)
PPSieve-CUDA + TPSieve -> TPSieve-CUDA
TPSieve-CUDA + CWPsieve -> CWPsieve-CUDA
____________
|
|
|
|
|
ABCD is the primary input format - but a specific ABCD:
ABCD 1201*2^$a+1 [116]
I haven't generalized this yet. Do you need me to?
That should be fine--the ABCD format we use is whatever the sr*sieve programs produce, which IIRC is the same as what you posted.
Processing each factor takes a measurable amount of CPU time. Off the top of my head I'm thinking 100 factors/s, but that's probably low. You should use a sieve file if you have the memory.
Also note that PPSieve only outputs factors, not sieve files. You need another program to remove the factors from the sieve file.
In that case, then, we'd definitely use a sieve file, since even our largest sieve in the past (k=2000-3400 for n=50K-1M) was still small enough that one could run 4 copies of sr2sieve without impacting available RAM overly much. I'm assuming ppsieve and sr2sieve have similar memory usage on the same sieve file?
Regarding ppsieve only outputting factors, that would be fine as we're already used to that with sr2sieve. |
|
|
|
|
|
Working for me now. Slightly slower on GPU than off because my GPU is an old make(8600 GTS). The test I did was a bit of a bad one though. I didn't pick ppsieve's strength for testing. It took 50 seconds while sr2sieve took 3. I will try and find a NPLB sieve file and test that.
edit: The NPLB sieve file for ks 301-399 n 1M-2M took only ~5/4 of the time sieving the ks 3-7 n 1-1M for the same p range. I was very impressed. |
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
I noticed that all my WUs started to fail. Is ther any reason for this?
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Sieve started: 543542000000000 <= p < 543543000000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 1 multiprocessors.
Cuda error: cudaStreamCreate: out of memory
12:45:18 (580): called boinc_finish
</stderr_txt>
I put in a new card this afternoon and now it works fine, but I think it is because it has more memory. 1010MB instead of 243MB...
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I would guess that some other process, like a game or Windows Aero, took enough of your GPU memory (and/or forgot to release enough of it) that PPSieve couldn't work. Did it keep happening after a reboot?
P.S. Henry, glad you like it. :)
P.P.S. Is anyone going to check if my speedup worked? That is, if the current binary runs the command-line test faster than the binary being distributed with BOINC?
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
I figured it out, last weekend I updated my drivers to be able to work with the latest CUDA toolkit. I think that caused the problems; before that it was working fine. I do use AERO but I am not running other apps that use the GUI.
The new card works fine because it has more memory to use :D (and it is faster)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
v0.1.6, of both PPSieve and TPSieve, is released. Many changes and fixes are included.
- Faster on the GPU than 0.1.5b (though about the same as 0.1.5c)
- Uses less CPU
- A huge memory leak on the GPU should be fixed.
- Input files are more often read correctly.
- Many other bugfixes and tweaks.
Get it at the usual URL, in the first post.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
v0.1.6, of both PPSieve and TPSieve, is released. Many changes and fixes are included.
- Faster on the GPU than 0.1.5b (though about the same as 0.1.5c)
- Uses less CPU
- A huge memory leak on the GPU should be fixed.
- Input files are more often read correctly.
- Many other bugfixes and tweaks.
Get it at the usual URL, in the first post.
0.1.6 runs slightly faster (long test, 35 seconds vs 37 seconds) as compared to 0.1.5. That's on a GTX280, C2Q6600, with the system idle.
CPU usage, which was already very low, dropped by about 50% to 75%.
C:\Temp\ppsieve-cuda-0-1-16>ppsieve-cuda-x86-windows.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.1.6 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 2 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 4.39 sec. (0.03 init + 4.36 sieve) at 721290 p/sec.
Processor time: 0.20 sec. (0.05 init + 0.16 sieve) at 20164794 p/sec.
Average processor utilization: 1.46 (init), 0.04 (sieve)
C:\Temp\ppsieve-cuda-0-1-16>ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.1.6 (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 35.48 sec. (0.04 init + 35.44 sieve) at 850540 p/sec.
Processor time: 0.70 sec. (0.05 init + 0.66 sieve) at 46010952 p/sec.
Average processor utilization: 1.26 (init), 0.02 (sieve)
____________
My lucky number is 75898524288+1 |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
0.1.6 runs slightly faster (long test, 35 seconds vs 37 seconds) as compared to 0.1.5. That's on a GTX280, C2Q6600, with the system idle.
CPU usage, which was already very low, dropped by about 50% to 75%.
Thanks Michael!
It would be nice to get a few more users to give this a test. Lennart has already returned positive results as well so we have 2 good reviews. :)
____________
|
|
|
|
|
|
8600M GT:
Short Test:
mmillerick@mmillerick-laptop:~/Desktop/ppsieve-cuda$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.1.6 (testing)
Compiled Sep 6 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070003000000
Thread 0 starting
Detected GPU 0: GeForce 8600M GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070003000000
Found 2 factors
count=95668,sum=0x37dacb7121ccffe4
Elapsed time: 33.79 sec. (0.05 init + 33.75 sieve) at 93219 p/sec.
Processor time: 0.40 sec. (0.05 init + 0.35 sieve) at 8987794 p/sec.
Average processor utilization: 1.05 (init), 0.01 (sieve)
Long Test:
mmillerick@mmillerick-laptop:~/Desktop/ppsieve-cuda$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070010e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.1.6 (testing)
Compiled Sep 6 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070010000000
Thread 0 starting
Detected GPU 0: GeForce 8600M GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
p=42070005505025, 91.75K p/sec, 0.01 CPU cores, 55.1% done. ETA 07 Sep 23:14
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070010000000
Found 10 factors
count=318533,sum=0xb9f8cbeb13d00db3
Elapsed time: 111.18 sec. (0.03 init + 111.15 sieve) at 91981 p/sec.
Processor time: 0.85 sec. (0.03 init + 0.82 sieve) at 12467824 p/sec.
Average processor utilization: 1.13 (init), 0.01 (sieve)
Overrall, I really like this version. It is was just a few seconds faster than previous versions, although I cannot tell the difference in the amount of CPU used.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
9800GT (shader clock now at 1674 MHz):
v0.1.5
Elapsed time: 55.48 sec. (0.03 init + 55.45 sieve) at 543660 p/sec.
Processor time: 1.22 sec. (0.03 init + 1.19 sieve) at 25386577 p/sec.
v0.1.6
Elapsed time: 53.01 sec. (0.03 init + 52.98 sieve) at 569065 p/sec.
Processor time: 0.52 sec. (0.03 init + 0.48 sieve) at 62238059 p/sec.
Nice improvement. :)
GTX460:
v0.1.5
Elapsed time: 14.91 sec. (0.03 init + 14.88 sieve) at 2025976 p/sec.
Processor time: 1.37 sec. (0.05 init + 1.33 sieve) at 22734825 p/sec.
v0.1.6
Elapsed time: 14.90 sec. (0.05 init + 14.85 sieve) at 2030069 p/sec.
Processor time: 0.36 sec. (0.06 init + 0.30 sieve) at 101708356 p/sec.
Same speed, although only 93% GPU load, much less processor time. With -m16, I get ~98% GPU load and
Elapsed time: 14.32 sec. (0.04 init + 14.28 sieve) at 2111101 p/sec.
Processor time: 0.81 sec. (0.06 init + 0.75 sieve) at 40259560 p/sec.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Mike M:
mmillerick@mmillerick-laptop:~/Desktop/ppsieve-cuda$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070010e6 -k 1201 -K 9999 -N 2000000
Me:
C:\Temp\ppsieve-cuda-0-1-16>ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000
I noticed that Mike M's long test only returned 10 factors as opposed the 27 my long test returns, and discovered that we're not testing the same range.
What range are we supposed to be testing? Or does it really matter that much at this point?
____________
My lucky number is 75898524288+1 |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
What range are we supposed to be testing? Or does it really matter that much at this point?
The following are what we're using:
Short: -p42070e9 -P42070010e6 10 factors
Long: -p42070e9 -P42070030e6 27 factors It looks like mmillerick used -p42070e9 -P42070003e6 for his short test and returned 2 factors which is correct for that range.
____________
|
|
|
|
|
|
I usually don't use the longest test range because it takes forever to do on my card.
For completion's sake:
mmillerick@mmillerick-laptop:~/Desktop/ppsieve-cuda$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.1.6 (testing)
Compiled Sep 6 2010 with GCC 4.3.3
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8600M GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
p=42070005505025, 91.75K p/sec, 0.01 CPU cores, 18.4% done. ETA 08 Sep 08:31
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
p=42070011010049, 91.75K p/sec, 0.00 CPU cores, 36.7% done. ETA 08 Sep 08:31
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
p=42070016252929, 87.38K p/sec, 0.00 CPU cores, 54.2% done. ETA 08 Sep 08:32
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
p=42070021757953, 91.75K p/sec, 0.00 CPU cores, 72.5% done. ETA 08 Sep 08:32
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
p=42070027000833, 84.47K p/sec, 0.00 CPU cores, 90.0% done. ETA 08 Sep 08:32
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 336.08 sec. (0.02 init + 336.06 sieve) at 89706 p/sec.
Processor time: 1.11 sec. (0.01 init + 1.10 sieve) at 27405964 p/sec.
Average processor utilization: 0.61 (init), 0.00 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
What range are we supposed to be testing? Or does it really matter that much at this point? It doesn't matter an awful lot. I just wanted a range that I know returns at least 10 factors.
GTX460:
v0.1.5
Elapsed time: 14.91 sec. (0.03 init + 14.88 sieve) at 2025976 p/sec.
Processor time: 1.37 sec. (0.05 init + 1.33 sieve) at 22734825 p/sec.
v0.1.6
Elapsed time: 14.90 sec. (0.05 init + 14.85 sieve) at 2030069 p/sec.
Processor time: 0.36 sec. (0.06 init + 0.30 sieve) at 101708356 p/sec.
Same speed, although only 93% GPU load, much less processor time. With -m16, I get ~98% GPU load and
Elapsed time: 14.32 sec. (0.04 init + 14.28 sieve) at 2111101 p/sec.
Processor time: 0.81 sec. (0.06 init + 0.75 sieve) at 40259560 p/sec.
Aha! -m16 can be arranged. And I'm sure the processor time would average out closer to the middle test on a longer range, even with -m16.
Thanks for testing!
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
GTX460:
v0.1.6
Elapsed time: 14.90 sec. (0.05 init + 14.85 sieve) at 2030069 p/sec.
Damn, that is a fast GPU. I don't suppose anyone's got a GTX 480 to test with? (Yeah, I know, if you go by GFlop/$, it's a lot more effective to run a pair of 460s than a single 480. Still...)
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
...And I've posted 0.1.6a, with the minor change that it now defaults to -m 16 on Fermi-based cards. :)
Edit: Undid last edit; see next post.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
...And I've posted 0.1.6a, with the minor change that it now defaults to -m 16 on Fermi-based cards. :)
On my non-Fermi GTX 280, 0.1.6a runs nearly identically to 0.1.6, which is what you would expect.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Actually, it runs exactly like 0.1.6 on non-Fermi. :)
Also, can I get some testers with Fermis to try some much larger -m values? Like 32, 64, or 127? (Why not 128? It's a cap I placed on the values possible. I'm not sure if it was arbitrary or not.)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Based on new results, also try 12, 24, 48, and 96.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
Also, can I get some testers with Fermis to try some much larger -m values? Like 32, 64, or 127? (Why not 128? It's a cap I placed on the values possible. I'm not sure if it was arbitrary or not.)
default -m16:
Elapsed time: 14.02 sec. (0.03 init + 13.99 sieve) at 2155358 p/sec.
Processor time: 0.80 sec. (0.03 init + 0.76 sieve) at 39437942 p/sec.
-m12:
Elapsed time: 16.41 sec. (0.03 init + 16.38 sieve) at 1840232 p/sec.
Processor time: 0.58 sec. (0.03 init + 0.55 sieve) at 55213176 p/sec.
-m24:
Elapsed time: 14.15 sec. (0.03 init + 14.13 sieve) at 2133997 p/sec.
Processor time: 1.12 sec. (0.03 init + 1.09 sieve) at 27606563 p/sec.
-m32:
Elapsed time: 14.42 sec. (0.03 init + 14.39 sieve) at 2095429 p/sec.
Processor time: 0.78 sec. (0.03 init + 0.75 sieve) at 40259560 p/sec.
-m48:
Elapsed time: 14.81 sec. (0.03 init + 14.78 sieve) at 2039431 p/sec.
Processor time: 0.83 sec. (0.02 init + 0.81 sieve) at 37162690 p/sec.
-m64:
Elapsed time: 14.69 sec. (0.03 init + 14.66 sieve) at 2056545 p/sec.
Processor time: 0.72 sec. (0.03 init + 0.69 sieve) at 43919558 p/sec.
-m96:
Elapsed time: 15.77 sec. (0.03 init + 15.74 sieve) at 1914931 p/sec.
Processor time: 0.75 sec. (0.03 init + 0.72 sieve) at 42010022 p/sec.
-m127:
Elapsed time: 15.70 sec. (0.03 init + 15.67 sieve) at 1923361 p/sec.
Processor time: 1.05 sec. (0.03 init + 1.01 sieve) at 29730159 p/sec.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Alright, -m 16 it is. Thanks!
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
GTX460:
v0.1.6
Elapsed time: 14.90 sec. (0.05 init + 14.85 sieve) at 2030069 p/sec.
Damn, that is a fast GPU. I don't suppose anyone's got a GTX 480 to test with? (Yeah, I know, if you go by GFlop/$, it's a lot more effective to run a pair of 460s than a single 480. Still...)
don`t have gtx480, but fairly oced gtx470 to 760MHz core.
D:\ppsieve-cuda>ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -m 16
ppsieve version cuda-0.1.6a (testing)
nstart=76, nstep=32, gpu_nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 8.62 sec. (0.02 init + 8.59 sieve) at 3508069 p/sec.
Processor time: 0.47 sec. (0.03 init + 0.44 sieve) at 69016376 p/sec.
Average processor utilization: 1.36 (init), 0.05 (sieve)
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Impressive. Four times the speed of my OC'd 280, which is less than 2 years old. That is just crazy fast.
Thanks for the speed report!
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
ok, just got this:
9/14/2010 2:05:17 PM PrimeGrid Requesting new tasks for GPU
9/14/2010 2:05:18 PM PrimeGrid Finished upload of 321_sr2sieve_1950777_2_0
9/14/2010 2:05:18 PM PrimeGrid Scheduler request completed: got 1 new tasks
9/14/2010 2:05:18 PM PrimeGrid Message from server: No work can be sent for the applications you have selected
9/14/2010 2:05:18 PM PrimeGrid Message from server: Your preferences allow work from applications other than those selected
9/14/2010 2:05:18 PM PrimeGrid Message from server: Sending work from other applications
9/14/2010 2:05:20 PM PrimeGrid Started download of primegrid_ppsieve_1.25_windows_x86_64__cuda23.exe
9/14/2010 2:05:23 PM PrimeGrid Finished download of primegrid_ppsieve_1.25_windows_x86_64__cuda23.exe
Two things are cool there: The new version of the CUDA ppsieve app has been pushed into production, and there's now a 64 bit version for Windows. Of course, I doubt the 64 bit version will make a measurable difference compared to the 32 bit version, but it shouldn't hurt, so it's cool regardless of how useful it is.
What's wierd, though, is that it's v1.25. All the other CUDA applications (Win-32, Linux-32, Linux-64, Mac-32, Mac-64) are v 1.29.
Naming error? Wrong version? Is Windows 64bit so uber and nifty that it doesn't need 1.29?
____________
My lucky number is 75898524288+1 |
|
|
RytisVolunteer moderator Project administrator
 Send message
Joined: 22 Jun 05 Posts: 2651 ID: 1 Credit: 58,387,426 RAC: 116,228
                     
|
|
It's a 32bit version renamed to be sent to 64bit hosts to work around GPU bug (because apparently, David Anderson sees no bug there as his machines can receive work just fine :S ). Hence the lower version, I need a separate version to clearly see it in the stats.
____________
|
|
|
|
|
|
btw.: PPS-Sieve for linux-x64 is still @ 1.20 - any reason for that? |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That's off-topic for this thread, but yes there is. For some reason, when I applied the thread-removing fix for Windows, it caused the Linux apps to slow to a crawl after just a few percent of work. So that's something I'm working on now, reinstating threads for Linux.
____________
|
|
|
|
|
That's off-topic for this thread, but yes there is.
yup - i know, just came to mind...
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Civilization 5 came out this week, and it's a pretty good refresh to that franchise.
Civ 5 and CUDA -- at least ppsieve CUDA -- do not play well together.
Civ 5 uses ALL the video memory on the card. As a result, any ppsieve CUDA app that attempts to run while CIV 5 is running will crash and burn. Not only that, but every CUDA task queued up tried to run and crashed immediately.
I thought CUDA apps check for memory before trying to run? I'm running the latest stuff. If you want the details, my machines aren't hidden. There's plenty of tasks with compute errors but I'm not sure if the logs have anything useful in them.
The CUDA app can be in the middle of executing as long as it's suspended. But if it runs, it dies. After Civ 5 exits, the CUDA app can run again.
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
I got GTX 470, drivers 258.96.
I don't receive any work. Massage:
("No work available for the applications you have selected. Please check your project preferences on the web site.")
____________
Polish National Team |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That's a server problem, not an application problem.
You might be able to get around it with an app_info.xml file similar to the one for the ATI app. But, then, you appear to have had problems with the ATI app, too.
____________
|
|
|
|
|
|
This server problem, is it fixable? :)
This app_info.xml, how it should look?
____________
Polish National Team |
|
|
|
|
This app_info.xml, how it should look?
I use this one for OpenCL sieving on my ATI card.
I'm doing the CPU work for PG in a Linux VM* so I had no need to modify the file.
*Setting up my box this way (and using two profiles for GPU and CPU work) helped me to avoid some of the BOINC oddities like the runtime-prediction mess.
____________
|
|
|
|
|
|
Thanks for the file. I tried it on ati some time ago, but all wu crushed ( device not found ). I'll give a second chance. :)
I add a app_info.xml file to primegrid folder...
<app_info>
<app>
<name>pps_sr2sieve</name>
<user_friendly_name>Proth Prime Search (Sieve)</user_friendly_name>
</app>
<file_info>
<name>ppsieve-cuda-x86-windows.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>pps_sr2sieve</app_name>
<version_num>127</version_num>
<plan_class>cuda23</plan_class>
<avg_ncpus>0.05</avg_ncpus>
<max_ncpus>1</max_ncpus>
<flops>1.0e11</flops>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
<cmdline>-m 16</cmdline>
<file_ref>
<file_name>ppsieve-cuda-x86-windows.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
... and I receive work, but all crushed.
____________
Polish National Team |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I have no knowledge of or connection to the BOINC server. I just make fast client apps.
Speaking of which, I think I've made a significantly faster one. (I'm thinking 30-80%!) So please try V0.2.0 Alpha.
I need this version tested on three ranges:
- the default range,
- -p20070e9 -P20070010e6, which should produce the following factors:
20070000475957 | 4995*2^1822738+1
20070001146497 | 4977*2^626298+1
20070001163929 | 3765*2^461308+1
20070001302811 | 7669*2^725426+1
20070001425977 | 5821*2^1775248+1
20070002245151 | 1221*2^646983+1
20070002606341 | 4809*2^497683+1
20070004816819 | 6699*2^1215561+1
20070005914001 | 9847*2^1634140+1
20070006187837 | 9923*2^287853+1
20070006875981 | 1645*2^965954+1
20070007170259 | 3889*2^49730+1
20070008329039 | 9065*2^832569+1
Found 13 factors
- and -p249871e9 -P249872e9, which should produce the following factors:
249871003789289 | 6295*2^266404+1
249871003804313 | 1897*2^1790254+1
249871004642153 | 4393*2^720262+1
249871008061891 | 3105*2^1189485+1
249871008485251 | 4787*2^131683+1
249871009106447 | 8785*2^1246050+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871017008411 | 7771*2^828544+1
249871018975427 | 5057*2^799271+1
249871020273263 | 5591*2^103221+1
Found 11 factors
The reason for this is that there are two major algorithms used here. One is very similar to the one you've been using, except that I finally realized the numbers don't have to stay in Montgomery form for the Montgomery math to work on them! The other is a special algorithm for nstep==32 that does two steps for the price of 1 and a half. I need the second range, nstep==31, to compare to, so I know when to go back to the regular algorithm. And the third range should be reduced to nstep==32; I want to make sure that works.
By the way, ATI people, this will be coming to your code soon. It might even be able to speed up the CPU code, with some work.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
|
Ok, I tested this first on the ranges we were testing before.
In a word, wow.
Old version: 37 seconds.
New version: 20 seconds.
Yowsa. I compared the output of the two and they produced the same results. That's a lot of improvement.
Now, for the specific test ranges you asked for:
1) Default. Sorry, not sure what you meant.
2) -p20070e9 -P20070010e6
21:09:48.30>ppsieve-cuda-boinc-x86-windows.exe -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.2.0-alpha (testing)
nstart=74, nstep=31
ppsieve initialized: 1201 <= k <= 9999, 74 <= n < 2000000
Didn't change nstep from 31
20070000475957 | 4995*2^1822738+1
20070001146497 | 4977*2^626298+1
20070001163929 | 3765*2^461308+1
20070001302811 | 7669*2^725426+1
20070001425977 | 5821*2^1775248+1
20070002245151 | 1221*2^646983+1
20070002606341 | 4809*2^497683+1
20070004816819 | 6699*2^1215561+1
20070005914001 | 9847*2^1634140+1
20070006187837 | 9923*2^287853+1
20070006875981 | 1645*2^965954+1
20070007170259 | 3889*2^49730+1
20070008329039 | 9065*2^832569+1
Found 13 factors
21:09:59.64>
3) -p249871e9 -P249872e9 This one produced results nothing at all like yours:
21:09:59.64>ppsieve-cuda-boinc-x86-windows.exe -p249871e9 -P249872e9 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.2.0-alpha (testing)
nstart=80, nstep=35
ppsieve initialized: 1201 <= k <= 9999, 80 <= n < 2000000
nstep changed to 32
249871003789289 | 6295*2^266404+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871027030549 | 8865*2^1534637+1
249871030776329 | 7815*2^1679937+1
249871032591751 | 2335*2^23512+1
249871038523049 | 7527*2^204096+1
249871049497963 | 6497*2^505399+1
249871066947839 | 8497*2^1221770+1
249871068167599 | 7311*2^450531+1
249871089712009 | 9281*2^1650023+1
249871091913587 | 2139*2^1290902+1
249871099624639 | 8381*2^350375+1
249871100827559 | 3885*2^890478+1
249871112392799 | 5433*2^86569+1
249871116444139 | 8167*2^537138+1
249871142109437 | 4571*2^235835+1
249871142846929 | 9865*2^1386722+1
249871145253061 | 7723*2^1616488+1
249871147007767 | 5427*2^1762047+1
249871191146399 | 1653*2^864278+1
249871203485779 | 7205*2^1483273+1
249871221249239 | 5273*2^796261+1
249871231197649 | 7115*2^1004763+1
249871234562441 | 3735*2^903719+1
249871240610737 | 9001*2^145588+1
249871255618841 | 8713*2^997864+1
249871286485397 | 1531*2^532792+1
249871296376463 | 1757*2^618675+1
249871298874097 | 6555*2^1071185+1
249871350749513 | 2067*2^1231878+1
249871351530373 | 7585*2^457912+1
249871356234763 | 8735*2^366023+1
249871357743289 | 9585*2^171002+1
249871390343693 | 5475*2^245223+1
249871393292999 | 3175*2^1217896+1
249871395741229 | 7827*2^670627+1
249871411673603 | 8675*2^1024157+1
249871422351119 | 2801*2^121123+1
249871426127417 | 1755*2^1248472+1
249871428879211 | 4007*2^1544859+1
249871432461143 | 6699*2^1416817+1
249871460981543 | 9837*2^1594991+1
249871467976957 | 3215*2^631839+1
249871468978793 | 6615*2^792218+1
249871473171109 | 6459*2^349090+1
249871479774359 | 8841*2^1188943+1
249871482340243 | 5919*2^1288106+1
249871485732901 | 2527*2^1986876+1
249871489377617 | 8787*2^1867082+1
249871500270103 | 5319*2^1686314+1
249871520126999 | 4461*2^1110529+1
249871523924669 | 5439*2^861802+1
249871524519619 | 4389*2^1674521+1
249871539157733 | 2991*2^155913+1
249871545891953 | 7993*2^97334+1
249871551853519 | 7461*2^636515+1
249871557290543 | 5415*2^775154+1
249871580676449 | 1945*2^178870+1
249871585564739 | 8163*2^1587554+1
249871598866061 | 2211*2^1478807+1
249871616165299 | 5007*2^137778+1
249871620144283 | 5301*2^297187+1
249871620594919 | 2995*2^1498336+1
249871622479667 | 7185*2^1792115+1
249871630939459 | 1775*2^87997+1
249871632993599 | 4487*2^800319+1
249871634644991 | 2227*2^586308+1
249871638628597 | 6415*2^430920+1
249871642894591 | 4545*2^1189338+1
249871649868299 | 6099*2^808945+1
249871653154573 | 5265*2^1571050+1
249871670943277 | 8393*2^866693+1
249871673398199 | 2095*2^673020+1
249871693098233 | 1827*2^960880+1
249871708905271 | 2193*2^1597924+1
249871712374597 | 7837*2^1377548+1
249871725055831 | 4979*2^933113+1
249871725321403 | 4585*2^24264+1
249871732548391 | 9867*2^406843+1
249871734477727 | 9757*2^1424650+1
249871735521757 | 9333*2^1259374+1
249871736890429 | 6811*2^1949992+1
249871745586703 | 5015*2^438593+1
249871751242439 | 8955*2^1496577+1
249871798529369 | 8421*2^1804145+1
249871803462733 | 5691*2^1134880+1
249871811763227 | 4965*2^872947+1
249871813158707 | 7879*2^384706+1
249871832201279 | 7635*2^318117+1
249871834337353 | 9435*2^1558391+1
249871837075541 | 8031*2^1527481+1
249871868971631 | 3219*2^314393+1
249871873118431 | 5473*2^1249052+1
249871878762767 | 7941*2^1256085+1
249871879880633 | 7271*2^582745+1
249871884477217 | 9853*2^651700+1
249871893408937 | 4917*2^356987+1
249871897589609 | 3317*2^1069159+1
249871900319813 | 8835*2^1995246+1
249871902518837 | 7281*2^1735167+1
249871907749279 | 2931*2^91627+1
249871912754773 | 9987*2^364542+1
249871929613001 | 2631*2^1634125+1
249871933840099 | 1963*2^896934+1
249871947714349 | 1305*2^1601273+1
249871963983179 | 9993*2^1764376+1
249871970029541 | 3153*2^236909+1
249871977317737 | 4791*2^1426980+1
249871980442069 | 7789*2^417126+1
249871993906949 | 4867*2^1128468+1
249871994269553 | 7293*2^801434+1
Found 112 factors
21:18:56.56>
____________
My lucky number is 75898524288+1 |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
Now, for the specific test ranges you asked for:
1) Default. Sorry, not sure what you meant.
3) -p249871e9 -P249872e9 This one produced results nothing at all like yours:
Default is what we've been using for testing: -p42070e9 -P42070030e6 27 factors
As for 3, 112 factors has been confirmed by Lennart.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Default is what we've been using for testing: -p42070e9 -P42070030e6 27 factors
Alright, that's the one I tested first against the old version, where the time went from 37 seconds to 20 seconds. Here's the output:
21:09:27.64>ppsieve-cuda-boinc-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000
ppsieve version cuda-0.2.0-alpha (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Found 27 factors
21:09:46.27>
This is on a GTX 280.
____________
My lucky number is 75898524288+1 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
3) -p249871e9 -P249872e9 This one produced results nothing at all like yours:
Whoops! I plucked that range out of an earlier post in this thread. I should have checked it before using it.
It's probably simpler to limit the range to -p249871e9 -P2498711e8, which returns the first 13 factors. The factors you found appear to match what I found with my CPU.
If anyone's now calculating how much this will speed up their WUs, be aware that the relative speedup falls off as nstep increases. The current nstep is about 37, so that's about a 60% speedup.
Edit: Now, can I get some Fermi benchmarks to compare to? And please report the speed that is written to stderr.txt.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
Edit: Now, can I get some Fermi benchmarks to compare to? And please report the speed that is written to stderr.txt.
GTX 460 again:
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 8.22 sec. (0.02 init + 8.19 sieve) at 3679789 p/sec.
Processor time: 0.56 sec. (0.03 init + 0.53 sieve) at 56837084 p/sec.
Average processor utilization: 1.36 (init), 0.06 (sieve)
27 factors found.
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 4.29 sec. (0.02 init + 4.27 sieve) at 2392471 p/sec.
Processor time: 0.27 sec. (0.02 init + 0.25 sieve) at 40959836 p/sec.
Average processor utilization: 0.74 (init), 0.06 (sieve)
13 factors found.
Sieve complete: 249871000000000 <= p < 249871100000000
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 24.69 sec. (0.06 init + 24.64 sieve) at 4064511 p/sec.
Processor time: 1.58 sec. (0.06 init + 1.51 sieve) at 66176544 p/sec.
Average processor utilization: 1.11 (init), 0.06 (sieve)
13 factors found. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
And please report the speed that is written to stderr.txt.
Ah, that's where that's hiding. Ok, here's the timing numbers for the results I posted before.
on the GTX 280:
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 16.58 sec. (0.04 init + 16.54 sieve) at 1727252 p/sec.
Processor time: 0.72 sec. (0.06 init + 0.66 sieve) at 43610381 p/sec.
Average processor utilization: 1.56 (init), 0.04 (sieve)
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 9.29 sec. (0.04 init + 9.25 sieve) at 1105076 p/sec.
Processor time: 0.41 sec. (0.05 init + 0.36 sieve) at 28493754 p/sec.
Average processor utilization: 1.14 (init), 0.04 (sieve)
Sieve complete: 249871000000000 <= p < 249872000000000
count=30166916,sum=0xa0b7dde9a581c7d4
Elapsed time: 534.88 sec. (0.09 init + 534.79 sieve) at 1870042 p/sec.
Processor time: 13.67 sec. (0.11 init + 13.56 sieve) at 73771277 p/sec.
Average processor utilization: 1.21 (init), 0.03 (sieve)
3) -p249871e9 -P249872e9 This one produced results nothing at all like yours:
Whoops! I plucked that range out of an earlier post in this thread. I should have checked it before using it.
It's not just that my test ran longer; the factors didn't match yours. Although the first number was the same, the second number was different.
____________
My lucky number is 75898524288+1 |
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
|
Timings for my 9800 GT on 64-bit Ubuntu:
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 28.82 sec. (0.02 init + 28.81 sieve) at 1046462 p/sec.
Processor time: 0.41 sec. (0.02 init + 0.39 sieve) at 77298872 p/sec.
Average processor utilization: 1.29 (init), 0.01 (sieve)
Found 27 factors
That's more than double the speed for 0.1.5a which is the last version I had
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 14.78 sec. (0.01 init + 14.77 sieve) at 692284 p/sec.
Processor time: 0.24 sec. (0.01 init + 0.23 sieve) at 44450504 p/sec.
Average processor utilization: 0.95 (init), 0.02 (sieve)
Found 13 factors
Sieve complete: 249871000000000 <= p < 249871100000000
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 89.33 sec. (0.04 init + 89.29 sieve) at 1121483 p/sec.
Processor time: 1.26 sec. (0.04 init + 1.22 sieve) at 82081154 p/sec.
Average processor utilization: 1.06 (init), 0.01 (sieve)
Found 13 factors
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Whoops! I plucked that range out of an earlier post in this thread. I should have checked it before using it.
It's not just that my test ran longer; the factors didn't match yours. Although the first number was the same, the second number was different.
I really should have checked what was in that post, instead of copy/pasting it. Those results were probably from a malfunctioning version or something.
Here are the correct results for the shorter range:
249871003789289 | 6295*2^266404+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871027030549 | 8865*2^1534637+1
249871030776329 | 7815*2^1679937+1
249871032591751 | 2335*2^23512+1
249871038523049 | 7527*2^204096+1
249871049497963 | 6497*2^505399+1
249871066947839 | 8497*2^1221770+1
249871068167599 | 7311*2^450531+1
249871089712009 | 9281*2^1650023+1
249871091913587 | 2139*2^1290902+1
249871099624639 | 8381*2^350375+1
They appear to match your results, and everyone else's.
Now, I'm just wondering why this higher range is consistently faster than the regular one. The cause of the speed difference is probably either the number of factors found or that the short range breaks at an awkward spot.
Would someone mind running -p42070e9 -P420701e8, as compared to the other ranges? Don't post the factors found in this one; just let me know how many there are.
P.S. Pschoefer, nice Fermi result! Now, to make it even faster, if I can...
Thanks, all!
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
-p42070e9 -P42070030e6
27 factors
17:35:14 (3040): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 4.52 sec. (0.03 init + 4.49 sieve) at 6716765 p/sec.
Processor time: 0.47 sec. (0.03 init + 0.44 sieve) at 69016376 p/sec.
Average processor utilization: 1.16 (init), 0.10 (sieve)
17:35:19 (3040): called boinc_finish
-p42070e9 -P420701e8
68 factors
17:35:29 (5048): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070100000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070100000000
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 14.55 sec. (0.03 init + 14.52 sieve) at 6895756 p/sec.
Processor time: 1.34 sec. (0.03 init + 1.31 sieve) at 76418190 p/sec.
Average processor utilization: 1.20 (init), 0.09 (sieve)
17:35:44 (5048): called boinc_finish
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
Would someone mind running -p42070e9 -P420701e8, as compared to the other ranges? Don't post the factors found in this one; just let me know how many there are.
42070e9 to 42070030e6:
13:22:48 (4640): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 17.48 sec. (0.04 init + 17.44 sieve) at 1728522 p/sec.
Processor time: 0.66 sec. (0.03 init + 0.62 sieve) at 48311485 p/sec.
Average processor utilization: 0.89 (init), 0.04 (sieve)
13:23:05 (4640): called boinc_finish 27 factors
42070e9 to 42070100e6
13:23:07 (656): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070100000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070100000000
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 57.17 sec. (0.06 init + 57.11 sieve) at 1753472 p/sec.
Processor time: 2.01 sec. (0.05 init + 1.97 sieve) at 50945460 p/sec.
Average processor utilization: 0.82 (init), 0.03 (sieve)
13:24:04 (656): called boinc_finish 68 factors
20070e9 to 20070010e6
13:24:06 (7792): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 20070000000000 <= p < 20070010000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 9.30 sec. (0.04 init + 9.27 sieve) at 1103166 p/sec.
Processor time: 0.44 sec. (0.05 init + 0.39 sieve) at 26214266 p/sec.
Average processor utilization: 1.34 (init), 0.04 (sieve)
13:24:16 (7792): called boinc_finish 13 factors
249871e9 to 2498711e8
13:24:18 (8144): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 249871000000000 <= p < 249871100000000
Thread 0 starting
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 249871000000000 <= p < 249871100000000
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 53.97 sec. (0.09 init + 53.88 sieve) at 1858598 p/sec.
Processor time: 1.56 sec. (0.11 init + 1.45 sieve) at 69022827 p/sec.
Average processor utilization: 1.24 (init), 0.03 (sieve)
13:25:12 (8144): called boinc_finish 13 factors
____________
My lucky number is 75898524288+1 |
|
|
|
|
|
Ken, could you please post non-BOINC-version binaries of 0.2.0 alpha? We'd like to use this at the NPLB project for a team sieve we just started and while the BOINC version does work, it is nice to have the ETA figures printed.
Thanks,
Max :-) |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
How about v0.2.0 non-alpha binaries? They've performed so well that I decided to leave off the "alpha" or "beta" and to put them in the main download location. But they've only had beta-quality testing, so more testing is needed.
This version should be faster for Fermi-based cards, and should also run the -p20070e9 -P20070010e6 test faster on all cards.
Let me know how these work for you.
____________
|
|
|
|
|
|
OS: Windows Vista SP-2 - Driver Version 258.96
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 260-192 @ 667 MHz
.\ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K9999 -N2000000 -c 60
ppsieve version cuda-0.2.0 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 19.61 sec. (0.03 init + 19.58 sieve) at 1539582 p/sec.
Processor time: 0.39 sec. (0.02 init + 0.37 sieve) at 80519228 p/sec.
Average processor utilization: 0.49 (init), 0.02 (sieve)
____________
|
|
|
|
|
|
Range: 42070e9 to 42070100e6
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 65.33 sec. (0.02 init + 65.31 sieve) at 1533405 p/sec.
Processor time: 1.11 sec. (0.02 init + 1.09 sieve) at 91701800 p/sec.
Average processor utilization: 0.68 (init), 0.02 (sieve)
____________
|
|
|
|
|
|
Range: 20070e9 to 20070010e6
Thread 0 completed
Waiting for threads to exit
Sieve complete: 20070000000000 <= p < 20070010000000
Found 13 factors
count=326136,sum=0x5ad678173464405c
Elapsed time: 10.49 sec. (0.02 init + 10.47 sieve) at 976281 p/sec.
Processor time: 0.30 sec. (0.03 init + 0.27 sieve) at 38550443 p/sec.
Average processor utilization: 1.64 (init), 0.03 (sieve)
____________
|
|
|
|
|
|
Range: 249871e9 to 2498711e8
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249871100000000
Found 13 factors
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 61.72 sec. (0.06 init + 61.66 sieve) at 1623920 p/sec.
Processor time: 0.84 sec. (0.05 init + 0.80 sieve) at 125865232 p/sec.
Average processor utilization: 0.81 (init), 0.01 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Thanks, Ralf! Very informative. :)
If I could get you to do one more thing, could you run this test:
Range: 20070e9 to 20070010e6
with the alpha client? I need to see how they compare speed-wise. Thanks!
Now, if someone can do the same tests with a Fermi-based card, that should be about all I need for now.
____________
|
|
|
|
|
|
Here it is:
Screen dump:
.\ppsieve-cuda-boinc-x86-windows.exe -p 20070e9 -P 20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.2.0-alpha (testing)
nstart=74, nstep=31
ppsieve initialized: 1201 <= k <= 9999, 74 <= n < 2000000
Didn't change nstep from 31
20070000475957 | 4995*2^1822738+1
20070001146497 | 4977*2^626298+1
20070001163929 | 3765*2^461308+1
20070001302811 | 7669*2^725426+1
20070001425977 | 5821*2^1775248+1
20070002245151 | 1221*2^646983+1
20070002606341 | 4809*2^497683+1
20070004816819 | 6699*2^1215561+1
20070005914001 | 9847*2^1634140+1
20070006187837 | 9923*2^287853+1
20070006875981 | 1645*2^965954+1
20070007170259 | 3889*2^49730+1
20070008329039 | 9065*2^832569+1
Found 13 factors
stderr.txt:
17:02:19 (2972): Can't open init data file - running in standalone mode
Sieve started: 20070000000000 <= p < 20070010000000
Thread 0 starting
Detected GPU 0: GeForce GTX 260
Detected compute capability: 1.3
Detected 24 multiprocessors.
Thread 0 completed
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 10.45 sec. (0.02 init + 10.44 sieve) at 979611 p/sec.
Processor time: 0.17 sec. (0.03 init + 0.14 sieve) at 72817259 p/sec.
Average processor utilization: 2.00 (init), 0.01 (sieve)
17:02:29 (2972): called boinc_finish
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
OS: Windows7U x64, Driver Version 260.63beta
CPU: Xeon X5650 @ 3.5 GHz
GPU: NVIDIA GTX 470 @ 760 MHz
pps-cuda-0.2.0-alpha
.\ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K9999 -N2000000 -c 60
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 4.54 sec. (0.05 init + 4.48 sieve) at 6725754 p/sec.
Processor time: 0.45 sec. (0.02 init + 0.44 sieve) at 69016534 p/sec.
Average processor utilization: 0.29 (init), 0.10 (sieve)
17:25:42 (1568): called boinc_finish
Range: 42070e9 to 42070100e6
Thread 0 completed
Found 68 factors
Sieve complete: 42070000000000 <= p < 42070100000000
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 14.59 sec. (0.03 init + 14.56 sieve) at 6876813 p/sec.
Processor time: 1.37 sec. (0.03 init + 1.34 sieve) at 74641034 p/sec.
Average processor utilization: 1.20 (init), 0.09 (sieve)
Range: 20070e9 to 20070010e6
Thread 0 completed
Found 13 factors
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 2.68 sec. (0.02 init + 2.67 sieve) at 3836033 p/sec.
Processor time: 0.33 sec. (0.02 init + 0.31 sieve) at 32767790 p/sec.
Average processor utilization: 0.82 (init), 0.12 (sieve)
Range: 249871e9 to 2498711e8
Thread 0 completed
Found 13 factors
Sieve complete: 249871000000000 <= p < 249871100000000
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 13.76 sec. (0.06 init + 13.70 sieve) at 7307399 p/sec.
Processor time: 1.11 sec. (0.06 init + 1.05 sieve) at 95807824 p/sec.
Average processor utilization: 1.08 (init), 0.08 (sieve)[/b]
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, that's the alpha, and all that information is useful. But how does it compare to the current non-alpha?
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
my bad.
OS: Windows7U x64, Driver Version 260.63beta
CPU: Xeon X5650 @ 3.5 GHz
GPU: NVIDIA GTX 470 @ 760 MHz
pps-cuda-0.2.0
.\ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K9999 -N2000000 -c 60
ppsieve version cuda-0.2.0 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 3.82 sec. (0.02 init + 3.80 sieve) at 7937029 p/sec.
Processor time: 0.44 sec. (0.03 init + 0.41 sieve) at 74325472 p/sec.
Average processor utilization: 1.36 (init), 0.11 (sieve)
.\ppsieve-cuda-x86-windows.exe -p42070e9 -P42070100e6 -k 1201 -K9999 -N2000000 -c 60
ppsieve version cuda-0.2.0 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070100000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 12.21 sec. (0.02 init + 12.18 sieve) at 8219773 p/sec.
Processor time: 0.98 sec. (0.02 init + 0.97 sieve) at 103534312 p/sec.
Average processor utilization: 0.68 (init), 0.08 (sieve)
.\ppsieve-cuda-x86-windows.exe -p20070e9 -P20070010e6 -k 1201 -K9999 -N2000000 -c 60
ppsieve version cuda-0.2.0 (testing)
nstart=74, nstep=31
ppsieve initialized: 1201 <= k <= 9999, 74 <= n < 2000000
Sieve started: 20070000000000 <= p < 20070010000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
nstep changed to 22
Thread 0 completed
Waiting for threads to exit
Sieve complete: 20070000000000 <= p < 20070010000000
Found 13 factors
count=326136,sum=0x5ad678173464405c
Elapsed time: 2.10 sec. (0.02 init + 2.08 sieve) at 4905486 p/sec.
Processor time: 0.30 sec. (0.02 init + 0.28 sieve) at 36408759 p/sec.
Average processor utilization: 0.97 (init), 0.13 (sieve)
.\ppsieve-cuda-x86-windows.exe -p249871e9 -P2498711e8 -k 1201 -K9999 -N2000000 -c 60
ppsieve version cuda-0.2.0 (testing)
nstart=80, nstep=35
ppsieve initialized: 1201 <= k <= 9999, 80 <= n < 2000000
Sieve started: 249871000000000 <= p < 249871100000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
nstep changed to 32
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249871100000000
Found 13 factors
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 11.57 sec. (0.06 init + 11.52 sieve) at 8692880 p/sec.
Processor time: 0.81 sec. (0.05 init + 0.76 sieve) at 131002555 p/sec.
Average processor utilization: 0.85 (init), 0.07 (sieve)
____________
|
|
|
|
|
|
A few alpha/non-alpha tests with a Fermi based card...
OS: Windows Vista SP-2 - Driver Version 258.96
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
\ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K9999 -N2000000 -c 60
cuda-0.2.0-alpha (testing):
Elapsed time: 8.72 sec. (0.03 init + 8.69 sieve) at 3467513 p/sec.
Processor time: 0.98 sec. (0.03 init + 0.95 sieve) at 31679666 p/sec.
Average processor utilization: 1.16 (init), 0.11 (sieve)
cuda-0.2.0 (testing):
Elapsed time: 7.11 sec. (0.02 init + 7.09 sieve) at 4253783 p/sec.
Processor time: 0.50 sec. (0.02 init + 0.48 sieve) at 62337413 p/sec.
Average processor utilization: 0.68 (init), 0.07 (sieve)
Range: 42070e9 to 42070100e6
cuda-0.2.0-alpha (testing):
Elapsed time: 28.70 sec. (0.03 init + 28.67 sieve) at 3492937 p/sec.
Processor time: 2.50 sec. (0.02 init + 2.48 sieve) at 40371860 p/sec.
Average processor utilization: 0.60 (init), 0.09 (sieve)
cuda-0.2.0 (testing):
Elapsed time: 23.40 sec. (0.02 init + 23.38 sieve) at 4283655 p/sec.
Processor time: 1.29 sec. (0.03 init + 1.26 sieve) at 79248476 p/sec.
Average processor utilization: 1.42 (init), 0.05 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I find it strange that, even compensating for clock speed and core count, the GTX 470 is about 35% faster than the GTX 460. I wonder why?
____________
|
|
|
|
|
I find it strange that, even compensating for clock speed and core count, the GTX 470 is about 35% faster than the GTX 460. I wonder why?
memory bandwith?
there are 2 versions of GTX460 around - one with 768MB and 192bit interface and the other with 1 or 2 GB and 256bit interface! |
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
afaik GTX 470 (and 465, 480) are "true" fermi made for gpgpu in the first place (L2 cache and other stuff inside gpu). GTX 460 is remake for getting the price lower (you dont need that L2 cache and so on for gaming that much)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Anyone having graphics slowdowns with GTX 460/470 and my app? If not, I could try doubling the work-per-kernel and see what happens.
I keep thinking it might have to do with 48 vs. 32 threads per multiprocessor; but the latest CUDA Occupancy Calculator doesn't show anything like that.
____________
|
|
|
|
|
I find it strange that, even compensating for clock speed and core count, the GTX 470 is about 35% faster than the GTX 460. I wonder why?
memory bandwith?
there are 2 versions of GTX460 around - one with 768MB and 192bit interface and the other with 1 or 2 GB and 256bit interface!
The card I used for testing is a 1GB GTX 460 (256 bit) running at the default factory settings (Manufacturer OC'ed if you want to call 50 MHz OC'ed...).
____________
|
|
|
|
|
Anyone having graphics slowdowns with GTX 460/470 and my app? If not, I could try doubling the work-per-kernel and see what happens.
I keep thinking it might have to do with 48 vs. 32 threads per multiprocessor; but the latest CUDA Occupancy Calculator doesn't show anything like that.
No. The only problem I have with the card that I can get no regular WUs but thats slightly Off-Topic here...
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
Anyone having graphics slowdowns with GTX 460/470 and my app? If not, I could try doubling the work-per-kernel and see what happens.
I keep thinking it might have to do with 48 vs. 32 threads per multiprocessor; but the latest CUDA Occupancy Calculator doesn't show anything like that.
No. The only problem I have with the card that I can get no regular WUs but thats slightly Off-Topic here...
same here.. no graphics slowdowns (...and no regular wu :( )
____________
|
|
|
|
|
afaik GTX 470 (and 465, 480) are "true" fermi made for gpgpu in the first place (L2 cache and other stuff inside gpu). GTX 460 is remake for getting the price lower (you dont need that L2 cache and so on for gaming that much)
The GTX 460 uses a redesigned chip (GF104 instead of the GF100) and if you think the DP capabilities of the GF100 in consumer cards are suboptimal you should better think twice about bying a GF104 based Fermi...
____________
|
|
|
|
|
Anyone having graphics slowdowns with GTX 460/470 and my app? If not, I could try doubling the work-per-kernel and see what happens.
I keep thinking it might have to do with 48 vs. 32 threads per multiprocessor; but the latest CUDA Occupancy Calculator doesn't show anything like that.
On a GTX 460 (running Linux) I get no GUI slowdown at all. I was under the impression that this was actually more normal than I'd heard, since other CUDA programs (such as msft's CUDA-MacLucasFFTW LL testing app over at GIMPS) don't produce any GUI slowdown either. But if this means there's still room for improvement, then by all means, go for it! :-) |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I think I've found the issue. From this article:
The ability to extract ILP from a warp will result in GF104’s compute abilities performing like a 384 CUDA core part some of the time, and like a 256 CUDA core part at other times.
The only ways I see to improve on this are to either (1) compile with the latest CUDA API, and/or (2) to use vectors like I did with OpenCL. Neither is particularly appealing; but someone might try (1) with an app_info.xml file. (If they can get work, that is.)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
PPSieve-CUDA v0.2.1 is out. It has lots of little tweaks, though most are to the related TPSieve. It might be a little faster, particularly on GTX 460s; but you probably won't notice it much. It might use a little less CPU on Windows, and on Linux 32-bit specifically, though you probably won't notice that either.
Give it a once-over on the 20T and 42T ranges, on Fermi and non-Fermi; then I think it will be good enough for BOINC.
Edit: A test with 32-bit Linux would be a good idea.
____________
|
|
|
|
|
PPSieve-CUDA v0.2.1 is out.
it really would be helpful to post the link...
|
|
|
|
|
PPSieve-CUDA v0.2.1 is out.
it really would be helpful to post the link...
Link in the opening post ;)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
PPSieve-CUDA v0.2.1 is out.
it really would be helpful to post the link...
Go to the first post in the thread. It's there.
____________
|
|
|
|
|
PPSieve-CUDA v0.2.1 is out.
it really would be helpful to post the link...
Go to the first post in the thread. It's there.
yup -I know. but i know someone who manged to search for a long time... |
|
|
|
|
|
Comparing cuda-0.2.0 (testing) with cuda-0.2.1 (testing) on a GTX 460
OS: Windows Vista SP-2 - Driver Version 258.96
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
\ppsieve-cuda-boinc-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K9999 -N2000000 -c 60
cuda-0.2.0 (testing):
Elapsed time: 7.11 sec. (0.02 init + 7.09 sieve) at 4253783 p/sec.
Processor time: 0.50 sec. (0.02 init + 0.48 sieve) at 62337413 p/sec.
Average processor utilization: 0.68 (init), 0.07 (sieve)
cuda-0.2.1 (testing):
Elapsed time: 6.93 sec. (0.03 init + 6.90 sieve) at 4367801 p/sec.
Processor time: 0.31 sec. (0.03 init + 0.28 sieve) at 107358779 p/sec.
Average processor utilization: 1.16 (init), 0.04 (sieve)
Range: 42070e9 to 42070100e6
cuda-0.2.0 (testing):
Elapsed time: 23.40 sec. (0.02 init + 23.38 sieve) at 4283655 p/sec.
Processor time: 1.29 sec. (0.03 init + 1.26 sieve) at 79248476 p/sec.
Average processor utilization: 1.42 (init), 0.05 (sieve)[/quote]
cuda-0.2.1 (testing):
Elapsed time: 22.65 sec. (0.03 init + 22.62 sieve) at 4427208 p/sec.
Processor time: 0.87 sec. (0.05 init + 0.83 sieve) at 121115629 p/sec.
Average processor utilization: 1.61 (init), 0.04 (sieve)
---
Factors, count, sums, etc. are identical.
____________
|
|
|
KPX Send message
Joined: 8 Jan 07 Posts: 20 ID: 4756 Credit: 92,931,253 RAC: 22,439
                   
|
PPSieve-CUDA v0.2.1 is out.
it really would be helpful to post the link...
Go to the first post in the thread. It's there.
Yes and no... The first post clearly says the attached app is for linux only. People searching for Windows app will be looking elsewhere.
Please, let's change the first post then, so it is clear the zip file contains also the Win app. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I've just re-thought the CPU multiplication, and I think I'm going to have to pull this build. I'll try again tomorrow. :(
____________
|
|
|
BiBi Volunteer tester Send message
Joined: 6 Mar 10 Posts: 151 ID: 56425 Credit: 34,290,031 RAC: 0
                   
|
|
The results for my 9500 GT PCI
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201
-K9999 -N2000000 -c 60
ppsieve version cuda-0.2.1 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 87.69 sec. (0.09 init + 87.60 sieve) at 344136 p/sec.
Processor time: 2.36 sec. (0.08 init + 2.28 sieve) at 13214930 p/sec.
Average processor utilization: 0.83 (init), 0.03 (sieve)
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070100e6 -k 1201
-K9999 -N2000000 -c 60
ppsieve version cuda-0.2.1 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070100000000
Thread 0 starting
Detected GPU 0: GeForce 9500 GT
Detected compute capability: 1.1
Detected 4 multiprocessors.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 296.34 sec. (0.12 init + 296.21 sieve) at 338062 p/sec.
Processor time: 7.77 sec. (0.05 init + 7.72 sieve) at 12973475 p/sec.
Average processor utilization: 0.38 (init), 0.03 (sieve)
____________
|
|
|
|
|
I've just re-thought the CPU multiplication, and I think I'm going to have to pull this build. I'll try again tomorrow. :(
what's wrong with it?
results look fine...
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
What's wrong with it is highly technical. It involves multiplying two 64-bit numbers to get a 128-bit number when you can only use individual 32-bit numbers, and avoiding overflow in those 32-bit numbers. I made an assumption that both 64-bit numbers were actually smaller, less-than-63-bit numbers, but that's not true in all cases. And one being smaller isn't enough to make it work all the time. (Just most of the time, as you've seen.)
The bad optimization wasn't aimed at PrimeGrid anyway; it was mostly aimed at Twin Prime Search on Mersenneforum. I've uploaded v0.2.1a, and you shouldn't notice much difference.
____________
|
|
|
|
|
What's wrong with it is highly technical. It involves multiplying two 64-bit numbers to get a 128-bit number when you can only use individual 32-bit numbers, and avoiding overflow in those 32-bit numbers. I made an assumption that both 64-bit numbers were actually smaller, less-than-63-bit numbers, but that's not true in all cases. And one being smaller isn't enough to make it work all the time. (Just most of the time, as you've seen.)
oops - not using SSE 128bit instructions?
we are talking about the cpu-part of the code - right?
|
|
|
|
|
I got GTX 470, drivers 258.96.
I don't receive any work. Massage:
("No work available for the applications you have selected. Please check your project preferences on the web site.")
I get the same error message out of the blue but i see in "Top Computers" that other persons do get work for cuda |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
oops - not using SSE 128bit instructions?
we are talking about the cpu-part of the code - right?
I am actually looking into that; but remember that SSE2 can't do 128-bit multiplies or adds. It can only do a 32*32->64-bit multiply and 64-bit add.
What I'm emulating is a "64*64->high 64 bits of 128" instruction. I wish SSE had that.
____________
|
|
|
|
|
oops - not using SSE 128bit instructions?
we are talking about the cpu-part of the code - right?
I am actually looking into that; but remember that SSE2 can't do 128-bit multiplies or adds. It can only do a 32*32->64-bit multiply and 64-bit add.
What I'm emulating is a "64*64->high 64 bits of 128" instruction. I wish SSE had that.
ouch - i'm rusty with that stuff, but you are right!
|
|
|
|
|
I got GTX 470, drivers 258.96.
I don't receive any work. Massage:
("No work available for the applications you have selected. Please check your project preferences on the web site.")
I get the same error message out of the blue but i see in "Top Computers" that other persons do get work for cuda
Yes, this is very annoying. I tried with app_info.xml but nothing worked.
Even seti runs on fermi, but pg not.
____________
Polish National Team |
|
|
|
|
Yes, this is very annoying. I tried with app_info.xml but nothing worked.
Even seti runs on fermi, but pg not.
I use this one:
<app_info>
<app>
<name>pps_sr2sieve</name>
<user_friendly_name>Proth Prime Search (Sieve)</user_friendly_name>
</app>
<file_info>
<name>ppsieve-cuda-boinc-x86-windows.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>pps_sr2sieve</app_name>
<version_num>129</version_num>
<plan_class>cuda23</plan_class>
<avg_ncpus>0.05</avg_ncpus>
<max_ncpus>1</max_ncpus>
<flops>1.0e11</flops>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
<cmdline>-m 16</cmdline>
<file_ref>
<file_name>ppsieve-cuda-boinc-x86-windows.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
to crunch WUs with a Fermi based GPU under Vista. PPS Sieving on the CPU is done in a Linux VM running at the lowest priority to avoid the runtime prediction chaos, although I do not recommend to crunch the long running LLR tasks in a VM (at least not 4 in parallel as I have already tried it. Your mileage may vary depending on the VM software and the CPU you use for such tests).
Currently the GPU crunches a WU in a few seconds less than 3 minutes while a core of the Q9550 needs about 64 minutes to complete a WU @ 3.4 GHz under Windows (64 bit).
____________
|
|
|
|
|
|
I will try this app toomorrow. Thanks.
____________
Polish National Team |
|
|
|
|
|
But why is it working for this host without app_info.xml? I would rather prefer to have it running as it used to be one month ago. :/ |
|
|
|
|
But why is it working for this host without app_info.xml? I would rather prefer to have it running as it used to be one month ago. :/
go ask DA about his recent server-code... |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
But why is it working for this host without app_info.xml? I would rather prefer to have it running as it used to be one month ago. :/
This hosts has a "NVIDIA GeForce GTX 260 (895MB)" and no Fermi inside. ;)
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
|
Fermi is troublemaker. :)
____________
Polish National Team |
|
|
|
|
Fermi is troublemaker. :)
Yes. My 460 still needs 32 seconds for per K with the old CUDA AP26 binaries... :)
...but it's speed at PPS sieve is fermidable.
____________
|
|
|
|
|
Fermi is troublemaker. :)
Yes. My 460 still needs 32 seconds for per K with the old CUDA AP26 binaries... :)
...but it's speed at PPS sieve is fermidable.
The AP26-app is heavily memory bound, my GTX 460 is at 38-40 seconds, my GTX 260 was at 32-33 seconds. Your GTX 460 has faster RAM than mine.
If i am itchy i would give the AP26-app a try when recompiled for cuda31 an try test with more cache/shared-mem for kernel-local-arrays. |
|
|
|
|
Fermi is troublemaker. :)
Yes. My 460 still needs 32 seconds for per K with the old CUDA AP26 binaries... :)
...but it's speed at PPS sieve is fermidable.
The AP26-app is heavily memory bound, my GTX 460 is at 38-40 seconds, my GTX 260 was at 32-33 seconds. Your GTX 460 has faster RAM than mine.
If i am itchy i would give the AP26-app a try when recompiled for cuda31 an try test with more cache/shared-mem for kernel-local-arrays.
going for AP27?? <LOL>
|
|
|
|
|
Fermi is troublemaker. :)
Yes. My 460 still needs 32 seconds for per K with the old CUDA AP26 binaries... :)
...but it's speed at PPS sieve is fermidable.
The AP26-app is heavily memory bound, my GTX 460 is at 38-40 seconds, my GTX 260 was at 32-33 seconds. Your GTX 460 has faster RAM than mine.
If i am itchy i would give the AP26-app a try when recompiled for cuda31 an try test with more cache/shared-mem for kernel-local-arrays.
It is probably the memory bandwidth difference between the 768 MB version and the 1 GB version that is important in this case (85.4 GB/s vs. 115.2 GB/s or 192 bit vs 256 bit wide RAM interface). It would be interesting to see the values for the GTX 470 (320 bit) and GTX 480 (384 bit)...
____________
|
|
|
|
|
(...)
It is probably the memory bandwidth difference between the 768 MB version and the 1 GB version that is important in this case (85.4 GB/s vs. 115.2 GB/s or 192 bit vs 256 bit wide RAM interface).
(...)
That is what i meant to imply.
Since the project is down and i do not own such cards we would never know for sure.
But nevertheless i crunched some 13k further numbers with only one AP21 and two AP20 so far. |
|
|
|
|
|
I've just read mdettweiler's post about a hefty speed increase with CUDA 3.1 on a GTX 460 (from 1.7M p/s up to 2.8M p/s) over in the mersenne forums. Compiling with CUDA 3.1 seems to be the way to go for the non high-end Fermi based GPUs.
____________
|
|
|
|
|
|
built with
[roadrunner@rr022 pps]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Mon_Jun__7_18:56:31_PDT_2010
Cuda compilation tools, release 3.1, V0.2.1221
[roadrunner@rr022 pps]$ time ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal
ppsieve version cuda-0.2.1a (testing)
Compiled Oct 10 2010 with GCC 4.1.2 20080704 (Red Hat 4.1.2-48)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 9.14 sec. (0.02 init + 9.12 sieve) at 3304812 p/sec.
Processor time: 1.96 sec. (0.02 init + 1.95 sieve) at 15485950 p/sec.
Average processor utilization: 1.10 (init), 0.21 (sieve)
real 0m9.140s
user 0m0.483s
sys 0m1.482s |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Strange. That doesn't even compare favorably with Ralf Recker's.
____________
|
|
|
|
|
|
His GPU is running with 725 MHz, mine uses stock clock of 675.
But then there even one second is missing.
Built with arch "sm_20" is even 2 seconds slower due to more registers used. |
|
|
|
|
His GPU is running with 725 MHz, mine uses stock clock of 675.
But then there even one second is missing.
Built with arch "sm_20" is even 2 seconds slower due to more registers used.
Which linux driver version do you have in use?
____________
|
|
|
|
|
|
NVIDIA-Linux-x86_64-256.53.run |
|
|
|
|
|
OK. I've just compiled a non-boinc Win32 test version with the CUDA 3.2RC toolkit (compute_20, sm_21) and until now I've seen nearly no speed gain. The best run on the short test range was:
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 6.88 sec. (0.02 init + 6.86 sieve) at 4395183 p/sec.
Processor time: 0.50 sec. (0.02 init + 0.48 sieve) at 62337413 p/sec.
Average processor utilization: 0.71 (init), 0.07 (sieve)
but most of the time I get values around 6.92-6.93 seconds for the test range.
Update:
The best run (with -m 64 on the command line) until now:
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 6.76 sec. (0.02 init + 6.74 sieve) at 4470793 p/sec.
Processor time: 0.30 sec. (0.02 init + 0.28 sieve) at 107359162 p/sec.
Average processor utilization: 0.71 (init), 0.04 (sieve)
Update 2:
The best run (with -m 72 on the command line) until now:
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 6.71 sec. (0.02 init + 6.69 sieve) at 4506886 p/sec.
Processor time: 0.36 sec. (0.02 init + 0.34 sieve) at 87839115 p/sec.
Average processor utilization: 0.71 (init), 0.05 (sieve)
____________
|
|
|
|
|
|
Running a Win32/CUDA32RC (compute_20,sm_21) non-boinc cuda-0.2.1a (testing) on a GTX 460
OS: Windows Vista SP-2 - Driver Version 258.96
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
ppsieve-cuda32RC-x86-windows.exe -p 42070e9 -P42070030e6 -k 1201 -K 9999 -N2000000 -c 60
ppsieve version cuda-0.2.1a (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 6.84 sec. (0.02 init + 6.82 sieve) at 4419669 p/sec.
Processor time: 0.31 sec. (0.02 init + 0.30 sieve) at 101708356 p/sec.
Average processor utilization: 0.71 (init), 0.04 (sieve)
The runtime with -m 64:
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 6.75 sec. (0.02 init + 6.73 sieve) at 4478764 p/sec.
Processor time: 0.41 sec. (0.03 init + 0.37 sieve) at 80519228 p/sec.
Average processor utilization: 1.36 (init), 0.06 (sieve)
The runtime with -m 72:
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 6.70 sec. (0.02 init + 6.68 sieve) at 4513634 p/sec.
Processor time: 0.42 sec. (0.05 init + 0.37 sieve) at 80519228 p/sec.
Average processor utilization: 2.03 (init), 0.06 (sieve)
____________
|
|
|
|
|
|
Running a Win32/CUDA32RC (compute_20,sm_21) non-boinc cuda-0.2.1a (testing) on a GTX 460
OS: Windows Vista SP-2 - Driver Version 258.96
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
ppsieve-cuda32RC-x86-windows.exe -p 42070e9 -P42070100e6 -k 1201 -K 9999 -N2000000 -c 60
The old result with cuda-0.2.1 (testing):
Elapsed time: 22.65 sec. (0.03 init + 22.62 sieve) at 4427208 p/sec.
Processor time: 0.87 sec. (0.05 init + 0.83 sieve) at 121115629 p/sec.
Average processor utilization: 1.61 (init), 0.04 (sieve)
The CUDA32RC result with 0.2.1a (testing):
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 22.46 sec. (0.02 init + 22.43 sieve) at 4463517 p/sec.
Processor time: 1.11 sec. (0.03 init + 1.08 sieve) at 93030803 p/sec.
Average processor utilization: 1.42 (init), 0.05 (sieve)
The CUDA32RC result with 0.2.1a (testing) and -m 64:
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 21.92 sec. (0.02 init + 21.90 sieve) at 4572349 p/sec.
Processor time: 1.00 sec. (0.02 init + 0.98 sieve) at 101890920 p/sec.
Average processor utilization: 0.71 (init), 0.04 (sieve)
The CUDA32RC result with 0.2.1a (testing) and -m 72:
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 22.02 sec. (0.02 init + 22.00 sieve) at 4551980 p/sec.
Processor time: 1.05 sec. (0.02 init + 1.03 sieve) at 97259542 p/sec.
Average processor utilization: 0.68 (init), 0.05 (sieve)
____________
|
|
|
|
|
|
One correction to the two posts above: The driver version is 260.63.
____________
|
|
|
|
|
|
I'm surprised that you didn't get any speed boost with CUDA 3.2. For me, compiling with CUDA 3.1 (vs. the provided 2.3 binaries) more than doubled speed on a GTX 460.
Note that this was on the RSP manual sieve, not PPSE; that may be the root of the difference, though nonetheless I would expect at least some speed boost to be had on PPSE. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Well, I'm not sure about this CUDA 3.1 or 3.2 stuff, but I've found a way to about double the speed on all GPUs. Again! V0.2.2-alpha. :)
Actually, I think the benchmark I got from Lennart was testing the 20T range. The speedup on current ranges is likely to be less. So test it on some current ranges, as well as the 20T range, on Fermi and non-Fermi. Then, once I add the same speedup to the CPU portion of this code, I'll post a final version, which can hopefully go into BOINC in a week or so.
____________
|
|
|
|
|
|
Running a Win32/CUDA32RC (compute_20,sm_21) non-boinc cuda-0.2.2-alpha (testing) on a GTX 460
OS: Windows Vista SP-2 - Driver Version 260.63
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
ppsieve-cuda32RC-x86-windows.exe -p 42070e9 -P42070030e6 -k 1201 -K 9999 -N2000000 -c 60
ppsieve version cuda-0.2.2-alpha (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 5.50 sec. (0.02 init + 5.48 sieve) at 5505215 p/sec.
Processor time: 0.31 sec. (0.02 init + 0.30 sieve) at 101708356 p/sec.
Average processor utilization: 0.68 (init), 0.05 (sieve)
____________
|
|
|
|
|
|
Running a Win32/CUDA32RC (compute_20,sm_21) non-boinc cuda-0.2.2-alpha (testing) on a GTX 460
OS: Windows Vista SP-2 - Driver Version 260.63
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
ppsieve-cuda32RC-x86-windows.exe -p 42070e9 -P42070100e6 -k 1201 -K 9999 -N2000000 -c 60
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 18.09 sec. (0.02 init + 18.07 sieve) at 5542034 p/sec.
Processor time: 0.78 sec. (0.03 init + 0.75 sieve) at 133731757 p/sec.
Average processor utilization: 1.42 (init), 0.04 (sieve)
____________
|
|
|
|
|
|
The downloadable binaries are faster than my build:
The short range from 42070e9 to 42070030e6:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 5.23 sec. (0.02 init + 5.21 sieve) at 5785178 p/sec.
Processor time: 0.28 sec. (0.03 init + 0.25 sieve) at 120779003 p/sec.
Average processor utilization: 1.36 (init), 0.05 (sieve)
The long range from 42070e9 to 42070100e6:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 17.11 sec. (0.02 init + 17.09 sieve) at 5860881 p/sec.
Processor time: 0.80 sec. (0.03 init + 0.76 sieve) at 131002555 p/sec.
Average processor utilization: 1.30 (init), 0.04 (sieve)
The range from 20070e9 to 20070010e6
Thread 0 completed
Waiting for threads to exit
Sieve complete: 20070000000000 <= p < 20070010000000
Found 13 factors
count=326136,sum=0x5ad678173464405c
Elapsed time: 2.26 sec. (0.02 init + 2.24 sieve) at 4560043 p/sec.
Processor time: 0.06 sec. (0.00 init + 0.06 sieve) at 163840000 p/sec.
Average processor utilization: 0.00 (init), 0.03 (sieve)
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
GTX 460 @ 800/1600/2000
Win7
Driver 258.96
ppsieve version cuda-0.2.2-alpha (testing)
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 5.00 sec. (0.02 init + 4.97 sieve) at 6061701 p/sec.
Processor time: 0.53 sec. (0.02 init + 0.51 sieve) at 58559410 p/sec.
Average processor utilization: 0.68 (init), 0.10 (sieve)
2) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 2.27 sec. (0.02 init + 2.25 sieve) at 4551662 p/sec.
Processor time: 0.11 sec. (0.02 init + 0.09 sieve) at 109226667 p/sec.
Average processor utilization: 0.68 (init), 0.04 (sieve)
3) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 2.25 sec. (0.02 init + 2.23 sieve) at 4578161 p/sec.
Processor time: 0.17 sec. (0.02 init + 0.16 sieve) at 65535580 p/sec.
Average processor utilization: 0.74 (init), 0.07 (sieve)
All expected factors found. That's about 25% faster than 0.2.1a on the first range, 50% faster on the others. |
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
OS: Windows7U x64, Driver Version 260.89beta
CPU: Xeon X5650 @ 3.5 GHz
GPU: NVIDIA GTX 470 @ 760 MHz
pps-cuda-0.2.2a
The short range from 42070e9 to 42070030e6:
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 2.59 sec. (0.03 init + 2.57 sieve) at 11738643 p/sec.
Processor time: 0.22 sec. (0.03 init + 0.19 sieve) at 161038456 p/sec.
Average processor utilization: 1.20 (init), 0.07 (sieve)
The long range from 42070e9 to 42070100e6:
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070100000000
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 8.25 sec. (0.03 init + 8.22 sieve) at 12180181 p/sec.
Processor time: 0.69 sec. (0.02 init + 0.67 sieve) at 149282067 p/sec.
Average processor utilization: 0.60 (init), 0.08 (sieve)
The range from 20070e9 to 20070010e6
Thread 0 completed
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 1.41 sec. (0.02 init + 1.39 sieve) at 7333599 p/sec.
Processor time: 0.09 sec. (0.02 init + 0.08 sieve) at 131072000 p/sec.
Average processor utilization: 0.82 (init), 0.06 (sieve)
(I can re-test it with GPU @ default 607MHz if interested) |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
9800GT @ 600/1675/900
Win7
Driver 258.96
ppsieve version cuda-0.2.2-alpha (testing)
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 15.50 sec. (0.03 init + 15.47 sieve) at 1948500 p/sec.
Processor time: 0.34 sec. (0.05 init + 0.30 sieve) at 101546307 p/sec.
Average processor utilization: 1.45 (init), 0.02 (sieve)
2) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 6.63 sec. (0.03 init + 6.60 sieve) at 1549812 p/sec.
Processor time: 0.19 sec. (0.03 init + 0.16 sieve) at 65431142 p/sec.
Average processor utilization: 1.10 (init), 0.02 (sieve)
3) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 6.63 sec. (0.03 init + 6.61 sieve) at 1547064 p/sec.
Processor time: 0.17 sec. (0.03 init + 0.14 sieve) at 72701269 p/sec.
Average processor utilization: 1.19 (init), 0.02 (sieve)
All expected factors found. |
|
|
|
|
|
GTX 275 @ 692/1548/1134
Ubuntu 10.04 (64b)
Driver 256.44
ppsieve version cuda-0.2.2-alpha (testing)
-p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 7.97 sec. (0.01 init + 7.96 sieve) at 3768573 p/sec.
Processor time: 0.24 sec. (0.02 init + 0.22 sieve) at 136364218 p/sec.
Average processor utilization: 1.95 (init), 0.03 (sieve)
-p42070e9 -P42070100e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 26.07 sec. (0.01 init + 26.06 sieve) at 3837507 p/sec.
Processor time: 0.59 sec. (0.02 init + 0.57 sieve) at 175438596 p/sec.
Average processor utilization: 1.98 (init), 0.02 (sieve)
-p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 3.64 sec. (0.01 init + 3.63 sieve) at 2754282 p/sec.
Processor time: 0.12 sec. (0.00 init + 0.12 sieve) at 83334400 p/sec.
Average processor utilization: 0.00 (init), 0.03 (sieve)
ppsieve version cuda-0.2.1a (testing)
-p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 13.16 sec. (0.01 init + 13.15 sieve) at 2281459 p/sec.
Processor time: 0.27 sec. (0.02 init + 0.25 sieve) at 120000512 p/sec.
Average processor utilization: 1.94 (init), 0.02 (sieve)
-p42070e9 -P42070100e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 43.15 sec. (0.01 init + 43.14 sieve) at 2317807 p/sec.
Processor time: 0.67 sec. (0.02 init + 0.65 sieve) at 153846154 p/sec.
Average processor utilization: 1.98 (init), 0.02 (sieve)
-p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 7.38 sec. (0.01 init + 7.37 sieve) at 1357136 p/sec.
Processor time: 0.14 sec. (0.01 init + 0.13 sieve) at 76924062 p/sec.
Average processor utilization: 1.45 (init), 0.02 (sieve)
____________
There's someone in our head but it's not us. |
|
|
|
|
|
A wu would be around 2 hours on my NVS 140M
# time ./ppsieve-cuda-x86_64-linux -p1186491e9 -P1186492e9 -k 1201 -K 9999 -N 2000000 -c 60 -m 16 --device 0
ppsieve version cuda-0.2.2-alpha (testing)
Compiled Oct 11 2010 with GCC 4.3.3
nstart=86, nstep=37
ppsieve initialized: 1201 <= k <= 9999, 86 <= n < 2000000
Sieve started: 1186491000000000 <= p < 1186492000000000
Thread 0 starting
Detected GPU 0: Quadro NVS 140M
Detected compute capability: 1.1
Detected 2 multiprocessors.
nstep changed to 32
p=1186491008912897, 148.5K p/sec, 0.01 CPU cores, 0.9% done. ETA 12 Oct 20:12
p=1186491017563649, 144.2K p/sec, 0.00 CPU cores, 1.8% done. ETA 12 Oct 20:13
1186491022263817 | 6941*2^233525+1
Thread 0 interrupted
Waiting for threads to exit
Sieve incomplete: 1186491000000000 <= p < 1186491024379393
Found 1 factor
count=702366,sum=0x2d114e5a1f2c7062
Elapsed time: 172.05 sec. (0.10 init + 171.96 sieve) at 141776 p/sec.
Processor time: 0.95 sec. (0.10 init + 0.85 sieve) at 28719784 p/sec.
Average processor utilization: 1.02 (init), 0.00 (sieve)
real 2m52.055s
user 0m0.680s
sys 0m0.266s
around 41.5 time slower than my GTX 460
# /opt/nvidia_cuda_sdk/bin/linux/release/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "Quadro NVS 140M"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 133890048 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.80 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
That is my nifty mobile-GPU is faster than one core of my Xeon W3520 at 2.66 GHz - it achieves around 110 K p/s or around 8550 s per WU... |
|
|
|
|
A wu would be around 2 hours on my NVS 140M
around 41.5 time slower than my GTX 460
probably the slowest GPU that's able to run CUDA-jobs.. ;)
That is my nifty mobile-GPU is faster than one core of my Xeon W3520 at 2.66 GHz - it achieves around 110 K p/s or around 8550 s per WU...
turn off HT! <LOL>
|
|
|
Benva Volunteer tester
 Send message
Joined: 5 May 08 Posts: 73 ID: 22332 Credit: 2,715,050 RAC: 0
     
|
|
8800 GTS
Win XP
Driver 258.96
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 24.97 sec. (0.03 init + 24.94 sieve) at 1208885 p/sec.
Processor time: 0.25 sec. (0.05 init + 0.20 sieve) at 148413834 p/sec.
Average processor utilization: 1.50 (init), 0.01 (sieve)
2)-p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Sieve complete: 20070000000000 <= p < 20070010000000
Found 13 factors
count=326136,sum=0x5ad678173464405c
Elapsed time: 10.52 sec. (0.02 init + 10.50 sieve) at 973678 p/sec.
Processor time: 0.16 sec. (0.03 init + 0.13 sieve) at 81788928 p/sec.
Average processor utilization: 2.00 (init), 0.01 (sieve)
3)-p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Sieve complete: 20070000000000 <= p < 20070010000000
Found 14 factors
count=326136,sum=0x5ad678173464405c
Elapsed time: 10.50 sec. (0.02 init + 10.48 sieve) at 975129 p/sec.
Processor time: 0.20 sec. (0.05 init + 0.16 sieve) at 65431142 p/sec.
Average processor utilization: 3.00 (init), 0.01 (sieve)
____________
|
|
|
|
|
|
# time ./ppsieve-cuda-x86_64-linux -p1186491e9 -P1186492e9 -k 1201 -K 9999 -N 2000000 -c 60 -m 16 --device 0
ppsieve version cuda-0.2.2-alpha (testing)
Compiled Oct 11 2010 with GCC 4.3.3
nstart=86, nstep=37
ppsieve initialized: 1201 <= k <= 9999, 86 <= n < 2000000
Sieve started: 1186491000000000 <= p < 1186492000000000
Thread 0 starting
Detected GPU 0: GeForce 9400 GT
Detected compute capability: 1.1
Detected 2 multiprocessors.
nstep changed to 32
(...)
Thread 0 completed
Waiting for threads to exit
Sieve complete: 1186491000000000 <= p < 1186492000000000
Found 27 factors
count=28805195,sum=0xbece431c19d67201
Elapsed time: 3905.06 sec. (0.13 init + 3904.93 sieve) at 256107 p/sec.
Processor time: 20.29 sec. (0.13 init + 20.16 sieve) at 49609729 p/sec.
Average processor utilization: 1.01 (init), 0.01 (sieve)
real 65m5.062s
user 0m19.914s
sys 0m0.375s
9400 GT @ 1400 MHz. |
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
|
Testing with a new toy...
GTX 480
Core i7 875K @ 3.2 GHz
Win7Ult x64
Driver 258.96
ppsieve version cuda-0.2.2-alpha (testing)
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 2.72 sec. (0.05 init + 2.67 sieve) at 11277529 p/sec.
Processor time: 0.48 sec. (0.06 init + 0.42 sieve) at 71572520 p/sec.
Average processor utilization: 1.22 (init), 0.16 (sieve)
2) -p42070e9 -P42070100e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 8.46 sec. (0.05 init + 8.41 sieve) at 11905037 p/sec.
Processor time: 1.36 sec. (0.06 init + 1.29 sieve) at 77338886 p/sec.
Average processor utilization: 1.25 (init), 0.15 (sieve)
3) -p20070e9 -P2007001e7 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 1.50 sec. (0.02 init + 1.48 sieve) at 6930870 p/sec.
Processor time: 0.34 sec. (0.11 init + 0.23 sieve) at 43690293 p/sec.
Average processor utilization: 5.20 (init), 0.16 (sieve)
Expected factors found.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
And now the hopefully-final version, PPSieve-CUDA 0.2.2, is out. This should be lighter on your CPU on 32-bit platforms, and just a little bit faster on non-Fermi cards.
I hope this will be the last upgrade for awhile, and that this will be the next version in BOINC. But it needs testing to get there. So, one more time?
Thanks!
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
OS: Windows7U x64, Driver Version 260.89beta
CPU: Xeon X5650 @ 3.5 GHz
GPU: NVIDIA GTX 470 @ 760 MHz
pps-cuda-0.2.2
The short range from 42070e9 to 42070030e6:
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 2.61 sec. (0.03 init + 2.59 sieve) at 11661449 p/sec.
Processor time: 0.27 sec. (0.02 init + 0.25 sieve) at 120779003 p/sec.
Average processor utilization: 0.60 (init), 0.10 (sieve)
The long range from 42070e9 to 42070100e6:
Sieve complete: 42070000000000 <= p < 42070100000000
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 8.27 sec. (0.03 init + 8.24 sieve) at 12147674 p/sec.
Processor time: 0.56 sec. (0.03 init + 0.53 sieve) at 188797967 p/sec.
Average processor utilization: 1.20 (init), 0.06 (sieve)
The range from 20070e9 to 20070010e6
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 1.42 sec. (0.02 init + 1.41 sieve) at 7276181 p/sec.
Processor time: 0.20 sec. (0.03 init + 0.17 sieve) at 59577835 p/sec.
Average processor utilization: 1.64 (init), 0.12 (sieve) |
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
|
9800 GT
Ubuntu 10.04 x64
C2Q Q9550
Driver 195.36.24
ppsieve version cuda-0.2.2 (testing)
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000
Elapsed time: 15.63 sec. (0.01 init + 15.61 sieve) at 1931042 p/sec.
Processor time: 0.27 sec. (0.02 init + 0.25 sieve) at 120586240 p/sec.
Average processor utilization: 1.43 (init), 0.02 (sieve)
Found 27 factors
Speedup from 0.2.2a: 9 %
3) -p20070e9 -P2007001e7 -k 1201 -K 9999 -N 2000000
Elapsed time: 6.77 sec. (0.01 init + 6.76 sieve) at 1512809 p/sec.
Processor time: 0.15 sec. (0.01 init + 0.14 sieve) at 73025829 p/sec.
Average processor utilization: 1.05 (init), 0.02 (sieve)
Found 13 factors
Speedup from 0.2.2a: 6 % |
|
|
|
|
|
Testing 0.2.2 (testing):
OS: Windows Vista SP-2 - Driver Version 260.63
CPU: Core 2 Quad 9550 @ 3.4 GHz
GPU: NVIDIA GTX 460 @ 725 MHz
The short range from 42070e9 to 42070030e6:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 5.24 sec. (0.02 init + 5.21 sieve) at 5781849 p/sec.
Processor time: 0.27 sec. (0.02 init + 0.25 sieve) at 120779003 p/sec.
Average processor utilization: 0.68 (init), 0.05 (sieve)
The long range from 42070e9 to 42070100e6:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070100000000
Found 68 factors
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 17.15 sec. (0.02 init + 17.13 sieve) at 5846850 p/sec.
Processor time: 0.59 sec. (0.02 init + 0.58 sieve) at 173490103 p/sec.
Average processor utilization: 0.68 (init), 0.03 (sieve)
The range from 20070e9 to 20070010e6:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 20070000000000 <= p < 20070010000000
Found 13 factors
count=326136,sum=0x5ad678173464405c
Elapsed time: 2.25 sec. (0.02 init + 2.24 sieve) at 4566153 p/sec.
Processor time: 0.19 sec. (0.03 init + 0.16 sieve) at 65535580 p/sec.
Average processor utilization: 1.95 (init), 0.07 (sieve)
-----
Factors, count, sums, etc. match.
____________
|
|
|
|
|
|
One extra note:
Since version 0.2.2-alpha I see clear a decrease* in speed when I compile my test binary for CUDA32RC for compute_20 / sm_20 (or sm_21). I get the fastest binaries here when I compile for compute_10 / sm_10.
*For example on the short test range from 42070e9 to 42070030e6 the average runtime time went up from 5.25 seconds to 5.50 seconds.
____________
|
|
|
|
|
|
Have you compiled with the nvcc-option "--ptxas-options -v" and compared the register-usage for each compilation? I got substantially more used registers when compiled for sm_20, a bit less when compiled for sm_12 and the lowest register usage for kernels when compiled for sm_11/sm_10.
The compiler does know about the registers available in each generation and tries to exploit them a good deal. One has to use "--maxrregcount N" or lounch-bounds with cuda3x to keep the usage down to a desired level. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
9800GT @ 600/1675/900
Win7 x64
Driver 258.96
ppsieve version cuda-0.2.2 (testing)
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 14.41 sec. (0.03 init + 14.38 sieve) at 2095871 p/sec.
Processor time: 0.19 sec. (0.03 init + 0.16 sieve) at 192937984 p/sec.
Average processor utilization: 1.14 (init), 0.01 (sieve)
2) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 6.23 sec. (0.03 init + 6.20 sieve) at 1648918 p/sec.
Processor time: 0.16 sec. (0.03 init + 0.13 sieve) at 81788928 p/sec.
Average processor utilization: 1.10 (init), 0.02 (sieve)
3) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 6.27 sec. (0.03 init + 6.24 sieve) at 1637826 p/sec.
Processor time: 0.16 sec. (0.03 init + 0.13 sieve) at 81788928 p/sec.
Average processor utilization: 1.14 (init), 0.02 (sieve)
All expected factors found. About 5% faster than 0.2.2-alpha on all three ranges, "much" less CPU time on the first range (0.34s->0.19s). |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 667 ID: 845 Credit: 2,374,701,989 RAC: 15,281
                          
|
|
GTX 460 @ 800/1600/2000
Win7 x64
Driver 258.96
ppsieve version cuda-0.2.2 (testing)
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 5.05 sec. (0.02 init + 5.02 sieve) at 6000167 p/sec.
Processor time: 0.47 sec. (0.03 init + 0.44 sieve) at 69016376 p/sec.
Average processor utilization: 1.36 (init), 0.09 (sieve)
2) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 2.25 sec. (0.02 init + 2.23 sieve) at 4590493 p/sec.
Processor time: 0.17 sec. (0.03 init + 0.14 sieve) at 72817259 p/sec.
Average processor utilization: 1.42 (init), 0.06 (sieve)
3) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 2.26 sec. (0.02 init + 2.24 sieve) at 4574065 p/sec.
Processor time: 0.17 sec. (0.02 init + 0.16 sieve) at 65535580 p/sec.
Average processor utilization: 0.71 (init), 0.07 (sieve)
All expected factors found. Same speed and roughly the same CPU usage. |
|
|
|
|
Have you compiled with the nvcc-option "--ptxas-options -v" and compared the register-usage for each compilation? I got substantially more used registers when compiled for sm_20, a bit less when compiled for sm_12 and the lowest register usage for kernels when compiled for sm_11/sm_10.
The compiler does know about the registers available in each generation and tries to exploit them a good deal. One has to use "--maxrregcount N" or lounch-bounds with cuda3x to keep the usage down to a desired level.
Currently (0.2.2-alpha and 0.2.2) I'm getting 16 registers used per thread no matter for which compute/sm version I compile (I'm using the default maxrregcount value of 32). There were some variations with 0.2.1 though. The maximum speed gain was about 3.4% when I ran the test ranges with -m 64 or -m 72.
Here the output from the occupancy analyzer for a run on the short 42070e9 test range:
Occupancy analysis for kernel 'd_check_more_ns_32_fermi' for context 'Session3 : Device_0 : Context_0' :
Kernel details : Grid size: 112 x 1, Block size: 128 x 1 x 1
Register Ratio = 0.5 ( 16384 / 32768 ) [16 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 7 SMs)
Occupancy limiting factor = Block-Size
Occupancy analysis for kernel 'd_start_ns' for context 'Session3 : Device_0 : Context_0' :
Kernel details : Grid size: 112 x 1, Block size: 128 x 1 x 1
Register Ratio = 0.5 ( 16384 / 32768 ) [16 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 7 SMs)
Occupancy limiting factor = Block-Size
____________
|
|
|
|
|
|
By the way: Overclocking the memory for this CUDA app is pretty much useless:
Global memory read throughput: 0.54 GB/s
Global memory write throughput: 0.49 GB/s
Global memory overall throughput: 1.03 GB/s
____________
|
|
|
|
|
Currently (0.2.2-alpha and 0.2.2) I'm getting 16 registers used per thread no matter for which compute/sm version I compile (I'm using the default maxrregcount value of 32). There were some variations with 0.2.1 though. The maximum speed gain was about 3.4% when I ran the test ranges with -m 64 or -m 72.
OK. There are some differences between the register usage output of ptxas (at least for sm_10 compiles) and the infos that the compute visual profiler gives:
Compiling for sm_10:
1>ptxas info : Compiling entry function '_Z21d_check_more_ns_32_24PKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 19 registers, 24+16 bytes smem, 56 bytes cmem[0], 24 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_24PKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 18 registers, 24+16 bytes smem, 56 bytes cmem[0], 24 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 20 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 18 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 18 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 17 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z15d_check_more_nsPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 19 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z10d_start_nsPKyPyS1_Ph' for 'sm_10'
1>ptxas info : Used 20 registers, 16+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
Compiling for sm_20:
1>ptxas info : Compiling entry function '_Z21d_check_more_ns_32_24PKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2], 4 bytes cmem[16]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_24PKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 15 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z15d_check_more_nsPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 19 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z10d_start_nsPKyPyS1_Ph' for 'sm_20'
1>ptxas info : Used 16 registers, 48 bytes cmem[0], 60 bytes cmem[2]
Nonetheless the profiler reports 16 registers per thread used when running sm_10 code. I guess it's time to dig a little bit deeper through the documentation for an explanation of this discrepancy...
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
And now the hopefully-final version, PPSieve-CUDA 0.2.2, is out. This should be lighter on your CPU on 32-bit platforms, and just a little bit faster on non-Fermi cards.
Comparing the new 0.2.2 against the last one I tested, 0.2.0-alpha:
Core2Quad Q6600 @2.4, system is idle, W7ProX64
Detected GPU 0: GeForce GTX 280
Detected compute capability: 1.3
Detected 30 multiprocessors.
42070e9 to 42070030e6:
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 9.52 sec. (0.04 init + 9.49 sieve) at 3177154 p/sec.
Processor time: 0.47 sec. (0.05 init + 0.42 sieve) at 71572520 p/sec.
Average processor utilization: 1.34 (init), 0.04 (sieve)
27 factors
0.2.0:
Elapsed time: 17.33 sec. (0.06 init + 17.26 sieve) at 1746211 p/sec.
Processor time: 0.70 sec. (0.03 init + 0.67 sieve) at 44940937 p/sec.
42070e9 to 42070100e6:
Sieve complete: 42070000000000 <= p < 42070100000000
count=3185940,sum=0x4413a5b6a515d4c0
Elapsed time: 31.07 sec. (0.04 init + 31.03 sieve) at 3227191 p/sec.
Processor time: 1.11 sec. (0.03 init + 1.08 sieve) at 93030803 p/sec.
Average processor utilization: 0.74 (init), 0.03 (sieve)
68 factors
0.2.0:
Elapsed time: 56.84 sec. (0.04 init + 56.80 sieve) at 1763065 p/sec.
Processor time: 2.04 sec. (0.06 init + 1.98 sieve) at 50544292 p/sec.
20070e9 to 20070010e6
Sieve complete: 20070000000000 <= p < 20070010000000
count=326136,sum=0x5ad678173464405c
Elapsed time: 4.31 sec. (0.03 init + 4.27 sieve) at 2393592 p/sec.
Processor time: 0.30 sec. (0.03 init + 0.27 sieve) at 38550443 p/sec.
Average processor utilization: 0.92 (init), 0.06 (sieve)
13 factors
0.2.0:
Elapsed time: 9.27 sec. (0.03 init + 9.24 sieve) at 1106868 p/sec.
Processor time: 0.41 sec. (0.03 init + 0.37 sieve) at 27306521 p/sec.
I skipped testing one of the releases, so I don't know if the big speedup came from 0.2.1 or 0.2.2, but this version is almost twice as fast as 0.2.0.
CPU utilization seems slightly higher in terms of perentage, but since it's running for half the time, total CPU usage is lower. Either way, CPU usage is really low.
GPU utilization percentage, as measured by EVGA Precision, seems to have dropped slightly, from about 99% to 97-98%. (That's to be expected if the kernels got faster.) In terms of temperature, I actually couldn't get a good reading on the new version because it doesn't run long enough for the GPU to get up to operating temperature!
Bottom line: fantastic speedup on my GTX280. It's running about twice as fast, still with minimal CPU usage. It's like getting a free upgrade to a 460. :)
The GUI is still pretty much unusable (well, technically, it's usable, but you just don't want to use it, so BOINC needs to be set to not use the GPU while I'm using the computer. That's the tradeoff for the high GPU effeciency, and isn't any different than any of the earlier versions.
____________
My lucky number is 75898524288+1 |
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
|
Intel Core 2 Quad Q9550 @3,3 GHz
Geforce 8800 GTS G80
1) -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 23.38 sec. (0.02 init + 23.36 sieve) at 1290555 p/sec.
Processor time: 0.30 sec. (0.03 init + 0.27 sieve) at 113492932 p/sec.
Average processor utilization: 2.00 (init), 0.01 (sieve)
2) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 10.31 sec. (0.02 init + 10.30 sieve) at 992885 p/sec.
Processor time: 0.27 sec. (0.05 init + 0.22 sieve) at 46736530 p/sec.
Average processor utilization: 3.00 (init), 0.02 (sieve)
3) -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60 -R
Elapsed time: 10.42 sec. (0.02 init + 10.41 sieve) at 982450 p/sec.
Processor time: 0.23 sec. (0.03 init + 0.20 sieve) at 50331648 p/sec.
Average processor utilization: 2.00 (init), 0.02 (sieve)
Factors match.
____________
|
|
|
|
|
Currently (0.2.2-alpha and 0.2.2) I'm getting 16 registers used per thread no matter for which compute/sm version I compile (I'm using the default maxrregcount value of 32). There were some variations with 0.2.1 though. The maximum speed gain was about 3.4% when I ran the test ranges with -m 64 or -m 72.
OK. There are some differences between the register usage output of ptxas (at least for sm_10 compiles) and the infos that the compute visual profiler gives:
Compiling for sm_10:
1>ptxas info : Compiling entry function '_Z21d_check_more_ns_32_24PKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 19 registers, 24+16 bytes smem, 56 bytes cmem[0], 24 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_24PKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 18 registers, 24+16 bytes smem, 56 bytes cmem[0], 24 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 20 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 18 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 18 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 17 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z15d_check_more_nsPKyS0_PyjPhj' for 'sm_10'
1>ptxas info : Used 19 registers, 24+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z10d_start_nsPKyPyS1_Ph' for 'sm_10'
1>ptxas info : Used 20 registers, 16+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
Compiling for sm_20:
1>ptxas info : Compiling entry function '_Z21d_check_more_ns_32_24PKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2], 4 bytes cmem[16]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_24PKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 15 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 16 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z15d_check_more_nsPKyS0_PyjPhj' for 'sm_20'
1>ptxas info : Used 19 registers, 56 bytes cmem[0], 60 bytes cmem[2]
1>ptxas info : Compiling entry function '_Z10d_start_nsPKyPyS1_Ph' for 'sm_20'
1>ptxas info : Used 16 registers, 48 bytes cmem[0], 60 bytes cmem[2]
Nonetheless the profiler reports 16 registers per thread used when running sm_10 code. I guess it's time to dig a little bit deeper through the documentation for an explanation of this discrepancy...
I have 0.2.2-alpha and get following:
[roadrunner@rr022 pps]$ make cuda
nvcc -m64 -arch="sm_10" -lcuda -lcudart --ptxas-options=-v -O3 -DNDEBUG -D_REENTRANT -I/usr/local/cuda/include -I. -I.. -include "/usr/local/cuda/include/cuda_runtime.h" -o ppsieve-cuda-x86_64-linux ../main.c ../sieve.c ../clock.c ../putil.c cuda_sleep_memcpy.cu appcu.cu app.c factor_proth.c -lm -lpthread -lstdc++
ptxas info : Compiling entry function '__cuda_dummy_entry__' for 'sm_10'
ptxas info : Used 0 registers
ptxas info : Compiling entry function '_Z24d_check_more_ns_22_fermiPKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 20 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 18 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 17 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z18d_check_more_ns_22PKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 20 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 18 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 17 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z15d_check_more_nsPKmS0_PmjPhj' for 'sm_10'
ptxas info : Used 19 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z10d_start_nsPKmPmS1_Ph' for 'sm_10'
ptxas info : Used 20 registers, 32+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
[roadrunner@rr022 pps]$ make cuda
nvcc -m64 -arch="sm_12" -lcuda -lcudart --ptxas-options=-v -O3 -DNDEBUG -D_REENTRANT -I/usr/local/cuda/include -I. -I.. -include "/usr/local/cuda/include/cuda_runtime.h" -o ppsieve-cuda-x86_64-linux ../main.c ../sieve.c ../clock.c ../putil.c cuda_sleep_memcpy.cu appcu.cu app.c factor_proth.c -lm -lpthread -lstdc++
ptxas info : Compiling entry function '__cuda_dummy_entry__' for 'sm_12'
ptxas info : Used 0 registers
ptxas info : Compiling entry function '_Z24d_check_more_ns_22_fermiPKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 20 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 18 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 17 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z18d_check_more_ns_22PKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 20 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 18 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 17 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z15d_check_more_nsPKmS0_PmjPhj' for 'sm_12'
ptxas info : Used 19 registers, 44+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
ptxas info : Compiling entry function '_Z10d_start_nsPKmPmS1_Ph' for 'sm_12'
ptxas info : Used 20 registers, 32+16 bytes smem, 56 bytes cmem[0], 20 bytes cmem[1]
[roadrunner@rr022 pps]$ make cuda
nvcc -m64 -arch="sm_20" -lcuda -lcudart --ptxas-options=-v -O3 -DNDEBUG -D_REENTRANT -I/usr/local/cuda/include -I. -I.. -include "/usr/local/cuda/include/cuda_runtime.h" -o ppsieve-cuda-x86_64-linux ../main.c ../sieve.c ../clock.c ../putil.c cuda_sleep_memcpy.cu appcu.cu app.c factor_proth.c -lm -lpthread -lstdc++
ptxas info : Compiling entry function '__cuda_dummy_entry__' for 'sm_20'
ptxas info : Used 2 registers, 32 bytes cmem[0]
ptxas info : Compiling entry function '_Z24d_check_more_ns_22_fermiPKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 22 registers, 76 bytes cmem[0], 56 bytes cmem[2], 4 bytes cmem[16]
ptxas info : Compiling entry function '_Z24d_check_more_ns_32_fermiPKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 21 registers, 76 bytes cmem[0], 56 bytes cmem[2]
ptxas info : Compiling entry function '_Z32d_check_more_ns_small_kmax_fermiPKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 22 registers, 76 bytes cmem[0], 56 bytes cmem[2]
ptxas info : Compiling entry function '_Z18d_check_more_ns_22PKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 22 registers, 76 bytes cmem[0], 56 bytes cmem[2]
ptxas info : Compiling entry function '_Z18d_check_more_ns_32PKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 20 registers, 76 bytes cmem[0], 56 bytes cmem[2]
ptxas info : Compiling entry function '_Z26d_check_more_ns_small_kmaxPKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 22 registers, 76 bytes cmem[0], 56 bytes cmem[2]
ptxas info : Compiling entry function '_Z15d_check_more_nsPKmS0_PmjPhj' for 'sm_20'
ptxas info : Used 22 registers, 76 bytes cmem[0], 56 bytes cmem[2]
ptxas info : Compiling entry function '_Z10d_start_nsPKmPmS1_Ph' for 'sm_20'
ptxas info : Used 17 registers, 64 bytes cmem[0], 56 bytes cmem[2] |
|
|
|
|
|
If you have a Fermi based GPU you might peel off a few extra seconds of the runtime per WU by simply running two or more WUs concurrently.
Here are the results of some quick tests I did yesterday (with 0.2.2 on my GTX 460 - not at stock clock but the settings were the same for all tests). All times are given in the mm:ss notation.
Concurrent workunits: 1 - Runtimes from 02:16 up to 02:19 - 02:16.0 to 02:19.0 per WU
Concurrent workunits: 2 - Runtimes from 04:24 up to 04:26 - 02:12.0 to 02:13.0 per WU
Concurrent workunits: 3 - Runtimes from 06:33 up to 06:37 - 02:11.0 to 02:12.3 per WU
Concurrent workunits: 4 - Runtimes from 08:44 up to 08:52 - 02:11.0 to 02:13.0 per WU
The WUs started and stopped nearly synchronous. A staggered start of the WUs should reduce the runtime even more because at the end of every WU there is a phase of a few seconds where the GPU load is 0%.
Here is a quick result of my first test with two WUs running concurrently. One of them was started with a few seconds delay. I did not time the first set because one WU ran alone for the first few seconds:
Concurrent workunits: 2 - Runtimes from 04:21 up to 04:22 - 02:10.5 to 02:11.0 per WU
____________
|
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2533 ID: 29980 Credit: 492,244,883 RAC: 16,046
                            
|
|
My first time attempting to run this as I only got the GTS450 recently. Factors match those in top post.
GTS 450 OC @ 888/1000/1776
Q6600 stock @ 2.4 GHz, running boinc LLR in background
Win7-64
Driver 260.89
ppsieve version cuda-0.2.2 (testing)
ppsieve-cuda-x86-windows.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 7.57 sec. (0.03 init + 7.54 sieve) at 4000642 p/sec.
Processor time: 0.53 sec. (0.05 init + 0.48 sieve) at 62337413 p/sec.
Average processor utilization: 1.51 (init), 0.06 (sieve)
ppsieve-cuda-x86-windows.exe -p20070e9 -P20070010e6 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 3.23 sec. (0.02 init + 3.21 sieve) at 3186730 p/sec.
Processor time: 0.28 sec. (0.03 init + 0.25 sieve) at 40959836 p/sec.
Average processor utilization: 1.49 (init), 0.08 (sieve)
ppsieve-cuda-x86-windows.exe -p249871e9 -P2498711e8 -k 1201 -K 9999 -N 2000000 -c 60
Elapsed time: 23.54 sec. (0.07 init + 23.47 sieve) at 4266619 p/sec.
Processor time: 1.03 sec. (0.08 init + 0.95 sieve) at 105231585 p/sec.
Average processor utilization: 1.05 (init), 0.04 (sieve)
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, people, kiia found a bug in v0.2.2. So we need to do some extra testing for v0.2.2a, which I've just released. In particular, can someone with two or more nVIDIA GPUs please try the standard test with "--device 1"?
Thanks! Hopefully, this will be the next version in BOINC.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
OK, people, kiia found a bug in v0.2.2. So we need to do some extra testing for v0.2.2a, which I've just released. In particular, can someone with two or more nVIDIA GPUs please try the standard test with "--device 1"?
Thanks! Hopefully, this will be the next version in BOINC.
Ah shoot...my dual rig is at the office! If no one gets to it over the weekend, I'll check it on Monday morning.
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
This only showed up in v0.2.2, not even in v0.2.2-alpha! So I imagine your results should be fine.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
This only showed up in v0.2.2, not even in v0.2.2-alpha! So I imagine your results should be fine.
Oops...wasn't clear...meant that I can't test the fix for you until Monday. :)
____________
141941*2^4299438-1 is prime!
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK.
Kiia's tested it, and it seems to be OK. :)
____________
|
|
|
|
|
|
>ppsieve-cuda-boinc-x86-windows.exe
-p42070e9 -P42070030e6 -k 1201 -K9999 -N2000000 -c 60 --device 1
ppsieve version cuda-0.2.2a (testing)
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 1: GeForce GTX 295
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 8.47 sec. (0.03 init + 8.45 sieve) at 3568485 p/sec.
Processor time: 0.33 sec. (0.05 init + 0.28 sieve) at 107358779 p/sec.
Average processor utilization: 1.73 (init), 0.03 (sieve)
09:21:31 (4516): called boinc_finish
09:24:55 (7120): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 480
Detected compute capability: 2.0
Detected 15 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 2.87 sec. (0.03 init + 2.84 sieve) at 10607516 p/sec.
Processor time: 0.45 sec. (0.03 init + 0.42 sieve) at 71572690 p/sec.
Average processor utilization: 1.11 (init), 0.15 (sieve)
09:24:58 (7120): called boinc_finish
09:25:45 (6620): Can't set up shared mem: -1. Will run in standalone mode.
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 2: GeForce GTX 295
Detected compute capability: 1.3
Detected 30 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 8.34 sec. (0.03 init + 8.31 sieve) at 3625563 p/sec.
Processor time: 0.33 sec. (0.03 init + 0.30 sieve) at 101708356 p/sec.
Average processor utilization: 1.11 (init), 0.04 (sieve)
09:25:54 (6620): called boinc_finish
It is a go in boinc |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
C:\cuda\primegrid>ppsieve-cuda-x86-windows.exe -p1186491e9 -P1186492e9 -k 1201 -K 9999 -N 2000000 -c 60 -m 16
ppsieve version cuda-0.2.2a (testing)
nstart=86, nstep=37
nstep changed to 32
ppsieve initialized: 1201 <= k <= 9999, 86 <= n < 2000000
Sieve started: 1186491000000000 <= p < 1186492000000000
Thread 0 starting
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
1186491022263817 | 6941*2^233525+1
1186491080146633 | 4395*2^1291390+1
p=1186491084672513, 1.411M p/sec, 0.01 CPU cores, 8.5% done. ETA 18 Oct 12:16
1186491095011007 | 6337*2^7988+1
p=1186491170131457, 1.391M p/sec, 0.01 CPU cores, 17.0% done. ETA 18 Oct 12:16
1186491191752583 | 2241*2^901315+1
1186491214124857 | 3423*2^1464822+1
p=1186491255590401, 1.389M p/sec, 0.01 CPU cores, 25.6% done. ETA 18 Oct 12:16
1186491289770991 | 1857*2^1984791+1
p=1186491341049345, 1.389M p/sec, 0.01 CPU cores, 34.1% done. ETA 18 Oct 12:16
1186491356920087 | 9471*2^1760528+1
1186491384738463 | 2525*2^1518847+1
p=1186491426508289, 1.389M p/sec, 0.01 CPU cores, 42.7% done. ETA 18 Oct 12:16
1186491451160011 | 1887*2^794562+1
1186491453475903 | 3635*2^1262489+1
1186491483260371 | 5675*2^560553+1
1186491508531129 | 9239*2^82853+1
p=1186491511967233, 1.389M p/sec, 0.01 CPU cores, 51.2% done. ETA 18 Oct 12:16
1186491511971901 | 9795*2^1854582+1
1186491512556677 | 3021*2^1745789+1
1186491522406637 | 4739*2^561617+1
1186491583230223 | 1481*2^23921+1
1186491594148577 | 7941*2^821073+1
p=1186491597426177, 1.389M p/sec, 0.01 CPU cores, 59.7% done. ETA 18 Oct 12:16
p=1186491682885121, 1.389M p/sec, 0.01 CPU cores, 68.3% done. ETA 18 Oct 12:16
1186491695054123 | 7941*2^863421+1
1186491696289171 | 6759*2^1711227+1
1186491701397437 | 6485*2^911463+1
p=1186491768344065, 1.389M p/sec, 0.01 CPU cores, 76.8% done. ETA 18 Oct 12:16
1186491773175427 | 9039*2^1690489+1
1186491781207897 | 8925*2^339084+1
1186491784054111 | 3859*2^679438+1
p=1186491853803009, 1.389M p/sec, 0.01 CPU cores, 85.4% done. ETA 18 Oct 12:16
1186491879046813 | 6889*2^1550574+1
1186491888090709 | 3273*2^1456432+1
p=1186491939261953, 1.389M p/sec, 0.01 CPU cores, 93.9% done. ETA 18 Oct 12:16
1186491941241479 | 3339*2^326445+1
1186491961620941 | 9493*2^1403080+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 1186491000000000 <= p < 1186492000000000
Found 27 factors
count=28805195,sum=0xbece431c19d67201
Elapsed time: 721.03 sec. (0.22 init + 720.81 sieve) at 1387433 p/sec.
Processor time: 5.95 sec. (0.23 init + 5.72 sieve) at 174877265 p/sec.
Average processor utilization: 1.07 (init), 0.01 (sieve)
C:\cuda\primegrid>ppsieve-cuda-boinc-x86-windows.exe -p1186491e9 -P1186492e9 -k 1201 -K 9999 -N 2000000 -c 60 -m 16
ppsieve version cuda-0.2.2a (testing)
nstart=86, nstep=37
nstep changed to 32
ppsieve initialized: 1201 <= k <= 9999, 86 <= n < 2000000
1186491022263817 | 6941*2^233525+1
1186491080146633 | 4395*2^1291390+1
1186491095011007 | 6337*2^7988+1
1186491191752583 | 2241*2^901315+1
1186491214124857 | 3423*2^1464822+1
1186491289770991 | 1857*2^1984791+1
1186491356920087 | 9471*2^1760528+1
1186491384738463 | 2525*2^1518847+1
1186491451160011 | 1887*2^794562+1
1186491453475903 | 3635*2^1262489+1
1186491483260371 | 5675*2^560553+1
1186491508531129 | 9239*2^82853+1
1186491511971901 | 9795*2^1854582+1
1186491512556677 | 3021*2^1745789+1
1186491522406637 | 4739*2^561617+1
1186491583230223 | 1481*2^23921+1
1186491594148577 | 7941*2^821073+1
1186491695054123 | 7941*2^863421+1
1186491696289171 | 6759*2^1711227+1
1186491701397437 | 6485*2^911463+1
1186491773175427 | 9039*2^1690489+1
1186491781207897 | 8925*2^339084+1
1186491784054111 | 3859*2^679438+1
1186491879046813 | 6889*2^1550574+1
1186491888090709 | 3273*2^1456432+1
1186491941241479 | 3339*2^326445+1
1186491961620941 | 9493*2^1403080+1
Found 27 factors
12:21:55 (700): Can't open init data file - running in standalone mode
Sieve started: 1186491000000000 <= p < 1186492000000000
Thread 0 starting
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
Thread 0 completed
Sieve complete: 1186491000000000 <= p < 1186492000000000
count=28805195,sum=0xbece431c19d67201
Elapsed time: 715.84 sec. (0.22 init + 715.63 sieve) at 1397491 p/sec.
Processor time: 5.92 sec. (0.23 init + 5.69 sieve) at 175838129 p/sec.
Average processor utilization: 1.07 (init), 0.01 (sieve)
12:33:51 (700): called boinc_finish
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
|
OK so how do we get the GTX460 to get PrimeGrid CUDA work? I get this message in red:
10/29/2010 6:24:58 PM PrimeGrid Message from server: _("No work available for the applications you have selected. Please check your project preferences on the web site.")
I've searched through this enormous thread and discovered discussions of various app_info.xml files some running under Linux VMs, talk of some server side problem caused by changes to the BOINC server software, lot's of posts about improved speeds and many arcane command line applications with folk posting the various speed improvements.
However, there is no clear and succinct solution.
I need a post with clear precise and unambiguous instructions so I can get my Nvidia GTX460 card to work on CUDA applications for PrimeGrid.
Thanks.
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
in short what you need:
this cudart.dll
this ppsieve-cuda application.
app_info.xml from this post.
then you should be able to do CUDA work, but no other else, unless you edit that .xml |
|
|
|
|
OK so how do we get the GTX460 to get PrimeGrid CUDA work? I get this message in red:
10/29/2010 6:24:58 PM PrimeGrid Message from server: _("No work available for the applications you have selected. Please check your project preferences on the web site.")
I've searched through this enormous thread and discovered discussions of various app_info.xml files some running under Linux VMs, talk of some server side problem caused by changes to the BOINC server software, lot's of posts about improved speeds and many arcane command line applications with folk posting the various speed improvements.
However, there is no clear and succinct solution.
I need a post with clear precise and unambiguous instructions so I can get my Nvidia GTX460 card to work on CUDA applications for PrimeGrid.
Thanks.
The same problem with GTX 460.
Unset no success.
BM 6.10.58
Nvidia 258.96
Win7 Pro 64bit
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
This app PPSieve CUDA 0.2.2 and "Proth Prime Search (Sieve)" enabled in your project settings
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
This app PPSieve CUDA 0.2.2 and "Proth Prime Search (Sieve)" enabled in your project settings
Vergiß es. Berechnungsfehler. -197
Ich mag nicht mehr mich damit beschäftigen. Trotzdem Danke. |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
This app PPSieve CUDA 0.2.2 and "Proth Prime Search (Sieve)" enabled in your project settings
Vergiß es. Berechnungsfehler. -197
Ich mag nicht mehr mich damit beschäftigen. Trotzdem Danke.
"Menipe" posted a complete app_info.xml for a GTX470 in the thread App_info file.
...and a GTX470 is really fast. Take a look at hostid=1241166...
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Menipe Volunteer tester Send message
Joined: 2 Jan 08 Posts: 235 ID: 17041 Credit: 103,536,180 RAC: 20,871
                      
|
|
The App_info I posted is set up for a Windows x64 machine, it contains most but not all applications. Windows x86, Linux and OS/X machines would be different.
It is not well documented or clear on how to set it up with App_info.xml anywhere that I have been able to find. I have put it together by trial and error.
You also need to have the .exe files that you reference in the App_info.xml file in the local PrimeGrid directory. Usually this is ...BOINC\Projects\www.primegrid.com
____________
|
|
|
|
|
|
I successfully run tpsieve-cuda-x86_64-linux but when I run boinc I still get computation errors.
I am successfully running work units from GPUGrid (although they run very, very slowly and I have not let them run to completion)
Fedora Linux 13 x86 64 bit
nVidia driver 260.19.12
CUDA toolkit 3.2.12 Linux 64 bit
[ted@linux-1023 tpsieve-cuda-0.2.2c]$ ./tpsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 -t 2
tpsieve version cuda-0.2.2c (testing)
Compiled Oct 19 2010 with GCC 4.3.3
nstart=76, nstep=32
Didn't change nstep from 31
tpsieve initialized: 1203 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Thread 1 starting
Detected GPU 1: GeForce 8400 GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070003511309 | 6057*2^1043547+1
42070005645821 | 3633*2^119620-1
42070008458437 | 7095*2^1422761-1
42070010190569 | 5625*2^1903125+1
42070012209011 | 9405*2^360411-1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
p=42070014942209, 249.0K p/sec, 0.01 CPU cores, 49.8% done. ETA 31 Oct 14:29
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019117111 | 2523*2^999263-1
42070024242289 | 8319*2^1792800-1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026719239 | 9981*2^629165-1
42070027452199 | 1323*2^854008+1
42070028029061 | 8205*2^1394191-1
42070029006583 | 5943*2^663870+1
p=42070029360129, 240.3K p/sec, 0.01 CPU cores, 97.9% done. ETA 31 Oct 14:29
Thread 0 completed
Waiting for threads to exit
Thread 1 completed
Sieve complete: 42070000000000 <= p < 42070030000000
Found 19 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 124.28 sec. (0.02 init + 124.27 sieve) at 242594 p/sec.
Processor time: 0.91 sec. (0.01 init + 0.89 sieve) at 33839612 p/sec.
Average processor utilization: 0.97 (init), 0.01 (sieve)
I ran the test from a terminal window.I then started boinc from the same terminal window.
[ted@linux-1023 BOINC]$ ./boinc
31-Oct-2010 14:35:41 [---] Starting BOINC client version 6.10.58 for x86_64-pc-linux-gnu
31-Oct-2010 14:35:41 [---] log flags: file_xfer, sched_ops, task
31-Oct-2010 14:35:41 [---] Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.3 c-ares/1.5.1
31-Oct-2010 14:35:41 [---] Data directory: /home/ted/BOINC
31-Oct-2010 14:35:41 [---] Processor: 2 GenuineIntel Intel(R) Core(TM)2 CPU 6700 @ 2.66GHz [Family 6 Model 15 Stepping 6]
31-Oct-2010 14:35:41 [---] Processor: 4.00 MB cache
31-Oct-2010 14:35:41 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm la
31-Oct-2010 14:35:41 [---] OS: Linux: 2.6.34.7-61.fc13.x86_64
31-Oct-2010 14:35:41 [---] Memory: 1.96 GB physical, 3.94 GB virtual
31-Oct-2010 14:35:41 [---] Disk: 239.83 GB total, 227.15 GB free
31-Oct-2010 14:35:41 [---] Local time is UTC -7 hours
31-Oct-2010 14:35:41 [---] NVIDIA GPU 0: GeForce 8400 GS (driver version unknown, CUDA version 3020, compute capability 1.1, 255MB, 29 GFLOPS peak)
31-Oct-2010 14:35:41 [---] NVIDIA GPU 1: GeForce 8400 GS (driver version unknown, CUDA version 3020, compute capability 1.1, 256MB, 29 GFLOPS peak)
31-Oct-2010 14:35:41 [PrimeGrid] URL http://www.primegrid.com/; Computer ID 166866; resource share 100
31-Oct-2010 14:35:41 [Collatz Conjecture] URL http://boinc.thesonntags.com/collatz/; Computer ID 44419; resource share 100
31-Oct-2010 14:35:41 [rosetta@home] URL http://boinc.bakerlab.org/rosetta/; Computer ID 1360852; resource share 100
31-Oct-2010 14:35:41 [---] General prefs: from http://bam.boincstats.com/ (last modified 31-Oct-2010 08:49:55)
31-Oct-2010 14:35:41 [---] Host location: none
31-Oct-2010 14:35:41 [---] General prefs: using your defaults
31-Oct-2010 14:35:41 [---] Reading preferences override file
31-Oct-2010 14:35:41 [---] Preferences:
31-Oct-2010 14:35:41 [---] max memory usage when active: 1004.04MB
31-Oct-2010 14:35:41 [---] max memory usage when idle: 1807.26MB
31-Oct-2010 14:35:41 [---] max disk usage: 10.00GB
31-Oct-2010 14:35:41 [---] max download rate: 49999995 bytes/sec
31-Oct-2010 14:35:41 [---] max upload rate: 49999995 bytes/sec
31-Oct-2010 14:35:41 [---] (to change preferences, visit the web site of an attached project, or select Preferences in the Manager)
31-Oct-2010 14:35:41 [---] Not using a proxy
Initialization completed
31-Oct-2010 14:36:42 [PrimeGrid] work fetch resumed by user
31-Oct-2010 14:36:43 [PrimeGrid] update requested by user
31-Oct-2010 14:36:46 [PrimeGrid] Sending scheduler request: Requested by user.
31-Oct-2010 14:36:46 [PrimeGrid] Requesting new tasks for CPU and GPU
31-Oct-2010 14:36:48 [PrimeGrid] Scheduler request completed: got 21 new tasks
31-Oct-2010 14:36:50 [PrimeGrid] Starting pps_sr2sieve_3253734_0
31-Oct-2010 14:36:50 [PrimeGrid] Starting task pps_sr2sieve_3253734_0 using pps_sr2sieve version 130
31-Oct-2010 14:36:50 [PrimeGrid] Starting pps_sr2sieve_3253733_0
31-Oct-2010 14:36:50 [PrimeGrid] Starting task pps_sr2sieve_3253733_0 using pps_sr2sieve version 130
31-Oct-2010 14:36:57 [PrimeGrid] Computation for task pps_sr2sieve_3253734_0 finished
etc.
Every single work unit fails with "Computation error". While they are running the boincmgr shows a message about using 0,11 CPUs + 1.00 NVIDIA GPUs (device 0 or 1) for a the work units.
Thoughts?
Thanks,
Ted |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Cuda error: getting factors found: unknown error
Oh, that error. I have no idea what causes it; the information returned is too sparse.
Um, what version are your drivers?
____________
|
|
|
|
|
Cuda error: getting factors found: unknown error
Oh, that error. I have no idea what causes it; the information returned is too sparse.
Um, what version are your drivers?
Fedora Linux 13 x86 64 bit
nVidia driver 260.19.12
CUDA toolkit 3.2.12 Linux 64 bit |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
OK, your driver looks recent. Try re-running the test, but replace -t 2 with --device 0. Then try it again with --device 1.
____________
|
|
|
|
|
OK, your driver looks recent. Try re-running the test, but replace -t 2 with --device 0. Then try it again with --device 1.
Done.
[Ted@linux-1023 ppsieve]$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 --device 0
ppsieve version cuda-0.2.2-alpha (testing)
Compiled Oct 11 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce 8400 GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
p=42070009175041, 152.9K p/sec, 0.01 CPU cores, 30.6% done. ETA 01 Nov 19:59
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
p=42070018350081, 152.9K p/sec, 0.00 CPU cores, 61.2% done. ETA 01 Nov 19:59
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
p=42070027262977, 148.5K p/sec, 0.00 CPU cores, 90.9% done. ETA 01 Nov 19:59
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 199.56 sec. (0.01 init + 199.55 sieve) at 151073 p/sec.
Processor time: 0.91 sec. (0.01 init + 0.89 sieve) at 33726115 p/sec.
Average processor utilization: 1.02 (init), 0.00 (sieve)
[Ted@linux-1023 ppsieve]$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 --device 1
ppsieve version cuda-0.2.2-alpha (testing)
Compiled Oct 11 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 1: GeForce 8400 GS
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
p=42070009437185, 157.3K p/sec, 0.01 CPU cores, 31.5% done. ETA 01 Nov 20:13
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
p=42070018874369, 157.3K p/sec, 0.00 CPU cores, 62.9% done. ETA 01 Nov 20:13
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
p=42070028311553, 157.3K p/sec, 0.00 CPU cores, 94.4% done. ETA 01 Nov 20:13
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 191.44 sec. (0.01 init + 191.42 sieve) at 157487 p/sec.
Processor time: 0.91 sec. (0.01 init + 0.90 sieve) at 33650822 p/sec.
Average processor utilization: 1.02 (init), 0.00 (sieve)
[Ted@linux-1023 ppsieve]$
I note that ppsieve-cuda-boinc-x86_64-linux is 384K while primegrid_ppsieve_1.30_x86_64-pc-linux-gnu__cuda23 is 412K.
Is there a log where I can find more information about the computation error? Other suggestions?
Ted |
|
|
|
|
|
I swapped out two NVIDIA 256Mb cards for 1 512Mb card and the problem went away.
|
|
|
|
|
I swapped out two NVIDIA 256Mb cards for 1 512Mb card and the problem went away.
Correction: the swap was for 1 1Gb card. I still get occasional failures for computation error (probably less than 10%).
____________
|
|
|
|
|
|
This -m parametr, how big he can be?
128 is ok?
____________
Polish National Team |
|
|
|
|
This -m parametr, how big he can be?
128 is ok?
At least it should work. You may experience performance hits if you increase
it too much. The default value (16) should be OK with the drivers up to 260.99.
I installed the devdriver_3.2 (261.00) yesterday and experience increasing
runtimes and a drop of the GPU load from around 98% or 99% to slightly over
80% while crunching ppsieve-cuda WUs. I had to change the -m parameter to
64 to mitigate the speed loss.
With tpsieve (used for PSA's PPR3M work) I've experienced an increase of the
p/sec. value of around 40% with the -m 64 setting and the devdriver_3.2
261.00. I've expected increasing p/sec. values in higher ranges but this has
surprised me a little bit.
tpsieve speeds (Driver: 260.63 for the 5.4T range, 260.99 WHQL otherwise):
5400G-5500G -> 8.9M p/sec
19400G-20000G -> 9.5M p/sec.
30000G-32000G -> 9.8M p/sec.
I've expected speeds slightly above 10M p/sec. for the 74000G-75000G range
but certainly not this:
74000G-75000G -> around 13.6M p/sec.
This looks more like the results of a GTX 470...
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
i dunno what that is, but your result is this high because of nstep=33. from 43000G+ nstep has risen to 33 (from 32) and processing speed has risen as well.
my processing speed below 43000G is around 14.4Mp/sec and above 43000G is around 18.5Mp/sec
edit: also i`m experiencing at 43000G-44000G gpu usage drop from constant 98% to ~86-90%, along with temperature drop. dunno why. |
|
|
|
|
my processing speed below 43000G is around 14.4Mp/sec and above 43000G is around 18.5Mp/sec
Interesting. The absolute distance between the 460s and the 470s does not change significantly in the higher ranges.
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
my processing speed below 43000G is around 14.4Mp/sec and above 43000G is around 18.5Mp/sec
Interesting. The absolute distance between the 460s and the 470s does not change significantly in the higher ranges.
it gets faster with higher Gs. when i start 74000G-75000G range im getting this:
p=74005793644545, 19.94M p/sec, 0.16 CPU cores, 0.6% done. ETA 05 Nov 06:47
edit: 20.28Mp/sec when i leave pc alone. |
|
|
|
|
my processing speed below 43000G is around 14.4Mp/sec and above 43000G is around 18.5Mp/sec
Interesting. The absolute distance between the 460s and the 470s does not change significantly in the higher ranges.
it gets faster with higher Gs. when i start 74000G-75000G range im getting this:
p=74005793644545, 19.94M p/sec, 0.16 CPU cores, 0.6% done. ETA 05 Nov 06:47
edit: 20.28Mp/sec when i leave pc alone.
The 45% disadvantage (50% if I factor in the lower clockspeed of your 470) of
my 460 against an overclocked GTX 470 with 33% more shaders is not as worse
as it initially looked when I saw the first test results in this thread. That is more
in line with the expectations based on the raw numbers.
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
FYI, when PPSieve hits nstep=32 or higher, it uses a very slightly different routine. Because 32 is a "magic" number for computers, and this application in particular, it's a little faster. It won't get much faster than that for higher P's; any increase beyond there is due only to the fact that there are fewer high P's to test.
TPSieve subtracts one from PPSieve's nstep calculation in order to check both signs at once. Is it actually printing 33 (and not later saying that it's reduced to 32)? If so, that's most likely a minor printing bug. If it were actually testing with 33, you'd see a huge slowdown.
____________
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 349,920,761 RAC: 35,293
                      
|
|
it is really printing nstep=33
tpsieve-cuda-x86-windows.exe -p48000G -P49000G -k5 -
K9999 -n2M -N3M -ffppr3M_48000G-49000G.txt -M2 -q
tpsieve version cuda-0.2.2c (testing)
nstart=2000000, nstep=33
tpsieve initialized: 5 <= k <= 9999, 2000000 <= n < 3000000
Sieve started: 48000000000000 <= p < 49000000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
p=48589610090497, 20.53M p/sec, 0.09 CPU cores, 59.0% done. ETA 05 Nov 13:54 |
|
|
|
|
|
The same thing here:
tpsieve-cuda-x86-windows -p74000G -P75000G -k 5 -K 9999 -n2M -N3M -ffppr3M_74000G-75000G.txt -M2 -q -m 64
tpsieve version cuda-0.2.2c (testing)
nstart=2000000, nstep=33
tpsieve initialized: 5 <= k <= 9999, 2000000 <= n < 3000000
Sieve started: 74000000000000 <= p < 75000000000000
Resuming from checkpoint p=74817731731457 in tpcheck74000G.txt
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
p=74823216570369, 13.03M p/sec, 0.03 CPU cores, 82.3% done. ETA 05 Nov 17:26
____________
|
|
|
|
|
|
Some tests with linux:
On the positive side (BOINC 6.10.58 amd64):
Sa 06 Nov 2010 16:43:36 CET NVIDIA GPU 0: GeForce GTX 460 (driver version unknown, CUDA version 3020, compute capability 2.1, 1023MB, 650 GFLOPS peak)
At least the displayed number of GFLOPS is closer to reality ;)
On the negative side:
- CoolBits currently doesn't work. I hope NVIDIA will fix this soon.
- The performance drops even further if the CPU is under full load (NFS@Home 15e tasks).
I modified the app_info.xml so that the GPU tasks blocks one CPU core, but even
then it is more than ten seconds slower. In addition to this I forgot to change -m 64
back from my Windows version of the app_info.xml and the application crashed. I
have yet to dig deeper to find the cause for the crash (need to install the toolkits).
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
In addition to this I forgot to change -m 64
back from my Windows version of the app_info.xml and the application crashed. I
have yet to dig deeper to find the cause for the crash (need to install the toolkits).
That is interesting. I just confirmed that this happens, but only with the BOINC app, not the non-BOINC app. Although when I run the non-BOINC app with -m 64, I can't run BOINC apps of any kind. I may have to get the 3.2 toolkits myself.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
|
Just to add to the list of small remaining issues with the application... :)
On systems with multiple GPU, the "<avg ncpus>" and "<max_ncpus>" settings default to some value over 0.5. This results in one full core being absorbed managing the GPU applications for 2-plus GPU. Resetting this value lower allows the GPUs and all CPU cores to run with no apparent problems, but downloading new tasks resets this to the original higher value. Could this be adjusted in the next application release, or is there some reason why it cannot be fixed (other than with an app_info file, which is not an optimal solution)?
____________
141941*2^4299438-1 is prime!
|
|
|
|
|
Just to add to the list of small remaining issues with the application... :)
On systems with multiple GPU, the "<avg ncpus>" and "<max_ncpus>" settings default to some value over 0.5. This results in one full core being absorbed managing the GPU applications for 2-plus GPU.
you could try to quirk araound that by adding
<ncpus>3</ncpus>
to your cc_config.xml.
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2258 ID: 1178 Credit: 10,867,108,087 RAC: 11,866,263
                                        
|
you could try to quirk araound that by adding
<ncpus>3</ncpus>
to your cc_config.xml.
That seems to have no effect.
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13633 ID: 53948 Credit: 280,904,358 RAC: 40,710
                           
|
you could try to quirk araound that by adding
<ncpus>3</ncpus>
to your cc_config.xml.
That seems to have no effect.
Just to be be clear, that's for a dual-core machine. Make the value one greater than the actual number of CPU cores.
Even if this works, it has the side effect that BOINC will run one extra CPU app if any of your GPUs are idle for whatever reason.
BTW, for reference, ppsieve-CUDA is showing as needing 0.75 CPUs on my system and the actual value would be closer to 0.0005 CPUs.
Since this app never uses a lot of CPU, and its probably impossible to put enough GPUs in one computer to ever use a whole CPU core to service the GPUs, is it possible on the server side to hard-wire this number to something like 1%?
____________
My lucky number is 75898524288+1 |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Since this app never uses a lot of CPU, and its probably impossible to put enough GPUs in one computer to ever use a whole CPU core to service the GPUs, is it possible on the server side to hard-wire this number to something like 1%?
Should be possible. When your client ask for work, he will get also a xml/ini-file which included all needed apps and command-switches.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
As part of the transition to TPSieve for PPS Sieve, I've upgraded both PPSieve-CUDA and TPSieve-CUDA to v0.2.2d. There are two changes:
- TPSieve now has a dummy -T option, which does nothing but prevent PPSieve from running work meant for TPSieve.
- Both apps now have a mechanism to try to avoid that mystery "unknown error". I really have no idea whether it will work; but I don't think it should hurt anything.
Now that I have an nVIDIA GPU, I can test these things. :) But more testers is always better. In particular, anyone who regularly gets those "unknown error"s should test this, and let me know what happens; particularly if you start getting Computation Errors instead.
____________
|
|
|
|
|
I would suggest to build a new tpsieve version which accepts a new command line parameter like for example "-T" and
send it with tpsieve WUs. On computers still using ppsieve via the app_info.xml way, ppsieve would not recognize this
option and exit with an error message. This should result in an calculation error and a missing result file, Even if not the
CPU runtime will be around 0 seconds ;)
- TPSieve now has a dummy -T option, which does nothing but prevent PPSieve from running work meant for TPSieve.
Thanks for implementing the workaround that fast. For the future I would suggest to make the output of the various BOINC
versions of the app distinguishable, for example by adding a header line at the top of the result file, so that the validator can
immediately recognize result files from an old, invalid, buggy or simply wrong sieving application.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
boinc@vmware2k-3:~/Cuda/ppsieve$ ./ppsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 --device 0
ppsieve version cuda-0.2.2d (testing)
Compiled Nov 8 2010 with GCC 4.3.3
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 21.56 sec. (0.02 init + 21.54 sieve) at 1399598 p/sec.
Processor time: 0.41 sec. (0.02 init + 0.39 sieve) at 77692714 p/sec.
Average processor utilization: 1.22 (init), 0.02 (sieve)
boinc@vmware2k-3:~/Cuda/tpsieve$ ./tpsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 --device 0
tpsieve version cuda-0.2.2d (testing)
Compiled Nov 8 2010 with GCC 4.3.3
nstart=76, nstep=32
Didn't change nstep from 31
tpsieve initialized: 1203 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
42070003511309 | 6057*2^1043547+1
42070005645821 | 3633*2^119620-1
42070008458437 | 7095*2^1422761-1
42070010190569 | 5625*2^1903125+1
42070012209011 | 9405*2^360411-1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019117111 | 2523*2^999263-1
42070024242289 | 8319*2^1792800-1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026719239 | 9981*2^629165-1
42070027452199 | 1323*2^854008+1
42070028029061 | 8205*2^1394191-1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 19 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 29.33 sec. (0.02 init + 29.32 sieve) at 1028306 p/sec.
Processor time: 0.37 sec. (0.02 init + 0.35 sieve) at 85638284 p/sec.
Average processor utilization: 1.21 (init), 0.01 (sieve)
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
FYI, if you want TPSieve to find all the factors PPSieve does, you need to add the flag "-M2". This is because TPSieve was designed originally to find only twin primes, and they only occur where k=3 (mod 6).
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
boinc@vmware2k-3:~/Cuda/tpsieve$ ./tpsieve-cuda-x86_64-linux -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 -M2 --device 0tpsieve version cuda-0.2.2d (testing)
Compiled Nov 8 2010 with GCC 4.3.3
nstart=76, nstep=32
Didn't change nstep from 31
tpsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GT 240
Detected compute capability: 1.2
Detected 12 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000154219 | 6023*2^934790-1
42070000198537 | 3373*2^1046686+1
42070001803331 | 5237*2^486598-1
42070003062431 | 7465*2^1994555-1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070005645821 | 3633*2^119620-1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070007733361 | 7007*2^1691614-1
42070008458437 | 7095*2^1422761-1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012209011 | 9405*2^360411-1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070016416499 | 4571*2^466510-1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019117111 | 2523*2^999263-1
42070019542387 | 8587*2^1703626+1
42070021901227 | 6589*2^1149693-1
42070023987581 | 9811*2^318944+1
42070024242289 | 8319*2^1792800-1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070026719239 | 9981*2^629165-1
42070027452199 | 1323*2^854008+1
42070028029061 | 8205*2^1394191-1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 40 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 29.33 sec. (0.02 init + 29.31 sieve) at 1028544 p/sec.
Processor time: 0.40 sec. (0.02 init + 0.38 sieve) at 79328251 p/sec.
Average processor utilization: 0.98 (init), 0.01 (sieve)
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
nstart=76, nstep=32
Didn't change nstep from 31
This is a known bug in printing nstep. Basically, nstep gets lowered by 1 for TPSieve, but there's nothing printed that tells you this. I decided not to bother digging around in several files to change this, since it doesn't affect the results, and at higher P's it says nstep is changed.
Edit: This is what The Pragmatic Programmer calls a "broken window". So I've fixed it in the source, and if there's another version necessary, the fix will be included.
____________
|
|
|
|
|
Edit: This is what The Pragmatic Programmer calls a "broken window". So I've fixed it in the source, and if there's another version necessary, the fix will be included.
will we ever to be able to fix base 0 or 1.. ? ;) |
|
|
|
|
Since this app never uses a lot of CPU, and its probably impossible to put enough GPUs in one computer to ever use a whole CPU core to service the GPUs, is it possible on the server side to hard-wire this number to something like 1%?
Should be possible. When your client ask for work, he will get also a xml/ini-file which included all needed apps and command-switches.
Is this a fix we can look forward to. I'm also having the problem of a CPU core running idle while keeping two GPUs busy here.
Thanks! |
|
|
|
|
Is this a fix we can look forward to. I'm also having the problem of a CPU core running idle while keeping two GPUs busy here.
Thanks!
My system with i7-980x and nVidia GTX295 has the same Problem (reporting 0.68 CPUs + 1 NVIDIA GPUs for each of the two GPU cores)
My workaround is a "lie" in my cc_config.xml
<cc_config>
<log_flags>
<cpu_sched>1</cpu_sched>
<task>1</task>
<file_xfer>1</file_xfer>
</log_flags>
<options>
<report_results_immediately>1</report_results_immediately>
<ncpus>13</ncpus>
</options>
</cc_config>
The "Gulfton" CPU has 6 cores + 6 HT-cores = 12 cores.
I tell the BOINC-Manager that it has 13 cores - 12 for the CPU jobs and one to feed the GPUs.
Works for me
____________
Member of Crunching Family
http://crunching-family.at/ |
|
|
|
|
Is this a fix we can look forward to. I'm also having the problem of a CPU core running idle while keeping two GPUs busy here.
Thanks!
My system with i7-980x and nVidia GTX295 has the same Problem (reporting 0.68 CPUs + 1 NVIDIA GPUs for each of the two GPU cores)
My workaround is a "lie" in my cc_config.xml
<cc_config>
<log_flags>
<cpu_sched>1</cpu_sched>
<task>1</task>
<file_xfer>1</file_xfer>
</log_flags>
<options>
<report_results_immediately>1</report_results_immediately>
<ncpus>13</ncpus>
</options>
</cc_config>
The "Gulfton" CPU has 6 cores + 6 HT-cores = 12 cores.
I tell the BOINC-Manager that it has 13 cores - 12 for the CPU jobs and one to feed the GPUs.
Works for me
I tried doing this on my machine, but what happened as a result was that it would also run an extra instance of a CPU app. Is there anyway to stop this from happening?
____________
|
|
|
STE\/E Volunteer tester
 Send message
Joined: 10 Aug 05 Posts: 573 ID: 103 Credit: 3,630,330,192 RAC: 0
                     
|
|
If you put 13 in the cc_config file then that's how many are going to run 13
I think you would be better off putting 11 or 10 depending on if it was a single or dual card box ...
____________
|
|
|
|
|
Is this a fix we can look forward to. I'm also having the problem of a CPU core running idle while keeping two GPUs busy here.
Thanks!
My system with i7-980x and nVidia GTX295 has the same Problem (reporting 0.68 CPUs + 1 NVIDIA GPUs for each of the two GPU cores)
My workaround is a "lie" in my cc_config.xml
<cc_config>
<log_flags>
<cpu_sched>1</cpu_sched>
<task>1</task>
<file_xfer>1</file_xfer>
</log_flags>
<options>
<report_results_immediately>1</report_results_immediately>
<ncpus>13</ncpus>
</options>
</cc_config>
The "Gulfton" CPU has 6 cores + 6 HT-cores = 12 cores.
I tell the BOINC-Manager that it has 13 cores - 12 for the CPU jobs and one to feed the GPUs.
Works for me
I tried doing this on my machine, but what happened as a result was that it would also run an extra instance of a CPU app. Is there anyway to stop this from happening?
This trick works only if the sum of the CPU usage that is reportet for the GPU cores by BM is greater than (or equal) one and less than two.
0,68 CPUs+0.68 CPUs = greater than 1 CPU/core
In this case the BM assigns one CPU core completely to the GPU(s) and
increasing the number of available cores (<ncpus>n+1</ncpus>) works for me as long as two GPU threads are running.
You may also read this thread: http://www.primegrid.com/forum_thread.php?id=2671&sort=6
____________
Member of Crunching Family
http://crunching-family.at/ |
|
|
STE\/E Volunteer tester
 Send message
Joined: 10 Aug 05 Posts: 573 ID: 103 Credit: 3,630,330,192 RAC: 0
                     
|
|
You Guy's are in Big Trouble now, I got my M8700 GT Laptop to run the CUDA Wu's ... lol
http://www.primegrid.com/show_host_detail.php?hostid=171692
____________
|
|
|
Vato Volunteer tester
 Send message
Joined: 2 Feb 08 Posts: 796 ID: 18447 Credit: 382,504,347 RAC: 225,569
                       
|
|
about 3 times as fast as my laptop with a G103M!
http://www.primegrid.com/show_host_detail.php?hostid=156321
____________
|
|
|
|
|
|
FWIW, I cannot get any work on my OSX/CUDA machine. I get:
Wed Dec 1 20:02:30 2010 PrimeGrid Message from server: _("No work available for the applications you have selected. Please check your project preferences on the web site.")
Perhaps this is a similar situation to the ATI cards, where an app_info.xml is required?
Could someone please post an app_info.xml file, as well as the link to d/l the app?
____________
Reno, NV
|
|
|
|
|
|
Zombie67, you need to update your driver to CUDA 3.2 (due to a bug in the Mac driver) otherwise you will not receive work.
I'm currently getting work for my OSX/CUDA system.
Cheers
- Iain |
|
|
|
|
Zombie67, you need to update your driver to CUDA 3.2 (due to a bug in the Mac driver) otherwise you will not receive work.
I'm currently getting work for my OSX/CUDA system.
Cheers
- Iain
Dang. Then it won't work with collatz.
____________
Reno, NV
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I've just uploaded v0.2.3 of PPSieve-CUDA and TPSieve-CUDA at the usual locations. These should work around the problems with the 260 drivers. I won't say "fix" because every time there's an error they re-run that section of work.
I've also greatly simplified the section of the code that was causing those errors, so my code shouldn't be the problem. Since I still get errors that are worked around, the 260 drivers may run my code slower than some other versions. But it would be good to get tests on 260 and non-260 drivers, Fermi and non-Fermi, and Windows and Linux, to make sure I didn't make any other mistakes.
(Say, did anyone have problems with frequent Computation Errors on Windows in the first place?)
____________
|
|
|
|
|
Could someone please post an app_info.xml file, as well as the link to d/l the app?
http://www.primegrid.com/download/
Win (ati)
<app_info>
<app>
<name>pps_sr2sieve</name>
<user_friendly_name>Proth Prime Search (Sieve)</user_friendly_name>
</app>
<file_info>
<name>primegrid_tpsieve_1.35_windows_intelx86__ati13ati.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>pps_sr2sieve</app_name>
<version_num>135</version_num>
<plan_class>ati13ati</plan_class>
<avg_ncpus>0.05</avg_ncpus>
<max_ncpus>1</max_ncpus>
<flops>1.0e11</flops>
<coproc>
<type>ATI</type>
<count>1</count>
</coproc>
<cmdline></cmdline>
<file_ref>
<file_name>primegrid_tpsieve_1.35_windows_intelx86__ati13ati.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
Linux (cuda)
<app_info>
<app>
<name>pps_sr2sieve</name>
<user_friendly_name>Proth Prime Search (Sieve)</user_friendly_name>
</app>
<file_info>
<name>primegrid_tpsieve_1.35_x86_64-pc-linux-gnu__cuda23</name>
<nbytes>425568.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>1</status>
<executable/>
</file_info>
<file_info>
<name>libcudart.so.2</name>
<nbytes>260840.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>1</status>
</file_info>
<app_version>
<app_name>pps_sr2sieve</app_name>
<version_num>135</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>0.050000</avg_ncpus>
<max_ncpus>0.050000</max_ncpus>
<flops>100000000000.000000</flops>
<plan_class>cuda23</plan_class>
<api_version>6.2.18</api_version>
<file_ref>
<file_name>primegrid_tpsieve_1.35_x86_64-pc-linux-gnu__cuda23</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>libcudart.so.2</file_name>
<open_name>libcudart.so.2</open_name>
</file_ref>
<coproc>
<type>CUDA</type>
<count>1.000000</count>
</coproc>
</app_version>
</app_info>
Zombie, Hope this helps
Steve
____________
From the High Desert in New Mexico
|
|
|
|
|
I've just uploaded v0.2.3 of PPSieve-CUDA and TPSieve-CUDA at the usual locations. These should work around the problems with the 260 drivers. I won't say "fix" because every time there's an error they re-run that section of work.
I've also greatly simplified the section of the code that was causing those errors, so my code shouldn't be the problem. Since I still get errors that are worked around, the 260 drivers may run my code slower than some other versions. But it would be good to get tests on 260 and non-260 drivers, Fermi and non-Fermi, and Windows and Linux, to make sure I didn't make any other mistakes.
(Say, did anyone have problems with frequent Computation Errors on Windows in the first place?)
Thank you very much for your efforts. I'll conduct some tests as soon as possible.
____________
|
|
|
|
|
I've just uploaded v0.2.3 of PPSieve-CUDA and TPSieve-CUDA at the usual locations. These should work around the problems with the 260 drivers. I won't say "fix" because every time there's an error they re-run that section of work.
I've also greatly simplified the section of the code that was causing those errors, so my code shouldn't be the problem. Since I still get errors that are worked around, the 260 drivers may run my code slower than some other versions. But it would be good to get tests on 260 and non-260 drivers, Fermi and non-Fermi, and Windows and Linux, to make sure I didn't make any other mistakes.
(Say, did anyone have problems with frequent Computation Errors on Windows in the first place?)
Thank you very much for your efforts. I'll conduct some tests as soon as possible.
Works fine on my GTX 460 (Windows Vista SP 2 x64 - Driver Version 258.96 WHQL) with -m 64.
____________
|
|
|
Artist Volunteer tester Send message
Joined: 29 Sep 08 Posts: 86 ID: 29825 Credit: 326,820,376 RAC: 88,127
                         
|
|
Thank you! It works with my GTS450, Linux-x86_64, Driver 260.19.26. |
|
|
|
|
|
Hi Ken_g6
download Ken-g6-PSieve-CUDA-322c1a6.zip
Is this the right latest for pps_sr2sieve?
It would be nice if you store Cuda.Rules file into projectdirectory.
heinz
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
What file did you download? Were you trying to get the source code?
Get - make that Git - the source code from this GitHub page. If you're trying to build with Visual Studio, follow the instructions in README-WIN.txt.
I have no Cuda.Rules file.
____________
|
|
|
|
|
|
Thanks Ken_g6,
I will try the Visual Studio build with VS2008Prof and CUDA3.2
Maybe we can get some speedup with the Fermi cards.
heinz
|
|
|
|
|
|
Ken_g6,
Latest sources seem to have eliminated the problem on my GTS 250 as well.
Do you still think there is a bug in the newer NVidia drivers, or was it the tpsieve code? |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I still think there's a bug in the drivers. Though it only seems to be the Linux 260.19.* drivers, not the Windows 260.99.* drivers. I've just done a workaround for now, recomputing a result each time an error is returned.
____________
|
|
|
|
|
|
CUDA3.2
I could compile with XE2011 with some minor changes.
test shows:
E:\I\SC\pps\Ken-g6-PSieve-CUDA-322c1a6\Release>echo off
delete files
done
-------------------------------------------------
test ppsieve-cuda-x86-windows_XE2011_MKLP_Qx_SSE3_ATOM_Rv220
compiled with:XE2011 VS2008
ppsieve-cuda-x86-windows_XE2011_MKLP_Qx_SSE3_ATOM_Rv220.exe -p42070e9 -P42070030
e6 -k 1201 -K 9999 -N 2000000 -c 60 -M2 --device 0
ppsieve version cuda-0.2.3 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: ION
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
p=42070010485761, 174.8K p/sec, 0.04 CPU cores, 35.0% done. ETA 16 Dec 00:21
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
p=42070020971521, 171.9K p/sec, 0.03 CPU cores, 69.9% done. ETA 16 Dec 00:21
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 173.54 sec. (0.12 init + 173.43 sieve) at 173831 p/sec.
Processor time: 5.48 sec. (0.19 init + 5.29 sieve) at 5700470 p/sec.
Average processor utilization: 1.63 (init), 0.03 (sieve)
-------------------------------------------------
It did not found 40 as our actual production app, although I used -M2 as you see above.
What does this mean: p=42070020971521, 171.9K p/sec, 0.03 CPU cores, 69.9% done. ETA 16 Dec 00:21
Are there something I'm missing in the test ?
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
ppsieve-cuda-x86-windows_XE2011_MKLP_Qx_SSE3_ATOM_Rv220.exe -p42070e9 -P42070030
e6 -k 1201 -K 9999 -N 2000000 -c 60 -M2 --device 0
PPsieve is the old app.
PPSE sieve used to be searching for factors of Proth numbers, like:
20070000475957 | 4995*2^1822738+1
20070001146497 | 4977*2^626298+1
PG switched to factors of Riesel numbers, like:
20070000541441 | 3243*2^1584966-1
20070000674041 | 8143*2^1397047-1
TPSieve, being designed to search for Twin Primes, can search both forms at once:
20070000475957 | 4995*2^1822738+1
20070000541441 | 3243*2^1584966-1
20070000674041 | 8143*2^1397047-1
20070001146497 | 4977*2^626298+1
What does this mean: p=42070020971521, 171.9K p/sec, 0.03 CPU cores, 69.9% done. ETA 16 Dec 00:21
42070000000000 <= p < 42070030000000 => sieve-range
171.9K p/sec => sieve-rate
ETA 16 Dec 00:21 => estimated end of calculating
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
|
Thanks Ronald,
have I the old source ? have Ken-g6-PSieve-CUDA-322c1a6
from github
no newer available... or I can't find it
?? any hints ?
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
ppsieve and tpsieve are made from the same codebase. "make" (in the pps subdirectory) should make all of them.
Edit: Make sure you got the "redc" branch, not the "redcl" branch; that one's for OpenCL, but may compile an old version for CUDA too.
____________
|
|
|
|
|
|
Thanks Ken_g6 pointing me to right direction,
The issue was that I choosed in VS2008 Konfigurationmanager--> Release, but I must use TPS-Release, this I did not know. I have seen that in the Preprocessor settings SEARCH_TWIN is set now. In the linkersection--> Input Ignore Specific Library-->LIBCMT.lib must be set.
With some minor changes the app compiles with Intel compiler.
A first testrun shows good results, I did not make any optimizations or profiling, first we will see to produce correct results.
have a look:
E:\I\SC\pps\Ken-g6-PSieve-CUDA-322c1a6\TPS-Release>echo off
--------------------------------------------------------
primegrid_tpsieve_1.37_windows_intelx86__cuda23.exe -p42070e9 -P42070030e6 -k 12
01 -K 9999 -N 2000000 -c 60 -M2 --device 0
tpsieve version cuda-0.2.3 (testing)
nstart=76, nstep=32
Changed nstep to 31
tpsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
42070005645821 | 3633*2^119620-1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070007733361 | 7007*2^1691614-1
42070008458437 | 7095*2^1422761-1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012209011 | 9405*2^360411-1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070016416499 | 4571*2^466510-1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019117111 | 2523*2^999263-1
42070019542387 | 8587*2^1703626+1
42070021901227 | 6589*2^1149693-1
42070023987581 | 9811*2^318944+1
42070024242289 | 8319*2^1792800-1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070026719239 | 9981*2^629165-1
42070027452199 | 1323*2^854008+1
42070028029061 | 8205*2^1394191-1
42070029006583 | 5943*2^663870+1
Found 40 factors
--------------------------------------------------------
tpsieve-cuda-x86-windows_XE2011.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 20
00000 -c 60 -M2 --device 0
tpsieve version cuda-0.2.3 (testing)
nstart=76, nstep=32
Changed nstep to 31
tpsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: ION
Detected compute capability: 1.1
Detected 2 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000154219 | 6023*2^934790-1
42070000198537 | 3373*2^1046686+1
42070001803331 | 5237*2^486598-1
42070003062431 | 7465*2^1994555-1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070005645821 | 3633*2^119620-1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070007733361 | 7007*2^1691614-1
p=42070007864321, 131.0K p/sec, 0.04 CPU cores, 26.2% done. ETA 17 Dec 00:26
42070008458437 | 7095*2^1422761-1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012209011 | 9405*2^360411-1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
p=42070015990785, 133.5K p/sec, 0.02 CPU cores, 53.3% done. ETA 17 Dec 00:26
42070016416499 | 4571*2^466510-1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019117111 | 2523*2^999263-1
42070019542387 | 8587*2^1703626+1
42070021901227 | 6589*2^1149693-1
p=42070024117249, 131.3K p/sec, 0.02 CPU cores, 80.4% done. ETA 17 Dec 00:26
42070023987581 | 9811*2^318944+1
42070024242289 | 8319*2^1792800-1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070026719239 | 9981*2^629165-1
42070027452199 | 1323*2^854008+1
42070028029061 | 8205*2^1394191-1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 40 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 229.27 sec. (0.08 init + 229.20 sieve) at 131532 p/sec.
Processor time: 6.13 sec. (0.12 init + 6.01 sieve) at 5019375 p/sec.
Average processor utilization: 1.60 (init), 0.03 (sieve)
Drücken Sie eine beliebige Taste . . .
~~~~~~~~~~~~~~~~~~~~~~~~
stderr from 1.37 production app shows:
-------------------------------------------
00:19:21 (3480): Can't open init data file - running in standalone mode
Sieve started: 42070000000000 <= p < 42070030000000
Resuming from checkpoint p=42070004718593 in tpcheck42070e9.txt
Thread 0 starting
Detected GPU 0: ION
Detected compute capability: 1.1
Detected 2 multiprocessors.
Thread 0 completed
Sieve complete: 42070000000000 <= p < 42070030000000
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 192.22 sec. (0.20 init + 192.02 sieve) at 132423 p/sec.
Processor time: 4.31 sec. (0.12 init + 4.18 sieve) at 6082043 p/sec.
Average processor utilization: 0.62 (init), 0.02 (sieve)
00:22:33 (3480): called boinc_finish
~~~~~~~~~~~~~~~~~~~~~~~~~~~
heinz |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
tpsieve-cuda-x86-windows_XE2011.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -c 60 -M2 --device 0
tpsieve version cuda-0.2.3 (testing)
nstart=76, nstep=32
Changed nstep to 31
tpsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: ION
Detected compute capability: 1.1
Detected 2 multiprocessors.
...
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 40 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 229.27 sec. (0.08 init + 229.20 sieve) at 131532 p/sec.
Processor time: 6.13 sec. (0.12 init + 6.01 sieve) at 5019375 p/sec.
Average processor utilization: 1.60 (init), 0.03 (sieve)
Drücken Sie eine beliebige Taste . . .
Hmmm. You have only 2 multiprocessors and need ~230sec for work that will be done in ~30sec on my slow GT240 (GT215) and ~6sec on a GTX460...
I think there is not much room for optimizing.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
Thanks Ken_g6 pointing me to right direction,
The issue was that I choosed in VS2008 Konfigurationmanager--> Release, but I must use TPS-Release, this I did not know. I have seen that in the Preprocessor settings SEARCH_TWIN is set now. In the linkersection--> Input Ignore Specific Library-->LIBCMT.lib must be set.
With some minor changes the app compiles with Intel compiler.
Oops. My world is Linux-centric, so I keep forgetting most of the world is Windows-centric.
The Intel compiler isn't going to change much on this app. It's nVIDIA's CUDA compiler where you would need to look for optimizations.
____________
|
|
|
|
|
|
While under full CPU load, on a GTX 470, no issues. Are any of our sieves currently/in the future planning on running this? I'd like to get my GPU doing nice things, but can't run boinc on this PC yet.
C:\Users\st47\Desktop\ppsieve-cuda>ppsieve-cuda-x86-windows.exe -p42070e9 -P420
70030e6 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.2.3 (testing)
nstart=76, nstep=32
ppsieve initialized: 1201 <= k <= 9999, 76 <= n < 2000000
Sieve started: 42070000000000 <= p < 42070030000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
42070000070587 | 9475*2^197534+1
42070000198537 | 3373*2^1046686+1
42070003101727 | 4207*2^1054290+1
42070003511309 | 6057*2^1043547+1
42070006307657 | 1513*2^1771812+1
42070006388603 | 2059*2^1816098+1
42070007177519 | 5437*2^1121592+1
42070007396759 | 7339*2^1803518+1
42070008823897 | 4639*2^952018+1
42070008858187 | 2893*2^317690+1
42070010190569 | 5625*2^1903125+1
42070011430123 | 3821*2^1406279+1
42070012301263 | 1957*2^1185814+1
42070013521999 | 1965*2^404493+1
42070013970587 | 7143*2^1462422+1
42070013989247 | 5037*2^838603+1
42070017332953 | 6237*2^1916994+1
42070018235321 | 1941*2^363948+1
42070019542387 | 8587*2^1703626+1
42070023987581 | 9811*2^318944+1
42070024339237 | 9257*2^1170495+1
42070024532551 | 4311*2^1690093+1
42070024936837 | 5679*2^1726142+1
42070024995961 | 9111*2^1707153+1
42070026021997 | 4039*2^1819590+1
42070027452199 | 1323*2^854008+1
42070029006583 | 5943*2^663870+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 42070000000000 <= p < 42070030000000
Found 27 factors
count=955289,sum=0x2dbc17167afb6a8d
Elapsed time: 3.34 sec. (0.04 init + 3.30 sieve) at 9137567 p/sec.
Processor time: 0.39 sec. (0.05 init + 0.34 sieve) at 87839115 p/sec.
Average processor utilization: 1.26 (init), 0.10 (sieve)
C:\Users\st47\Desktop\ppsieve-cuda>ppsieve-cuda-x86-windows.exe -p249871e9 -P24
98711e8 -k 1201 -K 9999 -N 2000000 -c 60
ppsieve version cuda-0.2.3 (testing)
nstart=80, nstep=35
nstep changed to 32
ppsieve initialized: 1201 <= k <= 9999, 80 <= n < 2000000
Sieve started: 249871000000000 <= p < 249871100000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
249871003789289 | 6295*2^266404+1
249871009510013 | 2771*2^1272671+1
249871010360639 | 1743*2^1337710+1
249871027030549 | 8865*2^1534637+1
249871030776329 | 7815*2^1679937+1
249871032591751 | 2335*2^23512+1
249871038523049 | 7527*2^204096+1
249871049497963 | 6497*2^505399+1
249871066947839 | 8497*2^1221770+1
249871068167599 | 7311*2^450531+1
249871089712009 | 9281*2^1650023+1
249871091913587 | 2139*2^1290902+1
249871099624639 | 8381*2^350375+1
Thread 0 completed
Waiting for threads to exit
Sieve complete: 249871000000000 <= p < 249871100000000
Found 13 factors
count=3016866,sum=0xdd752eb120eb924a
Elapsed time: 9.98 sec. (0.09 init + 9.89 sieve) at 10127771 p/sec.
Processor time: 0.90 sec. (0.09 init + 0.81 sieve) at 123444762 p/sec.
Average processor utilization: 1.02 (init), 0.08 (sieve)
C:\Users\st47\Desktop\ppsieve-cuda>
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
Activate "Proth Prime Search (Sieve)" and Cuda in your PG preferences...
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
I'd like to get my GPU doing nice things, but can't run boinc on this PC yet.
You can inquire about doing some manual sieving work. Also, on Mersenneforum, Twin Prime Search and No Prime Left Behind use this code. See also the GPU computing section and GFN prime search, with different code.
____________
|
|
|
Genn Volunteer tester
 Send message
Joined: 16 Jul 09 Posts: 50 ID: 43504 Credit: 91,204,089 RAC: 0
                     
|
|
Is there any build of tpsieve for cuda 3 under linux 64? I would like to run genefercuda and tpsieve on the same machine and I want to have one version of cuda-toolkit installed. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I would imagine you just have to download libcudart.so.2 from PrimeGrid and put it in the same directory with TPSieve. But I haven't tried that yet.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
CUDA Occupancy Calculator
Compute Capability Threads/Multiprocessor
1.0 768 / 6
1.1 768 / 6
1.2 1024 / 8
1.3 1024 / 8
2.0 1536 / 12
Multiprocessor = Threads mod 128
Depended of the Compute Capability of your nVidia-GPU the parameter "-m" should it not be ?
6 for all pre-GT200
8 for the GT200
12 for Fermi
[add]
question
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Ah, but you assumed practically no register use here. My code uses between 17 and 20 registers, generally. So that leads to:
CUDA Occupancy Calculator
Compute Capability Threads/Multiprocessor
1.0 384 / 3
1.1 384 / 3
1.2 768 / 6
1.3 768 / 6
2.0 1024 / 8
And those are the defaults, except for 2.0 which I bumped to 16, and for which -m64 seems to work nicely in most cases.
Edit: By the way, since my code isn't memory-bound, these low occupancies (down to 50%) shouldn't be a problem performance-wise.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
Ken released a new version for tps-sieve 0.2.3a.
Downloadable at https://sites.google.com/site/kenscode/prime-programs/tpsieve-cuda.zip?attredirects=0&d=1
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
|
What's the syntax of the inputfile when sieving for twins with tpsieve? I am familiar with the ABC format in general, but I keep getting "invalid header" messages.
____________
There are only 10 kinds of people - those who understand binary and those who don't
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
TPSsieve needs no input file.
You set a sieve range via "-p" and "-P", the basis with "-k" and "-K", a number range for X+/-1 with "-N".
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
TPSsieve needs no input file.
You set a sieve range via "-p" and "-P", the basis with "-k" and "-K", a number range for X+/-1 with "-N".
True, but I already have a file created with NewPGen and I'd like to continue sieving with tpsieve. You can specify an input file, but I don't get the format in the first line right.
____________
There are only 10 kinds of people - those who understand binary and those who don't
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
Changes for ppsieve version 0.3.2a: A very tiny bugfix that only affects you if you try to run with fewer
than the maximum number of N's.
Also doubled the speed...of reading the input ABCD file.
The file "psp_sr2sieve_20100629.sieveinput" (was deleted or moved from the PG download-folder) begins with:
14
991
49999997
k=10223
1181
+336
+1272
+660
+492
+180
+48
+420
+876
+996
...
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
|
OK, so I played around for a while with tpsieve-cuda-x86-windows.exe without input file. But I ran into another error when trying to do a single-n sieve.
tpsieve -p1e9 -P2e9 -n54321 -k2 -K1e6 gives an error message that nmax is too close to nmin. If i add -N66666 it works, but it also finds factors for all those unwanted n values. How can I do a single-n sieve?
Thanks
Peter
____________
There are only 10 kinds of people - those who understand binary and those who don't
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
How can I do a single-n sieve?
Thanks
Peter
Good question. I believe the proper arguments to make tpsieve do a single N would be tpsieve -p1e9 -P2e9 -n54321 -N54322 -k2 -K1e6. You must specify a value for -N.
However, because there are so few N's, most of the work will be done on the CPU finding P's to sieve with. If done with the OpenCL app, all the work would be done on the CPU!
I have given some thought recently to how I would do a single-N sieve on the GPU. If you look at The Prime Pages' Wheel Factorization page, you'll see a wheel for 2, 3, 5, and 7. It turns out that it has exactly 48 spokes, which is very nice, because 48*4 = 192, and 192 is a very good block size for CUDA. So then each block could be initialized with a number that is 0 (mod 840 ==210*4). Each consecutive block could be initialized with the next number that is 0 (mod 840). And the P in each thread could be set to that number plus the wheel number and incremented by 210*4*block_count at each iteration.
The downside to this is that a wheel sieve doesn't get rid of all that many composites, so this would test roughly six times as many numbers as if it sieved with primes only. On the plus side, this would mean no significant global memory access, so that six would be just about the only limiting factor.
But I'm not planning to write such a single-N sieve at this time.
____________
|
|
|
|
|
|
For the benefit of those using app_info, I've updated the Mac builds of ppsieve and tpsieve to the latest (0.2.3a) code:
http://www.pyramid-productions.net/downloads/ppsieve-cuda.tar.gz
http://www.pyramid-productions.net/downloads/tpsieve-cuda.tar.gz
As is hopefully well known, I still can't get the OpenCL code to work correctly with Apple's OpenCL compiler...
Cheers
- Iain |
|
|
|
|
Good question. I believe the proper arguments to make tpsieve do a single N would be tpsieve -p1e9 -P2e9 -n54321 -N54322 -k2 -K1e6. You must specify a value for -N.
This gives me "Error: pmin is not large enough (or nmax is close to nmin)"
But you say it would be inefficient anyway. So it would be a much smarter idea to sieve many N's with a small number of K values, right?
Thanks
Peter
____________
There are only 10 kinds of people - those who understand binary and those who don't
|
|
|
|
|
|
we are currently running with M6 as default.
how many threads does this mean for a GTS250 with 16 core config?
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
But you say it would be inefficient anyway. So it would be a much smarter idea to sieve many N's with a small number of K values, right?
Right, Peter. With the exception that you may find you get more, more-deeply-sieved factors sieving a single N on your CPU than you get with PPSieve-CUDA on your GPU. Especially if you use my NSieve64 app which could be up to 20-30% faster than NewPGen on 64-bit Linux. But you'll have to compile it.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
Ken any news on TPS/PPSsieve coding? I see you have released new apps but not updated the readme-files...
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Just a minor update. John said the PPSE sieve was starting to have cases where no factors are returned. This allows that case to occur while still validating. See the Git log for (a few more) details.
____________
|
|
|
|
|
|
Finally sm_21 compilation with the CUDA SDK 3.2 and the 270.26 beta drivers works as expected (in terms of speed). The binary is even faster (not much though...) than the stock app with the 256.53 drivers (a few modifications were necessary)...
Stock app (270.26 drivers):
ralf@quadriga:~$ LD_LIBRARY_PATH=. /home/ralf/primegrid_tpsieve_1.38_x86_64-pc-linux-gnu__cuda23 -p 100000e9 -P 100001e9 -k 3 -K 9999 -n 2M -N 3M -c60 -q
tpsieve version cuda-0.2.3a (testing)
Compiled Jan 11 2011 with GCC 4.3.3
nstart=2000000, nstep=34
nstep changed to 32
tpsieve initialized: 3 <= k <= 9999, 2000000 <= n < 3000000
Found 173 factors
ralf@quadriga:~$ cat stderr.txt
Can't open init data file - running in standalone mode
Sieve started: 100000000000000 <= p < 100001000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 100000000000000 <= p < 100001000000000
count=31019409,sum=0x284af85735fd771f
Elapsed time: 92.83 sec. (0.03 init + 92.80 sieve) at 10776335 p/sec.
Processor time: 4.65 sec. (0.03 init + 4.62 sieve) at 216266622 p/sec.
Average processor utilization: 1.11 (init), 0.05 (sieve)
called boinc_finish
"Do it yourself" app (270.26 drivers):
ralf@quadriga:~/source/PSieve-CUDA/pps$ ./tpsieve-cuda-x86_64-linux -p 100000e9 -P 100001e9 -k 3 -K 9999 -n 2M -N 3M -c60 -q -m 72
tpsieve version cuda-0.2.3b (testing)
Compiled Feb 27 2011 with GCC 4.5.2
nstart=2000000, nstep=34
nstep changed to 32
tpsieve initialized: 3 <= k <= 9999, 2000000 <= n < 3000000
Sieve started: 100000000000000 <= p < 100001000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
p=100000707264513, 11.79M p/sec, 0.04 CPU cores, 70.7% done. ETA 27 Feb 14:53
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 85.33 sec. (0.02 init + 85.31 sieve) at 11723348 p/sec.
Processor time: 3.34 sec. (0.02 init + 3.32 sieve) at 301209852 p/sec.
Average processor utilization: 1.23 (init), 0.04 (sieve)
ralf@quadriga:~/source/PSieve-CUDA/pps$
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
I had some weird problems with GDM-shutdown, unloading the nvidia-modul of my DotschUX1.2-inst and installing 270.26. A restart solved this and 270.26 with cudatoolkit_3.2.16 installs without further flaws.
Boinc is running now and wrote following in the log:
NVIDIA GPU 0: GeForce GT 240 (driver version unknown, CUDA version 4000, compute capability 1.2, 1023MB, 302 GFLOPS peak)
Which modifications were necessary???
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
Which modifications were necessary???
I'm currently trying to figure out which modifications were not necessary :)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Don't forget: The PrimeGrid version is the BOINC version, which doesn't use threads. Yours is the non-BOINC version which does. Threads seem to avoid the CUDA blocking sync bug which makes the BOINC version slower.
____________
|
|
|
|
|
Don't forget: The PrimeGrid version is the BOINC version, which doesn't use threads. Yours is the non-BOINC version which does. Threads seem to avoid the CUDA blocking sync bug which makes the BOINC version slower.
The BOINC version is about as fast as the non-BOINC version...
ralf@quadriga:~/source/PSieve-CUDA/pps$ ./tpsieve-cuda-boinc-x86_64-linux -p 100000e9 -P 100001e9 -k 3 -K 9999 -n 2M -N 3M -c60 -q
tpsieve version cuda-0.2.3b (testing)
Compiled Feb 27 2011 with GCC 4.5.2
nstart=2000000, nstep=34
nstep changed to 32
tpsieve initialized: 3 <= k <= 9999, 2000000 <= n < 3000000
Found 173 factors
ralf@quadriga:~/source/PSieve-CUDA/pps$ cat stderr.txt
21:53:22 (1897): Can't open init data file - running in standalone mode
Sieve started: 100000000000000 <= p < 100001000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 460
Detected compute capability: 2.1
Detected 7 multiprocessors.
Thread 0 completed
Sieve complete: 100000000000000 <= p < 100001000000000
count=31019409,sum=0x284af85735fd771f
Elapsed time: 85.01 sec. (0.02 init + 84.99 sieve) at 11767616 p/sec.
Processor time: 3.83 sec. (0.02 init + 3.80 sieve) at 262885661 p/sec.
Average processor utilization: 1.21 (init), 0.04 (sieve)
21:54:47 (1897): called boinc_finish
____________
|
|
|
|
|
|
Driver version 256.53 - CUDA SDK 2.3 - sm_10
Official binary:
Thread 0 completed
Sieve complete: 100000000000000 <= p < 100001000000000
count=31019409,sum=0x284af85735fd771f
Elapsed time: 86.66 sec. (0.03 init + 86.64 sieve) at 11543273 p/sec.
Processor time: 4.36 sec. (0.03 init + 4.33 sieve) at 230844137 p/sec.
Average processor utilization: 1.11 (init), 0.05 (sieve)
Trying out a new idea:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 78.22 sec. (0.02 init + 78.20 sieve) at 12788964 p/sec.
Processor time: 5.04 sec. (0.03 init + 5.01 sieve) at 199683878 p/sec.
Average processor utilization: 1.13 (init), 0.06 (sieve)
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
A new idea? I'm intrigued. What did you do?
____________
|
|
|
|
|
A new idea? I'm intrigued. What did you do?
No algorithmic changes... I only try to understand the weird ways of the compiler...
By the way:
Driver version 270.26 - CUDA SDK 3.2 - sm_21:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 78.00 sec. (0.02 init + 77.98 sieve) at 12908276 p/sec.
Processor time: 4.02 sec. (0.02 init + 4.00 sieve) at 251894391 p/sec.
Average processor utilization: 1.23 (init), 0.05 (sieve)
down from 93.56 seconds with the stock binary...
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
A new idea? I'm intrigued. What did you do?
No algorithmic changes... I only try to understand the weird ways of the compiler...
I still want to know what you did, so I can do it too! In particular, Driver version 256.53 - CUDA SDK 2.3 - sm_10 is my compiler setup, so if you found a way to make that faster, it's directly applicable to the stock binary.
____________
|
|
|
|
|
A new idea? I'm intrigued. What did you do?
No algorithmic changes... I only try to understand the weird ways of the compiler...
I still want to know what you did, so I can do it too! In particular, Driver version 256.53 - CUDA SDK 2.3 - sm_10 is my compiler setup, so if you found a way to make that faster, it's directly applicable to the stock binary.
PM for you ;)
____________
|
|
|
|
|
A new idea? I'm intrigued. What did you do?
No algorithmic changes... I only try to understand the weird ways of the compiler...
By the way:
Driver version 270.26 - CUDA SDK 3.2 - sm_21:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 78.00 sec. (0.02 init + 77.98 sieve) at 12908276 p/sec.
Processor time: 4.02 sec. (0.02 init + 4.00 sieve) at 251894391 p/sec.
Average processor utilization: 1.23 (init), 0.05 (sieve)
down from 93.56 seconds with the stock binary...
Current (retired) IPC* rates (all measured with the SDK 3.2 profiler and the 270.26 drivers) for the tpsieve/d_check_more_ns_32_fermi kernel:
Stock app: 1.96 (sm_10 compiled - SDK 2.3)
My modified sm_10 app: 2.02 (sm_10 compiled - SDK 3.2)
My modified sm_21 app: 2.21 (sm_21 compiled - SDK 3.2 - Limited register usage)
The IPC increase helps to compensate the performance drop between the good old 256.53 drivers and the 270.26 beta drivers. I'm still waiting for drivers on the 256.53 performance level with enabled overclocking for Linux...
*Instructions per cycle
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Sadly, I'm too busy lately to work much on this anymore. But you're making good enough progress that you should probably fork my source and go from there!
Just remember, any time you change the range the GPU searches, you should also change the range the CPU verifies in app.c; otherwise you'll get occasional Computation Errors.
____________
|
|
|
|
|
otherwise you'll get occasional Computation Errors.
If they were only occasional... one or two typos that I've made resulted in tons of them ;) I regard any computation error as an indicator that I've messed something up.
But you're making good enough progress that you should probably fork my source and go from there!
I have cloned the repository on my box and created a few branches for the experiments. Especially my last one made a mess of the code in appcu.cu. I need to clean up the file thoroughly before I publish any of the changes ;)
---
Driver version 270.26beta - sm_21 compiled - GTX 460 - My usual test range:
Experimental interleaving - Initial version: Less threads, more work per thread (inspired by the papers and presentations of Volkov et al.). I have to check if some of my earlier modifications are now counterproductive and I haven't measured the IPC rate yet.
Update: The IPC rate has gone down from 2.21 to 2.14, the global memory throughput dropped from 0.99 to 0.60 GiB/second, the l1 gld hitrate went up from 10% to nearly 50%.
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 74.50 sec. (0.03 init + 74.47 sieve) at 13429324 p/sec.
Processor time: 4.50 sec. (0.03 init + 4.47 sieve) at 223817573 p/sec.
Average processor utilization: 1.11 (init), 0.06 (sieve)
ralf@quadriga:~/source/PSieve-CUDA/pps$
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
otherwise you'll get occasional Computation Errors.
If they were only occasional... one or two typos that I've made resulted in tons of them ;) I regard any computation error as an indicator that I've messed something up.
OK, just throw, say, 1024 more N's on the highest CPU computation range and you should be good.
Experimental interleaving - Initial version: Less threads, more work per thread (inspired by the papers and presentations of Volkov et al.). I have to check if some of my earlier modifications are now counterproductive and I haven't measured the IPC rate yet.
Update: The IPC rate has gone down from 2.21 to 2.14, the global memory throughput dropped from 0.99 to 0.60 GiB/second, the l1 gld hitrate went up from 10% to nearly 50%.
I was just looking at that, and it does look promising. You're probably throwing too many threads at it and registers are being spilled to the L1 cache. Fiddle with the -m option and it might improve. Also, be sure to use the same loop counter variables for both P's.
Edit: BLOCKSIZE! That's what you need to change. It could be risky; as long as it's a power of 2 it ought to be workable. So try 64 or maybe 32.
____________
|
|
|
|
|
OK, just throw, say, 1024 more N's on the highest CPU computation range and you should be good.
As I said: I've discarded all changes that introduced computation errors in the first place. The last thing I want to do is to introduce bugs into the code ;)
Edit: BLOCKSIZE! That's what you need to change. It could be risky; as long as it's a power of 2 it ought to be workable. So try 64 or maybe 32.
I guess there are not enough memory accesses to hide for any real significant gain.
Two additional questions: Can we somehow get rid of the (Fermi) rare case when the low 32 bit of kpos == 0 and what is the purpose of the if(__umulhi...) in the un-rare case?
____________
|
|
|
|
|
You're probably throwing too many threads at it and registers are being spilled to the L1 cache. Fiddle with the -m option and it might improve
Occupancy analysis for kernel 'd_check_more_ns_32_fermi' for context 'Experimental Interleaving 2x : Device_0 : Context_0' :
Register Ratio = 0.875 ( 28672 / 32768 ) [27 registers per thread]
____________
|
|
|
|
|
|
Reverted some of my initial changes (do while again...):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 72.97 sec. (0.02 init + 72.95 sieve) at 13708667 p/sec.
Processor time: 4.50 sec. (0.02 init + 4.48 sieve) at 223417566 p/sec.
Average processor utilization: 1.17 (init), 0.06 (sieve)
I have a working 4x version but currently the speed gain == 0 compared with the stock binary running with the 270.26 drivers (with reduced BLOCKSIZE, otherwise matters would be far worse...). Now I'm thinking about a way to make sure that (unsigned int) kpos !=0) to be able to remove a branch because this would result in an interesting speed gain:
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 63.73 sec. (0.02 init + 63.71 sieve) at 15696477 p/sec.
Processor time: 4.53 sec. (0.02 init + 4.51 sieve) at 221831638 p/sec.
Average processor utilization: 1.17 (init), 0.07 (sieve)
____________
|
|
|
|
|
You're probably throwing too many threads at it and registers are being spilled to the L1 cache. Fiddle with the -m option and it might improve
Occupancy analysis for kernel 'd_check_more_ns_32_fermi' for context 'Experimental Interleaving 2x : Device_0 : Context_0' :
Register Ratio = 0.875 ( 28672 / 32768 ) [27 registers per thread]
I reduced the register pressure for the X2 version a little bit and fused the my_factor_found variables into a 16 bit integer... I had to left shift one of two original the my_factor_variables (two bits of an unsigned char used) by 9 bits instead of the more natural 8 bits to outsmart the compiler, which would generate 10% slower code otherwise...
ccupancy analysis for kernel 'd_check_more_ns_X2_32_fermi' for context 'X2 speed record - safe : Device_0 : Context_0' :
Kernel details : Grid size: 14 x 1, Block size: 512 x 1 x 1
Register Ratio = 0.9375 ( 30720 / 32768 ) [20 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 3 : 8
Active threads per SM = 1536 : 1536
Occupancy = 1 ( 48 / 48 )
Achieved occupancy = 0.666667 (on 7 SMs)
Occupancy limiting factor = None
(Retired IPC for the kernel: 2.26)
____________
|
|
|
|
|
|
GF100 comparison of my tuned version (sm_20) versus the stock binary (sm_10) on a GTX 470 running at stock clocks (270.26beta drivers):
Only 11% plus here (compared with the 20% plus on the GTX 460).
Retired IPC on the GTX 460: 2.26 (sm_21)
Retired IPC on the GTX 470: 1.70 (sm_20)
The GTX 470 is 26% faster with 33% more shaders running at lower clock rates (607 MHz vs 725 MHz). The performance per shader per MHz is around 50% higher on the GTX 470.
---
shmget in attach_shmem: Invalid argument
Can't set up shared mem: -1
Will run in standalone mode.
Sieve started: 100000000000000 <= p < 100001000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 470
Detected compute capability: 2.0
Detected 14 multiprocessors.
Thread 0 completed
Sieve complete: 100000000000000 <= p < 100001000000000
count=31019409,sum=0x284af85735fd771f
Elapsed time: 66.47 sec. (0.08 init + 66.39 sieve) at 15064639 p/sec.
Processor time: 5.09 sec. (0.03 init + 5.06 sieve) at 197631800 p/sec.
Average processor utilization: 0.38 (init), 0.08 (sieve)
called boinc_finish
---
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 59.88 sec. (0.03 init + 59.85 sieve) at 16710806 p/sec.
Processor time: 4.33 sec. (0.04 init + 4.30 sieve) at 232778571 p/sec.
Average processor utilization: 1.24 (init), 0.07 (sieve)
ralf@quadriga:~/source/PSieve-CUDA/pps$
---
Occupancy analysis for kernel 'd_check_more_ns_X2_32_fermi' for context 'GTX470-redc-ralf-sm_20-X2 : Device_0 : Context_0' :
Kernel details : Grid size: 112 x 1, Block size: 128 x 1 x 1
Register Ratio = 0.75 ( 24576 / 32768 ) [20 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 14 SMs)
Occupancy limiting factor = Block-Size
____________
|
|
|
|
|
The GTX 470 is 26% faster with 33% more shaders running at lower clock rates (607 MHz vs 725 MHz). The performance per shader per MHz is around 50% higher on the GTX 470.
Scratch that: I've accidentally inverted the number of shaders/cores. The GTX 470 is of course only 12% (and not 50%) more efficient per shader/core per MHz running my tuned tpsieve version.
____________
|
|
|
|
|
|
GTX 470 / GF 100 results (270.26beta drivers):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 59.88 sec. (0.03 init + 59.85 sieve) at 16710806 p/sec.
Processor time: 4.33 sec. (0.04 init + 4.30 sieve) at 232778571 p/sec.
Average processor utilization: 1.24 (init), 0.07 (sieve)
ralf@quadriga:~/source/PSieve-CUDA/pps$
GTX 470 / GF 100 results with the new drivers (270.41.06):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 53.89 sec. (0.03 init + 53.86 sieve) at 18569398 p/sec.
Processor time: 4.91 sec. (0.04 init + 4.87 sieve) at 205257955 p/sec.
Average processor utilization: 1.10 (init), 0.09 (sieve)
More than 10% faster
____________
|
|
|
|
|
GTX 470 / GF 100 results (270.26beta drivers):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 59.88 sec. (0.03 init + 59.85 sieve) at 16710806 p/sec.
Processor time: 4.33 sec. (0.04 init + 4.30 sieve) at 232778571 p/sec.
Average processor utilization: 1.24 (init), 0.07 (sieve)
ralf@quadriga:~/source/PSieve-CUDA/pps$
GTX 470 / GF 100 results with the new drivers (270.41.06):
Thread 0 completed
Waiting for threads to exit
Sieve complete: 100000000000000 <= p < 100001000000000
Found 173 factors
count=31019409,sum=0x284af85735fd771f
Elapsed time: 53.89 sec. (0.03 init + 53.86 sieve) at 18569398 p/sec.
Processor time: 4.91 sec. (0.04 init + 4.87 sieve) at 205257955 p/sec.
Average processor utilization: 1.10 (init), 0.09 (sieve)
More than 10% faster
The stock app is a little bit faster too:
Thread 0 completed
Sieve complete: 100000000000000 <= p < 100001000000000
count=31019409,sum=0x284af85735fd771f
Elapsed time: 64.17 sec. (0.05 init + 64.12 sieve) at 15597932 p/sec.
Processor time: 4.39 sec. (0.03 init + 4.36 sieve) at 229151421 p/sec.
Average processor utilization: 0.52 (init), 0.07 (sieve)
called boinc_finish
Interestingly the 10% speed gain for my modified app is reproducible on my standard test range but for current WUs the runtime is nearly the same.
____________
|
|
|
|
|
Interestingly the 10% speed gain for my modified app is reproducible on my standard test range but for current WUs the runtime is nearly the same.
Update: I've accidently run the test with the new drivers with a 32 bit executable. The 64 bit executable is as slow/fast as before.
Time/WU - Modified application - 270.41.06 drivers - GTX 470:
32 bit BOINC: 13:26
64 bit BOINC: 14:58
____________
|
|
|
|
|
Yes, i compared apples with oranges...
OT:
Are you a member of the nVidia registered developer program?
I ask because i saw your posting about "New LLVM-based compiler delivers up to 10% faster performance for many applications in Cuda4.1".
I posted the quote in the challenge thread. I've just compiled my 32 bit Linux tpsieve version with the 4.1rc2 SDK. The resulting 15-20% drop in performance made it slower than the stock 64 bit PG app. One caveat: I've not yet replaced the 285.05.09 drivers with the 285.05.23 developer drivers.
____________
|
|
|
|
|
Yes, i compared apples with oranges...
OT:
Are you a member of the nVidia registered developer program?
I ask because i saw your posting about "New LLVM-based compiler delivers up to 10% faster performance for many applications in Cuda4.1".
I posted the quote in the challenge thread. I've just compiled my 32 bit Linux tpsieve version with the 4.1rc2 SDK. The resulting 15-20% drop in performance made it slower than the stock 64 bit PG app. One caveat: I've not yet replaced the 285.05.09 drivers with the 285.05.23 developer drivers.
I installed the new developer drivers yesterday. The new compiler still produces 15-20% slower code for my 32 bit Linux tpsieve version. I still have no benchmarks (4.0 vs. 4.1rc2) for the stock application.
____________
|
|
|
|
|
Yes, i compared apples with oranges...
OT:
Are you a member of the nVidia registered developer program?
I ask because i saw your posting about "New LLVM-based compiler delivers up to 10% faster performance for many applications in Cuda4.1".
I posted the quote in the challenge thread. I've just compiled my 32 bit Linux tpsieve version with the 4.1rc2 SDK. The resulting 15-20% drop in performance made it slower than the stock 64 bit PG app. One caveat: I've not yet replaced the 285.05.09 drivers with the 285.05.23 developer drivers.
I installed the new developer drivers yesterday. The new compiler still produces 15-20% slower code for my 32 bit Linux tpsieve version. I still have no benchmarks (4.0 vs. 4.1rc2) for the stock application.
Looks like the new compiler has some serious problems when it comes to generating code for fermi based cards.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Yes, i compared apples with oranges...
OT:
Are you a member of the nVidia registered developer program?
I ask because i saw your posting about "New LLVM-based compiler delivers up to 10% faster performance for many applications in Cuda4.1".
I posted the quote in the challenge thread. I've just compiled my 32 bit Linux tpsieve version with the 4.1rc2 SDK. The resulting 15-20% drop in performance made it slower than the stock 64 bit PG app. One caveat: I've not yet replaced the 285.05.09 drivers with the 285.05.23 developer drivers.
I installed the new developer drivers yesterday. The new compiler still produces 15-20% slower code for my 32 bit Linux tpsieve version. I still have no benchmarks (4.0 vs. 4.1rc2) for the stock application.
Looks like the new compiler has some serious problems when it comes to generating code for fermi based cards.
Do you have some benchmarks for Cuda32 and Cuda40?
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
Yes, i compared apples with oranges...
OT:
Are you a member of the nVidia registered developer program?
I ask because i saw your posting about "New LLVM-based compiler delivers up to 10% faster performance for many applications in Cuda4.1".
I posted the quote in the challenge thread. I've just compiled my 32 bit Linux tpsieve version with the 4.1rc2 SDK. The resulting 15-20% drop in performance made it slower than the stock 64 bit PG app. One caveat: I've not yet replaced the 285.05.09 drivers with the 285.05.23 developer drivers.
I installed the new developer drivers yesterday. The new compiler still produces 15-20% slower code for my 32 bit Linux tpsieve version. I still have no benchmarks (4.0 vs. 4.1rc2) for the stock application.
Looks like the new compiler has some serious problems when it comes to generating code for fermi based cards.
Do you have some benchmarks for Cuda32 and Cuda40?
Not for the stock app.
For my 32/64 bit Linux versions the performance of the binaries compiled with the various SDK versions (3.2, 4.0 and 4.1rc2) doesn't differ much (< 1%). The 64 bit versions are 10-15% slower than their 32 bit counterparts.
The above mentioned performance drop occurs when 4.1rc2 SDK compiler generates code for the CC 2.0 or 2.1 cards. In the process of isolating the source of the problem I reexamined my modifications and managed to squeeze out another 35 seconds per WU on the GTX 460 (32 bit binary, no gain or loss for the 470).
____________
|
|
|
|
|
|
What about CUDA 5 implementation for speedup and/or reduce cpu usability? |
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
|
I tried this with Cuda32 and saw no speed up. I believe there will be no change with even newer Cuda versions.
On lubuntu64-12.10 i get 1% cpu usage via top and the same low values were reported for XP. The gpu usage was at 99% all the time.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
|
|
My result on gtx470 and win7:
PPS (Sieve) v1.39 (cudaPPSsieve)
Average processor utilization: 1.00 (init), 0.10 (sieve)
At most 10%. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Hey, long time no post!
I believe I have an app_config.xml that somewhat improves tpsieve performance on Nvidia Kepler and Maxwell. But I'm in the middle of a different race so I can't test it out right now.
<app_config>
<app_version>
<app_name>pps_sr2sieve</app_name>
<plan_class>cuda23</plan_class>
<cmdline>-m64</cmdline>
</app_version>
</app_config>
This code requires BOINC 7.4.39 or higher. You can do the same thing with app_info.xml, but that's much harder to write.
The other problem is this cmdline switch doesn't leave any mention in the log about whether it works or not! (My bad!) So if it works you should see a small speedup running one WU at a time (7-8% on Maxwell, maybe more on Kepler); if not you probably just won't see any change.
Also, are people getting more than a 7-8% speedup on Kepler/Maxwell by running two WUs at once? If so, I may need to rewrite the CUDA app for next year's race.
Thanks, all!
____________
|
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1905 ID: 352 Credit: 4,056,750,098 RAC: 4,353,900
                                 
|
This code requires BOINC 7.4.39 or higher. You can do the same thing with app_info.xml, but that's much harder to write.
Probably 7.2.39 since latest one is 7.4.35 (released yesterday).
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186 |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
That's right. I had to upgrade to 7.2.42 because my Linux repository was a little behind.
Hey, I almost typed "2.4.39" :P
____________
|
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Nobody testing this here? Someone else was getting some kind of error like wrong app_name.
Maybe this works better?
<app_config>
<app>
<name>pps_sr2sieve</name>
<max_concurrent>9</max_concurrent>
</app>
<app_version>
<app_name>pps_sr2sieve</app_name>
<plan_class>cuda23</plan_class>
<avg_ncpus>0.1</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>-m64</cmdline>
</app_version>
</app_config>
____________
|
|
|
|
|
Nobody testing this here? Someone else was getting some kind of error like wrong app_name.
Maybe this works better?
<app_config>
<app>
<name>pps_sr2sieve</name>
<max_concurrent>9</max_concurrent>
</app>
<app_version>
<app_name>pps_sr2sieve</app_name>
<plan_class>cuda23</plan_class>
<avg_ncpus>0.1</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>-m64</cmdline>
</app_version>
</app_config>
I briefly skimmed over this thread, Ken. How does one test your app_config? Do I save it as app_config.xml in the same slot folder as the PPS (Sieve) 1.39 executable? |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
You put it in the project directory, which, yes, contains the executable.
One thing I notice in the list of applications from the challenge thread: For Mac you'd want plan class "cuda32" instead of "cuda23".
____________
|
|
|
|
|
|
I've tried the app_config.xml file by placing in the project directory which contains all of the project executables and the improvement is very very slight, on the order of maybe 2 or 3 seconds on a big Kepler. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
Are you sure there is an improvement? Are there any error messages in the log saying things like "Entry in app_config.xml for app 'pps_sr2sieve', plan class 'cuda23' doesn't match any app versions"?
____________
|
|
|
|
|
Are you sure there is an improvement? Are there any error messages in the log saying things like "Entry in app_config.xml for app 'pps_sr2sieve', plan class 'cuda23' doesn't match any app versions"?
The stderr.txt contains only the following:
Sieve started: 54431751000000000 <= p < 54431760000000000
Thread 0 starting
Detected GPU 0: GeForce GTX TITAN Black
Detected compute capability: 3.5
Detected 15 multiprocessors. |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
|
I'm asking for testing of this app_config.xml file in BOINC.
Edit: If you want to test this manually, just add -m64 to the command line. But that doesn't help with testing the file in BOINC.
____________
|
|
|
|
|
You put it in the project directory, which, yes, contains the executable.
I'm asking for testing of this app_config.xml file in BOINC.
Edit: If you want to test this manually, just add -m64 to the command line. But that doesn't help with testing the file in BOINC.
I don't want to test this manually. This is where I had placed the app_config.xml file:
J:\ProgramData\BOINC\projects\www.primegrid.com
I have just now moved it to here for further testing if the previous location was incorrect:
J:\ProgramData\BOINC |
|
|
Ken_g6 Volunteer developer
 Send message
Joined: 4 Jul 06 Posts: 921 ID: 3110 Credit: 218,950,902 RAC: 3,798
                          
|
I don't want to test this manually. This is where I had placed the app_config.xml file:
J:\ProgramData\BOINC\projects\www.primegrid.com
That's correct.
OK, let's look at the log in a different place. Use shift-ctrl-E to bring up the event log.
____________
|
|
|
|
|
|
BOINC Manager Event Log:
12/17/2014 11:31:55 AM | | Starting BOINC client version 7.4.27 for windows_x86_64
12/17/2014 11:31:55 AM | | log flags: file_xfer, sched_ops, task
12/17/2014 11:31:55 AM | | Libraries: libcurl/7.33.0 OpenSSL/1.0.1h zlib/1.2.8
12/17/2014 11:31:55 AM | | Data directory: J:\ProgramData\BOINC
12/17/2014 11:31:55 AM | | Running under account Alan
12/17/2014 11:31:55 AM | | CUDA: NVIDIA GPU 0: GeForce GTX TITAN Black (driver version 344.75, CUDA version 6.5, compute capability 3.5, 4096MB, 4096MB available, 6172 GFLOPS peak)
12/17/2014 11:31:55 AM | | OpenCL: NVIDIA GPU 0: GeForce GTX TITAN Black (driver version 344.75, device version OpenCL 1.1 CUDA, 6144MB, 4096MB available, 6172 GFLOPS peak)
12/17/2014 11:31:55 AM | | OpenCL CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 3.0.1.10878, device version OpenCL 1.2 (Build 76413))
12/17/2014 11:31:55 AM | | Host name: HAF932-Haswell
12/17/2014 11:31:55 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz [Family 6 Model 60 Stepping 3]
12/17/2014 11:31:55 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 pbe fsgsbase bmi1 hle smep bmi2
12/17/2014 11:31:55 AM | | OS: Microsoft Windows 8: Professional x64 Edition, (06.02.9200.00)
12/17/2014 11:31:55 AM | | Memory: 31.89 GB physical, 63.89 GB virtual
12/17/2014 11:31:55 AM | | Disk: 1.82 TB total, 1.42 TB free
12/17/2014 11:31:55 AM | | Local time is UTC -8 hours
12/17/2014 11:31:55 AM | PrimeGrid | Found app_config.xml
12/17/2014 11:31:55 AM | | Config: report completed tasks immediately
12/17/2014 11:31:55 AM | | Config: use all coprocessors
12/17/2014 11:31:55 AM | | Config: don't compute while Crysis2.exe is running
12/17/2014 11:31:55 AM | | Config: don't compute while Crysis3.exe is running
12/17/2014 11:31:55 AM | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 4234366; resource share 100
12/17/2014 11:31:55 AM | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 5153529; resource share 1000
12/17/2014 11:31:55 AM | GPUGRID | URL http://www.gpugrid.net/; Computer ID 138693; resource share 10
12/17/2014 11:31:55 AM | PrimeGrid | URL http://www.primegrid.com/; Computer ID 325757; resource share 1000
12/17/2014 11:31:55 AM | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 2333557; resource share 100
12/17/2014 11:31:55 AM | PrimeGrid | General prefs: from PrimeGrid (last modified 29-May-2014 01:28:26)
12/17/2014 11:31:55 AM | PrimeGrid | Host location: none
12/17/2014 11:31:55 AM | PrimeGrid | General prefs: using your defaults
12/17/2014 11:31:55 AM | | Reading preferences override file
12/17/2014 11:31:55 AM | | Preferences:
12/17/2014 11:31:55 AM | | max memory usage when active: 31020.94MB
12/17/2014 11:31:55 AM | | max memory usage when idle: 32653.62MB
12/17/2014 11:31:55 AM | | max disk usage: 10.00GB
12/17/2014 11:31:55 AM | | max CPUs used: 2
12/17/2014 11:31:55 AM | | (to change preferences, visit a project web site or select Preferences in the Manager)
12/17/2014 11:31:55 AM | | Not using a proxy
EDITED: Forgot to move app_config.xml file back to PrimeGrid project directory. |
|
|
|
|
|
On a perhaps related note, I've tried running a few PPS Sieve tasks today two-at-a-time on a single card using the <max_concurrent> tags in app_config.xml. I see an improvement in throughput, but very small. I'm on Linux (Ubuntu 14.04), with a GTX 770 (evga with ACX cooling), 2600K, and BOINC 7.2.42.
One at a time, these tasks were very consistent at about 502 seconds. Running two parallel, they are very consistent at about 988 seconds. I was not using the newly-announced "-m64" flag in either case, and told boinc to leave two cores free (50%).
GPU temperatures and fan speeds were not noticeably different either.
FWIW :-)
--Gary |
|
|