Author |
Message |
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
Thanks to Gerrit kindly sharing his work so far, I have been able to start working with CUDA application development.
I have my first version available for test here:
http://mfl0p.angelfire.com/testing/
To keep things simple for now, this is a Linux-64bit binary, built with CUDA SDK 2.3. Once I get the base code completed I will move to compiling on Windows machines.
The file contains the binary gpuapp100, a known good results file, and the libcudart.so.2 (if you need it).
To test the file, run with command line:
time ./gpuapp100 366384 366384 0 --device 0
And compare results with included TEST file. If you have 2 gpus in the system, you can specify GPU #2 with "--device 1" instead. It would be great for someone with two GPUs to run the app on both GPUs at the same time and check the results.
Thanks
____________
|
|
|
|
Hello
I runned two at the same time on a GTX295 and the results are:
[sash@pc-sascha-linux CUDA1]$ time ./gpuapp100 366384 366384 0 --device 0
Compiled Nov 26 2009
2 CUDA device(s) found
[0] GeForce GTX 295 (30 MPs; 1242 MHz) compute-capability: 1.3
[1] GeForce GTX 295 (30 MPs; 1242 MHz) compute-capability: 1.3
Using GPU: 0
real 1m30.402s
user 0m0.360s
sys 0m0.643s
and
[sash@pc-sascha-linux CUDA2]$ time ./gpuapp100 366384 366384 0 --device 1
Compiled Nov 26 2009
2 CUDA device(s) found
[0] GeForce GTX 295 (30 MPs; 1242 MHz) compute-capability: 1.3
[1] GeForce GTX 295 (30 MPs; 1242 MHz) compute-capability: 1.3
Using GPU: 1
real 1m34.211s
user 0m0.370s
sys 0m1.313s |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
Thanks for testing, did the results files match with cmp?
Also, my results on test range with two test machines:
GeForce GTS 240 (14 MPs; 1620 MHz) compute-capability: 1.1
1m 53s
GeForce 8500 GT (2 MPs; 918 MHz) compute-capability: 1.1
13m 27s
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
I've decided AP26 on GPU is too slow to continue further development. The search method requires frequent division and lookups to global VRAM. Both of which are very slow (16 and 600 cycles, respectively).
When comparing power consumption per workunit it is just not feasible to run AP26 on GPU.
For example, my Corei7 @ 2.8ghz will complete 16 workunits an hour.
The GTS240 is using roughly the same amount of power, will only do 5 workunits an hour.
This is not even taking into account the fact GPUs can, and do, make calculation errors that are currently not being double checked by the AP26 project.
____________
|
|
|
|
Whilist YOU might think it's a waste, why don't you ask people if they want a gpu app for ap26? I sure do even if it's a bit slower.
I do have a suggestion though, gpu's have alot of cores right, try to have one "number" at a time being tested on each of the cores(100 cores 100 "numbers" being tested), even if it takes a bit longer per "number" then trying to split one "number" up onto 100+ cores, you. You might have to develop a seperate ap26 GPU WU for this approach which also should take more advantage of each cores limited shared memory(faster then global vram). I also have cuda sdk installed on my windows machine, and I'd be willing to help even though I have limited cuda coding experience, I'm experienced in windows dev with c++/asm.
____________
|
|
|
|
I too think it should be done.
My app for linux is working and a GTX 260 is a bit faster than one core of my Xeon W3520.
The Xeon has 8 cores with 2.66 GHz each, the GTX 260 has only 1.242 GHz.
The graphics card I is hampered by the fact the device RAM is that slow.
Taken into account that the graphics card goes on top of the CPU it is not that bad and older PCs with a Pentium D oder worse would benefit a lot from the app. |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
For example, my Corei7 @ 2.8ghz will complete 16 workunits an hour.
The GTS240 is using roughly the same amount of power, will only do 5 workunits an hour.
Are you checking this at the outlet or basing it on power consumption ratings?
If the latter, then your numbers are going to be far off because, unlike a CPU under full load, a GPU almost never reaches its peak power usage. For example, my i7 920 (with monitor powered off and measured at the plug) under full 8-thread cpu load with 9500GT also under load (Collatz apps) never goes above 173 W (bounces between 166 and 173). The 9500GT is rated at 50 W by Nvidia, but this is obviously not the functional power draw since the CPU and other components (couple of USB devices, hard drive spinning, etc.) would push this well above the 173 mark if the GPU were really pulling that much power.
Also, The GTS 240 is rated for a maximum power draw of about 120 W, which should be a bit less than the i7?
Last, this is a bit of an apples-to-oranges comparison. The GTS 240 is a fine card, but it is essentially a revamped 9800 GT (both with 112 shaders), which is at most at the bottom of the mid-range in GPU. The i7, on the other hand, is a top-end CPU. A better comparison would be using the GTX 275 or comparable (240 shaders).
____________
141941*2^4299438-1 is prime!
|
|
|
|
A Core i7 has a TDP of 130 W.
My Nehalem 5504 with 8 Threads and 9400 GT has 135 W under full load without graphics load and 155 W with graphics load, the card is rated at 50 W (max).
A GTX 260 should take 80 W more (measured with Intel Core i7 965 Extreme + 300GB WD Velociraptor) when under full load than in idle-mode. |
|
|
|
I also agree that it should be done. Even if using the GPU is less efficient than using the CPU, the fact remains that by adding the GPU on top of the CPU one will be able to get more work done. |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
I think some of you think i'm making the decision on GPU-AP26.
I'm not. I'm not a member of the PrimeGrid group.
I consider myself a decent programmer that has done some work with the AP26 application in the past. I'm familiar with the application and where the bottlenecks are. I had someone with a Geforce 285 test my CUDA app version and it will complete a workunit in 6 minutes.
And regarding my slow, old video card specs, I know this already, thank you. It still completes workunits in 10 minutes.
After all i've done to help make the project progress faster, I do not like seeing some of your comments directed at me.
Bryan
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
I think some of you think i'm making the decision on GPU-AP26.
I'm not. I'm not a member of the PrimeGrid group.
I consider myself a decent programmer that has done some work with the AP26 application in the past. I'm familiar with the application and where the bottlenecks are. I had someone with a Geforce 285 test my CUDA app version and it will complete a workunit in 6 minutes.
And regarding my slow, old video card specs, I know this already, thank you. It still completes workunits in 10 minutes.
After all i've done to help make the project progress faster, I do not like seeing some of your comments directed at me.
Bryan
Bryan,
I apologize if you picked up some insulting tone from my message...it was certainly not intended. The GTS 240 is the fastest card that I have personal access to as well, and my point had nothing to do with calling your card slow or old. I simply was pointing out that comparing it to one of the fastest possible desktop CPU's was not the best comparison to make. My question about the wattage issue was a legitimate one. I have seen many people reference the NVidia ratings in this way (even several very experienced programmers and techs who had not worked much on GPUs) including myself at one time. I am still curious about whether you were using those figures to compute the comparison or some other method (I have to deal with a lot of systems with low-end PS's, so power efficiency issues are always of interest to me)?
Last, as I have said elsewhere on these boards, thank you for all your programming efforts here at PG. I wish my own programming skills were applicable here, and I am very grateful that you (and others) have both the skill and time to commit to the project. I think that some of our community see the "developer" tag first and forget the "volunteer" in front of it, and that probably explains why some might feel you have some decision making authority.
Scott
____________
141941*2^4299438-1 is prime!
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
OK.... Here is my latest version:
http://mfl0p.angelfire.com/testing/
Required:
0. CUDA Compute-Capable 1.1 or higher
1. Linux x86_64
2. Latest Nvidia driver installed (currently 190.42)
3. libstdc++ 64bit (the CUDA binary cannot be linked static for various reasons, app was built with GCC 4.1, so any 64bit Linux distro should be OK)
4. If you don't have the 2.3 cuda toolkit installed, you will have to make the OS aware of libcudart.so.2 (included in .tar, renamed from libcudart.so.2.3) ex:
say you extracted the files to /home/ap26test, then:
export LD_LIBRARY_PATH=/home/ap26test:$LD_LIBRARY_PATH
We need to get quite a few people testing/timing this program vs the known good results file on the website (has 3 different test ranges, and 3 different shifts).
You must use the command lines:
time ./ap26_CUDA2.3_x86_64_Linux 366384 366384 0 --device 0
time ./ap26_CUDA2.3_x86_64_Linux 3744537 3744541 64 --device 0
time ./ap26_CUDA2.3_x86_64_Linux 76000000 76000003 640 --device 0
then use "cmp" to check results ex:
cmp SOL-AP26.txt intel366384_366384_0.txt
cmp SOL-AP26.txt intel3744537_3744541_64.txt
cmp SOL-AP26.txt intel76000000_76000003_640.txt
if you have multiple GPUS, then you can specify device 0, 1, etc on command line
Then, it is up to the project admins if they want to add this application as official.
Bryan
____________
|
|
|
Lumiukko Volunteer tester Send message
Joined: 7 Jul 08 Posts: 165 ID: 25183 Credit: 875,031,530 RAC: 112,652
                           
|
OK.... Here is my latest version:
http://mfl0p.angelfire.com/testing/
...
We need to get quite a few people testing/timing this program vs the known good results file on the website (has 3 different test ranges, and 3 different shifts).
...
Then, it is up to the project admins if they want to add this application as official.
Bryan
Here are the timings from my system:
Intel i7 940 @2.93GHz 6GB RAM
Ubuntu 9.04 Kernel Linux 2.6.28-16-generic
GeForce GTX 285 (30 MPs; 1476 MHz) compute-capability: 1.3
NVidia driver 190.42
time ./ap26_CUDA2.3_x86_64_Linux 366384 366384 0 --device 0
real 0m59.093s
user 0m0.664s
sys 0m0.724s
time ./ap26_CUDA2.3_x86_64_Linux 3744537 3744541 64 --device 0
real 3m0.660s
user 0m2.000s
sys 0m2.168s
time ./ap26_CUDA2.3_x86_64_Linux 76000000 76000003 640 --device 0
real 2m54.699s
user 0m2.020s
sys 0m2.360s
All result files match.
--
Lumiukko
|
|
|
Sysadm@Nbg Volunteer moderator Volunteer tester Project scientist
 Send message
Joined: 5 Feb 08 Posts: 1224 ID: 18646 Credit: 877,041,954 RAC: 316,786
                      
|
We need to get quite a few people testing/timing this program vs the known good results file on the website (has 3 different test ranges, and 3 different shifts).
Hi Bryan
I have done the tests for you:
time ./ap26_CUDA2.3_x86_64_Linux 366384 366384 0 --device 0
real 4m43.410s
user 0m0.600s
sys 0m0.810s
time ./ap26_CUDA2.3_x86_64_Linux 3744537 3744541 64 --device 0
real 11m13.122s
user 0m2.520s
sys 0m2.760s
time ./ap26_CUDA2.3_x86_64_Linux 76000000 76000003 640 --device 0
real 12m41.019s
user 0m2.390s
sys 0m2.870s
cmp: no differences
and this was the machine:
- AMD Phenom(tm) 9550 Quad
- GeForce 9800 GTX+ (16 MPs; 1836 MHz) compute-capability: 1.1
- Linux 2.6.31-14-server (ubuntu 9.10 64-bit server)
____________
Sysadm@Nbg
my current lucky number: 113856050^65536 + 1
PSA-PRPNet-Stats-URL: http://u-g-f.de/PRPNet/
|
|
|
|
Computer specs:
Intel Core 2 Duo T7700 @2.4GHz 2GB DDR2 SDRAM - 667.0 MHz
GeForce 8600M GT (4 MPs; 933 MHz) compute-capability: 1.1
Ubuntu 9.04 Kernel Linux 2.6.28-16-generic
NVIDIA Driver Version 190.18
time ./ap26_CUDA2.3_x86_64_Linux 366384 366384 0 --device 0
real 7m13.941s
user 0m1.940s
sys 0m3.072s
time ./ap26_CUDA2.3_x86_64_Linux 3744537 3744541 64 --device 0
real 22m13.210s
user 0m6.420s
sys 0m9.273s
time ./ap26_CUDA2.3_x86_64_Linux 76000000 76000003 640 --device 0
real 21m36.473s
user 0m6.448s
sys 0m9.261s
All of the results files matched.
|
|
|
|
GTX 260 (27 cores @ 1242 MHz)
366384 366384 0
76.2 s (1m16.2s)
3744537 3744541 64
235.8 s (3m55.8s)
76000000 76000003 640
229.2 s (3m49.2s)
FX 580 (4 cores @ 1100 MHz)
366384 366384 0
352.9 s (5m52.9s)
3744537 3744541 64
1082.5 s (18m2.5s)
76000000 76000003 640
1053.4 s (17m33.4s)
Is it normal that the app prints this?
Checkpoint: 0 / 7
Checkpoint: 1 / 7
Checkpoint: 2 / 7
Checkpoint: 3 / 7
Checkpoint: 4 / 7
Checkpoint: 5 / 7
Checkpoint: 6 / 7
And the the solution but no 7th checkpoint?
Seems like the 7th checkpoint is the solution?
Solutions are matching.
I would propose to use this app, GPU is very cold with this one (65/69 °C).
The GTX260 would be good for 4462 C/day, the FX580 for 963 C/day.
I can't see where this is bad, a Core i7 920 alone would be around 10500 with around 150-160 Watt. |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
Is it normal that the app prints this?
Checkpoint: 0 / 7
Checkpoint: 1 / 7
Checkpoint: 2 / 7
Checkpoint: 3 / 7
Checkpoint: 4 / 7
Checkpoint: 5 / 7
Checkpoint: 6 / 7
And the the solution but no 7th checkpoint?
Seems like the 7th checkpoint is the solution?
Yes, I left that "checkpoint" printf in the code so when running standalone mode you know the program is actually progressing, and about how fast. 0-6 is normal, it's not actually making a checkpoint to disk when it prints that. Call it a "beta" app feature. It's only cosmetic.
All BOINC checkpointing is handled in the background, and this code is not related at all.
____________
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
[0] GeForce 9800 GT (14 MPs; 1500 MHz) compute-capability: 1.1
Core2 Q9550, 4 GB, Ubuntu 9.10, kernel 2.6.31-15
nVidia driver 190.42
I guess the app requires a CPU core, with all cores on BOINC it was very slow (11m for 1 K).
366384 366384 0
with <=2 CPU cores on BOINC, 1m 47s
with 3 cores on BOINC, 1m 59s
3744537 3744541 64
with <=2 cores on BOINC, 5m 28s
with 3 cores on BOINC, 5m 38s
76000000 76000003 640
with <=2 cores on BOINC, 5m 18s
with 3 cores on BOINC, 5m 40s
Result files matched.
GPU temp max 75 C, idle 59 C
____________
|
|
|
|
11 minutes for one BOINC-WU or 11 minutes for 1 K?
I did not see any CPU load on my client.
11 minutes for one BOINC-WU would match your posted single K-times. |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
re: CPU load
Each GPU workunit on both of my quads with CUDA cards requires about 9 to 20 seconds of CPU time. The slower the GPU the more CPU time will be used. I've experienced 0.5-2.0% CPU load for the CUDA app while the CPU is running 4 threads of AP26 at the same time. The GPU is running at the same speed with CPU at full load or not on my systems. Anything heavily using the video card will slow the app down.
Also, hopefully everyone has noticed minimal slowdown of X video display while the app is running, the kernel runtimes are very small to keep the display responsive. I only notice a little slowdown with my old 2MP 8500gt card.
Also, with the app here are my speeds:
GeForce GTS 240 (14 MPs; 1620 MHz) compute-capability: 1.1
single K: 366384 1min 39sec
GeForce 8500 GT (2 MPs; 918 MHz) compute-capability: 1.1
single K: 366384 12min 50sec
____________
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
11 minutes for one BOINC-WU or 11 minutes for 1 K?
I did not see any CPU load on my client.
11 minutes for one BOINC-WU would match your posted single K-times.
With PPSSieve tasks running on all four cores, the standalone test for the first range took 11 min. Didn't even want to try the 3 K ranges. With BOINC limited to three cores or less, run time dropped to less than 2 min.
So I shouldn't see this slowdown? Is there something I can do? I'm a total newbie when it comes to Linux and running things on it so forgive me if I'm missing something obvious. Amazed I got this running in a few hours of work.
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
I wouldn't worry about it. It's not what was expected, but then again it's not running under BOINC yet. Things may be ok there.
____________
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
OK, thanks.
And BIG thanks for your and everyone else's work in development of the CUDA application so far AND in the future.
____________
|
|
|
|
76.2 seconds with the "366384 366384 0" on my Xeon W3520 while running 8 AP26-CPU WUs and 1 of my AP26-GPU WUs on the second card. |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
I have updated the Linux-64bit app on the website with a few changes:
added missing cudaThreadExit(); call per Nvidia docs
changed "Checkpoint: 6 / 7" display while running standalone to more readable version ex: "Computation of K: 366384 is 28% complete"
and a few other minor things. nothing that will change computation results or speed.
Also, I have compiled a CUDA AP26 for Windows. However, the GPU slowdown that samuel7 is experiencing while running a fully loaded CPU also causes the Windows CUDA app to completely stall. This appears to a problem with the way the current CUDA/OS interaction between threads takes place. It also may explain why some Linux versions like RHEL do not have this problem. Also related may be the "faster" kernel timer in Linux vs Windows. Either way, the only other option is to change to a busy-wait in the CPU thread of the GPU app, which would consume an entire CPU core, which I do not think is very efficient. Currently using block, tried yield but was of no help.
____________
|
|
|
|
Also, hopefully everyone has noticed minimal slowdown of X video display while the app is running, the kernel runtimes are very small to keep the display responsive.
I've never had any gpu crunching app or any of the code I fiddled around with in cuda ever cause my display to lag in any sort of way.. except when I'm running a game at the same time, small kernel run-times may contribute to wasted cpu cycles on OS to video delays. An application runtime profiler may be needed to determine where the wasted cycles are located.
I offer suggestions because I'm interested in getting this gpu app off the runway, I'd be willing to take a look at your source code if you wish to see if I can spot any thing that may help, 2 heads are better then one /agree?. I'm grateful for all the work you have put into it, And I hope you are willing to continue improving the app. |
|
|
|
Running an GPU-app blocks the GPU in the same way as a game or even more.
The profiler does not help because it shows the GPU-time of a kernel and not "wasted cycles".
And yes: the shorter the kernel-runtime the more time is left to display the window-manger, or you use only the second card for crunching. |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
UPDATE
A Windows 32bit/64bit CUDA 2.3 application is available for testing at:
http://mfl0p.angelfire.com/testing
Updating your video drivers to current version will make sure you have CUDA 2.3 drivers installed.
Since the application runs on the GPU, there is no reason to make a 64bit binary also. This 32bit binary will be just as fast, and runs on 64bit Windows also.
Thanks!
____________
|
|
|
|
Test Results For windows app.
Nvidia Geforce GTX 275 Stock Speeds.
AMD Phenom II 955BE x4 3.4ghz 8g ram
Windows 7 x64
3x cores running ap26, no slowdown experienced in the gpu app. Only a few extra seconds added to ap26 cpu tasks.
1x core running game, experienced a 25% reduction in fps.
Run times are all over the place, but as to be expected while running during normal computer usage hours.
GPU time to compute K: 366384 was 78 seconds
average CPU 1.06%
kernel execution time 48ms
GPU time to compute K: 3744539 was 76 seconds
GPU time to compute K: 3744540 was 82 seconds
GPU time to compute K: 3744541 was 68 seconds
average CPU 1.21%
kernel execution time ~50ms
GPU time to compute K: 76000000 was 81 seconds
GPU time to compute K: 76000001 was 78 seconds
GPU time to compute K: 76000002 was 77 seconds
average CPU 1.15%
kernel execution time ~49ms
Results seem to match at a glance, sorry no fancy linux commands here. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 686 ID: 845 Credit: 2,910,184,413 RAC: 361,400
                              
|
GeForce 9800 GT (14 MPs; 1875 MHz shader clock) compute-capability: 1.1
Q9450@3.2GHz
Windows XP x64
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 67 ms
GPU time to compute K: 366384 was 119 seconds
Result file matches.
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 67 ms
GPU time to compute K: 3744539 was 119 seconds
Calculated kernel execution time: 77 ms
GPU time to compute K: 3744540 was 122 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 3744541 was 119 seconds
Result file matches.
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000000 was 119 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000001 was 119 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000002 was 119 seconds
Result file matches.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
GeForce 9600 GSO (12 MPs; 1700 MHz shader clock) compute-capability: 1.1
Pentium D Extreme Edition@3.73GHz
Windows XP Pro (32-bit)
(all test with 4 321-LLR tasks running)
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 85 ms
GPU time to compute K: 366384 was 119 seconds
Result file matches.
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 85 ms
GPU time to compute K: 3744539 was 121 seconds
Calculated kernel execution time: 95 ms
GPU time to compute K: 3744540 was 140 seconds
Calculated kernel execution time: 85 ms
GPU time to compute K: 3744541 was 118 seconds
Result file matches.
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 85 ms
GPU time to compute K: 76000000 was 119 seconds
Calculated kernel execution time: 86 ms
GPU time to compute K: 76000001 was 121 seconds
Calculated kernel execution time: 85 ms
GPU time to compute K: 76000002 was 118 seconds
Result file matches.
CPU usage remained below 1%. Screen responsiveness very good, only slight slowdown for Windows GUI w/ browser and a few apps open.
____________
141941*2^4299438-1 is prime!
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
GeForce 8600 GT (4 MPs; 1700 MHz shader clock) compute-capability: 1.1
Athlon x2 5600+@2.8GHz
Windows 7 Enterprise (64-bit)
(all test with GFN sieve running on single core)
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 200 ms
GPU time to compute K: 366384 was 292 seconds
Result file matches.
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 199 ms
GPU time to compute K: 3744539 was 289 seconds
Calculated kernel execution time: 215 ms
GPU time to compute K: 3744540 was 297 seconds
Calculated kernel execution time: 200 ms
GPU time to compute K: 3744541 was 276 seconds
Result file matches.
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 191 ms
GPU time to compute K: 76000000 was 279 seconds
Calculated kernel execution time: 198 ms
GPU time to compute K: 76000001 was 276 seconds
Calculated kernel execution time: 200 ms
GPU time to compute K: 76000002 was 277 seconds
Result file matches.
CPU usage remained below 1%. Screen responsiveness noticeably sluggish but usable with Windows GUI w/ browser and a few apps open.
____________
141941*2^4299438-1 is prime!
|
|
|
Lumiukko Volunteer tester Send message
Joined: 7 Jul 08 Posts: 165 ID: 25183 Credit: 875,031,530 RAC: 112,652
                           
|
GeForce GTS250 (16 MP; 1836 MHz shader clock) compute-capability: 1.1
Dual Xeon E5345 @2.33GHz, 16GB RAM
Windows 7 x64
(all tests with 8 PPS-LLR tasks running)
c:\Temp>ap26cuda.exe 366384 366384 0 --device 0
Calculated kernel execution time: 59 ms
GPU time to compute K: 366384 was 115 seconds
c:\Temp>fc /L SOL-AP26.txt intel366384_366384_0.txt
Comparing files SOL-AP26.txt and INTEL366384_366384_0.TXT
FC: no differences encountered
c:\Temp>ap26cuda.exe 3744537 3744541 64 --device 0
Calculated kernel execution time: 60 ms
GPU time to compute K: 3744539 was 119 seconds
Calculated kernel execution time: 65 ms
GPU time to compute K: 3744540 was 122 seconds
Calculated kernel execution time: 63 ms
GPU time to compute K: 3744541 was 120 seconds
c:\Temp>fc /L SOL-AP26.txt intel3744537_3744541_64.txt
Comparing files SOL-AP26.txt and INTEL3744537_3744541_64.TXT
FC: no differences encountered
c:\Temp>ap26cuda.exe 76000000 76000003 640 --device 0
Calculated kernel execution time: 57 ms
GPU time to compute K: 76000000 was 118 seconds
Calculated kernel execution time: 60 ms
GPU time to compute K: 76000001 was 120 seconds
Calculated kernel execution time: 61 ms
GPU time to compute K: 76000002 was 120 seconds
c:\Temp>fc /L SOL-AP26.txt intel76000000_76000003_640.txt
Comparing files SOL-AP26.txt and INTEL76000000_76000003_640.TXT
FC: no differences encountered
--
Lumiukko
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
GeForce 8400 GS (2 MPs; 918 MHz shader clock) compute-capability: 1.1
Pentium 4 Prescott@3.2GHz (Hyperthreaded)
Windows 2000 Pro (32-bit)
(all test with no BOINC processes running)
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 663 ms
GPU time to compute K: 366384 was 1046 seconds
Result file matches.
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 661 ms
GPU time to compute K: 3744539 was 911 seconds
Calculated kernel execution time: 770 ms
GPU time to compute K: 3744540 was 1060 seconds
Calculated kernel execution time: 661 ms
GPU time to compute K: 3744541 was 911 seconds
Result file matches.
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 661 ms
GPU time to compute K: 76000000 was 911 seconds
Calculated kernel execution time: 659 ms
GPU time to compute K: 76000001 was 910 seconds
Calculated kernel execution time: 660 ms
GPU time to compute K: 76000002 was 911 seconds
Result file matches.
CPU usage bounced up to 50% (one full thread) during kernel execution, but remained below 1% otherwise. Screen responsiveness noticeably sluggish but still usable with Windows GUI w/ browser and a couple of apps open.
____________
141941*2^4299438-1 is prime!
|
|
|
Menipe Volunteer tester Send message
Joined: 2 Jan 08 Posts: 235 ID: 17041 Credit: 112,895,959 RAC: 826
                       
|
Computer specs:
Intel Core 2 Quad Q6600 @ 2.4GHz 8GB RAM
GeForce GTX 275 (30 MPs; 1460 MHz) compute-capability: 1.3
Ubuntu 9.10 Kernel Linux 2.6.31-15-generic
NVIDIA Driver Version 190.18
Result file matches.
time ./ap26_CUDA2.3_x86_64_Linux 366384 366384 0 --device 0
real 1m2.977s
user 0m0.820s
sys 0m0.660s
time ./ap26_CUDA2.3_x86_64_Linux 3744537 3744541 64 --device 0
real 3m14.083s
user 0m2.350s
sys 0m1.960s
time ./ap26_CUDA2.3_x86_64_Linux 76000000 76000003 640 --device 0
real 3m7.862s
user 0m2.190s
sys 0m2.190s
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
CPU usage bounced up to 50% (one full thread) during kernel execution, but remained below 1% otherwise. Screen responsiveness noticeably sluggish but still usable with Windows GUI w/ browser and a couple of apps open.
This is normal for the windows app, CUDA blocking sync does not work properly in CUDA2.3 with a fully loaded CPU under Windows, yet. Linux version does not have this problem, at least under RHEL test machines.
So my work-around is to "benchmark" the kernel speed at the start of each K, then manually block and spin the CPU thread based on the predicted kernel runtime in the loop. So for each K, the gpuapp.exe will peg the cpu while the kernel bench is completing.
This is not noticeable on faster cards, because the benchmark runs in under 2 seconds or so.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
Thanks for the explanation.
I have tried to test across the range for windows OS'es...have a couple more I can do with another 32 shader card and an even slower card than the 16 shader 8400 GS. Anything in particular that you are looking for in the testing beyond what had already been provided above?
And again, Thanks for working on the app! :)
____________
141941*2^4299438-1 is prime!
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
UPDATE
Linux64 version has been updated on website, now includes some error checking code from the Windows version.
If the GPU apps are released to public, they will checkpoint after every K, 6 times per workunit. In other words, the times you are seeing while testing will be the time between checkpoints/progress update in the BOINC manager. Example, GTS240 every 2mins or so in Windows.
Obviously, using too slow of a card and exiting/pausing BOINC many times will cause the workunit to take forever to process, because it never reaches a checkpoint.
With current code, Windows GUI will be laggy, Linux I have not noticed this.
Anything in particular that you are looking for in the testing beyond what had already been provided above?
Well, really any and all testing is good. I just ran the 76000000 test range while playing a FPS game at 1920x1080, and results matched. The game had horrible framerate, of course, but it ran.
Some may choose to use the BOINC feature "only run gpu apps when computer is idle", or exit BOINC before gaming, or only run the app on your second GPU (non-primary), if you have one, etc.
If the video card has limited VRAM (256MB or less), then the app may error depending on what else the display is processing at the time.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
GeForce 9500 GT (4 MPs; 1750 MHz shader clock) compute-capability: 1.1
i7 920@2.66GHz
Windows Vista Home Premium (64-bit)
(all test with GFN sieve running on 7 of 8 cores)
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 171 ms
GPU time to compute K: 366384 was 254 seconds
Result file matches.
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 171 ms
GPU time to compute K: 3744539 was 251 seconds
Calculated kernel execution time: 185 ms
GPU time to compute K: 3744540 was 259 seconds
Calculated kernel execution time:172 ms
GPU time to compute K: 3744541 was 250 seconds
Result file matches.
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 171 ms
GPU time to compute K: 76000000 was 247 seconds
Calculated kernel execution time: 170 ms
GPU time to compute K: 76000001 was 239 seconds
Calculated kernel execution time: 171 ms
GPU time to compute K: 76000002 was 248 seconds
Result file matches.
Screen responsiveness noticeably sluggish but usable with Windows GUI w/ browser and a few apps open.
____________
141941*2^4299438-1 is prime!
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
Same host as above, now in Windows
GeForce 9800 GT (14 MPs; 1500 MHz) compute-capability: 1.1
Core2 Q9550, 4 GB RAM
Vista (64-bit), nvidia driver 191.07
BOINC tasks running on all four CPU cores during tests.
KMIN=366384 KMAX=366384 SHIFT=0
Calculated kernel execution time: 73 ms
GPU time to compute K: 366384 was 103 seconds
KMIN=3744537 KMAX=3744541 SHIFT=64
Calculated kernel execution time: 73 ms
GPU time to compute K: 3744539 was 103 seconds
Calculated kernel execution time: 80 ms
GPU time to compute K: 3744540 was 113 seconds
Calculated kernel execution time: 73 ms
GPU time to compute K: 3744541 was 102 seconds
KMIN=76000000 KMAX=76000003 SHIFT=640
Calculated kernel execution time: 73 ms
GPU time to compute K: 76000000 was 106 seconds
Calculated kernel execution time: 72 ms
GPU time to compute K: 76000001 was 102 seconds
Calculated kernel execution time: 73 ms
GPU time to compute K: 76000002 was 104 seconds
Result files matched. Screen was a little slow to respond, quite like when running SETI CUDA tasks. |
|
|
|
On a Windows PC, do I have to put some type of app_info.xml file in the projects directory to get your CUDA app to work? Is there some type of readme.txt file that I missed? Any help would be appreciated.
Jeff
____________
|
|
|
|
yes, please, where is the app_info file ? |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
The app may be made official very soon. Maybe Rytis or John can give some input.
____________
|
|
|
|
we want to test this app too, please help us to crunch with our nvidia card on windows... |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
The app needs to be run from the command line. To do this, open up a Command Prompt either through the link under "Accessories" or by just using the Run option in your start menu and typing "cmd".
Move to the directory where you extracted the application file. Then run the app on the command line like this:
application_name range_numbers shift --device #
For example, to run the first range on a single GPU install, you would use:
ap26cuda.exe 366384 366384 0 --device 0
____________
141941*2^4299438-1 is prime!
|
|
|
|
it's not crunched with boinc ? like collatz, milky, seti, etc... |
|
|
Sysadm@Nbg Volunteer moderator Volunteer tester Project scientist
 Send message
Joined: 5 Feb 08 Posts: 1224 ID: 18646 Credit: 877,041,954 RAC: 316,786
                      
|
you can see it in thread title: CUDA testing << it is test only
official app and credtis will follow, if app is performing well ...
then in future you can use it with boinc like others ...
____________
Sysadm@Nbg
my current lucky number: 113856050^65536 + 1
PSA-PRPNet-Stats-URL: http://u-g-f.de/PRPNet/
|
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
it's not crunched with boinc ? like collatz, milky, seti, etc...
Manual testing has gone well as evidenced by all the previous posts. BOINC roll-out, I hope, should be soon. :)
Thank you everyone for helping out.
____________
|
|
|
|
ok thanks everybody, I'm waiting for the official boinc app (sorry for my english but I'm french) |
|
|
|
You could adapt my app_info.xml to suit the app mfl0p built until the app would release official.
For windows you do have to edit the CPU-app too, not only the CUDA-app. |
|
|
Lumiukko Volunteer tester Send message
Joined: 7 Jul 08 Posts: 165 ID: 25183 Credit: 875,031,530 RAC: 112,652
                           
|
GeForce GTX 285 (30 MPs; 1476 MHz) compute-capability: 1.3
Dual Xeon E5345 @2.33GHz, 12GB RAM
Windows 7 x64
(all tests with BOINC running 8 Cullen-LLR tasks)
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 41 ms
GPU time to compute K: 366384 was 69 seconds
Result file matches.
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 43 ms
GPU time to compute K: 3744539 was 69 seconds
Calculated kernel execution time: 56 ms
GPU time to compute K: 3744540 was 124 seconds
Calculated kernel execution time: 42 ms
GPU time to compute K: 3744541 was 69 seconds
Result file matches.
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 42 ms
GPU time to compute K: 76000000 was 67 seconds
Calculated kernel execution time: 42 ms
GPU time to compute K: 76000001 was 67 seconds
Calculated kernel execution time: 42 ms
GPU time to compute K: 76000002 was 67 seconds
Result file matches.
--
Lumiukko
|
|
|
|
GeForce 8800 GTS 512 (16 MPs; 1674 MHz shader clock) compute-capability: 1.1
Q6600 @ 3.0 GHz 6 GB Ram
Win7 x64
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 67 ms
GPU time to compute K: 366384 was 106 seconds
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 68 ms
GPU time to compute K: 3744539 was 106 seconds
Calculated kernel execution time: 77 ms
GPU time to compute K: 3744540 was 109 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 3744541 was 105 seconds
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000000 was 107 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000001 was 106 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000002 was 106 seconds
____________
|
|
|
valterc Volunteer tester Send message
Joined: 30 May 07 Posts: 121 ID: 8810 Credit: 20,287,840,052 RAC: 5,303,529
                        
|
In the following picture you can see some results:
W7ux64:
Q9450 @ 3520 (~14 minutes)
GPU 0: GTX275 (~7 minutes)
GPU 1: 8600GT (~40 minutes)
screen lag is slightly higher than running seti on gpus
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
UPDATE: new version available for testing, Linux 64bit only
I have been working on the CUDA application a bit more, the Linux version has been updated here:
http://mfl0p.angelfire.com/testing/
Using test range 366384..
GTS 240:
With current 1.00 app: 97 sec
With new app: 75 sec
GTS 250:
With current 1.00 app: 84 sec
With new app: 71 sec
If anyone with much faster cards (1.3 hardware) can test this app to see if it is any faster that would be great.
Also, if anyone with slow cards like 9500gt can test, would be nice also.
Memory usage is the same. The CUDA kernels have been tweaked to gain speed. Of course, this is on my old 1.1 hardware. Faster cards may actually be slower with this version, as I am calling fewer thread blocks per kernel... so testing is needed.
Eventually, as Nvidia keeps adding more MPs to the cards, i'll have to increase the number of blocks per kernel to keep the fast cards busy. We may be at that point right now, I don't know, as I do not have access to 1.3 hardware currently.
Re: new Windows version, because I know someone will ask... let's test this version and see how it scales to different cards, and work from there.
Thanks again,
Bryan
____________
|
|
|
|
Thank you mfl0p !
Nice to see this update !
So something like 15% - 20% improvement so far ?
I might try this week with my GTX275 and compare both version. |
|
|
|
Hi mfl0p,
Looks great - just to say I have managed to build a Mac version based on roadrunner's code (see http://www.primegrid.com/forum_thread.php?id=1616)- but since yours is apparently faster, whenever you feel ready to release the code I'd be happy to port it.
If not, I expect we can release a BOINC/Mac/32bit/CUDA app based on his code soon.
Cheers
- Iain |
|
|
Sysadm@Nbg Volunteer moderator Volunteer tester Project scientist
 Send message
Joined: 5 Feb 08 Posts: 1224 ID: 18646 Credit: 877,041,954 RAC: 316,786
                      
|
my test on a NVIDIA GeForce 9800 GTX+:
GPU time to compute K: 366384 was 67 seconds
Checkpoint: KMIN=366384 KMAX=366384 SHIFT=0 K=366385 ITER=0/7 (100.00%)
real 1m9.821s
user 0m0.640s
sys 0m1.000s
____________
Sysadm@Nbg
my current lucky number: 113856050^65536 + 1
PSA-PRPNet-Stats-URL: http://u-g-f.de/PRPNet/
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
On my GeForce 9800 GT (14 MPs; 1500 MHz) compute-capability: 1.1
366384 366384 0
old app, 107 sec
new app, 81 sec
CPU cores idle while testing. Waiting for Win64 version... :-)
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
UPDATE for Windows users
I have updated the code for the Windows app, I need some fast 1.3 cards to test this program. I know most 1.1 cards run faster, but newer hardware may not.
Program is here:
http://mfl0p.angelfire.com/testing/
I have removed the benchmark and timed sleep code in favor of CUDA streams and sleeping on event waits. The GUI seems a bit smoother and the app is a bit faster on my test box. Also, better CUDA error checking has been added.
command line:
ap26cuda.exe 366384 366384 0 --device 0
or device 1,2,3,etc. if you have multiple cards
Thanks!
Also, the Linux app is getting some changes, I'm trying to get it to full speed in Ubuntu. It seems CUDA apps just don't run as fast under Windows or newer Linux kernels.
____________
|
|
|
|
On my GTX 285:
366384 366384 0 --> 65 sec
3744537 3744541 64 --> 191 sec
76000000 76000003 640 -->191 sec
Result file matches. |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
GeForce 9500 GT (4 MPs; 1750 MHz shader clock) compute-capability: 1.1
i7 920@2.66GHz
Windows Vista Home Premium (64-bit)
(all test with GFN sieve running on 7 of 8 cores)
KMIN: 366384 KMAX: 366384 SHIFT: 0
With current app: 254 sec
With new app: 188 sec
...and importantly, no screen sluggishness, etc. with the new app! Will try on a really slow 16 shader card in a bit...sorry, I don't have any 1.3 compute capable cards. :(
____________
141941*2^4299438-1 is prime!
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
Looking good so far. As expected the slower cards are seeing the most benefit from the new code. I'm now using fewer blocks per kernel to reduce global memory access. However, as a side effect, faster cards don't see much speed increase because they can hide the global memory access by running many more thread blocks concurrently. Major changes would be required to launch more blocks per kernel for the faster cards, and until I can get a faster GPU, I am unable to do this.
Another new feature of the program, progress will be updated in BOINC regularly, so you will be able to see how fast the program is working. The current app only updates BOINC progress every K, or 11% with a 9K WU. Checkpoints are still at every K. Stopping and restarting the program in BOINC you may see the progress drop back to the last checkpoint %.
I'll use this code in the Linux application and see how well it works. It's not as good as CUDA's built in application blocking sync feature, but I have only seen that feature work properly on older kernels like RedHat Enterprise 5.4, which the majority of computers are NOT running :) So i'll optimize for the most common computers.
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2392 ID: 1178 Credit: 18,638,012,045 RAC: 7,046,910
                                                
|
GeForce 8400M GS (2 MPs; 800 MHz shader clock) compute-capability: 1.1
T8100@2.1GHz
Windows Vista Home Premium (32-bit)
(all test with both cores idle)
KMIN: 366384 KMAX: 366384 SHIFT: 0
With current app: 1023 sec
With new app: 727 sec
Very nice improvement!
...screen sluggishness still very marked on this one, but without going down to some of the 8-shader OEM cards, one would be hard pressed to test on a slower card.
____________
141941*2^4299438-1 is prime!
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
UPDATE - Linux 64bit CUDA app available
I've used the same code to compile the app for Linux64, and it actually runs under Ubuntu 9.10 with a fully loaded CPU...
App is available at same link as above.
We'll see what Rytis wants to do :)
Thanks for testing!
Bryan
____________
|
|
|
Lumiukko Volunteer tester Send message
Joined: 7 Jul 08 Posts: 165 ID: 25183 Credit: 875,031,530 RAC: 112,652
                           
|
GeForce GTX 285 (30 MPs; 1476 MHz) compute-capability: 1.3
Dual Xeon E5345 @2.33GHz, 12GB RAM
Windows 7 x64
with BOINC running 4 PPS-LLR and 4 SGS-LLR tasks:
KMIN: 366384 KMAX: 366384 SHIFT: 0
Old app:
GPU time to compute K: 366384 was 85 seconds
New app:
GPU time to compute K: 366384 was 87 seconds
when BOINC not running:
KMIN: 366384 KMAX: 366384 SHIFT: 0
Old app:
GPU time to compute K: 366384 was 52 seconds
New app:
GPU time to compute K: 366384 was 59 seconds
All result files match.
--
Lumiukko |
|
|
|
GeForce 8400M GS (2 MPs; 800 MHz shader clock) compute-capability: 1.1
T8100@2.1GHz
Windows Vista Home Premium (32-bit)
(all test with both cores idle)
KMIN: 366384 KMAX: 366384 SHIFT: 0
With current app: 1023 sec
With new app: 727 sec
I know I'm in the minority, but could I please get the Linux CUDA app (32-bit) to allow my 8600 with 256M of memory? It works in Windows, but I don't.
Please...
____________
6r39 7ri99
Beware the dual headed Gentoo with Wine! |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
I have obtained a CUDA 1.3 card (GTX260 core216) and have modified the source code again to take advantage of the larger amount of MPs available on this hardware.
Initial testing shows:
Search parameters are KMIN: 366384 KMAX: 366384 SHIFT: 0
Found 1 GPUS, using GPU 0, GeForce GTX 260 (27 MPs; 1242 MHz shader clock) compute-capability: 1.3
GPU Memory: 821006336 bytes free, 117796864 bytes used of 938803200 bytes
GPU time to compute K: 366384 was 38 seconds
Compiled Feb 19 2010 with CUDA toolkit 2.3 and GCC 4.1.2 20080704 (Red Hat 4.1.2-46) Linux x86_64
This is a new record speed for the test range! And, I'm just getting started! Video RAM usage has gone up to about 113MB.
Bryan
____________
|
|
|
HAmsty Volunteer tester
 Send message
Joined: 26 Dec 08 Posts: 132 ID: 33421 Credit: 12,510,712 RAC: 0
                
|
GPU time to compute K: 366384 was 38 seconds
was this the overall processing time? if so - wow, very impressiv!
____________
|
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
GPU time to compute K: 366384 was 38 seconds
was this the overall processing time? if so - wow, very impressiv!
Yes, that was total processing time. CPU usage is now almost zero.
New apps require about 155MB VRAM free.
My test range results compared to current 1.00 official app:
GTX 260: new 34sec, old 74sec
GTS 250: new 47sec, old 84sec
GTS 240: new 54sec, old 97sec
8500GT : new 600sec, old 770sec
As you can see the more MPs the card has, the better the new code works. I've changed the code to add enough blocks to scale with future GPUs that may be released.
New apps are here:
Windows 32bit/64bit
http://www.megaupload.com/?d=28DGC7Y5
Linux 64bit
http://www.megaupload.com/?d=JQU69LDC
____________
|
|
|
|
I've just downloaded it and started testing.
35 seconds on my "old" GTX 260-192 with the standard test range. The solution file seems to match. I'm tempted to continue the hunt for the AP26 gold badge! You did some excellent work. Thank you very much.
If this goes on it will be necessary to increase the WU range again for the next AP26 challenge in june(?)... |
|
|
|
By the way:
It seems that the renicing trick is not necessary any more under Linux to reach full speed when the CPU is under full load.
Update:
Ignore the statement above: I just see that the GPU app is blocking one CPU core again. I must have messed up my app_info.xml or something else went wrong. |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
The apps have a new sleep feature for the CPU thread, so it should be using near 0% CPU load. This was done mainly to get rid of the windows app benchmark and to get the app to run under Ubuntu's newer kernel. If the CPU is loaded with other programs, the GPU may be starved and slow down a little, 10-20sec per K is what i've seen. I use 0.05 CPUS + 1 Nvidia GPU in my app_info file.
____________
|
|
|
|
GeForce 8800 GTS 512 (16 MPs; 1674 MHz shader clock) compute-capability: 1.1
Q6600 @ 3.0 GHz 6 GB Ram
Win7 x64
KMIN: 366384 KMAX: 366384 SHIFT: 0
Calculated kernel execution time: 67 ms
GPU time to compute K: 366384 was 106 seconds
KMIN: 3744537 KMAX: 3744541 SHIFT: 64
Calculated kernel execution time: 68 ms
GPU time to compute K: 3744539 was 106 seconds
Calculated kernel execution time: 77 ms
GPU time to compute K: 3744540 was 109 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 3744541 was 105 seconds
KMIN: 76000000 KMAX: 76000003 SHIFT: 640
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000000 was 107 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000001 was 106 seconds
Calculated kernel execution time: 67 ms
GPU time to compute K: 76000002 was 106 seconds
With the new app:
Can't open init data file - running in standalone mode
Search parameters are KMIN: 366384 KMAX: 366384 SHIFT: 0
Found 1 GPUS, using GPU 0, GeForce 8800 GTS 512 (16 MPs; 1674 MHz shader clock) compute-capability: 1.1
GPU time to compute K: 366384 was 64 seconds
Compiled Feb 20 2010 with CUDA toolkit 2.3 and MSVS2008
called boinc_finish
nice work
____________
|
|
|
|
By the way:
It seems that the renicing trick is not necessary any more under Linux to reach full speed when the CPU is under full load.
Update:
Ignore the statement above: I just see that the GPU app is blocking one CPU core again. I must have messed up my app_info.xml or something else went wrong.
With the database running again I continue my tests. 4 CPU WUs and 1 GPU running in parallel with no problems. The reason for the problem mentioned above (blocked CPU core) was a typo that I made in my app_info.xml. The runtimes of the GPU WUs are about 5:30 minutes on my GTX 260. Thank you again for the great work.
|
|
|
|
This is probably a newb question but, what does the app_info.xml need to look like to run this on a windows comp with 2 GPU's? I keep trashing wu's trying to write one. |
|
|
RytisVolunteer moderator Project administrator
 Send message
Joined: 22 Jun 05 Posts: 2653 ID: 1 Credit: 102,211,564 RAC: 96,431
                     
|
Feb20 builds have been made public.
____________
|
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 686 ID: 845 Credit: 2,910,184,413 RAC: 361,400
                              
|
Still the old app version on Windows x64.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 431,895,044 RAC: 1,094,007
                               
|
Feb20 builds have been made public.
Impressive. I'm seeing about a 100% speed increase on a GTX280.
____________
My lucky number is 75898524288+1 |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 686 ID: 845 Credit: 2,910,184,413 RAC: 361,400
                              
|
Still the old app version on Windows x64.
Now the old app disappeared from apps.php, so I would expect to get the Win32 app on Win64. But I still get the old app, even after detaching and reattaching.
____________
|
|
|
|
~9560 C/day for my GTX 260, i think this is an fair amount for this GPU.
10% of off one core of my W3520 is deducted for x11, if i [ctrl]+[alt]+[F1] to console-mode this normalizes to 0.
Very good work! |
|
|
mfl0p Project administrator Volunteer developer Send message
Joined: 5 Apr 09 Posts: 249 ID: 38042 Credit: 2,462,723,269 RAC: 3,314,976
                              
|
Very good work!
Thanks!
The new code now lets PrimeGrid's GPU applications show the real power of the GPU, even on huge 64bit integers with lots of division, which is one of the slowest things a GPU can compute.
____________
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
Still the old app version on Windows x64.
Now the old app disappeared from apps.php, so I would expect to get the Win32 app on Win64. But I still get the old app, even after detaching and reattaching.
Bump. |
|
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 686 ID: 845 Credit: 2,910,184,413 RAC: 361,400
                              
|
Still the old app version on Windows x64.
Now the old app disappeared from apps.php, so I would expect to get the Win32 app on Win64. But I still get the old app, even after detaching and reattaching.
Bump.
It has been working for me since a few hours after my post.
____________
|
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
It has been working for me since a few hours after my post.
Okay, thanks! I just looked at apps.php and forgot that the CUDA app can be 32-bit without penalty. |
|
|
|
(...)
even on huge 64bit integers with lots of division, which is one of the slowest things a GPU can compute.
Slowest thing is mod, but that was easily circumvented with division. |
|
|
|
(...)
even on huge 64bit integers with lots of division, which is one of the slowest things a GPU can compute.
Slowest thing is mod, but that was easily circumvented with division.
You maybe surprise to know that I didn’t know about it before… |
|
|
|
At least you had 27 days to figure this out. |
|
|