Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
Minimum CUDA requirement for GFN
Author |
Message |
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
The 3.2.0 version of Genefer (Version 3.00 on BOINC) requires a minimum CUDA version of 5.5. I *think*, but am not certain, that this corresponds to driver version 320.
The app won't run with an earlier version of CUDA and will give you an error about having an insufficient driver level.
The server is configured to not send you GFN CUDA tasks unless your system has driver that supports CUDA 5.5.
If you're not getting GFN tasks, try upgrading your driver.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
This may be a short lived problem. I'm testing the next version of the Genefer apps, which are significantly faster. The new OpenCL app runs about 20% faster than the CUDA app, so we may just eliminate the CUDA app completely.
That speed comparison is on a GTX 580, which was faster on CUDA than OpenCL in the past.
____________
My lucky number is 75898524288+1 | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,187,004,401 RAC: 23,279,189
                                                
|
This may be a short lived problem. I'm testing the next version of the Genefer apps, which are significantly faster. The new OpenCL app runs about 20% faster than the CUDA app, so we may just eliminate the CUDA app completely.
That speed comparison is on a GTX 580, which was faster on CUDA than OpenCL in the past.
The CUDA vs. OpenCL tests need to be done on a broader array of GPUs than a GTX580. The OCL vs. CUDA differences on different GTX 4xx and 5xx cards were quite variable in the past.
Is this the same new apps as were just released in PRPnet? If so, I'll test on a couple of card versions over the next couple of days to add to the testing.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
This may be a short lived problem. I'm testing the next version of the Genefer apps, which are significantly faster. The new OpenCL app runs about 20% faster than the CUDA app, so we may just eliminate the CUDA app completely.
That speed comparison is on a GTX 580, which was faster on CUDA than OpenCL in the past.
The CUDA vs. OpenCL tests need to be done on a broader array of GPUs than a GTX580. The OCL vs. CUDA differences on different GTX 4xx and 5xx cards were quite variable in the past.
Is this the same new apps as were just released in PRPnet? If so, I'll test on a couple of card versions over the next couple of days to add to the testing.
Yes, these are the apps released in the new PRPNet, but the OCL app needs a bug fix before it goes into production on BOINC. When running standalone, as it does in PRPNet, it's ok, but under BOINC it's using a whole CPU core. I think this is an easy fix.
The new Z-transform CPU and OCL apps are a LOT faster than the earlier apps. I don't thing CUDA is going to be faster on any GPU at this point. T put things in perspective, I may be able to run the CPU app on world record tasks and make the deadline. Maybe.
C:\Temp>genefer_windows64.exe -q "26586^4194304+1" -x fma
genefer 3.2.1-0 (Windows/CPU/64-bit)
Supported transform implementations: default x87 sse2 sse4 avx fma
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -q 26586^4194304+1 -x fma
Priority change succeeded.
Testing 26586^4194304+1...
Using FMA transform
The checkpoint doesn't match current test: 26586^4194304+1 != 432100^1048576+1.
Current test will be restarted
Starting initialization...
Initialization complete (22.293 seconds).
Testing 26586^4194304+1... 61648896 steps to go (417:46:36 remaining)
^C caught. Writing checkpoint.
21 days (the deadline) is 504 hours.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
This may be a short lived problem. I'm testing the next version of the Genefer apps, which are significantly faster. The new OpenCL app runs about 20% faster than the CUDA app, so we may just eliminate the CUDA app completely.
That speed comparison is on a GTX 580, which was faster on CUDA than OpenCL in the past.
The CUDA vs. OpenCL tests need to be done on a broader array of GPUs than a GTX580. The OCL vs. CUDA differences on different GTX 4xx and 5xx cards were quite variable in the past.
Is this the same new apps as were just released in PRPnet? If so, I'll test on a couple of card versions over the next couple of days to add to the testing.
Yes, these are the apps released in the new PRPNet, but the OCL app needs a bug fix before it goes into production on BOINC. When running standalone, as it does in PRPNet, it's ok, but under BOINC it's using a whole CPU core. I think this is an easy fix.
The new Z-transform CPU and OCL apps are a LOT faster than the earlier apps. I don't thing CUDA is going to be faster on any GPU at this point. To put things in perspective, I may be able to run the CPU app on world record tasks and make the deadline. Maybe.
C:\Temp>genefer_windows64.exe -q "26586^4194304+1" -x fma
genefer 3.2.1-0 (Windows/CPU/64-bit)
Supported transform implementations: default x87 sse2 sse4 avx fma
Copyright 2001-2014, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2014, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefer_windows64.exe -q 26586^4194304+1 -x fma
Priority change succeeded.
Testing 26586^4194304+1...
Using FMA transform
The checkpoint doesn't match current test: 26586^4194304+1 != 432100^1048576+1.
Current test will be restarted
Starting initialization...
Initialization complete (22.293 seconds).
Testing 26586^4194304+1... 61648896 steps to go (417:46:36 remaining)
^C caught. Writing checkpoint.
21 days (the deadline) is 504 hours.
Note that this is running just 1 core, so it's faster than it would be with all 4 cores going, and I've got a computer that's fairly well optimized for LLR and GFN, with a Haswell CPU supporting FMA, no hyperthreading, and 2400 Mhz memory. We're not yet where we need to be to run GFN-WR on a CPU, but it's no longer a completely ridiculous concept.
____________
My lucky number is 75898524288+1 | |
|
|
I have a gtx770 with 2600K running linux ubuntu 14.04. Just yesterday updated the o/s (from 13.10). After that, with the old video driver (304.xx) and the new PG app, gfn OCL tasks failed immediately (but did download, then fail), and GFN CUDA didn't download at all. Not surprising given the info above... crashed a lot of pps sieve tasks too; sorry about that.
Updated the latest nvidia driver today from the nvidia website. The linux repositories that I subscribe to don't seem to have anything more recent than a 304.xx driver. I grabbed NVIDIA-Linux-x86_64-331.79.run from their site and installed it as follows:
stop window manager with ctrl-alt-f1
stop X with (as root, or with sudo) "sudo service lightdm stop"
cd to wherever you save the above .run file, and run it with "sudo sh ./NVIDIA..."
There are then several windows requiring responses; just keep responding "yes", "go ahead", "i agree", or whatever. Use the keyboard arrow keys to select the proper response.
After the installer was complete, I rebooted. Then held my breath. Or one could just get back into the window manager with "sudo service lightdm start".
It booted into the GUI. Yay.
I first tried a few pps sieve units. Remember, I'd just done an o/s upgrade, a video driver upgrade, and have new PG apps downloaded, one right after the other. Those WUs seemed to work; speed looks about the same as before. Now I'm running GFN CUDA (short). The WU I have appears to be working, but it will be several hours before it finishes, and I hesitate to declare victory before it validates. Next, I will try GFN-OpenCL to see how that works with the new setup.
CUDA version reported (in the boinc log) with the above driver version is 6.0.
Why can't computers be as easy to use as a toaster? Sorry, rhetorical question.
--Gary | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
Gary,
Glad to hear CUDA is working for you.
You might want to try the OpenCL version, for three reasons:
1) It's faster on your GPU, and the next OpenCL version will be even faster.
2) There's a good chance when the next OpenCL app goes live, we'll be turning off the CUDA app because it's looking like the OpenCL app will be faster no matter what GPU you have. You'll want to know if the OpenCL app works -- and so do we.
3) Earlier today I discovered a problem with the Windows OpenCL app. I want to know if it also affects the Linux OpenCL app so we can also fix that app, if needed. The Windows OpenCL app, when running under BOINC, hogs an entire CPU core. It's supposed to use a full core if the CPU is idle, but if all the cores are busy it should only use a fraction of a core. That worked under PRPNet, but not under BOINC. We've fixed it for the next Windows OpenCL version, but we need to know if this also affects Linux and Mac.
____________
My lucky number is 75898524288+1 | |
|
|
GFN 3.00 Ubuntu CUDA unit finished and validated (ID 395052353). Dang I was about a half hour late, though the wingman had a big head start :-)
OpenCL unit with Ubuntu GFN 3.00 (id 395195414) is running now. Yes, it started up OK and appears healthy. Same environment as the above, except for OpenCL vs. CUDA. Faster? Yes, somewhat. The CUDA WU I just finished took 10.4+ hours. If I extrapolate the progress on the current OpenCL unit (still early on), it looks like right at 10 hours.
As for processor usage, if I set boinc at 50 or 75 % of available processors (2600K, 4 physical cores, HT off) the OpenCL app does indeed use ~100% of a core, over and above the 2 or 3 cores of what LLR is using. If I load the CPU with a full 4 LLR tasks, the GFN OpenCL app *still* runs at 100% (-ish) of a core, forcing two of the LLR's to split a core 50/50. Running LLR on either 3 or 4 cores, plus GFN OpenCL, makes the CPU run *really* hot, so much that I need to run just 2 LLR's (yes, I could use better cooling). The GFN CUDA app that I ran first used maybe 11 or 12% of a core on average (which is in line with the behavior of the old app, and prior to the o/s update). Temps were OK with 3 LLR's also running in that case.
--Gary | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
OpenCL unit with Ubuntu GFN 3.00 (id 395195414) is running now. Yes, it started up OK and appears healthy. Same environment as the above, except for OpenCL vs. CUDA. Faster? Yes, somewhat. The CUDA WU I just finished took 10.4+ hours. If I extrapolate the progress on the current OpenCL unit (still early on), it looks like right at 10 hours.
Look at the estimate in stderr.txt in the slot directory. It should be very accurate. On your GPU I would be very surprised if the OpenCL app isn't faster, but the next OpenCL app is a lot faster than the current app.
As for processor usage, if I set boinc at 50 or 75 % of available processors (2600K, 4 physical cores, HT off) the OpenCL app does indeed use ~100% of a core, over and above the 2 or 3 cores of what LLR is using. If I load the CPU with a full 4 LLR tasks, the GFN OpenCL app *still* runs at 100% (-ish) of a core, forcing two of the LLR's to split a core 50/50.
Ok, so the core-hogging problem exists in Linux, too. We'll need to fix it there as well. Thanks. The change to the code should fix the problem on Linux and Mac as well as Windows.
____________
My lucky number is 75898524288+1 | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,187,004,401 RAC: 23,279,189
                                                
|
Doing testing on PRPnet GFN524288 port with a pair of GTX 460s. Results look good.
GeneferCUDA results range from 5 hours to 5 hours 15 minutes each.
New OCL version results range from 4 hours 11 minutes to 4 hours 48 minutes.
That is between 5% to 20% faster. The longer time on OCL still had the CUDA version running on the other card (these are in the same box), so that might explain the widely different performance. I'll keep running with the new OCL version to see if these times stabilize nearer to the shorter time.
Will test on a GTS 450 and GTX 550 Ti when I get home today.
| |
|
|
Well here I was thinking the high cpu usage was because of my old old systems (still might be part of the problem).
770 OCL runs a short GFN - gpu: 36,567.16 cpu: 36,334.22
770 cuda with block size set to default runs a short GFN - gpu: 43,486.78 cpu: 16,531.39
These numbers are in seconds so the OCL is quite a bit faster and based on wu's run are pretty consistent as far as time for each type. | |
|
|
OpenCL GFN (short) 3.0 unit finished, hopefully successfully (and hopefully prime!... okay, that's a stretch...) a couple of hours ago. It *was* faster than the CUDA app, although not by much, maybe 5-10%. Compare the WU ID links I posted prior for specifics.
Yes, the "CPU hog" issue with OpenCL appears in linux too, it seems.
As for the original title of this thread, CUDA 6.0 works with the current app under linux, for sure. CUDA 5.0 does *not* work, everything else unchanged (no surprise there). I never had an install of 5.5 so I can't speak to that.
--Gary
p.s. PPS Sieve GPU is also working happily with the new configuration.
p.p.s. Sorry I can't help with Mac testing. I have Macs, but they are old, and the GPUs are not double precision.
p.p.p.s. @Rick: your gtx770 OpenCL gfn/short timings are right in line with mine. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,187,004,401 RAC: 23,279,189
                                                
|
Will test on a GTS 450 and GTX 550 Ti when I get home today.
Both of these are faster with OCL. CUDA tests estimate about 7.5 hours for GFN524288 work. The OCL on the same tests are at about 6.5 hours, or about 15% faster.
I assume that whatever changes were made to the OCL app cannot be made to the CUDA app, given that the CUDA app was generally better on Fermi cards before?
| |
|
|
Not really, the entire FFT algorithm used by geneferCUDA is buried inside the CuFFT library, so we can't easily touch it without re-implementing the whole thing by hand. Yves already did this in the OpenCL app, so we were able to modify it to use the (faster) z-Transform, rather than the DWT that is still used by the CUDA app.
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,187,004,401 RAC: 23,279,189
                                                
|
Not really, the entire FFT algorithm used by geneferCUDA is buried inside the CuFFT library, so we can't easily touch it without re-implementing the whole thing by hand. Yves already did this in the OpenCL app, so we were able to modify it to use the (faster) z-Transform, rather than the DWT that is still used by the CUDA app.
I was also curious if any of the CUDA 6 improvements would be useful in a similar fashion? | |
|
|
Not really, the entire FFT algorithm used by geneferCUDA is buried inside the CuFFT library, so we can't easily touch it without re-implementing the whole thing by hand. Yves already did this in the OpenCL app, so we were able to modify it to use the (faster) z-Transform, rather than the DWT that is still used by the CUDA app.
I was also curious if any of the CUDA 6 improvements would be useful in a similar fashion?
I don't immediately see anything that will help here, unfortunately...
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Dave  Send message
Joined: 13 Feb 12 Posts: 3257 ID: 130544 Credit: 2,460,866,079 RAC: 4,328,950
                           
|
My own tests on n=20 GTX580 i7-2600k @ 3.4:
GPU time CPU time
34,360.81 453.23 Genefer v3.00 (cudaGFN)
32,726.89 20,139.29 Genefer v3.00 (OCLcudaGFN)
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,807,968 RAC: 586,050
                               
|
With the latest release of Genefer 3.2.2 (released as BOINC apps 3.01), CUDA 6.0 is required for the Mac apps. Windows and Linux still use CUDA 5.5.
____________
My lucky number is 75898524288+1 | |
|
Message boards :
Generalized Fermat Prime Search :
Minimum CUDA requirement for GFN |