Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Problems and Help :
NVIDIA driver update computation errors
Author |
Message |
|
Ever since installing the most recent NVIDIA driver (released 2/21/2012) I've been getting computation errors. Even on processor jobs I think. I've reinstalled BOINC.
Where should I begin troubleshooting?
AMD Athlon II x3 450
PNY Geforce GTX 550Ti | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
Ever since installing the most recent NVIDIA driver (released 2/21/2012) I've been getting computation errors. Even on processor jobs I think. I've reinstalled BOINC.
Where should I begin troubleshooting?
AMD Athlon II x3 450
PNY Geforce GTX 550Ti
No troubleshooting necessary. It's a known problem with the driver. When it powers down the monitor, CUDA dies.
Either roll back to a previous version of the driver, or set your system to never put the monitors to sleep.
____________
My lucky number is 75898524288+1 | |
|
|
Thanks! | |
|
|
Has this problem been brought to NVIDIA's attention? I really don't want to leave my monitor powered on while I'm not using the PC but BOINC is grinding away; it's a waste of electricity. I'd like to stay at the lastest driver release because I'm a PC gamer as well as a BOINC participant.
Intel i7-950
EVGA GeForce GTX 460 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,413,232,408 RAC: 2,792,323
                                      
|
Has this problem been brought to NVIDIA's attention? I really don't want to leave my monitor powered on while I'm not using the PC but BOINC is grinding away; it's a waste of electricity. I'd like to stay at the lastest driver release because I'm a PC gamer as well as a BOINC participant.
The issues should be send to nVidia.
On the other hand, if you manually turn off your monitor, it will consume even less/zero power comparing to sleep mode.
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
Has this problem been brought to NVIDIA's attention? I really don't want to leave my monitor powered on while I'm not using the PC but BOINC is grinding away; it's a waste of electricity. I'd like to stay at the lastest driver release because I'm a PC gamer as well as a BOINC participant.
Intel i7-950
EVGA GeForce GTX 460
As far as I know, it's been brought to their attention multiple times.
____________
My lucky number is 75898524288+1 | |
|
|
How sure are you it's a nVidia problem? I work in the software world and this is a constant problem, assuming it's someone else bug.
I've run many other projects on my nVidia card and never had a problem?
BTW, I get BSODs on ATI and computation error on nVidia. I really think the problem is with the code here. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
How sure are you it's a nVidia problem? I work in the software world and this is a constant problem, assuming it's someone else bug.
Very sure.
This specific problem has been widely reported.
You could go over to Nvidia's forums where this has been reported for a while. It's also being discussed at nearly every BOINC project that has a CUDA app.
While there's all sorts of problems that can cause a program to fail, and even more that can cause a GPU program to fail, this specific problem is definitely a driver problem. The stderr output of the GFN and PPS Sieve programs indicates what caused it to fail, and it's crystal clear that the problem is in the CUDA system, not the program. (I'm not familiar enough with the GCW sieve to interpret the output of that app.)
That's not the same as saying there's no bugs in the programs, but this problem is in the driver, and there's no programmatic way to correct it. The only known fixes are to roll back to a pre-295 driver or to prevent the computer from turning off the monitors.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
I've run many other projects on my nVidia card and never had a problem?
BTW, I get BSODs on ATI and computation error on nVidia. I really think the problem is with the code here.
I thought I'd answer this part separately.
First of all, I can only speak about the GeneferCUDA app; I'm not involved in the development of the Sieve apps. However, my personal experience with those is that both sieve apps have been very stable for me.
As far as GeneferCUDA is concerned, I took a look at your computers with CUDA 1.3-capable GPUs.
On your Linux box with the GT430, I see two Genefer WUs, and both have this error:
GeneferCUDA-boinc 1.06 (CUDA3.2) based on GeneferCUDA 1.049 and Genefer 2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009-2011, Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010-2012, Shoichiro Yamada (CUDA)
Portions of this software written by Michael Goetz 2011-2012 (BOINC)
A program for finding large probable generalized Fermat primes.
../../projects/www.primegrid.com/primegrid_genefer_1.06_i686-pc-linux-gnu__cuda32_13 -boinc -q hidden --device 0
GeneferCUDA-boinc.cu(107) : cudaSafeCall() Runtime API error : CUDA driver version is insufficient for CUDA runtime version.
That's the error returned by the API call to the driver to initialize the CUDA system, and the error is self-explanatory.
On your Windows box with the GTX 580, you are mostly completing the Genefer WUs successfully, but there are a few errors.
Here's one of the errors:
GeneferCUDA-boinc 1.06 (CUDA3.2) based on GeneferCUDA 1.049 and Genefer 2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009-2011, Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010-2012, Shoichiro Yamada (CUDA)
Portions of this software written by Michael Goetz 2011-2012 (BOINC)
A program for finding large probable generalized Fermat primes.
Command line: projects/www.primegrid.com/primegrid_genefer_1.06_windows_intelx86__cuda32_13.exe -boinc -q hidden --device 0
Priority change succeeded.
GPU=GeForce GTX 580
Global memory=1610612736 Shared memory/block=49152 Registers/block=32768 Warp size=32
Max threads/block=1024
Max thread dim=1024 1024 64
Max grid=65535 65535 65535
CC=2.0
Clock=1720 MHz
# of MP=16
No project preference specified; using SHIFT=7
maxErr during b^N initialization = 0.0000 (0.145 seconds).
Testing b^262144+1...
maxErr exceeded for 727978^262144+1, 0.5000 > 0.4500
18:06:11 (8380): called boinc_finish
What we have found is that GeneferCUDA, and likely other programs that extensively use double precision hardware, is far more sensitive to overclocking than either gaming or most other CUDA apps, which do not use the double precision hardware. The error seen here is typical of a card that's running too hot or too fast.
Your card is overclocked (possibly factory overclocked), and slowing the card down to stock speeds (1544 MHz) and/or running the card cooler (increasing fan speed, etc.) usually is able to correct the problem. It's the experience of many that lowering the memory clock speed is an affective way to avoid the problem.
The fact that the problem responds to temperature changes indicates that this is a hardware problem, not a software problem.
____________
My lucky number is 75898524288+1 | |
|
|
Interesting. On the 580 box the GPU is not over clocked, but the CPU is.
I'm trying to get to my 100 mil point and these errors just started poping up today. Tried updating the driver and turned off the display sleep. Just saw a bunch more. | |
|
|
Looking at my failed packets it appears they are failing on multiple machines. You may want to reconsider this and look to see if there is a bug. As a matter of fact I haven't found one that has failed in the last day being successful on another machine. It looks like the all my failed packets are failing elsewhere.
And not all my packets are failing. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
Interesting. On the 580 box the GPU is not over clocked, but the CPU is.
According to Nvidia's website, it IS overclocked. You're running at 1720 MHz and stock is 1544 MHz.
Your card may have come overclocked from the factory, but it's still overclocked.
The problems I saw in your Genefer WUs are most likely due to overclocking. If you slow down the memory clock and the shader clock, those problems should go away.
Tried updating the driver and turned off the display sleep. Just saw a bunch more.
I'm having trouble following you. You're talking about your Linux box now, right?
Please remember there's multiple applications running at PrimeGrid. With regards to the 295 driver error, that would affect ALL CUDA programs, but I don't see any evidence of that happening in your PPS Sieve WUs. I don't know what the problem is there, but that's not my application.
When I pointed out the error that said your driver needed to be updated, that was for a GENEFER WU. That has nothing to do with the PPS Sieve.
BTW, although there's definitely a problem with the 295 driver, nobody has stated that this is YOUR problem. You might want to ask about the problems you're experiencing in the appropriate topic for those sub-projects, so the right people might be able to help. Those would be the Proth Prime Search and Cullen/Woodall Search topics.
____________
My lucky number is 75898524288+1 | |
|
|
According to my nVidia console it's running at 1544.
No I'm talking about the Windows box running the GTX580.
The errors are for PPS(Sieve) (cuda23). I have about 140 since yesterday in the PPS Sieve cuda23 subproject. Is this the wrong place to bring this up? I'll start a new thread. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
According to my nVidia console it's running at 1544.
That's interesting because the WU I looked at, and quoted above, shows the driver reporting a higher clock rate back to the application. I don't know what to say about that.
No I'm talking about the Windows box running the GTX580.
In that case, why did you update the driver? It was the Linux box that needs a driver update. There was no need to update the Windows machine with the 580. If you updated that GTX 580 computer with the 295 driver, you replaced a working driver with one that fails under certain circumstances.
The errors are for PPS(Sieve) (cuda23). I have about 140 since yesterday in the PPS Sieve cuda23 subproject. Is this the wrong place to bring this up? I'll start a new thread.
Yes and no. This particular thread was asking about driver problems, and that doesn't seem to be your problem. While I can help you with the GeneferCUDA problems, there's not much I can do about your PPS Sieve problems.
So I would take a look at the topics I mentioned previously for a solution with the PPS Sieve.
Reading around, I see there are problems with some GCW Sieve WUs (not the software). I did take a peak at some of your PPS errors, and that looks like a faulty WU as well. Not so much the software, in both cases, as it is a configuration problem on the server. In any event, you're likely to find better answers where this is already being discussed.
I recommend taking a look in the Proth Prime Search topic. There's some discussion going on there about some bad WUs which might be applicable to your situation.
____________
My lucky number is 75898524288+1 | |
|
|
I`m faccing the same errors using the 285 Driver, Cullen/Woodall are working fine only the Sieve WUs are producing about 80% error rate. The Computer runs 24/7 and all energy saving measures are disabled. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
I`m faccing the same errors using the 285 Driver, Cullen/Woodall are working fine only the Sieve WUs are producing about 80% error rate. The Computer runs 24/7 and all energy saving measures are disabled.
You'll need to unhide your computers if you want help.
It doesn't sound like you have the 295 driver problem.
Just so you know, there's about 4 different problems that all popped up around the same time, so although some people have the 295 problem, you could have one of the other problems.
____________
My lucky number is 75898524288+1 | |
|
|
I downloaded Cuda wu's to My 285 Driver system last night and the system started giving "Computation Errors". The system was working fine a week ago and works fine on other projests. It appears Problem lies with Cuda generation. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 479,202,127 RAC: 399,131
                               
|
I downloaded Cuda wu's to My 285 Driver system last night and the system started giving "Computation Errors". The system was working fine a week ago and works fine on other projests. It appears Problem lies with Cuda generation.
You are correct.
There's four different bugs running around right now. One is a bad driver from Nvidia, another was a configuration error when creating WUs, a third isn't so much a bug as we ran into a limitation of one of the applications, and the fourth... I forget what the fourth is. All happened at once, more or less, which as you can imagine caused much confusion!
The good news is that since you're not running the 295 driver, you obviously aren't affected by that bug. The bad news is you're affected by the PPS Sieve WU bug. That's a problem with the WU generation, not the software, per se. Head over to the "Faulty batch PPS_Sr2 W.U.s" thread in the Proth Prime Search topic for more information.
____________
My lucky number is 75898524288+1 | |
|
|
I had installed the 295 driver and had the problems as discussed, I re-installed the 290 driver that I had on my dual SLI 570 rig and Prime is running but I am not downloading any CUDA units? My preferences never changed and yet no CUDA work?
Any suggestions?
____________
Bark LOUD, Bite HARD! | |
|
|
I had installed the 295 driver and had the problems as discussed, I re-installed the 290 driver that I had on my dual SLI 570 rig and Prime is running but I am not downloading any CUDA units? My preferences never changed and yet no CUDA work?
Any suggestions?
Check your preferences page and check if "use Nvidia GPU" is enabled. | |
|
|
Yes it is. As I noted in my original post, none of the preferences on my account have been changed. Was getting CUDA before the install of the 295 driver and when I started having the errors, I went back to the 290 driver that had been working fine. No other changes.
Just that now I am not receiving any CUDA units only CPU units...
____________
Bark LOUD, Bite HARD! | |
|
Message boards :
Problems and Help :
NVIDIA driver update computation errors |