Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
How to prevent Computation Errors
Author |
Message |
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I'm not going to sugar coat this: This is not an easy project to run. There's lots of ways to break a GPU app, and GenerferCUDA is more sensitive than most. This is compounded by the sheer length of time you must keep GeneferCUDA running in order to complete a WU.
GeneferCUDA is somewhat different than any other app than I'm aware of. It uses double precision floating point, so it's using physically different circuitry than most other apps. It also is very efficient, so it's driving the GPU harder than most apps do. The two, together, make GeneferCUDA very demanding on the hardware and GPUs that run everything else just fine will fail on GeneferCUDA unless you slow down and/or cool off the GPU. (Yes, you can overclock the heck out of your GPU and it will run GeneferCUDA just fine -- provided you duct your air conditioning right into the computer case. It's easier to turn the clocks down, however.)
I've prepared a list of things you can do to increase the odds of a happy outcome with Genefer.
1) Never switch from stock apps to app_info, or from app_info to stock apps when you have any WUs on your computer.
2) Never try to upgrade an app in the middle of a WU. (You don't have to worry about this with stock apps. The server won't do that to you. It's only possible if you're foolish enough to do this manually using app_info.)
3) Don't run Nvidia's 295 or 296 drivers for Windows, or, if you do, configure your computer to never power off the screen(s).
4) Don't run Nvidia's 364 or 365 drivers for any platform, under any circumstances. There is a serious bug that produces incorrect results. Please use either 362.xx (or lower) or 368.xx (or higher). Note that Windows Vista is no longer supported as of 368.xx, so Vista *MUST* use 362.xx or lower.
5) Never connect to a computer doing any kind of CUDA work by using Windows Remote Desktop. That will kill the CUDA program. Use something like VNC instead.
6) Under some circumstance, using the "switch user" feature under Windows and logging in as a second user, while the first user is still logged on, will kill all CUDA programs. So don't do that, either.
7) Run at stock clock speeds, especially the memory clock. This is recommended even if your GPU is factory overclocked. Some specific cards, i.e., the GTX 550Ti, will need to go even slower and may not work reliably unless you underclock the memory. Also try to keep the GPU as cool as possible: clean the dust out, run the fan faster, convince your wife that all that ducting leading into the computer will actually save money on energy costs, etc. Just don't attribute credit for that last one to me. ;-)
8) Make sure that no other software is reading files in the BOINC directories. This includes virus scanners or backup programs. I use a web based backup, and it takes a while to backup a 32 MB GeneferCUDA checkpoint file. If Genefer does its periodic checkpoint while the old file is locked by the backup process, Genefer will die. (Update: v1.07 is smarter about handling this scenario, but it's still a good idea to prevent programs like backups and anti-virus scanners from looking in the BOINC directories.)
9) If you have a mixture of double precision and single precision GPUs, you MUST explicitly tell BOINC to run GeneferCUDA only on the double precision GPU(s). Otherwise, BOINC may try to run GeneferCUDA on a single precision GPU, and the WU will fail. Please read this message for details about configuring cc_config properly for this scenario. This is only applicable to Nvidia GPUs, and only applicable to certain transforms.
____________
My lucky number is 75898524288+1 | |
|
|
I'm using a normal speed GTX 590, is there a reason why I get the maxErr exceeded only when N=32768 (on PRPNet) but OK for the other N values?
What would I need to do to get it to run for this N value?
Any help would be appreciated.
____________
147*2^1392930+1 was my first prime number found, others have followed :) | |
|
|
I'm using a normal speed GTX 590, is there a reason why I get the maxErr exceeded only when N=32768 (on PRPNet) but OK for the other N values?
What would I need to do to get it to run for this N value?
Any help would be appreciated.
Yes there is a reason. I'm not sure about the proper technical stuff behind it, but somehow the GPU's can't handle numbers beyond a certain b value for a given N. Currently only N=262144 and upwards can be crunched on a GPU, for all others b has simply become to big. That also means you can not do anything to enable crunching at that level.
____________
PrimeGrid Challenge Overall standings --- Last update: From Pi to Paddy (2016)
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I'm using a normal speed GTX 590, is there a reason why I get the maxErr exceeded only when N=32768 (on PRPNet) but OK for the other N values?
What would I need to do to get it to run for this N value?
Any help would be appreciated.
There's a maximum limit for B at each N. If you look at Genefer "B" limits, you will see that GenerferCUDA's limit at N=32768 is about 1,840,000.
Currently, PRPNet is searching N=32768 at approximately B=4,700,000, which is way beyond what GeneferCUDA is capable of processing. You'll need to use the much slower Genefer80 program on your CPU to process those WUs. Genefer80 can go up to about B=64,510,000 at N=32768.
N=262144 is the smallest N you can currently crunch with GeneferCUDA. Below that another program (running on the CPU) is needed.
The maxErr check was actually designed to detect this too-high-B condition. The fact that it also happens to usually catch the hardware errors caused by overclocking is a (useful) coincidence.
____________
My lucky number is 75898524288+1 | |
|
|
Thanks both, the upper "B" limits helps to explain a few things, I had been trying to slow it down by building a slower version with a few tweaks, but also wasn't sure about my final results with an "(err=0.5000)".
Out of interest what would be the max "err" value allowed?
____________
147*2^1392930+1 was my first prime number found, others have followed :) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Thanks both, the upper "B" limits helps to explain a few things, I had been trying to slow it down by building a slower version with a few tweaks, but also wasn't sure about my final results with an "(err=0.5000)".
Out of interest what would be the max "err" value allowed?
0.4500
____________
My lucky number is 75898524288+1 | |
|
|
Result 353012046 failed due to operator error. The imbecile (who should have known better and shall remain nameless!) went and changed the executable (using app_info) in the middle of the run.
Mike, you should hide your computer before making such accusations :)
____________
Twitter: IainBethune
Proud member of team "Aggie The Pew". Go Aggie!
3073428256125*2^1290000-1 is Prime! | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3233 ID: 50683 Credit: 151,443,349 RAC: 73,965
                         
|
Can you give explanation for this error
Starting initialization...
maxErr during b^N initialization = 0.0000 (0.213 seconds).
Estimated total run time for 480258^262144+1 is 0:58:32
cuda_subs.cu(281) : cufftSafeCall() CUFFT error: 6.:32:34 remaining)
[2012-03-05 08:44:19 CEST] GFN262144: No data in file [genefer.log]. Is genefer broken?
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Can you give explanation for this error
Starting initialization...
maxErr during b^N initialization = 0.0000 (0.213 seconds).
Estimated total run time for 480258^262144+1 is 0:58:32
cuda_subs.cu(281) : cufftSafeCall() CUFFT error: 6.:32:34 remaining)
[2012-03-05 08:44:19 CEST] GFN262144: No data in file [genefer.log]. Is genefer broken?
Maybe.
What OS is that computer using, what type of GPU, what version video driver is being used, do you use RDP (Windows remote desktop), and was the computer being used for anything else at the time of the error?
The cuFFT error 6 (it shouldn't have overwritten the status message like that) means, literally, "it didn't work." That's not the most useful error message. :)
Thepart of the status message that wasn't overwritten shows that the program had 32 minutes to go, so it had run successfully for 26 minutes before failing.
You might want to try running that test again manually, to see if it works:
GeneferCUDA -q "480258^262144+1"
I suspect it was a transient error where something external caused the CUDA program to die. It could have run out of video memory, for example. I know that's not a very satisfying answer, but what happened to you is not very common and is usually caused by something in the environment.
You asked "Can you give explanation for this error". You might actually be the best one to answer that question as there's a good chance you might know what caused the failure.
____________
My lucky number is 75898524288+1 | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3233 ID: 50683 Credit: 151,443,349 RAC: 73,965
                         
|
It repair itself when new drivers are installed. All data you ask ( OS, computer, GPU, ) you have in P.M from last time when I send to you link to my hidden host.
So for now all is ok....
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
STE\/E Volunteer tester
 Send message
Joined: 10 Aug 05 Posts: 573 ID: 103 Credit: 3,659,101,651 RAC: 314,262
                     
|
I had to stop running the GFN's, to many errors & the Box's would be basically locked up, totally Unresponsive, would have to push the Re-set Button to even get into the Box once it rebooted ...
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Another problem to watch out for -- this one just killed a WU of mine. This would affect all LLR programs also, by the way, and is more likely to happen with larger numbers, so this is especially of concern for SoB.
I use a web backup program, CrashPlan. It's pretty good and I'm generally thrilled with it.
It just did something it shouldn't have done, however, and as a result a WU just died over a day into the calculations.
It is important that virus scanners and backup programs not be allowed to backup the directories where the BOINC programs are working. Although they are designed to co-exist with the programs that are using those files, a problem still occurs if the program attempts to delete the file while it's being backed up. This will cause Genefer to fail. It will probably also cause any program that writes a checkpoint file to fail -- and that's pretty much any BOINC program. Or PRPNet program, for that matter.
The bigger the checkpoint file, the more likely it is that the program is going to try to delete the old checkpoint while it's being backed up. Genefer's checkpoint files are pretty huge.
If you ever see a Genefer WU fail with error code 11, that's usually what happened.
(In my particular instance, CrashPlan was already configured not to backup BOINC. I'm not sure why it was. It's now configured not only with BOINC not selected, but explicitly de-selected as well.)
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I had to stop running the GFN's, to many errors & the Box's would be basically locked up, totally Unresponsive, would have to push the Re-set Button to even get into the Box once it rebooted ...
Looking at your computers, you were running Genefer on 5 computers. 3 of those had a 100% success rate, and a fourth had a single WU that didn't validate against the other results, but otherwise was successful (I don't know what went wrong with that WU).
There was a single computer that was returning errors consistently. And on that machine, it looks like it's just one of your two GPUs that is creating all the errors. It's GPU 1 (the second GPU) on host 95370 that's problematic. Normally, the error that's occurring on that GPU is symptomatic of overclocking problems, but that GPU isn't overclocked. I suspect it's either a marginally faulty GPU, or, for some reason, it's operating in a warmer environment than usual. Or any number of other potential causes. Regardless of the exact cause, it might be possible to elliminate the errors by lowering the clocks below normal speed -- particularly the memory clock.
As for the lag (which I know is horrible when it occurs!), it's usually possible to completely eliminate it if you take the time to find what other program Genefer is teaming up with to cause the problem. Once you know what else is causing the problem, you can either avoid using that program, manually suspend BOINC GPU computations when using that program, or use cc_config.xml to automatically suspend BOINC GPU computations when using that program.
Even if you can't find the other program, it's always possible to configure BOINC to not run GPU programs when the computer is in use, so the lag would never happen when you're using it.
It's up to you, of course. I just want to point out there may be solutions.
____________
My lucky number is 75898524288+1 | |
|
|
I returned from work this morning(and had to leave again for another job 10 minutes later)
to find all my recent genfer WUs had errored out.
I expected a few of them to do so ,as some of them had already errored out or been aborted
up to 11 times before.But I didn't expect everyone to do so.
I have already completed quite a few WUs at the same clock speed on the same card(460)
and recently validated a WU which had errored out or been aborted 11 times,so wasn't
expecting this problem when I got home.
I have since reduced clock speeds,but as I said earlier I have validated quite a few
on the clock speeds I was running.
The only major changes were running SOBs and turning off hyper threading to hopefully
get a bit of turbo boost on the i7.
Confused john3760
edit: 1 WU has just validated on the lower clock speed . I will keep an eye on the next few.
edit2: Make that two WUs.I still don't understand why I have had to drop the clocks,when they were working alright beforehand. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
edit: 1 WU has just validated on the lower clock speed . I will keep an eye on the next few.
edit2: Make that two WUs.I still don't understand why I have had to drop the clocks,when they were working alright beforehand.
I also have a 460, and you're running about 300 MHz above stock clocks.
I did some tests with my 460, and I had no trouble running in the 1600 MHz range like you're doing.
But I still run at stock speeds anyway.
We know that the problem is temperature sensitive, and that points to a hardware fault. I don't know where you live, but today, in my area, it's unusually warm. If your card is a little bit warmer today than yesterday, for example, that could cause the card to have trouble today even though it's worked fine in the past.
My card seems perfectly stable at the speed you're running at. The 460's seem to have a decent amount of headroom for overclocking.
But we're running WUs that take over a week to crunch, and they're going to get longer. I run at stock speeds to increase the probability of success. At stock speeds, you shouldn't have any trouble with these errors.
____________
My lucky number is 75898524288+1 | |
|
|
Hello,
can you pls have a look a my failed wu's
i use an gigabyte gtx460, never overclocked, but the large wu's break after some time
http://www.primegrid.com/result.php?resultid=356749167
http://www.primegrid.com/result.php?resultid=356362009
http://www.primegrid.com/result.php?resultid=355521013
any idea whts wrong ?
rgds
Tabaluga
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
can you pls have a look a my failed wu's
i use an gigabyte gtx460, never overclocked, but the large wu's break after some time
http://www.primegrid.com/result.php?resultid=356749167
http://www.primegrid.com/result.php?resultid=356362009
http://www.primegrid.com/result.php?resultid=355521013
any idea whts wrong ?
That's the same error you see with overclocked GPUs.
This is from your GTX 460:
GPU=GeForce GTX 460
Global memory=1073283072 Shared memory/block=49152 Registers/block=32768 Warp size=32
Max threads/block=1024
Max thread dim=1024 1024 64
Max grid=65535 65535 65535
CC=2.1
Clock=1430 MHz
This is a GTX 460 that is stock:
GPU=GeForce GTX 460
Global memory=1073741824 Shared memory/block=49152 Registers/block=32768 Warp size=32
Max threads/block=1024
Max thread dim=1024 1024 64
Max grid=65535 65535 65535
CC=2.1
Clock=1350 MHz
You may have never changed the clock settings on your GPU because it is "factory overclocked". But it's still overclocked. Try setting the clocks to the stock settings and see if that helps:
Core Clock: 675 MHz
shader clock: 1350 MHz
memory clock: 1800 MHz
____________
My lucky number is 75898524288+1 | |
|
|
I also have some strange behaviours, which I yet could not figured out. I have two hosts with GTX560 running, W7, same driver, same distributor, same stock clock, default Shift.
One host with 1.06 stock app, the other with app_info (because of avx) and 2.3 beta. The host with stock app is running fine and all results get validated. The other have a couple of results returned as valid, but inconclusive. Also One of 20 wu's gets a maxerror during crunching. But also a lot of valid and correct results.
Both are only boinc machines, no productive work on them. The good host have also the higher temps, cpu and gpu.
I have no idea how I can track down the problem. Michael, any ideas?
Regards Odi
____________
| |
|
|
Hi Michael,
set down the speeds to recommended values
will come back with result of next wu, but looks like i don't get one
found funny messages in boinc log
08.03.2012 20:54:46 | PrimeGrid | Resent lost task genefer_4194304_72892_12
08.03.2012 20:54:46 | PrimeGrid | [error] No app version found for app genefer platform windows_intelx86 ver 106 class cuda32_13; discarding genefer_4194304_72892_12
any idea ?
rgds
Tabaluga
| |
|
|
No having any idea what it took or takes to get the program going for the genefers I do wonder if you can add an error trap routine what would allow the program to maybe restart at some point when it encounters these kinds of errors with a counter to keep it from looping. Of course you may already be doing this. | |
|
|
thanks for the reply.
since downclockIng every.thing seems to be Ok.
I wish I had caught it sooner.
thanks john3760 | |
|
|
Try setting the clocks to the stock settings and see if that helps:
Core Clock: 675 MHz
shader clock: 1350 MHz
memory clock: 1800 MHz
I have GTX460 too and I had problems with factory overclocked settings
Core Clock: 810 MHz
Shader clock: 1620 MHz
Memory clock: 2000 MHz
All problems gone out after downclocking MEMORY clock without changing CORE clock. I had finished enough tasks to claim that GTX460 has enough adequate margin of safety for CORE overclocking, but MEMORY clock needs to be reduced to stock speed or so.
So, I have no any problems with these settings:
Core Clock: 810 MHz
Shader clock: 1620 MHz
Memory clock: 1600 MHz
Michael Goetz, I recommend you to add Memory clock to log. If you could, sure.
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
No having any idea what it took or takes to get the program going for the genefers I do wonder if you can add an error trap routine what would allow the program to maybe restart at some point when it encounters these kinds of errors with a counter to keep it from looping. Of course you may already be doing this.
I thought about it.
Assume for the time being that I'm correct in my analysis of the problem and that it is a hardware fault that occurs with the right combination of clock speed and temperature.
If this problem occurs, it will probably continue to reoccur. If a maxErr exceeded error is encountered, you can't just go back one step and try again. You would have to restart from the beginning, and there's no guarantee that the problem won't happen again.
With the length of these WUs, restarting isn't really an option either because there's a good chance that the second time around, even without an error, you can't return the WU by the deadline. Missing deadlines is worse for the project than computation errors are.
Also, if returning a computation error hopefully gets the attention of the person who owns the computer so they can figure out what's wrong and correct it.
(As you can see the the ton of 295 driver errors, however, that's clearly not the case a good percentage of the time!)
____________
My lucky number is 75898524288+1 | |
|
|
If this problem occurs, it will probably continue to reoccur. If a maxErr exceeded error is encountered, you can't just go back one step and try again. You would have to restart from the beginning, and there's no guarantee that the problem won't happen again.
That's the part I figured might be the issue and I agree it wouldn't necessarily be good to just restart. Too bad you can't add check points to revisit in the event of errors. I know we've had lots of errors but we don't get to see a multiple occurance reported. who knows it might turn out that you could narrow the cause. just a wish list :)
i will say in my opinion, a lot of errors are temperature related however that's caused. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I also have some strange behaviours, which I yet could not figured out. I have two hosts with GTX560 running, W7, same driver, same distributor, same stock clock, default Shift.
One host with 1.06 stock app, the other with app_info (because of avx) and 2.3 beta. The host with stock app is running fine and all results get validated. The other have a couple of results returned as valid, but inconclusive. Also One of 20 wu's gets a maxerror during crunching. But also a lot of valid and correct results.
Both are only boinc machines, no productive work on them. The good host have also the higher temps, cpu and gpu.
I have no idea how I can track down the problem. Michael, any ideas?
Regards Odi
Yeah, that one's a little puzzling. The inconclusive results (and those that were inconclusive and later set to invalid) have the same underlying problem that causes the maxErr exceeded error. That maxErr test doesn't always trap all the errors; some make it all the way through to the end of the calculation but then get detected because the residuals don't match what the other computers return.
I'd try the same technique we use with the 550 Ti -- try lowering the memory clock and see if that helps. I think you're already running at stock clocks (although Nvidia's website is uncharacteristically vague about the 560 for some reason.)
This is one of those situations where everything needs to work right -- and there's a lot of variables. It can be pretty hard to figure out what the cause is. One person -- Honza(?) -- fixed his problems by swapping parts (GPUs and power supplies) between two computers, and now both computers are working fine.
So that's another thing you can try since you have a bunch of computers. Try swapping that GPU with a similar GPU in another computer, and see if the problem moves with the GPU or stays in the same computer. If it stays in the original computer, the problem might be in the motherboard or power supply.
____________
My lucky number is 75898524288+1 | |
|
|
i will say in my opinion, a lot of errors are temperature related however that's caused.
My GPU works at 88C in both cases:
Memory clock = 2000 MHz
Memory clock = 1600 MHz
But the most of WUs at 2000MHz had finished with maxErr,
and all(!) (apprx 30 units) WUs had finished successfully at 1600MHz
Do you still think 88C is not enough high for GPU?
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
08.03.2012 20:54:46 | PrimeGrid | Resent lost task genefer_4194304_72892_12
08.03.2012 20:54:46 | PrimeGrid | [error] No app version found for app genefer platform windows_intelx86 ver 106 class cuda32_13; discarding genefer_4194304_72892_12
That's an error that happens when the BOINC client in your computer and the BOINC server at PrimeGrid get out of sync for some reason. There's lot of things that can cause that, but fortunately it's not very common.
Hopefully the problem will resolve itself -- if not, try restarting BOINC. If that doesn't work, you may need to delete BOINC and reinstall BOINC.
Just to be certain, you're not now, and have never before, used app_info, correct?
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
thanks for the reply.
since downclockIng every.thing seems to be Ok.
I wish I had caught it sooner.
thanks john3760
Awesome! Good luck.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
All problems gone out after downclocking MEMORY clock without changing CORE clock. I had finished enough tasks to claim that GTX460 has enough adequate margin of safety for CORE overclocking, but MEMORY clock needs to be reduced to stock speed or so.
That's not surprising at all. It's underclocking the memory clock that resolved the GTX 550 Ti problem, so it makes sense that it would work on other cards too.
Michael Goetz, I recommend you to add Memory clock to log. If you could, sure.
I wish I could, but the API that I use to get the information you see in the stderr output only shows the shader clock. I'm sure there's a way to get the other clocks since overclocking utilities obviously can do that, but I don't know how to do it and it's possible you need UAC access to get to that information.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
That's the part I figured might be the issue and I agree it wouldn't necessarily be good to just restart. Too bad you can't add check points to revisit in the event of errors.
That's possible to do, but unwise in my opinion. If an error occurred, you have no way of knowing how long ago the actual error happened. Some WUs make it all the way to the end without the error ever being caught. The *only* point you can be sure that the numbers are correct is at the beginning of the calculation.
who knows it might turn out that you could narrow the cause. just a wish list :)
i will say in my opinion, a lot of errors are temperature related however that's caused.
It's a hardware fault of some type (it's hard to explain any way for a software fault to be affected by temperature alone). In the vast majority of cases, it occurs when running the hardware beyond what it's designed to do, i.e., overclocking.
Overclocking works fine most of the time, but it doesn't with Genefer.
If there was such a thing as a "Society For Prevention of Cruelty to GPUs", it would probably prohibit people from running Genefer. This program pushes the GPU really, really hard.
____________
My lucky number is 75898524288+1 | |
|
|
Yeah, that one's a little puzzling.
Yeah, it seems so.
I think you're already running at stock clocks
Yep, these are identical cards. Same manufacturer, same speeds, same other specs.
fixed his problems by swapping parts (GPUs and power supplies) between two computers, and now both computers are working fine.
Identical Power Supplies with enough power (750W) for one card.
Try swapping that GPU with a similar GPU in another computer, and see if the problem moves with the GPU or stays in the same computer.
Good idea, I will try next week. For now I reboot the system. I have the subjective feeling that the errors accumulate the last days. The system seems reliable a week ago.
Regards Odi
____________
| |
|
|
All problems gone out after downclocking MEMORY clock without changing CORE clock. I had finished enough tasks to claim that GTX460 has enough adequate margin of safety for CORE overclocking, but MEMORY clock needs to be reduced to stock speed or so.
That's not surprising at all. It's underclocking the memory clock that resolved the GTX 550 Ti problem, so it makes sense that it would work on other cards too.
I know, but you still reccomend do CORE downclocking, because all info you possesed is Shader clock.
and it's possible you need UAC access to get to that information.
Nothing special.
NVIDIA System Tools adds Performance tab to standart NVIDIA control panel:
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I know, but you still reccomend do CORE downclocking, because all info you possesed is Shader clock.
I changed the post at the top to emphasize the memory clock, and I'll try to remember to mention it in the future.
____________
My lucky number is 75898524288+1 | |
|
|
I know, but you still reccomend do CORE downclocking, because all info you possesed is Shader clock.
I changed the post at the top to emphasize the memory clock, and I'll try to remember to mention it in the future.
That's one reason why I think alot of it is temp related. Way back when we just had sieves, one main area to help reduce temps was the downclocking of the memory as it was not really part of the equation at the time just to keep the temps lower. As x3 stated he lowered his memory and was somewhat successful. I wish I still had access to a freezer/staging area where I could put a test box in and see what happened. The area was kept between high 30's and low 40's. That would've been a fun test. | |
|
|
yes, correct never used app_info
the out of sync problem is solved :
restarting boinc did not help, but just detach from primegrid and
reattach solved it.
rgds
Tabaluga | |
|
Dave  Send message
Joined: 13 Feb 12 Posts: 3208 ID: 130544 Credit: 2,286,061,901 RAC: 773,689
                           
|
88C seems far too high. I avoid going over 80 on my 580s. Manually set the fan speed to a speed that brings the temp certainly below 80C & monitor it with SpeedFan. | |
|
|
88C seems far to high. I avoid going over 80 on my 580s. Manually set the fan speed to a speed that brings the temp certainly below 80C & monitor it with SpeedFan.
Why do I need do this if my both GPUs work perfectly in this mode during the last 1,5 years?
Yes, I know you'll never buy my GPU's after what I said, but ain't nobody's business. I can't do anything better. The most of GTX460's are equiped with propeler fan istead of turbine fan. So fan speed up brings nothing more only a noise, when your 2-slots GPUs packed like a pie and every higher one blows in the back of a lower...
FYI: There is HD4890 between my GTX460's.
3 videocards in one box - it's very difficult to be cool when 2 of 3 are working...
the 3rd, the main and the highest GTX460 is idle - this way I avoid lagging.
When I tried to use all 3 cards, the temperature of the highest one increased up to 100C, so I've decided don't crunch on this one.
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
My 460, depending on ambient temperature, runs between 80 and 84.
That's with everything at stock, including the fan control.
____________
My lucky number is 75898524288+1 | |
|
|
next wu failed with same error code
core 675
shader 1350
memory 1800
so now i took back core/shader to normal 715/1430
and memory to 1750
will come back with next result.
rgds
Tabaluga | |
|
|
Tabaluga,
try to reduce memory clock down to 1600.
actually, memory clock doesn't almost impact to duration of GeneferCuda execution.
If it will successful on 1600, you can try 1700 and 1750 afterward.
____________
| |
|
|
For now I reboot the system. I have the subjective feeling that the errors accumulate the last days. The system seems reliable a week ago.
@Michael: The things getting stranger and stranger. I reboot the system. After this, all WU's errored out with maxerror after a couple of minutes. Because nothing changed since last boot except the beta app, I roll back the 2.3. beta to 1.6 stock app.
After this, the tasks running good, 3 valids up to now, 2 waiting for validation. I have no explanation for this. After the next wu, I will try another reboot and take a look if the maxerror will appear again.
Regards Odi
____________
| |
|
|
Update: I figured out it is not the app. I can reproduce this behaviour also with 1.06. I also tested downclocking but this not solve the problem. Meanwhile I suppose it could have a relation with cpu liability. The errors accumulate when the cpu is running 4 SR5 AVX prpnet tasks simultaneously, which stress the CPU a lot.
I switched the machine to Milkyway to take a look if the double precision tasks from then also produces errors or not.
Regards Odi
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Updated the first post to reflect that the 296 drivers have the same problem as the 295 drivers, and that this problem appears to be limited to Windows.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Updated the first post to add information about running with a mixture of single precision and double precision GPUs.
I also removed most of the narrative to reduce the tl;dr factor. ;)
____________
My lucky number is 75898524288+1 | |
|
|
Is there a GeneferCUDA test or benchmark we can use to test the ability to successfully run a work unit without having to wait for one to be assigned...to just fail? I'm running the windows BOINC software.
http://www.primegrid.com/result.php?resultid=366798079
Ok, I can answer my own question. It might be useful for others too:
start->run->cmd
cd\programdata\boinc\projects\www.primegrid.com
primegrid_genefer_2_3_0_0_1.07_windows_intelx86__cuda32_13.exe -boinc -q "215902^524288+1" --device 0
If it fails, run "type stderr.txt" to see the info. I guess that's better than logging computational errors. | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,140,716,592 RAC: 2,274,372
                                      
|
Yes, there is built-in benchmark in Genefer.
See GeneferCUDA Block Size Setting thread for more and some numbers to compare your speed.
But beware - GFN is computation intensive and overclocking tends to computation errors.
____________
My stats | |
|
|
Thanks for the link. I had edited my post before I saw yours with what I had come up with. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Andy,
You're a bit of a pioneer here with your GTX 680, so we don't really know what to expect in terms of reliability.
What we have found, in general, is that temperature and clock speed can make GeneferCUDA fail on other GPUs, but your's is the first GTX 680 to exhibit the maxErr exceeded error that is typical of overclocking induced failure.
If you're overclocking the GPU, definitely try reverting to stock clock speeds. You could also try opening the computer case and having a fan blow cool air onto the GPU, as well as lowering the memory clock below stock speed.
I don't have any reliability data with your GPU, so it would be interesting to see what factors affect reliability. Stock shader/core clocks, stock or underclocked memory clock, and good cooling have in general been what helps other cards.
EDIT: If the numbers that Genefer is reporting are correct, you're already significantly underclocking your card at 705 Mhz, whereas the stock clock is 1006 MHz. Of course, it's entirely possible that the reported clock speed is not accurate.
____________
My lucky number is 75898524288+1 | |
|
|
The reported speed is "wrong" - that is, nVidia's reference design comes of the box clocked at 1006MHz. This card is just a reference card without any manufacturer overclock (other than the auto overclocking built into these new cards).
I'm sitting at around 73C running the command I had in my previous post, so my temps are good.
I had been overclocking both the GPU and VRAM when getting those errors, and reducing the overclock has helped in my testing. I'm trying to find a setting that is reliable and fast, so I'm still overclocking... But I want the next block I retrieve to actually be completed, so I'm simulating it by running one of the blocks I failed manually.
I was actually surprised that an overclock that showed no artifacts or other instability would actually fail in CUDA, so I had come here looking to see if there was a bug related to the new architecture. But it was just my overclock. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
The reported speed is "wrong" - that is, nVidia's reference design comes of the box clocked at 1006MHz. This card is just a reference card without any manufacturer overclock (other than the auto overclocking built into these new cards).
I'm sitting at around 73C running the command I had in my previous post, so my temps are good.
I had been overclocking both the GPU and VRAM when getting those errors, and reducing the overclock has helped in my testing. I'm trying to find a setting that is reliable and fast, so I'm still overclocking... But I want the next block I retrieve to actually be completed, so I'm simulating it by running one of the blocks I failed manually.
I was actually surprised that an overclock that showed no artifacts or other instability would actually fail in CUDA, so I had come here looking to see if there was a bug related to the new architecture. But it was just my overclock.
I'm very glad to hear that the errors occurred while you were experimenting with overclocking. It's bad enough the 680 is slower than the 580; if it also couldn't run stable at reference clock speeds that would just be throwing salt on the wounds.
With the 200, 400, and 500 series GPUs, we found that GeneferCUDA -- not CUDA programs in general -- is extremely sensitive to overclocking and/or heat. Your GTX 680 will likely run other CUDA programs just fine at clock speeds where GeneferCUDA will fail.
Genefer has two characteristics that few other CUDA programs have. It uses double precision math, so it's using physically different circuitry than most other programs (and games) use, and it's almost entirely GPU based, with the CPU hardly being used at all. The GPU is therefore running almost continuously, which isn't the situation with most CUDA programs.
My best guess is that this continuous running of the double precision circuitry stresses something in the GPU in a way that rarely occurs with any other program. Overclocking, often even factory overclocking, results in the "maxErr exceeded" error that you saw. On some GPUs, in particular the 550 Ti, even the reference speeds are too fast, and slowing down the memory clock (which significantly reduces operating temperature) is often needed to get GeneferCUDA to run stable.
What's interesting is that it appears that if you keep the GPU really cold, you can overclock it like crazy and still have Genefer run correctly. One person literally had an air conditioning unit's airflow ducted straight into the computer case. Yes, that worked, and his GPU is running at a very high clock speed, without error.
____________
My lucky number is 75898524288+1 | |
|
|
Your general advice in point #6 still holds true for the GTX680. Memory speeds are key to processing the Genefer work units successfully, but there's no reason to underclock the memory on the 680. As a note, according to GPU-Z, memory controller load is about 57% for Genefer versus 0% for PPS.
As can be seen at http://www.primegrid.com/results.php?hostid=250209&offset=0&show_names=0&state=0&appid=16 from April 5 and beyond, all the work units processed by the 680 have been successful (using the settings described below). The ~9500-10000s runs are the 680, and the ~12000s runs are the 560Ti I also have in there.
My particular card seems to handle Genefer running at 1652MHz(1502MHz stock, +300 DDR MHz offset) memory and +135MHz core. There are no separate shader clocks.
+325 memory offset seemed to work but eventually failed as can be seen in the work history, but +300 has worked without failures yet. Failures seem to happen within the first 15 minutes if the memory is clocked too high, which is different from the other computation errors I see on the Fermi cards which tend to fail a lot around an hour into processing. This memory clock is significantly lower than +475 which will work for a game without noticing any corruption.
I did slightly less extensive testing with core speeds, as I wanted to use a speed that would work for Genefer and for games. I've used +140 without problems in for games, but I've kept it to +135.
Heat definitely isn't a problem. 74C/54%/2100RPM is what I see on Genefer work units with a slightly modified fan profile, and this is pretty quiet. There are probably two reasons for such cool and quiet operations... the performance is less, and I'm unable to use any voltages higher than 1.1750 to clock the core higher to make heat more of an issue. Other work unit types definitely generate more heat.
I wanted to be reasonably sure that I found stable settings before I posted describing how I got there. It looks reasonably stable enough. I hope these are the details you were looking for, and I don't see any reason to expect issues with the stock GTX680.
EDIT: About the "+135" on the core speed - the stock clock is 1006MHz but it will automatically dynamically overclock to at least 1059 and usually more depending on TDP and temperature. With a +135 offset to the base clock, there are two frequencies the core will run at depending on temperature. If the GPU is below 70C, the core will run at 1215MHz. If it's 71C or above, it will run at 1202MHz. It runs at 1.175v at 1215MHz and 1.163v at 1202MHz. This is the highest the voltage will go on the reference board. | |
|
|
I am seeing lots of computation errors, it *used* to work just fine, no driver changes but now all the wu i get fail:
http://www.primegrid.com/results.php?hostid=221152
It seems most/all of the work units i am getting that failed also failed multiple times for others so I'm no sure if this a more widespread issue... | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I am seeing lots of computation errors, it *used* to work just fine, no driver changes but now all the wu i get fail:
http://www.primegrid.com/results.php?hostid=221152
It seems most/all of the work units i am getting that failed also failed multiple times for others so I'm no sure if this a more widespread issue...
ALL CUDA WUs everywhere are currently seeing massive numbers of errors due to the Nvidia 295/296 driver bug. That's got nothing to do with your problem.
This is a typical error from your recent WUs:
maxErr exceeded for 1988^4194304+1, 0.4989 > 0.4500
That problem is caused by a hardware fault, which is, in turn, usually caused by overclocking or overheating.
Since you're not overclocking the GPU, I suspect the problem is with heat. The weather is getting warmer (I've actually stopped crunching for the summer) so perhaps the ambient temperature around your computer is higher than before. Or perhaps the culprit is dust. It's probably worthwhile to take some compressed air and blow out the computer (and the GPU) to clean those heatsinks off. Depending on how long it's been since you've done that, this can lower the operating temperatures by a significant amount, and that can mean the difference between a calculation succeeding or failing.
It's also possible the GPU has failed, but I'd bet on heat being the culprit. If blowing the dust out doesn't solve the problem, try running a WU with the GPU fan set to run continuously at 100% and see if it makes a difference. If that fixes the problem, then the card is just too hot.
____________
My lucky number is 75898524288+1 | |
|
|
So after posting last night I aborted my wu and went back to prpnet.
First I tried GFN262144 which estimated to complete 1:14:00 but it exceeded maxerr and fell back to cpu.
I then tried GFN524288 and each wu took 4 hours, it completed two wu last night on cuda without issue.
This morning I've installed the nvidia monitoring tools, the GPU runs at 65 C (17 C in the room). I'm now trying GFN262144 with the GPU fans set to max which has the temp at 54C. If it completes these I'm going to switch back and try it on primegrid again perhaps on a smaller wu to start and work my way up.
If this keeps failing I'm going to have to tell my wife the video card has failed and I need to upgrade to a dual 580 ;) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Running PrimeGrid or PRPNet makes no difference with regards to the operation of GeneferCUDA. It's the same program, running in the same way (at least the parts that affect the maxErr error.)
Unless you're using GeneferCUDA outside of its B range (which won't happen currently on BOINC or on PRPNet at 262144 or 524288), if you're seeing maxErr exceeded errors that's an error in your hardware. If that's happening a lot, it doesn't matter which WUs you're running; you need to address the problem. If it's only happening once in a while, although it's still a good idea to fix the root cause, you could manage to complete a lot of WUs if you run shorter WUs since the odds of your completing a short WU before encountering an error are higher than the odds of completing a long WU before encountering an error.
____________
My lucky number is 75898524288+1 | |
|
|
I just got this GFN error on a new platform I'm working with:
cuda_subs.cu(156) : cudaSafeCall() Runtime API error : CUDA driver version is insufficient for CUDA runtime version.
WUs error out after just 2 or 3 seconds.
Ubuntu 12.04 64-bit, with 2 two nVidia Tesla M2050 cards.
Driver: NVIDIA-Linux-x86_64-295.59
CUDA toolkit: cudatoolkit_4.2.9_linux_64
SDK: gpucomputingsdk_4.2.9_linux
I think those are all the latest versions? What could be the mismatch? | |
|
|
I am pretty sure that this question was asked many times, but...
I am testing a completely new, out-of-the-box 570 card (Gigabyte GV N570OC-13I REV2.0) and it seems to pass any test except GeneFer/CUDA (in the framework of this BOINC rally) and separately CUDALucas. They fail at similar rates (unsurprisingly, because these are very similar algos): totally wrong data as the card is shipped (memory at 1900MHz) and 10-20% fail rate with memory at, say, 1820MHz. GPU freq doesn't seem to matter; I underclocked it to some degree, too.
I am not sure if the RMA Gigabyte people will listen kindly to an argument as above ("works fine for anything but some obscure (for them) applications"). What could be my options? RMA would probably get the same result. Reseat the cooler (which in itself and by external reviews look pretty decent) on some quality paste? This particular cooler design I believe covers the memory modules, too (even though memory is not factory OCd).
Thanks in advance for any advice.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Serge, if setting the shader clock to 1464 (factory settings) doesn't work, I think I'd RMA it. I don't remember anyone having to underclock a 570.
One other thought...
The following assumes that I've correctly identified your GPU. It has three fans and a rather small vent on the back plate, right?
One thing I would try is to open the computer case and set a desk fan to blow air in there. That cooler design, frankly, is not my favorite. I much prefer a single fan at one end of the card that blows air through the whole card and then out the back. That open design on your card blows a lot of hot air back into the case, which then gets drawn back into the GPU. Unless the case has tremendous airflow the air in there -- meaning the air that's supposed to be cooling off the GPU AND the CPU -- is going to get awfully hot.
If running with the case open helps, a more permanent solution would be to add some case fans so that a LOT of cool air is constantly being drawn into the case below the GPU.
____________
My lucky number is 75898524288+1 | |
|
|
Yep, that's the one; 3 fans. The airflow is good (the case is Antec 900) and I ran it in the open as well.
Either way the temps are very low, hovering at 61"C under full load.
I am using EVGA Precision (and OC Scanner) left from the previous 560Ti448 that i moved to another comp. It doesn't fail the OC Scanner (it would have been easier if it did).
There may be a defective memory chip that passed the light QC and is somewhat working. What I like the least that even with more underclocking, genefer fails still once every few tasks. That's not good.
When the tasks finish, they finish well - and quite a few are validated already. But with the 1/3 drop off, this is a fairly rotten deal.
I'll start the RMA, or maybe a return if Amazon will take it back. The kids won't be happy :-) less games for them for a while.
Could it be from somewhat silly wiring? (I had the 8-pin PCI-E which I used; for the other 6-pin I used the 2-molex joiner included with the card. This contraption leaves one, middle 12V rail not powered. Are they connected in the circuitry on the card immediately anyway?)
Thanks.
P.S. The PSU is 700W, though. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Could it be from somewhat silly wiring? (I had the 8-pin PCI-E which I used; for the other 6-pin I used the 2-molex joiner included with the card. This contraption leaves on 12V rail not powered. Are they connected in the circuitry on the card immediately anyway?
Thanks.
It's probably not the wiring, but it COULD be the power supply. A power supply that's having trouble keeping up will result in unstable (and difficult to diagnose) problems.
If you had to use a Y-cable to drive the auxiliary power plugs, it's *possible* that the power supply wasn't designed to supply that much power on those connectors. You might see an improvement by trying a different power supply, or plugging the Y-cable into different molex connectors.
I have, in the past, had a problem that went away when I pulled out a perfectly adequate 600W PS and replaced it with a 700W PS. 600W should have been more than enough to drive a Core i3 with a GTX 470, but the problems went away when I put in the bigger power supply. This was after going through about 3 different 470's.
So, it might not be an overclocking/overheating problem, per se. It might be that lowering the clocks lowers the power consumption, and the power supply can keep up when the power draw is lower.
____________
My lucky number is 75898524288+1 | |
|
|
I'll try to crimple a proper straight 6-to-6 PCI-E YY5700-H connector on Monday and will decide on RMA after that. (The PSU is the modular venerable TT 700W; the proper cables were misplaced after the move to the new house.)
Thanks, it could be the reason, too. (They throw in those Y-cables with every card, sure, that's better than nothing - but for this job, these cables could be the weak link.) | |
|
|
I was able to crimple it tonight. (The TT 700W PSU is modular and has two "red" PCI-E feeders + one hardwired 8-pin PCI-E which I was already using.)
With the first PCI-E port, nothing changed: still shaky even when underclocked.
With diminishing hope I've replugged the PCI-E to the other, never before used port ...et voila, no errors in CUDALucas -r * 10 times with underclock. No errors at stock either (which is a 780 OC/1900 memory). The card's temp crawled to 65"C, which signals that it is now properly fed (and still nicely cool; there's huge room for OC which I am not interested in. I am interested in relative silence).
So it was the PSU, but there's still life in it. It seems that one of the two ports has been burnt.
I will go for a genefer test for a couple days now. And will probably stay with it, because GPU-to-72 is starting to tread water, not much fun.
Thanks, Mike, you've been very helpful! You hit the nail on the head.
P.S. ...Nope, genefer still failed at stock. Will get another PSU, then.
Serge Batalov | |
|
|
I am not sure if the RMA Gigabyte people will listen kindly to an argument as above ("works fine for anything but some obscure (for them) applications"). What could be my options?
I completely understand what you're saying, but there's no reason to start writing lists of what did and didn't work for you, when returning the card; if it doesn't work even with you setting it to a 570's "stock" settings, then just return it and say that things kept crashing on it. You're not lying, we all know that Genefer will work on a 570, and it's not as if it's "unfair" of you to run Genefer on it. I'm sure others who have problems with a certain game don't have problems returning their cards, or indeed bother saying "Well it crashes with Portal 2 but not on Crysis" or whatever. I say send it back and say "It crashes with certain things" and see what they say, the worst thing that can happen is that you get your current card back (well, I suppose you'll have to pay postage...). It also won't be the Gigabyte RMA people, it'll just be Amazon and then it's theirs to deal with, unless you've got a history of sending things back to them with no good reason then they'll probably at least give you another card.
You definitely shouldn't have to buy another PSU to get a 570 to work (unless you know that your PSU is faulty of course), 700w is definitely more than enough if that's your only graphics card in the system.
Incidentally, I'd uninstall all sofware from previous cards and just use what came with that one; I don't think it's likely that anything is clashing (and presumably more than one set of software shouldn't be affecting anything at the same time, anyway) but it'll eliminate another possible source of problems. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
[quoteP.S. ...Nope, genefer still failed at stock. Will get another PSU, then.
Serge Batalov[/quote]
Any chance of putting the 570 into the computer that holds the 560Ti to see it it works in there?
Something's not right -- either the GPU, the motherboard, or the power supply -- and it would be nice to know which one. If the other computer is stable, moving the 570 into there to test the 570 is the easiest piece to swap. If the 570 works in there, at least you know the GPU doesn't need to be RMA'd.
____________
My lucky number is 75898524288+1 | |
|
|
It occurred to me not to be lazy and indeed I swapped the cards in the morning.
It was the card -- the bugs go with the card and not with the PSU or environment.
(The other comp is Linux64, so I cannot even underclock it there. It stubbornly does exactly the same err=0.5>0.45 in this different environment which is if anything only better. In the original comp, the 560Ti448 happily picked up the BOINC queue and the last I checked never fouled.)
I am returning the 570 and getting another from Tigerdirect. It will be a different batch than Amazon and I can wait for it to arrive to cross-mail (unlike RMA or replacement).
Thanks, guys! It was educational. Like Tolstoy said, "Happy families are all alike; every unhappy family is unhappy in its own way." That is, when everything works, there's no way to be prepared for how things will go wrong in a new build. :-)
P.S. For Amazon, just like you suggested, I didn't sophisticate or rationalize: just wrote, "defective; runs for a while, then produces artifacts under load." | |
|
|
Followup: the new card came in the mail. No errors.
I don't have a good BOINC binary (it's OpenSUSE 12.1), and even when I built it, the uploaded binary (1.07; cuda 3.2) didn't run b/c of some dependencies... so I'll post the residue here, if that's alright? I am pretty sure that his WU is already done and validated.
If you could help me out and confirm that this residue is correct? (No credit requested.)
372886^524288+1 is a probable composite. (RES=bbe7b88c2c01efe3) (2921111 digits) (err = 0.0859) (time = 4:33:39)
This is with own built genefercuda.1.061, sm20, cuda 4.2.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Serge,
I can't see the residual until the wingman verifies it. Keep an eye on that WU and give me a shout when it completes and I'll check it for you at that time.
Although it's not a guarantee, completing the WU usually means the GPU worked correctly.
Of course, the easiest thing to do is just let it run a few more Genefer WUs and see what happens, but certainly it appears as if this card is functioning much better than the previous one.
You could also run the same WU through a second time and see if you get the same residual. A bad computation will b erratic, so you'll likely get a different residual, or more likely an error.
____________
My lucky number is 75898524288+1 | |
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
You could run the latest prime thru it :) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
You could run the latest prime thru it :)
Yeah, but running new WUs through it does productive work, assuming the card is working correctly. At this point, I'd assume it is working correctly.
Also, you could run it manually and use the -t parameter and it will run a bunch of self tests on successively larger and larger numbers. That test runs calculations of specific numbers, at increasing n's, for which it knows the correct residuals.
But since it's more likely than not that the card is operational, I'd use it on something productive. At worst, the short WUs are, well, short, so not much is lost.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
If you could help me out and confirm that this residue is correct? (No credit requested.)
372886^524288+1 is a probable composite. (RES=bbe7b88c2c01efe3) (2921111 digits) (err = 0.0859) (time = 4:33:39)
That IS the correct residual.
____________
My lucky number is 75898524288+1 | |
|
|
I have a machine with a fresh install of Lubuntu 12.04 (Precise), and Boinc 6.10.58. Tasks were running okay until I tried Genefer CUDA. I trashed a handful of WUs before I noticed a problem (sorry 'bout that). All would error-out immediately. Tracked it down to the fact that Genefer CUDA for Linux is a 32-bit executable, and by default now (with 12.04) only 64-bit libraries are installed on 64-bit machines. I had to do this:
sudo apt-get install ia32-libs-multiarch
Now, two WUs (2x570) on this new machine are running okay :-)
If this had been an upgrade from an older version of the OS, this probably wouldn't have happened (32-bit libs would have been there already), but of course I don't know for sure.
--Gary | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
And until very recently, you would have had exactly the same problem with any LLR project as those were only 32 bit apps too.
For what it's worth, at least on Windows on my computer, the 32-bit build of GeneferCUDA was faster than the 64-bit build.
____________
My lucky number is 75898524288+1 | |
|
|
Is there any way to block computers like this from receiving any new GFN work?
http://www.primegrid.com/show_host_detail.php?hostid=236684
Users like this are very frustrating for every serious user.
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Is there any way to block computers like this from receiving any new GFN work?
http://www.primegrid.com/show_host_detail.php?hostid=236684
Users like this are very frustrating for every serious user.
The short answer is yes, things can be done, but the cure is likely to cause worse problems than it might solve.
That particular host isn't really a problem as far as other users are concerned. He's generating errors and returning them promptly, so the effect on others is minimal. They are a problem for the server and the admins, but not so much for users.
It's the computers that go MIA or abort their WUs or otherwise error after 2 or 3 weeks that are the biggest problem to users. That's what causes the big delays.
____________
My lucky number is 75898524288+1 | |
|
|
I think it is causing problems for other users.
Since the units are held of a few hours/days the system generates a lot more units then needed. So the overall time for all these units to validate increases.
If less units are generated the time for validation will be shorter.
____________
| |
|
|
Is there a testing program available that will detect double precision issues or memory issues such as with GeneFerCUDA without wrecking someone else's scores ?
I am becoming a bit frustrated with days and days of wasted GPU time.
No doubt my wingmen/women will be frustrated with me for having errored out on them.
MaxErr exceeded... This to me sounds like a rounding issue somewhere in the code.
GTX295 which ran milkyway and oter projects fine for a long time.
Something faster and more modern is on the near horizon...
Found an anser in this thread:
"Ok, I can answer my own question. It might be useful for others too:
start->run->cmd
cd\programdata\boinc\projects\www.primegrid.com
primegrid_genefer_2_3_0_0_1.07_windows_intelx86__cuda32_13.exe -boinc -q "215902^524288+1" --device 0
If it fails, run "type stderr.txt" to see the info. I guess that's better than logging computational errors."
Thank you for sharing that ! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
MaxErr exceeded... This to me sounds like a rounding issue somewhere in the code.
In fact, that's designed just for that purpose. Not to catch a coding error, but to detect when the number being tested is too large. That's what determines the "b limit" you might read about in other threads.
In practice, it's also good at detecting computational errors. These tend to occur because the hardware is running too fast and/or too hot. Generally, lowering the GPU clocks, especially the memory clock, will solve that problem. Improving cooling also can solve the problems; sometimes it's as simple as raising the GPU fan speed. I'm told that ducting cold air conditioning airflow straight into the computer chassis does wonders for allowing higher GPU speeds. :)
This problem is very common with GPUs that are overclocked. This includes GPUs that came overclocked from the factory. With some models, in particular the GTX 550Ti, even stock clocks are too high and you typically need to lower the memory clock below the stock settings to get reliable operation.
GTX295 which ran milkyway and oter projects fine for a long time.
Something faster and more modern is on the near horizon...
GeneferCUDA is rather unique in that it uses the double precision math hardware (very few apps and probably no games use DP) and also in that it runs at nearly 100% efficiency. The combination of both is likely to play a role in causing this problem.
Also note that in some applications some errors may be acceptable and go undetected. In sieves, for example, if a factor is missed, it will not be detected. This is considered to be acceptable (although obviously not desirable) because if a factor is missed it will be checked eventually by LLR. That's why sieves don't require double checking by a wingman.
GeneferCUDA, on the other hand, performs sanity checks at every step of the calculation and usually (but not always) will detect computational errors and abort the processing immediately, rather than letting it run to the end only to eventually be invalidated on the server when it doesn't match the wingman's results.
As for more modern hardware, if you're upgrading specifically to run GeneferCUDA, please be aware that the 600 series GPUs are slower with this app than are the older 500 series GPUs. A GTX 680 is about 20% slower than a GTX 580.
For games, and probably most other applications, the GTX 680 is faster. Probably a lot faster in some situations. (I did hear that the 600 series cards are also slower at sieving, but I'll let someone else address the details there. It might be only one sieve at which it's slow. I don't know all the details.)
____________
My lucky number is 75898524288+1 | |
|
|
Thank you Michael,
In fact, that's designed just for that purpose. Not to catch a coding error, but to detect when the number being tested is too large. That's what determines the "b limit" you might read about in other threads.
I'll have to do that.
Right now, I'm running only one task on the GTX295, temperatures are down 13 degrees, memory clock is lowered by 10%. 4 hours to finish. It has been running for 25 hours.
GeneferCUDA, on the other hand, performs sanity checks at every step of the calculation and usually (but not always) will detect computational errors and abort the processing immediately, rather than letting it run to the end only to eventually be invalidated on the server when it doesn't match the wingman's results.
Might it be a suggestion to add an extra safetynet that will redo the calculation that led to the erroneous result in more steps with smaller operands in each so that the error can be 'worked around' ? Or that it is repeated as-is to see if the error re-occurs, something that should not happen if it is a thermal or memory induced issue which to me would seem of a random nature ?
That would avoid throwing away dozens of hours of CPU/GPU time.
As for more modern hardware, if you're upgrading specifically to run GeneferCUDA, please be aware that the 600 series GPUs are slower with this app than are the older 500 series GPUs. A GTX 680 is about 20% slower than a GTX 580.
Thank you for that suggestion !
Is anyone using Tesla cards ? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Might it be a suggestion to add an extra safetynet that will redo the calculation that led to the erroneous result in more steps with smaller operands in each so that the error can be 'worked around' ?
Although the program's maxErr checks are designed to catch loss of precision errors, what you see happening is a hardware failure. Using smaller elements (and larger FFTs) won't fix that.
Or that it is repeated as-is to see if the error re-occurs, something that should not happen if it is a thermal or memory induced issue which to me would seem of a random nature ?
I considered this, and decided it was probably a bad idea:
Assume the task takes 100 hours, and at 95 hours it detects an error. Right now, it aborts the processing and those 95 hours are wasted. That sucks.
What if you restart from the latest checkpoint file? We don't know exactly when the error occurred, so the checkpoint file might not contain valid data. You could continue running to the end and then fail the validation. That might seem like a good idea at the 95 hour mark, but it would not be smart earlier in the calculation. At 5 hours, you would be doing another 95 hours of crunching that might be based on a bad checkpoint file. Better to give up and start again.
What if you restart from the beginning? That's effectively the same thing -- from your perspective -- as declaring an error and starting at the beginning on a brand new WU. In a general sense, that's better for your computer in case the WU itself is at fault. (That doesn't generally happen here, but does at some other projects such as CPDN).
I couldn't come up with a scenario where it made sense to try to continue. Well, that's not entirely true. There are some failures where the problem is likely to be a transient problem that may clear itself up. The program is designed to wait up to an hour in the hope that the problem goes away, before deciding to give up and error out. One example is the "no GPU" error that can be caused by the 295 windows driver. This problem may go away because the screen is taken out of sleep mode, or if the problem is caused by RDP being used to remotely control the computer, if RDP is turned off the program can then use the GPU.
But computation errors are hardware errors, and it's better to not try to continue processing. The more reliable course of action is to try to resolve the source of the problem, which usually involves lowering clock speeds.
As for more modern hardware, if you're upgrading specifically to run GeneferCUDA, please be aware that the 600 series GPUs are slower with this app than are the older 500 series GPUs. A GTX 680 is about 20% slower than a GTX 580.
Thank you for that suggestion !
Is anyone using Tesla cards ?
Tesla cards are high-reliability for compute critical applications.
They're also VERY expensive. Nevertheless, some people have tried running on them -- I suspect in most cases, by renting them on AWS. This lead us to discover that...
They're not very fast. Their clock speeds aren't all that high, and they are out-performed by the much less expensive consumer GeForce cards.
____________
My lucky number is 75898524288+1 | |
|
|
Thanks for the elaborate answer, much appreciated.
Although the program's maxErr checks are designed to catch loss of precision errors, what you see happening is a hardware failure. Using smaller elements (and larger FFTs) won't fix that.
I understand that there can be two kinds of failure, one that occurs during the calculation (shader or other subsystem problem) and one that is caused by corrupted arguments to the calculation (memory issues). The memory corruption is not linked to a specific moment and thus you cannot determine when the error occurred that was detected. ok, point is clear. Thanks.
Tesla cards are high-reliability for compute critical applications.
They're also VERY expensive. Nevertheless, some people have tried running on them -- I suspect in most cases, by renting them on AWS. This lead us to discover that...
They're not very fast. Their clock speeds aren't all that high, and they are out-performed by the much less expensive consumer GeForce cards.
If they are high-reliability for compute critical applications they would seem to be ideally suited for GeneFerCUDA. I would think that if I have to downclock a consumer card anyway and it still may crash out at 99% of the processing, an ECC Tesla, despite the costs, may be the more sensible solution. (If money was not an issue..) I'll read up on the specs for the 580s, thanks.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
If they are high-reliability for compute critical applications they would seem to be ideally suited for GeneFerCUDA. I would think that if I have to downclock a consumer card anyway and it still may crash out at 99% of the processing, an ECC Tesla, despite the costs, may be the more sensible solution. (If money was not an issue..) I'll read up on the specs for the 580s, thanks.
The Tesla cards are about $2000. You can get a non-overclocked GTX 580 for less than a quarter the cost, it will run faster, and if you don't overclock it you should have no problem.
We also do not know whether the problem we see will be corrected by ECC memory. For example, if the problem is in the memory controller and not the memory itself (or if it's in the power regulators for the memory), having ECC ram won't help. We're not even sure the problem is in the memory, except for the fact that lowering the memory clock seems to be the most effective way of fixing the problem. On the other hand, one would think the problem is related to the double precision arithmetic hardware since that's what's different about the Genefer program.
____________
My lucky number is 75898524288+1 | |
|
|
Just found this at:
http://blog.accelereyes.com/blog/2012/04/26/benchmarking-kepler-gtx-680/
For double precision, as expected the C2070 is well ahead of the pack. The most interesting snippet here is that the GTX 680, finishes dead last compared to its predecessors. At about 1/10 th of its single precision performance, the 680 is about twice as slow as the 580 which settles down a ~ 1/5th the single precision performance.
Due to a different FP64 architecture apparently.
We're not even sure the problem is in the memory, except for the fact that lowering the memory clock seems to be the most effective way of fixing the problem. On the other hand, one would think the problem is related to the double precision arithmetic hardware since that's what's different about the Genefer program.
I have been looking for other reports of rounding issues related to nvidia hardware but have not been able to find any yet. It still would seem odd to me that Teslas would be used in TITAN of the DOE if they were that unreliable. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
I have been looking for other reports of rounding issues related to nvidia hardware but have not been able to find any yet. It still would seem odd to me that Teslas would be used in TITAN of the DOE if they were that unreliable.
I don't believe I said that Tesla's were unreliable. I just was commenting that using ECC memory isn't necessarily the solution to the problem. If anything, I would expect Tesla's to be more reliable because their clock rates are lower.
It might not be a coincidence that we're finding that GeForce GPUs running the double precision GeneferCUDA or llrCUDA programs have reliability problems at faster clock rates and the Tesla GPUs, which are designed for higher reliability, run at lower clock rates.
____________
My lucky number is 75898524288+1 | |
|
|
By FP64 perfomance GTX 580 ~= Quadro 4000, link. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
By FP64 perfomance GTX 580 ~= Quadro 4000, link.
If you can, it's best to actually run the cards with GeneferCUDA, either to see the actual run time estimates or to run the benchmarks. That gives you an extremely accurate gauge of actual performance with the software that you're running, as opposed to a benchmark that is unlikely to be testing exactly the same instruction mix.
EDIT: a 580 can do the current short WUs in about 9 hours, more or less.
On paper, for example, the Tesla GPUs should be better than the GeForce GPUs because they have more DP ALUs and are therefore faster at double precision math. However, their overall slower clock rates slow down everything else, so they end up being slower than the GeForce cards even though they're supposed to be better at DP.
____________
My lucky number is 75898524288+1 | |
|
|
GPU=Quadro 4000, CC=2.0, Clock=950 MHz, # of MP=8
Estimated total run time for 4626^1048576+1 is 12:00:16
4626^1048576+1 is complete. (3843247 digits) (err = 0.0000) (time = 11:59:12)
Estimated total run time for 17174^1048576+1 is 13:52:13
17174^1048576+1 is complete. (4440585 digits) (err = 0.0002) (time = 13:51:12)
Estimated total run time for 30730^1048576+1 is 14:46:02
30730^1048576+1 is complete. (4705551 digits) (err = 0.0009) (time = 14:40:36)
=> Quadro 4000 ~= GTX 560 | |
|
|
Michael has said it, the affordable Tesla cards, for this purpose, are incredibly bad value for money; as in, that they might, with the right tweaking, be as good at GFN as a GTX 5xx. There's no need to get a Tesla card, unless you're very rich - but if you just want to spend some savings on a GFN machine for the new year, just get an adequate i5 system and shove two 580s in there, in SLI.
If you can afford it, then fair enough, forget the GTX cards and go for these:
http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last
http://www.fudzilla.com/home/item/29452-nvidia-officially-launches-tesla-k20-and-k20x-gpus
Of course, Tesla family was never cheap so, although there is no official price, the estimated price is set at around US $3199 for the K20 and somewhere over that number for the more powerful K20X.
They'd probably be all right at GFN. | |
|
|
@Michael:
I don't believe I said that Tesla's were unreliable
You are correct. In my line of thinking I was reasoning that a Tesla should be reliable if the DOE uses it. Not wanting to imply that you said they were unreliable.
...forget the GTX cards and go for these:
http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last
I'll sleep on it :-) | |
|
|
If these errors occur in general with software running double precision on CUDA cards at factory speeds, it can also be of concern with other projects and users.
I have addressed the issue with Nvidia. I'll keep you posted. | |
|
Dave  Send message
Joined: 13 Feb 12 Posts: 3208 ID: 130544 Credit: 2,286,061,901 RAC: 773,689
                           
|
It's the type of work it's doing, not just the fact that it's DP... | |
|
|
Error while computing After 33h of crunching on gt 430 :(
I have found another cause of error. It is... vlc media player.
Could anyone can tell me why genefer is not able to restart from checkpoint after error?
Is it possible to create app_info specially for genefer to leave some GPU power for normal computer use? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,230,186 RAC: 922,880
                               
|
Error while computing After 33h of crunching on gt 430 :(
I have found another cause of error. It is... vlc media player.
Could anyone can tell me why genefer is not able to restart from checkpoint after error?
Genefer can. BOINC can not.
Genefer checkpoints and can restart from the checkpoint after an error. However, BOINC, once it detects an error, aborts the task. That's under the control of the BOINC client.
The types of errors that Genefer can detect generally mean something has gone wrong with the calculations, and in this scenario you can't be sure the data in the last checkpoint is good or if it was corrupted by the math error, which might have happened a while ago. If it restarted from the checkpoing, you might end up wasting 55 hours (the full run) rather than just 33 errors.
The other type of error is usually something that crashes Genefer completely and that's beyond Genefer's ability to catch and restart. BOINC will detect that it stopped running and abort the task. Ironically, those are usually the types of errors where the calculation could be restarted.
Is it possible to create app_info specially for genefer to leave some GPU power for normal computer use?
No, but if what you're trying to accomplsh is reducing the screen lag Genefer is causing, try setting the block size to 6. It defaults to 7. This is only to reduce screen lag; it won't do anything to prevent the type of error you experienced.
If another program is run that uses enough video memory (or other GPU resources) such that Genefer can't run, Genefer is going to crash.
____________
My lucky number is 75898524288+1 | |
|
|
Error while computing After 33h of crunching on gt 430 :(
I have found another cause of error. It is... vlc media player.
Could anyone can tell me why genefer is not able to restart from checkpoint after error?
Is it possible to create app_info specially for genefer to leave some GPU power for normal computer use?
Since i did watch a lot of videos while my gpu crunched genefer tasks and none of my genefer tasks ever got an error, i cannot confirm your theory ;-).
But maybe this is related with michaels statement "If another program is run that uses enough video memory (or other GPU resources) such that Genefer can't run, Genefer is going to crash." since the gt430 isnt that powerful
____________
| |
|
|
Thx for advice.
I found http://boinc.berkeley.edu/wiki/Client_configuration and there are some options to achieve my goal.
Now im trying cc_config.xml with lines:
<exclusive_gpu_app>vlc.exe</exclusive_gpu_app>
plus
<exclusive_app>vlc.exe</exclusive_app>
Looks like it works. | |
|
Post to thread
Message boards :
Generalized Fermat Prime Search :
How to prevent Computation Errors |