Author |
Message |
|
My dual gtx570 rig is erroring all these out with the same message.
http://www.primegrid.com/result.php?resultid=344918424
stock clocks, dedicated cpus.
driver 285.62
http://www.primegrid.com/show_host_detail.php?hostid=209729
--------
my gtx260 has competed one successfully with the 260.99 drivers
any ideas?
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
SLI enabled?
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
My dual gtx570 rig is erroring all these out with the same message.
http://www.primegrid.com/result.php?resultid=344918424
stock clocks, dedicated cpus.
driver 285.62
http://www.primegrid.com/show_host_detail.php?hostid=209729
--------
my gtx260 has competed one successfully with the 260.99 drivers
any ideas?
First of all, I don't think the problem is with the drivers. The symptoms the card is exhibiting, quasi-random maxErr exceeded errors, is typical of the what you see when a GPU is overclocked too much. In this case, the symptoms are bad enough so that the error is usually (but not always) ocuring during the initialization, which takes under a second. But it's not happening in the exact same spot each time. Furthermore, you had one WU that made it past initialization but got an error within the first 10 seconds of the main processing, and one that completed and might be valid (the wingman hasn't returned it yet.) So, most of the failures happen in the first second -- but one WU ran for nearly an hour. Strange.
There are errors on both of your gpus, so it's not merely a single bad card.
I do have a few ideas. To start with, errors like that are not typical on 570s running at stock speeds. So either you have two bad cards, or something is adversely affecting their operation. A combination of temperature and clock speed has been implicated in this kind of error, so I would recommend trying the following steps to see if it helps:
1) Raise the fan speed to 100% to lower the temperature.
2) Lower the memory clock speed to the minimum it can be set to.
3) Lower the shader clocks.
Beyond that, my only other guess would be that there's something on the computer that's interfering with the CUDA processing. That seems unlikely because you're computing GCW sieves, and while clock speeds and temperature do indeed affect Genefer differently than other CUDA programs (different circuitry is used by Genefer), if use of the GPU is interfering with CUDA it should affect all CUDA programs. So the three ideas above are all I have. Your problem is rather perplexing.
____________
My lucky number is 75898524288+1 |
|
|
|
Not running sli.
I will check those suggestions out when I get home.
This box is headless dedicated cruncher, so I don't use the GPUs at all except to crunch unless I need use vnc to remote manage it.
I haven't really checked the temperature ranges on these cards what should the normal operating range be? The ambient air temperature is 50-60 F. right now, so I would be surprised if they run hot, but hey you never know.
Looks like that one WU validated too...
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,183,747,281 RAC: 23,151,563
                                                
|
I haven't really checked the temperature ranges on these cards what should the normal operating range be? The ambient air temperature is 50-60 F. right now, so I would be surprised if they run hot, but hey you never know.
I'd try to keep them below 80C (under 70C is ideal if you can). All the problems I have had with GFN errors have been tied to heat issues (none of which occurred below 80C on any card) or due to memory clocks being to high (only on the GTX 550's) or the shader clock being over or at the edge of stability (literally, a drop down to the next shader iteration of 54 units lower fixed the issue).
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
I'm not sure what a 570 should run at. My 460 (in a hot environment -- it's next to the radiator!) runs 80-85. 470 runs hotter at 90-95. Those temps are on the high side, but my card is stable at stock clocks and moderate overclocking.
One other thing you can try: pull 1 card out (the "bottom" card) and see if the errors go away. If they do, then it's either heat from the bottom card heating up the top card, or the power supply is having trouble running them both at full blast.
Actually, you don't need to pull the card to test that: just tell boinc to only use one card.
____________
My lucky number is 75898524288+1 |
|
|
|
I've been having problems running GFN workunits on two of my GPUs. I've had a few work and get validated, but most of them fail in one of these two ways:
Here's the most common way it fails (after just a second or two):
Command line: projects/www.primegrid.com/primegrid_genefer_1.06_windows_intelx86__cuda32_13.exe -boinc -q hidden --device 0
Priority change succeeded.
GeneferCUDA-boinc.cu(107) : cudaSafeCall() Runtime API error : no CUDA-capable device is detected.
23:28:20 (776): called boinc_finish
================================
Here's another way it fails (after 10 minutes):
Sieve started: 125869764000000000 <= p < 125869773000000000
Thread 0 starting
Detected GPU 0: Device Emulation (CPU)
Detected compute capability: 9999.9999
Detected emulator! We can't use that!
Sleeping 10 minutes to give the server a break.
13:27:06 (3296): called boinc_finish
One machine is a GTS-450 with no customization at all (no overclocking of CPU or GPU).
Other machine has a GTX-465 at stock 706 MHz speed (it is stable at 825, but I turned off overclocking to try to run these WUs).
I went back to PPS Sieve (which never fails on these machines).
Any ideas?
...Doug
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Priority change succeeded.
GeneferCUDA-boinc.cu(107) : cudaSafeCall() Runtime API error : no CUDA-capable device is detected.
23:28:20 (776): called boinc_finish
This problem I understand. (It's nice to have a problem I can understand!!!) And it's fixable, at least in theory. I know, roughly, what's causing the problem. The only question is whether you're willing and able to fixing it.
BOINC says "I have a GPU; gimme GPU work to do!". But when the WU gets there, the driver's saying "You have no available GPU".
The reason it's not available is that something else in the computer is using the GPU in an exclusive mode which is preventing CUDA apps from using the GPU. The CUDA API's report that there's no usable GPU in this situation.
What could be doing this? The two most likely suspects are full screen games (or, perhaps, video), or Remote Desktop software. If you use the Windows Remote Desktop software, that WILL prevent CUDA apps from running (and will crash any CUDA apps that are in progress.) Please note that this ALSO includes the fast-switch user login switching, where you are logged into two Windows user sessions at once. Doing that also precludes using CUDA. Note that most (or all) VNC remote desktop programs do NOT interfere with CUDA.
There's an unlimited number of other programs that could also use the GPU in exclusive mode. If none of the above is applicable to your computer, then it's something else that you need to look for that precluding CUDA from running.
================================
Here's another way it fails (after 10 minutes):
Sieve started: 125869764000000000 <= p < 125869773000000000
Thread 0 starting
Detected GPU 0: Device Emulation (CPU)
Detected compute capability: 9999.9999
Detected emulator! We can't use that!
Sleeping 10 minutes to give the server a break.
13:27:06 (3296): called boinc_finish
That's basically the same problem -- Something else is grabbing the GPU in exclusive mode, so the CUDA app sees no available GPU. In this case, the drivers tried to use the built in GPU emulator instead. I'm not sure what causes the emulator to run vs. reporting no GPU, but they both have the same root cause: something else has exclusive control of the GPU.
I went back to PPS Sieve (which never fails on these machines).
That's the part I don't understand. That shouldn't work either, unless you were RDP'd into these boxes only while trying to run GeneferCUDA. RDP will kill all CUDA apps.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
I just looked at your 465 machine.
Most of the GeneferCUDA WUs are getting that error -- but some are working.
If you look at your PPS Sieve WUs, you'll see that the same errors are happening with the sieve software.
Looking at your 450 machine, I see the same pattern -- both successes and errors with both GeneferCUDA and PPS Sieve.
If I had to guess, you're using Windows RDP to log into the computer -- perhaps to check to see if this is working. That will prevent all CUDA apps from operating -- and would be intermittent, since then moment you stop looking at the computer CUDA will start working again.
____________
My lucky number is 75898524288+1 |
|
|
|
I haven't really checked the temperature ranges on these cards what should the normal operating range be? The ambient air temperature is 50-60 F. right now, so I would be surprised if they run hot, but hey you never know.
I'd try to keep them below 80C (under 70C is ideal if you can). All the problems I have had with GFN errors have been tied to heat issues (none of which occurred below 80C on any card) or due to memory clocks being to high (only on the GTX 550's) or the shader clock being over or at the edge of stability (literally, a drop down to the next shader iteration of 54 units lower fixed the issue).
I assume that I can lower the shader clocks in the nvidia control panel? Like is said these cards are stock from the way they shipped I haven't mucked with them at all so I am not familiar with the tools available. :)
____________
|
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,183,747,281 RAC: 23,151,563
                                                
|
I haven't really checked the temperature ranges on these cards what should the normal operating range be? The ambient air temperature is 50-60 F. right now, so I would be surprised if they run hot, but hey you never know.
I'd try to keep them below 80C (under 70C is ideal if you can). All the problems I have had with GFN errors have been tied to heat issues (none of which occurred below 80C on any card) or due to memory clocks being to high (only on the GTX 550's) or the shader clock being over or at the edge of stability (literally, a drop down to the next shader iteration of 54 units lower fixed the issue).
I assume that I can lower the shader clocks in the nvidia control panel? Like is said these cards are stock from the way they shipped I haven't mucked with them at all so I am not familiar with the tools available. :)
MSI Afterburner works well for changing clocks and fan speeds on NVidia cards.
____________
141941*2^4299438-1 is prime!
|
|
|
|
I just looked at your 465 machine.
Most of the GeneferCUDA WUs are getting that error -- but some are working.
If you look at your PPS Sieve WUs, you'll see that the same errors are happening with the sieve software.
Looking at your 450 machine, I see the same pattern -- both successes and errors with both GeneferCUDA and PPS Sieve.
If I had to guess, you're using Windows RDP to log into the computer -- perhaps to check to see if this is working. That will prevent all CUDA apps from operating -- and would be intermittent, since then moment you stop looking at the computer CUDA will start working again.
Nope, no incoming RDP on these boxes. I do use one of them for outgoing RDP to a few other machines. The GPU based WUs on machines that have CUDA capable cards do stop running while I'm connected, but with the fix to BOINC a while back, they start running again as soon as I disconnect. Although it used to be an issue, I haven't had troubles in this regard for a while now.
It looks like my problems started when I upgraded to the latest nVidia beta drivers (295.51) and since I have gone back to 290.53 on both machines, things are much better. I have returned my overclock on the video cards to their previously stable settings. The 465 has finished one Genefer and successfully started another. The other box was happily crunching away before my son started his current StarCraft 2 session. :)
We'll see how they both do overnight.
   ...Doug
____________
|
|
|
|
http://www.primegrid.com/result.php?resultid=345269049
Resuming b^262144+1 from a checkpoint (2257085 iterations left)
Terminating because BOINC client requested that we should quit.
____________
|
|
|
|
Both machines have completed several Genefer WUs since reinstalling version 290.53 and I'm satisfied that the incompatibility was in the 295.51 driver.
[Edit] It is interesting that because I have not restarted BOINC after reverting to the previous video drivers, it is still reporting that I am using 295.51 on the completed WUs.
   ...Doug
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Resuming b^262144+1 from a checkpoint (2257085 iterations left)
Terminating because BOINC client requested that we should quit.
Since the result completed successfully, this appears to be just an instance of missing stderr output. Everything after that (i.e., the resumption of the WU) is probably just missing from the log.
The WU itself is probably fine. You'll know once the wingman returns a result.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Both machines have completed several Genefer WUs since reinstalling version 290.53 and I'm satisfied that the incompatibility was in the 295.51 driver.
[Edit] It is interesting that because I have not restarted BOINC after reverting to the previous video drivers, it is still reporting that I am using 295.51 on the completed WUs.
   ...Doug
First part: That's awesome, both because you're back in business, and also because now we know there might be a problem with 295.51. It might be interesting if you reinstall 295.51 to see if the problem returns. (It might have just been something that went bad during installation.)
Regarding the Edit: yeah, the drivers would take effect immediately, but Boinc probably only does that check when it starts up. If you could hot-swap CPUs, it probably wouldn't detect that either. ;-)
____________
My lucky number is 75898524288+1 |
|
|
|
If you could hot-swap CPUs, it probably wouldn't detect that either. ;-)
I just read up on the $25 pc called a Raspberry. That's as close to a hot swap cpu I've ever seen.
ps - let us know how your next hot swap goes :) |
|
|
samuel7 Volunteer tester
 Send message
Joined: 1 May 09 Posts: 89 ID: 39425 Credit: 257,425,010 RAC: 0
                    
|
I had the same problems with the 295.51 driver on my GTX 480 host. I tried installing the beta driver twice, the latter a clean install. Two PPS Sieve tasks completed successfully but the following ones failed with the Device Emulation message. Downgraded the driver and it's working fine again. I'm doing GFN524288 tasks on PRPNet at the moment.
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Regarding the Edit: yeah, the drivers would take effect immediately, but Boinc probably only does that check when it starts up. If you could hot-swap CPUs, it probably wouldn't detect that either. ;-)
Yep, Boinc needs a restart to detect such things.
I saw the same after migrating suspended virtual machines between different hosts with different CPUs.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
Ok, this is what I learned last night.
Temps look pretty good, when running DiRt wus the cards both run at 97% load and if I put the fans at 95, 85 for the inside outside gpu the temps run at 50-60c.
When running Genefer the outside card runs the WU load at 99% temps same as with the DiRt WUs. Most if not all of the WUs that get assinged to the inside card fail in the first 2 secs. The ones that actually run for a longer period of time also run at 99% and similar temps before hitting failure.
Changing down the shader clock had no effect, and I forgot to try lowering the memory clock speed.
When the Genefer WU was running on the outside card the inside card was able to run DiRt WUs, so I don't think I have a PSU power issue with the cards under load.
Things left to try? I can try the memory clock tonight, do you think its just a bad card?
Just noticed I am seeing a couple of different errors from last nights test units:
- exit code 1030 (0x406)
The environment is incorrect. (0xa) - exit code 10 (0xa)
- exit code 2006 (0x7d6)
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Things left to try? I can try the memory clock tonight, do you think its just a bad card?
I don't think it's a bad card. Especially 2 bad cards, but you never know. Can you run LLRCUDA?
Did you try running just one GPU, with the other GPU idle?
- exit code 1030 (0x406)
The environment is incorrect. (0xa) - exit code 10 (0xa)
- exit code 2006 (0x7d6)
That's just Boinc interpreting my error code as if it was one of its own error codes. That's the same error as all the others. It might be that I'm not supposed to use that range of numbers for error codes. This is the first time I put together a program for Boinc, so I probably did a few things wrong.
____________
My lucky number is 75898524288+1 |
|
|
|
You mean the PPS LLR Cuda? I haven't tried that. It is listed as beta I thought, are there WUs available?
I don't think its two bad cards either. I think its just the one. The other card completed a 2nd WU for validation. I didn't want to trash a bunch of WUs so I allow new work it gets like 10 tasks then set NNT, the good card starts crunching and the bad one errors the other 9.
I will try disabling one card then the other via the cc_config file tonight to verify the single cards.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
You mean the PPS LLR Cuda? I haven't tried that. It is listed as beta I thought, are there WUs available?
I don't think its two bad cards either. I think its just the one. The other card completed a 2nd WU for validation. I didn't want to trash a bunch of WUs so I allow new work it gets like 10 tasks then set NNT, the good card starts crunching and the bad one errors the other 9.
I will try disabling one card then the other via the cc_config file tonight to verify the single cards.
Yes, I meant the PPS LLR CUDA. You can run it manually from the command line. That's much better for testing since you can control all the variables. You never know what a Boinc server is going to send you. Or not send you.
____________
My lucky number is 75898524288+1 |
|
|
|
ok, lowering the memory clocks didn't make a difference, and I tried running one card at a time by ignoring one card or the other in the cc_config file. Didn't make a difference either. Not sure what the conditions are that have allowed a few tasks to complete.
Where can I get the PPS_LLR cuda app for testing as you suggested?
____________
|
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,420,431,564 RAC: 2,590,392
                                      
|
Where can I get the PPS_LLR cuda app for testing as you suggested?
llrCUDA testing is good place to start.
Also you can read the thread llrCUDA at mersenneforum?[/quote] for recent news and developement.
____________
My stats |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
ok, lowering the memory clocks didn't make a difference, and I tried running one card at a time by ignoring one card or the other in the cc_config file. Didn't make a difference either. Not sure what the conditions are that have allowed a few tasks to complete.
Where can I get the PPS_LLR cuda app for testing as you suggested?
I'm running out of ideas. Running LLR CUDA is the only other test I can think of, and that's merely to determine whether a similar program also fails.
Other CUDA WUs do work, right? You can run PPS Sieve and CW Sieve on those GPUs?
One other thing that you can try is to un-install and re-install the video driver. This doesn't sound like a driver problem, however.
____________
My lucky number is 75898524288+1 |
|
|
|
ok, lowering the memory clocks didn't make a difference, and I tried running one card at a time by ignoring one card or the other in the cc_config file. Didn't make a difference either. Not sure what the conditions are that have allowed a few tasks to complete.
Where can I get the PPS_LLR cuda app for testing as you suggested?
I'm running out of ideas. Running LLR CUDA is the only other test I can think of, and that's merely to determine whether a similar program also fails.
Other CUDA WUs do work, right? You can run PPS Sieve and CW Sieve on those GPUs?
One other thing that you can try is to un-install and re-install the video driver. This doesn't sound like a driver problem, however.
Yes, all of the sieve WUs run like a champ
____________
|
|
|
|
Where can I get the PPS_LLR cuda app for testing as you suggested?
llrCUDA testing is good place to start.
Also you can read the thread llrCUDA at mersenneforum? for recent news and developement.
where else to get the window 64bit llr cuda app, the link http://pgllr.mine.nu/software/LLR/
from the llrCUDA testing thread only results in connection timed out.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Yes, all of the sieve WUs run like a champ
Looking at this computer, I don't see any valid results at all except for one Genefer. You've run sieves on it in the past? Perhaps something has changed since then. It would be helpful if you ran a few sieve CUDA WUs on there to help diagnose the problem.
____________
My lucky number is 75898524288+1 |
|
|
|
Yes, all of the sieve WUs run like a champ
Looking at this computer, I don't see any valid results at all except for one Genefer. You've run sieves on it in the past? Perhaps something has changed since then. It would be helpful if you ran a few sieve CUDA WUs on there to help diagnose the problem.
Ok, its got a batch of sieves running through it now.
____________
|
|
|
|
where else to get the window 64bit llr cuda app, the link http://pgllr.mine.nu/software/LLR/ from the llrCUDA testing thread only results in connection timed out.
LLRCuda itself is located at the latest PRPnet Package: http://uwin.mine.nu/PRPNet/
The libraries (different from genefercuda) can downloaded here http://www.bc-team.org/downloads.php?view=detail&df_id=39
Regards Odi
____________
|
|
|
|
where else to get the window 64bit llr cuda app, the link http://pgllr.mine.nu/software/LLR/ from the llrCUDA testing thread only results in connection timed out.
LLRCuda itself is located at the latest PRPnet Package: http://uwin.mine.nu/PRPNet/
The libraries (different from genefercuda) can downloaded here http://www.bc-team.org/downloads.php?view=detail&df_id=39
Regards Odi
It looks like only the cpu version off llr is in the latest prpnet package? Or because I dled the
prpclient-5.0.5-windows-gpu.7z that is all gpu stuff only in there?
Thanks for the libraries link, I have those now.
____________
|
|
|
|
It looks like only the cpu version off llr is in the latest prpnet package? Or because I dled the prpclient-5.0.5-windows-gpu.7z that is all gpu stuff only in there?
Yes, take a look in the normal package, there the llrcuda is located. I also did not understand why llrcuda is only in this package.
Regards Odi
____________
|
|
|
|
I was able to fix errors on one of my computers by setting the power saving mode from "adaptive" to performance in the nvidia 3d settings in the nvidia contol panel. Not sure this will fix your issue but something to try. |
|
|
|
I was able to fix errors on one of my computers by setting the power saving mode from "adaptive" to performance in the nvidia 3d settings in the nvidia contol panel. Not sure this will fix your issue but something to try.
Yes, I have that already set, but it never hurts to double check.
____________
|
|
|
|
Ok, some new? observations.
I got prpnet installed and working to try and test llrCUDA.
I don't know if this would also apply to genefer or is specific to prpnet and not boinc but what I noticed was.
1. when I left the gpuaffinity setting to the default -1 and launched two clients one of my two GPUs (gpu 1) never seemed to get any activity. When I specified either 0 or 1 in the affinity settings I could drive the llrCUDA to either GPU. Without using the affinity settings I couldn't get the WUs to run. Does the app correctly try and select GPUs?
2. At startup there was always what appeared to be zero GPU load but max CPU load for a period of time. During this time I would see messages on the console like this:
Iter: 304380/4526670, ERROR: ROUND OFF (0.4999994338) > 0.4
Continuing from last save file.
Resuming Proth prime test of 121*2^4526664+1 at bit 304258 [6.72%]
Using complex irrational base DWT, FFT length = 524288, a = 3
it would do that for a while and then the WU would begin crunching on the GPU @ 97%-99% load. with a regular updating status line. like this :
121*2^4526664+1, bit: 1040000 / 4526670 [22.97%]. Time per bit: 1.371 ms.
3. GPU 0 seems to be crunching now, GPU 1 seems to hit this error :
[2012-02-12 20:02:39 PST] 121: GetNextIncompleteWorkUnit found 121*2^4526664+1
Resuming Proth prime test of 121*2^4526664+1 at bit 304258 [6.72%]
Using complex irrational base DWT, FFT length = 524288, a = 3
c:/Users/XCyber/Desktop/PrimeTest/llrcuda_win64/gwpnumi.cu(1832) : cudaSafeCall(
) Runtime API error : unknown error.
[2012-02-12 20:04:41 PST] 121: Could not open file [lresults.txt] for reading.
Assuming user stopped with ^C
[2012-02-12 20:04:41 PST] 121: ProcessWorkUnit did not complete 121*2^4526664+1
[2012-02-12 20:04:41 PST] 121: AddWorkUnitToList for 121*2^4526664+1
[2012-02-12 20:04:41 PST] 121: IsWorkUnitCompleted for 121*2^4526664+1. Returni
ng false (not completed, main)
[2012-02-12 20:04:41 PST] Total Time: 0:02:02 Total Work Units: 0 Special Res
ults Found: 0
[2012-02-12 20:04:41 PST] PPSEhigh: Returning work. currentworkunits=0, complet
edworkunits=0, quitOption=2
[2012-02-12 20:04:41 PST] 121: Returning work. currentworkunits=1, completedwor
kunits=0, quitOption=2
[2012-02-12 20:04:41 PST] 27: Returning work. currentworkunits=0, completedwork
units=0, quitOption=2
[2012-02-12 20:04:41 PST] Client shutdown complete
[2012-02-12 20:04:41 PST] 121: AddWorkUnitToList for 121*2^4526664+1
[2012-02-12 20:04:41 PST] 121: IsWorkUnitCompleted for 121*2^4526664+1. Returni
ng false (not completed, main)
[2012-02-12 20:04:41 PST] 121: DeleteWorkUnit for 121*2^4526664+1
Is there some further debugging info somewhere?
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
1. when I left the gpuaffinity setting to the default -1 and launched two clients one of my two GPUs (gpu 1) never seemed to get any activity. When I specified either 0 or 1 in the affinity settings I could drive the llrCUDA to either GPU. Without using the affinity settings I couldn't get the WUs to run. Does the app correctly try and select GPUs?
I actually never used llrCUDA before today, so I'm not that familiar with it. The affinity setting of 0 or 1 will select specific GPUs. Otherwise, I think you should comment that line out. (Not 100% sure of that.)
2. At startup there was always what appeared to be zero GPU load but max CPU load for a period of time. During this time I would see messages on the console like this:
That appears to be normal operation for many of our primality programs. The initialization is noticeable on GPU programs because you can see that the GPU isn't doing anything while the initialization is occuring.
Regular LLR does the same thing; it's just not so obvious to the casual observer. Genefer did the same thing also, until I rewrote that part to use the GPU & FFTs to do the initialization.
Iter: 304380/4526670, ERROR: ROUND OFF (0.4999994338) > 0.4
Continuing from last save file.
Yeah, that's bad. That's similar to the problems it's having with GeneferCUDA, which is what I was expecting since the programs do similar things.
3. GPU 0 seems to be crunching now, GPU 1 seems to hit this error :
c:/Users/XCyber/Desktop/PrimeTest/llrcuda_win64/gwpnumi.cu(1832) : cudaSafeCall(
) Runtime API error : unknown error.
Not much more to say, I'm afraid. Both llrCUDA and GeneferCUDA don't run well on your GPUs for reasons unknown.
I am out of ideas. The cause is likely to be either bad hardware (not necessarily the GPUs) or some software that's interfering with the GPUs. However, both of those seem unlikely given that GCW Sieve works fine.
Perhaps re-installing the drivers might help, but I'm not very optimistic.
____________
My lucky number is 75898524288+1 |
|
|
|
Hmm, frustrating indeed.
Well I was thinking about pulling the GPUs and the power supply and putting them over in a different box to see if that made a difference. I guess I will try a driver sweep and clean install of the latest drivers first if that doesn't work then I will go for the hardware move.
Do you think that 2 570s and an i7 860 at max load will work on a 700w power supply? Right now there is a gtx260 and an ati 5850 in the i7 box just fine. If that will work I won't have to move the power supplies I can just swap the graphics cards.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Hmm, frustrating indeed.
Well I was thinking about pulling the GPUs and the power supply and putting them over in a different box to see if that made a difference. I guess I will try a driver sweep and clean install of the latest drivers first if that doesn't work then I will go for the hardware move.
Do you think that 2 570s and an i7 860 at max load will work on a 700w power supply? Right now there is a gtx260 and an ati 5850 in the i7 box just fine. If that will work I won't have to move the power supplies I can just swap the graphics cards.
It depends on the power supply. Not all 700W power supplies are created equal. A high quality one might be ok, but it will be running at close to capacity. A cheap PS probably will have trouble.
I like to use 600 or 700 watt PS for SINGLE GPU configs to insure plenty of headroom and stable operation. With two big GPUs I probably would use 800 to 1000 watts.
Moving the GPUs to another box is a good way to see if the problem is with the GPU or with the box.
BTW, does geneferCUDA run on the GTX 260? Now that the server config has been corrected, you should be able to get GFN WUs on the 260.
____________
My lucky number is 75898524288+1 |
|
|
|
Yep, genefer does run on the gtx260 that host is here:
http://www.primegrid.com/show_host_detail.php?hostid=162940
____________
|
|
|
|
1. when I left the gpuaffinity setting to the default -1 and launched two clients one of my two GPUs (gpu 1) never seemed to get any activity. When I specified either 0 or 1 in the affinity settings I could drive the llrCUDA to either GPU. Without using the affinity settings I couldn't get the WUs to run. Does the app correctly try and select GPUs?
This is correct. If you're running more than one gpu simultaneously you must specify it in the .ini files.
2. At startup there was always what appeared to be zero GPU load but max CPU load for a period of time. it would do that for a while and then the WU would begin crunching on the GPU @ 97%-99% load. with a regular updating status line.
This is also correct. This version of llrcuda needs, depending of the port it's running, up to a half cpu core over the time besides the gpu.
3. GPU 0 seems to be crunching now, GPU 1 seems to hit this error : gwpnumi.cu(1832) : cudaSafeCall() Runtime API error : unknown error. Is there some further debugging info somewhere?
Strange, I read about a similar error in llrcuda thread at mersenneforum some weeks ago: http://www.mersenneforum.org/showthread.php?t=14608&page=5
But it looks like an old error.
Maybe it depends on the old llrcuda build, which is handled out with prpnet. I try to remember it's version 0.60. There is newer source available, but I don't know if anyone compliled the new source as windows executable.
Regards Odi
____________
|
|
|
|
Strange, I read about a similar error in llrcuda thread at mersenneforum some weeks ago: http://www.mersenneforum.org/showthread.php?t=14608&page=5
But it looks like an old error.
Maybe it depends on the old llrcuda build, which is handled out with prpnet. I try to remember it's version 0.60. There is newer source available, but I don't know if anyone compliled the new source as windows executable.
Regards Odi
Yes it is version .60, I also had to install an old version of the cuda dev kit to get the older lib files. the .40.17 files didn't work it needed the .32.16 versions
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Yes it is version .60, I also had to install an old version of the cuda dev kit to get the older lib files. the .40.17 files didn't work it needed the .32.16 versions
To much work to get the needed libs. Both are available via PG...
Win32
http://www.primegrid.com/download/cudart32_32_16.dll
http://www.primegrid.com/download/cufft32_32_16.dll
Linux32
http://www.primegrid.com/download/libcudart.so.3.32bit
http://www.primegrid.com/download/libcufft.so.3.32bit
Cuda32 was the fastest SDK in all tests, therefore are all apps linked against this older version. Another cause for using this version is the smaller file size of both needed libraries itself.
While the file size for 'libcudart.so.3/4' = ~300KB is not critical, is there a much bigger difference between 'libcufft.so.3' = ~28MB and 'libcufft.so.4' = ~85MB.
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
|
Do you think that 2 570s and an i7 860 at max load will work on a 700w power supply?
No.
At the very minimum 800 but I would get a high end 1000.
____________
|
|
|
|
Yes it is version .60, I also had to install an old version of the cuda dev kit to get the older lib files. the .40.17 files didn't work it needed the .32.16 versions
To much work to get the needed libs. Both are available via PG...
Win32
http://www.primegrid.com/download/cudart32_32_16.dll
http://www.primegrid.com/download/cufft32_32_16.dll
The 0.60 win version of LLRCuda needs other libs (cudart64_32_16.dll and cufft64_32_16.dll). I never seen them in the PG download folder. Since a few month ago, there only on a lennart's server which is gone now for a while.
Regards Odi
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Yes it is version .60, I also had to install an old version of the cuda dev kit to get the older lib files. the .40.17 files didn't work it needed the .32.16 versions
To much work to get the needed libs. Both are available via PG...
Win32
http://www.primegrid.com/download/cudart32_32_16.dll
http://www.primegrid.com/download/cufft32_32_16.dll
The 0.60 win version of LLRCuda needs other libs (cudart64_32_16.dll and cufft64_32_16.dll). I never seen them in the PG download folder. Since a few month ago, there only on a lennart's server which is gone now for a while.
Regards Odi
I put the cuda and cufft v 3.2 libraries on my website. This includes both the 32 and 64 bit versions.
You may download it here.
The files in there are:
These are for GeneferCUDA:
cudart32_32_16.dll
cufft32_32_16.dll
It seems these are for llrCUDA:
cudart64_32_16.dll
cufft64_32_16.dll
____________
My lucky number is 75898524288+1 |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,420,431,564 RAC: 2,590,392
                                      
|
Not sure if it was mentioned before...Tesla.
I browsed some of my recents WU running GTX 580. For example http://www.primegrid.com/workunit.php?wuid=247614280
This WU was originally finished by GTX580. A lot of errors followed (some TESLA) and it was completed by my GTX580 as wingman.
http://www.primegrid.com/show_host_detail.php?hostid=231342
NVIDIA Tesla C2050 (2687MB) under Linux.
Unfortunatelly, Application details page for host doesn't show correct numbers.
Anyway, this host with Tesla is trashing Genefer WUs.
____________
My stats |
|
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3250 ID: 50683 Credit: 152,646,050 RAC: 10,054
                         
|
Report that to John!
Host can be stopped for receive new tasks...
____________
92*10^1585996-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Not sure if it was mentioned before...Tesla.
I browsed some of my recents WU running GTX 580. For example http://www.primegrid.com/workunit.php?wuid=247614280
This WU was originally finished by GTX580. A lot of errors followed (some TESLA) and it was completed by my GTX580 as wingman.
http://www.primegrid.com/show_host_detail.php?hostid=231342
NVIDIA Tesla C2050 (2687MB) under Linux.
Unfortunatelly, Application details page for host doesn't show correct numbers.
Anyway, this host with Tesla is trashing Genefer WUs.
Thanks for pointing that out.
I've seen another Tesla machine with similar results. That one purported to have 7 (!!!) GPUs.
Now that I've seen two such machines, I think there's a problem. The application isn't just failing; it's not even beginning to run. You don't see that usually.
I'm wondering if perhaps the CUDA 2.3 toolkit is somehow incompatible with those Tesla cards. But even then, you should see the startup banner. It's as if the cuda dll's aren't present -- but they're downloaded automatically by Boinc. Very strange.
Can't really figure this one out unless one of the people having a problem with a Tesla speaks up.
____________
My lucky number is 75898524288+1 |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,183,747,281 RAC: 23,151,563
                                                
|
Not sure if it was mentioned before...Tesla.
I browsed some of my recents WU running GTX 580. For example http://www.primegrid.com/workunit.php?wuid=247614280
This WU was originally finished by GTX580. A lot of errors followed (some TESLA) and it was completed by my GTX580 as wingman.
http://www.primegrid.com/show_host_detail.php?hostid=231342
NVIDIA Tesla C2050 (2687MB) under Linux.
Unfortunatelly, Application details page for host doesn't show correct numbers.
Anyway, this host with Tesla is trashing Genefer WUs.
Thanks for pointing that out.
I've seen another Tesla machine with similar results. That one purported to have 7 (!!!) GPUs.
Now that I've seen two such machines, I think there's a problem. The application isn't just failing; it's not even beginning to run. You don't see that usually.
I'm wondering if perhaps the CUDA 2.3 toolkit is somehow incompatible with those Tesla cards. But even then, you should see the startup banner. It's as if the cuda dll's aren't present -- but they're downloaded automatically by Boinc. Very strange.
Can't really figure this one out unless one of the people having a problem with a Tesla speaks up.
Probably not a Tesla specific problem. For example,
this host doesn't seem to have any problems with Genefer tasks.
BTW, the boxes that list between 4 to 8 Tesla GPU are likely those with external GPGPU units attached to them. I wonder if that interface is more the issue than with the cards themselves? Anyone with a QuadroPlex (the Quadro external box) able to see if those work or crash like the Tesla's?
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Probably not a Tesla specific problem. For example,
this host doesn't seem to have any problems with Genefer tasks.
BTW, the boxes that list between 4 to 8 Tesla GPU are likely those with external GPGPU units attached to them. I wonder if that interface is more the issue than with the cards themselves? Anyone with a QuadroPlex (the Quadro external box) able to see if those work or crash like the Tesla's
That box has 4 GPUs, so presumably it's working with the external GPU enclosure.
Thanks again -- now I've seen a Tesla that IS working. I've been wondering all along how fast those $2000 GPUs would be with their faster floating point hardware.
The answer: they're slow. At least that one is. It's taking nearly two hours per WU. A GTX 460 does them in 90 minutes and a GTX 580 in 60 minutes.
I don't understand why they're so slow. The shader clock is pretty slow, so I guess that's more important than the speed of the DP hardware.
So much for my dream of running a bunch of Amazon dual-Tesla servers at spot prices. Looks like they're not very impressive.
____________
My lucky number is 75898524288+1 |
|
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,183,747,281 RAC: 23,151,563
                                                
|
That box has 4 GPUs, so presumably it's working with the external GPU enclosure.
Thanks again -- now I've seen a Tesla that IS working. I've been wondering all along how fast those $2000 GPUs would be with their faster floating point hardware.
The answer: they're slow. At least that one is. It's taking nearly two hours per WU. A GTX 460 does them in 90 minutes and a GTX 580 in 60 minutes.
I don't understand why they're so slow. The shader clock is pretty slow, so I guess that's more important than the speed of the DP hardware.
So much for my dream of running a bunch of Amazon dual-Tesla servers at spot prices. Looks like they're not very impressive.
I think something is not quite right with those numbers on the Tesla. My brief access to some Quadro 4000s (research box that we were stress testing for a week or so) shows them as a bit faster than a GTX 460 despite having fewer shaders (256 vs. 336) and being shader clocked slower (1110 vs. 1350 or more--indeed, even faster than my overclocked 460 at 1730MHz). Maybe the Tesla's in that machine are running more than one thing at once (either multiple GFN per card or some sieves as well)?
____________
141941*2^4299438-1 is prime!
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
Maybe, but they do in fact have a pretty slow clock. Not slow enough to explain the results, however.
____________
My lucky number is 75898524288+1 |
|
|
|
Michael Goetz, you always say about a complex of reasons that causes maxerr.
IMHO, the main cause of maxerr is overclocked memory.
As you know, my Zotac GTX 460 factory overclocked to 810/2000.
And I had maxerr problems at these clocks.
2 weeks ago I decided to downclock both clocks to 750/1600 and had checked 26 WUs thereafter without any errors.
Yesterday I have decided return shader clock to factory state (810MHz), but leave memory clock at 1600MHz.
I know that 1 WU is not a rule, but today I was checking 113858^524288+1 during 4:12:31. The temperature was highly extreme ~ 88 degrees during all 4 hours.
The WU has finished and I guess it finished succesfully:
http://www.primegrid.com/result.php?resultid=352285118
I'm not sure I'll able to check more WUs fast (max 1 WU per day), but if maxerr will return, I'll report.
Thus, I am staying at 810/1600 clocks.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
IMHO, the main cause of maxerr is overclocked memory.
That wouldn't surprise me at all, considering that underclocking the memory seems to be the best strategy to getting the GTX 550 Ti to work.
____________
My lucky number is 75898524288+1 |
|
|
|
My GTX 460 (MSI factory OCed 725/1800/1500) is running at 2000 shaders now, up from 1900 that I'd been running mostly until 2 days ago. Went to 1960, then 1980 then 2000 after a few valid WUs on each. Had to raise voltage a little more (now at 1.050) but if I go to 2025 voltage needs a considerably larger jump for that increment of shader increase, AND the temps go up at higher rate. Memory is at 1900 which is only slightly above factory. I had 1 error and that was when I raised memory to 1960, so it went right back to 1900 and not 1 error or invalid since. I'm going to stick to these settings for now. Now you also need to realize/remember, I have A/C pumping directly into the case and my temps are in the low to mid 40s C even at full utilization which allows me higher clocks than most here. My OCed to 4.14 Ghz AMD x6 1100T runs in the 20s C at 100% load as well.
So memory does seem to be the more sensitive setting. OCing shaders while either lowering or leaving memory clocks at stock may work for most. But of coarse every GPU/system is different.
NM*
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
|
|
|
|
I crunched half a dozen GFN 262144 in a C2duo mac, without errors.
I was going to see how long the 500k tasks would take on that cpu, but they errored immediately. Here's the log:
Ter 28 Fev 23:29:22 2012 | PrimeGrid | Output file genefer_524288_70340_5_0 for task genefer_524288_70340_5 absent
I have no idea on what that means. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
I crunched half a dozen GFN 262144 in a C2duo mac, without errors.
I was going to see how long the 500k tasks would take on that cpu, but they errored immediately. Here's the log:
Ter 28 Fev 23:29:22 2012 | PrimeGrid | Output file genefer_524288_70340_5_0 for task genefer_524288_70340_5 absent
I have no idea on what that means.
It means it didn't work. That's BOINC telling you that Genefer didn't finish doing what it's supposed to do.
The (much) more interesting information is found in the stderr text you can see by following the link for the result:
<core_client_version>6.12.35</core_client_version>
<![CDATA[
<message>
process exited with code 2 (0x2, -254)
</message>
<stderr_txt>
This build of GeneferCUDA was not compiled with BOINC support.
</stderr_txt>
]]>
That's one serious problem there. The two WUs you have with that error are using version 1.01 of the Mac CPU client. All your prior WUs used the 1.00 client.
The 1.01 client was built incorrectly and can't possibly work. Looks like the wrong file got used by accident.
And to top it all off, the error message says GeneferCUDA when the program is actually GenefX64. Or at least it SHOULD be GenefX64; if the build options were wrong it actually could be a non-BOINC Mac version of GeneferCUDA. The lack of BOINC support would be detected before it got around to noticing there's no GPU it can use.
There is nothing you can do to fix this; it needs to be corrected on the server side.
I'll make sure the right people know about it.
____________
My lucky number is 75898524288+1 |
|
|
|
Thanks Mike. Please post or pm me if you know of any update so that a can give it a new try. |
|
|
John Honorary cruncher
 Send message
Joined: 21 Feb 06 Posts: 2875 ID: 2449 Credit: 2,681,934 RAC: 0
                 
|
That's one serious problem there. The two WUs you have with that error are using version 1.01 of the Mac CPU client. All your prior WUs used the 1.00 client.
I can confirm, v1.00 worked and v1.01 does not.
btw, a sample run time using v1.00 at the new N for GenefX64 follows:
124490^524288+1 is a probable composite. (RES=9bf84708ecb86965) (2671318 digits) (err = 0.0117) (time = 40:16:08)
This is on Intel Core i5 @ 2.8 GHz with 4G RAM.
____________
|
|
|
|
Just to follow up here...
I finally got around to moving the 850w power supply and the two gtx570s over into the i7 box and pulled that 700w and the gtx260 and amd 5850 over to the AM3 box... the 570s now appear to be crunching away at two Genefer world record tasks.
http://www.primegrid.com/show_host_detail.php?hostid=162940
The amd box is happy with its new power supply and gpus too. Go figure?!
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,681,090 RAC: 578,366
                               
|
The amd box is happy with its new power supply and gpus too. Go figure?!
Interesting. I could think of a lot of possible explanations, but figuring out what the actual cause was would take a lot of experimentation. Probably not worth it if everything is behaving now.
Good luck!
____________
My lucky number is 75898524288+1 |
|
|