Author |
Message |
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,161,398 RAC: 289,514
                               
|
The advanced Gerbicz error checking in LLR2 is both a blessing and a curse. The fantastic news is that most errors that occur in LLR -- which would normally cause a task to fail validation and completely waste the entire calculation -- are now corrected and are able to complete.
The flip side of this is that if your computer is completing valid tasks even though its malfunctioning, so there's no easy way to know that you have a problem. The computer is making hardware mistakes, but we're correcting them and the problem is hidden from the computer's owner. If you never know the computer isn't working correctly, you can't fix it.
This was driven home this morning when Honza told me one of his teammates had seen a few of these Gerbicz errors, and I decided to look in the database and see how common they were.
I found that not only was Honza's teammate getting these errors, but so was Honza, and he didn't even know it.
Another person getting errors without knowing it? ME. Yes, my brand new Ryzen was having problems and I had absolutely no idea anything was wrong!
The results page now displays an unmistakable WARNING if a task had Gerbicz errors.
____________
My lucky number is 75898524288+1 |
|
|
Bur Volunteer tester
 Send message
Joined: 25 Feb 20 Posts: 515 ID: 1241833 Credit: 415,402,227 RAC: 20,308
                
|
Is it also noticeable by longer completion times on the tasks with errors?
____________
1281979 * 2^485014 + 1 is prime ... no further hits up to: n = 5,700,000 |
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 10,182
                              
|
Nice, I take it the "warning" will be in the status column? Fortunately I just skimmed through the results I have remaining on the system and didn't see any. Maybe add a "warning" category for the state filter so it is easier to see if there are any such tasks? |
|
|
streamVolunteer moderator Project administrator Volunteer developer Volunteer tester Send message
Joined: 1 Mar 14 Posts: 1051 ID: 301928 Credit: 563,881,725 RAC: 1,288
                         
|
Is it also noticeable by longer completion times on the tasks with errors?
Yes, these tasks takes longer then usual but in many cases it's almost impossible to spot. Even under normal circumstances, runtimes will vary depending on system load. Also they may depends on 'k' being tested (different k may have different FFT sizes).
|
|
|
|
At what point should we start to be concerned? I have 8 valid tasks currently listed and a single warning.
At some point could we add timestamps to each line in the stderr file? It would make it much easier to tie errors to what else might have been happening on the pc in question. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,161,398 RAC: 289,514
                               
|
Nice, I take it the "warning" will be in the status column? Fortunately I just skimmed through the results I have remaining on the system and didn't see any. Maybe add a "warning" category for the state filter so it is easier to see if there are any such tasks?
Probably not. It’s too database intensive. But I’ll think about it. It’s something I would like to see too.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,161,398 RAC: 289,514
                               
|
At what point should we start to be concerned? I have 8 valid tasks currently listed and a single warning.
At some point could we add timestamps to each line in the stderr file? It would make it much easier to tie errors to what else might have been happening on the pc in question.
Replace “warning” with “invalid task” and what answer do you get? If you suspect an outside event may have caused it, then just wait and see if more happen.
____________
My lucky number is 75898524288+1 |
|
|
|
At what point should we start to be concerned? I have 8 valid tasks currently listed and a single warning.
At some point could we add timestamps to each line in the stderr file? It would make it much easier to tie errors to what else might have been happening on the pc in question.
Replace “warning” with “invalid task” and what answer do you get? If you suspect an outside event may have caused it, then just wait and see if more happen.
well I ask as I don't think I've ever had an actual invalid task with this processor and I wasn't using the pc during the hours the task in question ran so not sure what might have caused the problem. Are the new checks more sensitive than before? |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,161,398 RAC: 289,514
                               
|
The same warning message now appears if certain errors appear occur in PPS-Sieve tasks on AMD GPUs.
____________
My lucky number is 75898524288+1 |
|
|
|
Hi I get 2 error message at task results which is "Error while computing WARNING!"
When I hover mouse over warning shows " Errors occurred and were corrected during this calculation. Your computer is not operating correctly. This is a hardware problem which you should fix." What is that mean ? I am not change any option or hardware and also 1 day ago completed TRP tasks. Any advice appreciated.
http://www.primegrid.com/result.php?resultid=1169128644
http://www.primegrid.com/result.php?resultid=1167613948
____________
My Lucky Number is 9037*2^1301022+1
|
|
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 606,174,833 RAC: 593,414
                         
|
Hi I get 2 error message at task results which is "Error while computing WARNING!"
When I hover mouse over warning shows " Errors occurred and were corrected during this calculation. Your computer is not operating correctly. This is a hardware problem which you should fix." What is that mean ? I am not change any option or hardware and also 1 day ago completed TRP tasks. Any advice appreciated.
http://www.primegrid.com/result.php?resultid=1169128644
http://www.primegrid.com/result.php?resultid=1167613948
Simply put, your computer is unstable and making mistakes while performing calculations.
As to why, we need a bit more info on your system. If you've done any overclocking, back it off a bit, your OC is unstable. If not, but you do have boost enabled, Ryzen CPUs are known to push the silicon a little bit too far out of the box and require manual intervention, so either disable it or control the clocks manually to prevent it from going ham. |
|
|
|
Hi I get 2 error message at task results which is "Error while computing WARNING!"
When I hover mouse over warning shows " Errors occurred and were corrected during this calculation. Your computer is not operating correctly. This is a hardware problem which you should fix." What is that mean ? I am not change any option or hardware and also 1 day ago completed TRP tasks. Any advice appreciated.
http://www.primegrid.com/result.php?resultid=1169128644
http://www.primegrid.com/result.php?resultid=1167613948
Simply put, your computer is unstable and making mistakes while performing calculations.
As to why, we need a bit more info on your system. If you've done any overclocking, back it off a bit, your OC is unstable. If not, but you do have boost enabled, Ryzen CPUs are known to push the silicon a little bit too far out of the box and require manual intervention, so either disable it or control the clocks manually to prevent it from going ham.
Firstly no OC no boost bios all default values I didn't change any values last few weeks. MB MSI mag 570x latest bios installed. CPU Ryzen 2700x with water cooling. avg temp 60-65 Celsius. Rig has 3 intake 1 out additional fans. No additional OC app installed like MSI dragon only for temps CPUID HW Monitor installed. I will try to see what's going on.
____________
My Lucky Number is 9037*2^1301022+1
|
|
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 918 ID: 370496 Credit: 606,174,833 RAC: 593,414
                         
|
Firstly no OC no boost bios all default values I didn't change any values last few weeks. MB MSI mag 570x latest bios installed. CPU Ryzen 2700x with water cooling. avg temp 60-65 Celsius. Rig has 3 intake 1 out additional fans. No additional OC app installed like MSI dragon only for temps CPUID HW Monitor installed. I will try to see what's going on.
If all is default, then you do have boost enabled, as it's on by default. Go into your BIOS and try disabling it.
Also, what is your RAM and what settings is it actually running at? |
|
|
|
It can happen that a computer runs correctly without any hardware errors for a long time, and then after that starts producing errors "by itself". As an example, if dust is gradually accumulating inside it, the cooling becomes less and less efficient, and at a certain point the computer will become too hot and start producing errors.
LLR2 is smart enough to see that the numbers do not make sense anymore, and will go back to the latest saved partial result and re-start from there. Because of that, LLR2 will still find the correct result eventually.
When the computer has hardware errors once in a while with LLR2, it can also have errors once in a while with other software on it. So it may be unstable (applications or the entire OS may crash from time to time, for example). Therefore you should try to fix it.
/JeppeSN |
|
|
|
There is an opportunity suggested here to create a tool that is very sensitive to computer stability. During the run-up to the current challenge, I noticed these "Gerbicz errors" in one of the output logs on a Ryzen 3700X, while testing an offline WU, but only with certain task/thread combinations, notably 8 tasks of 2 threads. The problem is yet unresolved, but for now running 8 tasks/1 thread works, and also provides slightly higher throughput. So even though this PC is not overclocked, and can run Memtest all night with no errors, it has a problem! |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14037 ID: 53948 Credit: 477,161,398 RAC: 289,514
                               
|
There is an opportunity suggested here to create a tool that is very sensitive to computer stability. During the run-up to the current challenge, I noticed these "Gerbicz errors" in one of the output logs on a Ryzen 3700X, while testing an offline WU, but only with certain task/thread combinations, notably 8 tasks of 2 threads. The problem is yet unresolved, but for now running 8 tasks/1 thread works, and also provides slightly higher throughput. So even though this PC is not overclocked, and can run Memtest all night with no errors, it has a problem!
Many of us have noticed that at least some Ryzens are not the most stable of CPUs. I've not only turned off Boost on my 3700X, but also turned off the XMP setting on the memory, which lowers it to stock speeds.
____________
My lucky number is 75898524288+1 |
|
|
|
There is an opportunity suggested here to create a tool that is very sensitive to computer stability. During the run-up to the current challenge, I noticed these "Gerbicz errors" in one of the output logs on a Ryzen 3700X, while testing an offline WU, but only with certain task/thread combinations, notably 8 tasks of 2 threads. The problem is yet unresolved, but for now running 8 tasks/1 thread works, and also provides slightly higher throughput. So even though this PC is not overclocked, and can run Memtest all night with no errors, it has a problem!
I've noticed running tasks hyperthreaded (8x2 on a 3700x) isn't just slower but makes Ryzen very unhappy. |
|
|
|
Thanks for the responses. Obviously this is disappointing! But good to know. |
|
|
|
Not really disappointing, hyperthreading is slower for LLR on all processors.
I've had 4 different ryzen processors and not noticed any instabilities when not trying to push things - although the 3700x will run super hot if you leave PBO set to AUTO rather than switching it off.
No idea what's up with Michael's but I'd suspect the motherboard and/or bios rather than the CPU if the RAM won't run at advertised speed - I did have to return a b450 board that refused to post with more than 1 stick of RAM. |
|
|