Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Problems and Help :
Errors occured and were corrected - warning on PSP tasks
Author |
Message |
|
I've got a PC which did not generate any errors previously, but is having problems during the challenge with PSP tasks.
I get a warning message: Errors occured and were corrected during this calculation. Your computer is not operating correctly. This is a hardware problem you should fix.
https://www.primegrid.com/workunit.php?wuid=908465889
https://www.primegrid.com/workunit.php?wuid=908465193
So far I have disabled PBO, set RAM speed to factory standard 2133, checked to make sure nothing is overheating - no luck, I still received a warning today.
The cruncher is a Win 11 PC with 5950X, 32GB memory (currently on 2133).
Any ideas what could be wrong? | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1963 ID: 352 Credit: 6,420,056,055 RAC: 2,653,228
                                      
|
Those steps you have done are good ones.
I had similar problem than disabling power boost helped.
How about power supply?
I see you have powerfull GPU as well in that host.
I would also disable SMT - running 4 task with 32 cores is not faster, you would be better with 16 real cores.
This may also help with stress on the system.
____________
My stats | |
|
|
The power supply should be good, it's a Be Queit 750W 80+ Platinum unit.
I would prefer not to disable HT - I am working on it while PrimeGrid is running and the HT threads let me use it without stuttering.
I am trying affinitywatcher now, BOINC is set to use 50% CPU. | |
|
|
Having same processor and same issues. Sometimes it makes it.
I have no overclocking, same PSU.
HT is not disabled in BIOS but I have Boinc only using 50% of processors (means 16 threads), and have set 2 tasks maximum and 8 CPU per tasks (L3 cache not overloaded).
So I am joining that topic too. | |
|
|
My guess is a possible CPU problem. Errors may come from CPU voltage too low, frequency too high, or temperature higher than, I guess 95 C.
A search on the internet websites shows that some Ryzen 5950X and some other Ryzen 5000 series CPUs can be defective or degraded. Some have stopped the crashes or errors with Core performance boost off, at a cost of speed penalty.
For quick and easy way to limit CPU frequency,
On Windows: edit power plan, Power Options, Processor power management
Limit Maximum processor frequency to 3900 or something. If you don't see it, search the internet: how to add Maximum processor frequency. When enabled, exit and open Power options to see it.
Other ways is to use Ryzen Master or Motherboard settings. Be careful, too much voltage or temperature higher than, I guess 95 C, may cause permanent CPU damage or degradation.
I have had 4 of my many tasks show warnings on Ryzen 5950X, Asus B550-E. I used Ryzen Master to lock the frequency to CCD0 = 4150 MHz, CCD1 = 4000 MHz, CPU 1.1 volts. Dropped it to 4125, 3975 MHz, 1.1 volts and so far, my settings are fine now, runs at CPU temperature 86 C. | |
|
|
Mine is running around 60°C, that is very far from maximum CPU temperature. So that is not the reason. All is by default, so a defective CPU would have already lead to system crash sometimes. But this computer is 24/7 stable. | |
|
|
My other guess then, may be a possible faulty RAM or bad RAM timings or RAM speed configured too fast. Try a bootable memtest86 or something, got memory errors? If there is RAM errors while using 2 or more RAM sticks, then test 1 RAM stick at a time or slow down RAM speed.
If RAM errors in memtest86 are in a few small ranges such as 123450000-12345FFFF, then Linux memmap kernel option, or Windows badmemorylist may sometimes be able to work around it.
A CPU or a RAM stick may be partially faulty, degraded, or unstable in that it may not crash nor error until something starts using heavy AVX2 computation, or something starts using a faulty location of RAM.[/code] | |
|
|
Or we have misunderstood the L3 cache optimization (avoiding errors) as explained here.
So for such Ryzen 5950x CPU (64MB total L3 cache available) and with PSP tasks (23MB L3 cache per work unit), than means (still with hyperthreading disabled or 50% CPU usage in Boinc Manager):
1- 2 work units maximum and 8 threads per work unit (I set this one).
or
2- 1 work unit maximum and 16 threads per work unit.
Which one is the best in both performance and avoiding errors? | |
|
|
Possible errors may come from RAM.
My Ryzen 2700x, 3900x, and 5950x have ECC (Error correcting code) unbuffered DIMM (DIMM) DDR4, which may have helped avoid most warnings. Recently on my Ryzen 2700x Linux I had to take out a RAM stick to fix random signal 7 app terminated problem and random computer restarts. Linux dmesg shows uncorrectable ECC error at random address. If it wasn't for ECC it may cause silent data corruption or random errors.
Slow speed on Ryzen multi-CCX with a possible workaround.
Several Ryzen have split L3 cache problem, each CCX (core complex) have their own L3 cache. Each of PrimeGrid PSP task takes about 24 MiB of cache. Ryzen 5900x and 5950x have 2x 32 MiB L3 cache. Can get faster speed, limit each PSP task to only one CCX with core/thread affinity.
I tested my speed with CPU affinity adjustments, CD (current directory) into empty folders and run tasks manually with BOINC suspended. Ran tasks for a few minutes or until I see Time per bit milliseconds, lower time is faster, then terminate with ctrl+c. Watts was measured from Windows LibreHardwareMonitor or HWinfo64. Linux can try Zenpower3, sensors to see watts.
CCX: affinity limited to 1 CCX per task.
t1: affinity limited to first thread of each core.
t2: affinity limited to second thread of each core. Ryzen 5950X second thread may be slower.
Windows 10, 5950x, 2 CCX, 2x 32MiB L3 cache, CCX0=4125MHz CCX1=3950MHz, 1.1 volts.
2x16, CCX --, 169 watts, 1.286 1.330, Fastest
2x16, --- --, 138 watts, 1.993 1.974
2x8, CCX --, 156 watts, 1.307 1.356
2x8, CCX t1, 157 watts, 1.288 1.331, Efficient
2x8, CCX t2, 158 watts, 1.312 1.349
2x8, --- --, 142 watts, 1.649 1.627
4x8, CCX --, 116 watts, 5.834 5.852 5.645 5.664
1x32, --- --, 141 watts, 0.929
1x16, CCX --, 113 watts, 1.276, limit 8 cores
1x16, --- --, 132 watts, 0.924
1x16, --- t1, 133 watts, 0.916
1x16, --- t2, 134 watts, 0.922
1x8, CCX --, 107 watts, 1.286
1x8, CCX t1, 105 watts, 1.279
1x8, CCX t2, 106 watts, 1.298
0, idle idle, 37 watts Note: Fastest 2x 16 thread may be slowed down by power limit or temperature limit.
A Linux script for 5950x that set affinity and can be run as root or boinc user. Can make changes depending on CPU and use case, save and chmod a+x this-script.
#!/bin/bash
while : ; do
(pgrep _llr\|sllr2) | while read; do
taskset -a -pc 8-15,24-31 $REPLY
if read; then
taskset -a -pc 0-7,16-23 $REPLY
fi
done
sleep 50
done More advanced details:
With SMT (simultaneous multithreading), Windows and Linux have different thread to core orders.
Windows: 0,0,1,1,2,2,3,3, ... 14,14,15,15
Linux: 0,1,2, ... 15,0,1,2, ... 15, to confirm this: grep /proc/cpuinfo -e "core id"
There are 2 ways to get command line, BOINC client_state.xml file, or while running, Linux (case sensitive): ps -A -f | grep llr
To set affinity in Linux: taskset -a -pc 0-15 $$ ($$ or process number)
Windows Task manager details, or command: start /affinity 0x0000FFFF cmd
To run Linux command:
/var/lib/boinc-client/projects/www.primegrid.com/sllr2_1.3.0_linux64_220821 -oGerbicz=1 -oProofName=proof -oProofCount=128 -oProductName=prod -oPietrzak=1 -oCachePoints=0 -pSavePoints -q79309*2^28919774+1 -d -t8 -oDiskWriteTime=10
To run in Windows command window:
C:\ProgramData\BOINC\projects\www.primegrid.com\llr2_1.3.0_win64_220821.exe -oGerbicz=1 -oProofName=proof -oProofCount=128 -oProductName=prod -oPietrzak=1 -oCachePoints=0 -pSavePoints "-q79309*2^28919774+1" -d -t8 -oDiskWriteTime=10 | |
|
|
Thanks for those data!
But 2x16 or 1x32 means Hyperthreading enabled (I thought for LLR we should not enabled HT), was it only to get data even with full threads? | |
|
|
Large PSP tasks possibly makes hyperthreading faster I guess. This is a tiny 0.1% speed gain, at a cost of 7% more watts. Good for stress testing.
Watch the speed and temperature in linux with: watch -n 0.2 "grep /proc/cpuinfo -e MHz; sensors", Windows can use LibreHardwareMonitor or HWinfo64.
Linux can install cpufrequtils package and try cpufreq-set -f 3200 so it may be slower but to check if there any more errors or not. Other options can be looked at with cpufreq-info. Other ways to set frequency in Linux without cpufreq utility is:
For my Ryzen 2700x, goes by KHz (kilohertz). Possibly applies to newer Ryzen CPU as well.
echo userspace | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 3200000 | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed
Find the speed that may possibly stop the errors or warnings. Motherboard BIOS settings do have some voltage offset and Load line calibration options, and RAM speed options. Some of motherboart settings may help reduce chance of errors. | |
|
|
I am trying your affinity script with current Blaise Challenge (321 LLR). limiting by 6 threads per workunit (MT) and only one workunit at a time. | |
|
Post to thread
Message boards :
Problems and Help :
Errors occured and were corrected - warning on PSP tasks |