Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Cullen/Woodall prime search :
GCW units not surviving client restart
Author |
Message |
|
I have a crop of GCW workunits that die at a client restart, usually after a reboot.
The problem is *nearly* replicable by using systemctl restart boinc.client.service. About one in ten survive this test.
Symptoms are that immediately on restart of the client the workunit finishes, uploads, and goes into the wait for validation, which it inevitable fails in due course.
This is a loss of credit to me for the crunching before the restart (though I do not expect credit where work fails, I do not expect restarting to trigger the "end of task" sequence prematurely). It also means (as with any invalid worK) the original wingman gets mixed feelings: they are first finisher, but the new wingman is not appointed till after the original wingman's task completes and they have no chance of being first.
This seems specific to llrGCW:
I also tried this with AP, a variety of other LLR tasks, and two different GFN n values. All of these survive the restart without problem.
At present I have only seen this on my Qubes / Xen machine, but will not have time to check out this behaviour on my "real" machines for a few days, and will update this thread by the weekend after I have done those tests. My hunch, without testing, is that this will turn out to be another Qubes-specific issue.
I would be glad of any immediate comments if this kind of issue has arisen before?
I did not see it before the server move, but was not trying to run llrGCW so I have no reason to suspect the migration.
R~~
GCW does not seem to have any huge memry requirements that might trigger this, but if there are significant differences with other LLR tasks please give me a heads up.
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes) | |
|
|
Examples:
I restarted the client at about 1603 by the computer's clock while two llrGCW tasks were running, but without rebooting the virtual machine -- or the host ;).
this unit survived and carried on with about 43 mins elapsed. THis WU has survived several restarts.
whereas this one exited showing "success" as the outcome on the task's web page, but cannot conceivably validate with a run time of only about 35 mins
Here is the event log for this client start, showing the WU ending and being reported and showing the next one starting.
Tue 27 Nov 2018 16:03:15 GMT | | Starting BOINC client version 7.6.33 for x86_64-pc-linux-gnu
Tue 27 Nov 2018 16:03:15 GMT | | log flags: file_xfer, task, cpu_sched
Tue 27 Nov 2018 16:03:15 GMT | | Libraries: libcurl/7.52.1 OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
Tue 27 Nov 2018 16:03:15 GMT | | Data directory: /var/lib/boinc-client
Tue 27 Nov 2018 16:03:15 GMT | | No usable GPUs found
Tue 27 Nov 2018 16:03:15 GMT | | Host name: G
Tue 27 Nov 2018 16:03:15 GMT | | Processor: 2 GenuineIntel Intel(R) Core(TM) m5-6Y54 CPU @ 1.10GHz [Family 6 Model 78 Stepping 3]
Tue 27 Nov 2018 16:03:15 GMT | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves
Tue 27 Nov 2018 16:03:15 GMT | | OS: Linux: 4.14.18-1.pvops.qubes.x86_64
Tue 27 Nov 2018 16:03:15 GMT | | Memory: 2.09 GB physical, 1024.00 MB virtual
Tue 27 Nov 2018 16:03:15 GMT | | Disk: 4.86 GB total, 1.65 GB free
Tue 27 Nov 2018 16:03:15 GMT | | Local time is UTC +0 hours
Tue 27 Nov 2018 16:03:15 GMT | | Config: GUI RPCs allowed from:
Tue 27 Nov 2018 16:03:15 GMT | PrimeGrid | URL http://www.primegrid.com/; Computer ID 941669; resource share 169
Tue 27 Nov 2018 16:03:15 GMT | PrimeGrid | General prefs: from PrimeGrid (last modified 10-Jul-2017 15:20:22)
Tue 27 Nov 2018 16:03:15 GMT | PrimeGrid | Computer location: Pluto
Tue 27 Nov 2018 16:03:15 GMT | PrimeGrid | General prefs: no separate prefs for Pluto; using your defaults
Tue 27 Nov 2018 16:03:15 GMT | | Reading preferences override file
Tue 27 Nov 2018 16:03:15 GMT | | Preferences:
Tue 27 Nov 2018 16:03:15 GMT | | max memory usage when active: 2119.44MB
Tue 27 Nov 2018 16:03:15 GMT | | max memory usage when idle: 2119.44MB
Tue 27 Nov 2018 16:03:15 GMT | | max disk usage: 1.78GB
Tue 27 Nov 2018 16:03:15 GMT | | suspend work if non-BOINC CPU load exceeds 65%
Tue 27 Nov 2018 16:03:15 GMT | | (to change preferences, visit a project web site or select Preferences in the Manager)
Tue 27 Nov 2018 16:03:15 GMT | | gui_rpc_auth.cfg is empty - no GUI RPC password protection
Tue 27 Nov 2018 16:03:16 GMT | PrimeGrid | [cpu_sched] Restarting task llrGCW_307836160_0 using llrGCW version 801 in slot 1
Tue 27 Nov 2018 16:03:18 GMT | PrimeGrid | [cpu_sched] Restarting task llrGCW_307835931_3 using llrGCW version 801 in slot 0
Tue 27 Nov 2018 16:03:21 GMT | PrimeGrid | Computation for task llrGCW_307835931_3 finished
Tue 27 Nov 2018 16:03:23 GMT | PrimeGrid | Started upload of llrGCW_307835931_3_r303567640_0
Tue 27 Nov 2018 16:03:24 GMT | PrimeGrid | Finished upload of llrGCW_307835931_3_r303567640_0
Tue 27 Nov 2018 16:03:24 GMT | PrimeGrid | Started download of llrGCW_307836266
Tue 27 Nov 2018 16:03:25 GMT | PrimeGrid | Finished download of llrGCW_307836266
Tue 27 Nov 2018 16:03:26 GMT | PrimeGrid | Starting task llrGCW_307836266_1
Tue 27 Nov 2018 16:03:26 GMT | PrimeGrid | [cpu_sched] Starting task llrGCW_307836266_1 using llrGCW version 801 in slot 0
Tue 27 Nov 2018 16:33:31 GMT | PrimeGrid | work fetch suspended by user
Original Example:
When I initially spotted this behaviour, two WU were running and had many hours crunching on each, and they both failed on reboot of the host and virtual machine. They are this one and that one | |
|
|
This is NOT just Xen, NOT just Qubes
I have just replicated this behaviour on a "real" computer running Linux as a real OS
That means the behaviour has been seen on two computers. running two different Debian derivatives, one running on "real metal" and one on a VM.
The Linux Mint machine has been used for some time on PG without displaying this behaviour on any other type of task, and the software is the same as always, other than regular updating from the OS repositories.
This is the first time the machine has ever downloaded a GCW task, and each of the first eight ended early (but with a "success" code) when the client was restated. The OS was NOT rebooted either time. The first 4 failed when the client was restarted after some 18 hours running, the second four when the client was restarted after only a very short run.
These tasks are listed here - if there are more than eight look for the oldest ones, reported between 1100 and 1200 UT on Nov 28.
____________
My computers found:
9831*21441403+1 is a quadhectokilo prime prime, ie >400,000 digits ;)
252031090528237591 + 65521*149*23*19*17*13*11*7*5*3*2*n is prime for every n in { 0..20 } (an arithemtic progression of 21 primes)
| |
|
Message boards :
Generalized Cullen/Woodall prime search :
GCW units not surviving client restart |