PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise

Advanced search

Message boards : Number crunching : tasks mislabelled as abandoned

Author Message
River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103875 - Posted: 20 Jan 2017 | 1:45:13 UTC
Last modified: 20 Jan 2017 | 1:55:38 UTC

hi

I have had the problem several times of finding tasks suddenly get labelled as abaindoned when they are not.

To be sure, when a task _really_ is abandoned, then the abandoned task spotter in the server is useful. It allows a task to be re-sent before it would otherwise time out.

After careful tests, here is what I think happens.

Two of my computers are numbered 527527 (RivLapGoldMint cpu m5-6Y54) and 533322 (RivLen, cpu i7-6500U).

As is usual for Linux boxes, for some reason the server picks up their IP address as 127.0.1.1 and I suspect that this may be the root of the problem.

However, they have different CPUs and are running a different version of the Linux kernel. It should be simple for the server to disambiguate them.

RivLapGoldMint has been crunching for PG for several months.

RivLen was new tonight -- it has had a bad screen and has not been turned on for months. If it has ever crunched for PG it was many months ago and running Win 10 not Linux.

When RivLen connected for the first time tonight, as a new host, the server misidentified it as being 527527 returning, so marked all 4 of the running tasks on 527527 as abandoned. For a few minutes, when I found RivLen in the list of my computers, it had all the recent past tasks listed as if it was 527527, though it showed the correct name.


When I updated from RivLapGoldMint and did a new list of my computers, RivLen had disappeared from the list, and RivLapGoldMint had come back, still showing its own tasks as abandoned, though they were still running. The website also claimed it was running the four tasks that were actually running on the other machine

(( incidentally, I do not understand how this is possible -- as part of an update, if I understand correctly, the client reports a list of currently running tasks, those waiting to run, and those suspended. THat is how "abandoned" tasks get discovered. So how it could claim that a computer it had just been talking to had tasks in progress that were on another machine is a mystery to me))

When I did an update from the new machine, it was then given a new number, and was shown as having zero tasks. I therefore aborted the tasks that have been runnuing only a few minutes. The server issued another four tasks, and then showed the computer as having 8 tasks -- when the tasks were reported as aborted apparently they migrated from 527527 to 533322.

A final check on the older machine, in real life it is still running the four tasks that were in progress, but on the server those tasks are showing as abandoned and the WU are listing replacement tasks to be sent out.



This computer has the relevant tasks.

The "abandoned" tasks are a=SOB b=SOB c=ESP and d=ESP

The other machine is here


I plan to run the ESP tasks to completion to see what happens, they are both over half way anyway with less than a day each to run.

I am in two minds about the SOBs, They are each about 10% complete after just over 2 days. This gives me a dilemma.

If I abort these twotasks I lose credit for 4 core-days work

On the other hand, I do not really want to do another 18 or so days crunching that is work that will be unnecessarily replicated by both the original wingman and the new one.

Also it is a loss of crunching power to the project if I triple check work.

And also if I do abort them and that WU comes up prime I am going to regret that choce....

In the event that those two SOB unsent tasks have not yet been despatched, is there a server side twaek that can hold them back? That would be a good short term work-around.

In the longer term, my suggested workaround is that (if this is easy-ish) you hold back resends on abandoned work for 26 hours. If during that time the server gets a trickle, it would cancel the "abandoned" status. After 26 hours with no contact, it would then schedule the new task for someone to download.

It may well be that it is not practicable to do that -- I do not know anything about the insides of the BOINC server, but if at all straightforward then it would save the project sendng out work for triple-checking unnecessarily.

All of this is a knock on from the efforts the BOINC programmers made to avoid an older problem, where a computer would suddenly be regarded as a new machine. That problem was easy to fix, as you could merge machines with the same name.

The fix to that problem means that the server currently merges computers that have different names, different cpus. They have , in my opinion, solved a minor problem at the expense of generating a bigger one.

Until tonight, I was thinking that part of the problem is that I have seven boxes that are hardware-identical apart from MAC addresses. I was shocked to see this problem arise with distinct hardware -- OK, I suppose I could have upgraded (or downgraded) the CPU, but in my opinion an empty machine signing on with the attach process should be treated as a new machine, not merged with anything else. I do realise that those choices are outside the control of anyone at PG.

Could Jim or Michael or some other project person please advise regarding those SOB jobs -- will you be able to hold back the second wingman. If not, do you think I should carry on wasting 18 days credit to save 2 days? Or should I abort them soon and accept a 10% loss ?

River~~

edit: typos

Profile Michael GoetzProject donor
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 21 Jan 10
Posts: 13682
ID: 53948
Credit: 304,549,179
RAC: 210,282
The "Shut up already!" badge:  This loud mouth has mansplained on the forums over 10 thousand times!  Sheesh!!!Discovered the World's First GFN-19 prime!!!Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 2 primes in the 2021 Tour de Primes321 LLR Turquoise: Earned 5,000,000 credits (5,132,712)Cullen LLR Turquoise: Earned 5,000,000 credits (5,038,114)ESP LLR Turquoise: Earned 5,000,000 credits (6,177,890)Generalized Cullen/Woodall LLR Turquoise: Earned 5,000,000 credits (5,094,541)PPS LLR Sapphire: Earned 20,000,000 credits (20,817,335)PSP LLR Turquoise: Earned 5,000,000 credits (7,956,186)SoB LLR Sapphire: Earned 20,000,000 credits (36,067,618)SR5 LLR Jade: Earned 10,000,000 credits (10,007,110)SGS LLR Ruby: Earned 2,000,000 credits (4,110,837)TRP LLR Turquoise: Earned 5,000,000 credits (5,084,329)Woodall LLR Turquoise: Earned 5,000,000 credits (5,032,821)321 Sieve (suspended) Jade: Earned 10,000,000 credits (10,061,196)Cullen/Woodall Sieve (suspended) Ruby: Earned 2,000,000 credits (4,170,256)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,059,304)PPS Sieve Sapphire: Earned 20,000,000 credits (22,885,121)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Amethyst: Earned 1,000,000 credits (1,035,522)TRP Sieve (suspended) Ruby: Earned 2,000,000 credits (2,051,121)AP 26/27 Jade: Earned 10,000,000 credits (10,890,516)GFN Emerald: Earned 50,000,000 credits (80,148,528)WW Sapphire: Earned 20,000,000 credits (45,284,000)PSA Jade: Earned 10,000,000 credits (12,445,029)
Message 103877 - Posted: 20 Jan 2017 | 2:06:26 UTC

(A lot of guesswork follows, so take it with a grain of salt.)

How did you install BOINC on the "new" computer, RivLen? Was it installed fresh from downloading BOINC, or was it copied from another installation?

It sounds to me like this was a copied installation rather than a fresh install. If so, that's probably what caused the problem. In that case you will want to uninstall BOINC, erase the BOINC data directory entirely, and reinstall BOINC directly from the Berkeley download. (In this case I wouldn't even trust the boinc-client package from your distro's package handler just in case there's something wrong with it. I'd use the official BOINC client straight from Berkeley.)
____________
My lucky number is 75898524288+1

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103880 - Posted: 20 Jan 2017 | 3:09:26 UTC - in response to Message 103877.

(A lot of guesswork follows, so take it with a grain of salt.)

How did you install BOINC on the "new" computer, RivLen? Was it installed fresh from downloading BOINC, or was it copied from another installation?


hi Michael,

good question, and a plausible guess on the facts I described.

However this was one of the things I suspected when I first noticed I was getting "abandoned" tasks again. So I have stopped the cloning some time ago, and work from fresh installs.

Not only that, but I eyeballed the directories before letting it run. On Linux the files turn up in two different places.

There are four files in /etc/boinc-client, and two of these I edited to allow access by boincview. There are only ever these four files in /etc/boinc-client

cc_config.xml
global_prefs_override.xml
gui_rpc_auth.cfg
remote_hosts.cfg

the first file now contains three log flags (though I cannot say what it contained when it downloaded)

the second file now contains my prefs which must have been downloaded from the server as I did not enter them manually,

the third and fourth files I customised to my own values before attaching (set the password in one, and set a list of allowed hosts for RPC). I stopped the client before the edit, and restarted after, and only attached after that.

The main boinc directory is /var/lib/boinc-client, which is also known as ~boinc the home directory for the user boinc. Before the client first runs this seems to contain just four softlinks to the above-mentioned files in /etc/boinc-client. Loads of files get added as soon as the client starts, and account_www.primegrid.com.xml gets added when you attach to the project. Up to the attach there is nowhere for a rogue host id to be lurking, as far s I could see. No project directories, not account file, and so on.

Both computers are running LinuxMInt, the older is running Mint 18 and the newer 18.1 (hence the different kernels)

In both cases, when BOINC was installed it was done with the package manager, by download from a mirror of the Mint distro -- one download was many months ago, the other tonight, but I believe the package may not have been updated, so yes it is potentially possible that something in that package is triggering the problem.

Your advice to use the Berkeley downloads conflicts with the advice given on that very download page, which advises Linux64 users as follows:

Linux x64
Tested on the current Ubuntu distribution; may work on others.
If available, we recommend that you install a distribution-specific package instead.


(This is common advice in Linux, as you probably know. In general there can be issues with running pre-compiled software in Unix if it was comiled against a different version of various libraries, roughly analogous to having the wrong set of DLLs in Windows, and by using the package native to the distro one avoids that class of errors)

I think though that your advice sounds worth a try, next time I add a fresh install.

Finally, the two computers do seem to be recognised as distinct now. I have a slight reluctance to do another install, in case the server misidentifies the fresh install for one of my other boxes. Do you think it is really necessary? I have only ever seen this issue (so far!) on the first couple of connects. Unless you feel I am making a big mistake, my inclination is to leave it be now it has settled in.

R~~

Profile Michael GoetzProject donor
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 21 Jan 10
Posts: 13682
ID: 53948
Credit: 304,549,179
RAC: 210,282
The "Shut up already!" badge:  This loud mouth has mansplained on the forums over 10 thousand times!  Sheesh!!!Discovered the World's First GFN-19 prime!!!Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 2 primes in the 2021 Tour de Primes321 LLR Turquoise: Earned 5,000,000 credits (5,132,712)Cullen LLR Turquoise: Earned 5,000,000 credits (5,038,114)ESP LLR Turquoise: Earned 5,000,000 credits (6,177,890)Generalized Cullen/Woodall LLR Turquoise: Earned 5,000,000 credits (5,094,541)PPS LLR Sapphire: Earned 20,000,000 credits (20,817,335)PSP LLR Turquoise: Earned 5,000,000 credits (7,956,186)SoB LLR Sapphire: Earned 20,000,000 credits (36,067,618)SR5 LLR Jade: Earned 10,000,000 credits (10,007,110)SGS LLR Ruby: Earned 2,000,000 credits (4,110,837)TRP LLR Turquoise: Earned 5,000,000 credits (5,084,329)Woodall LLR Turquoise: Earned 5,000,000 credits (5,032,821)321 Sieve (suspended) Jade: Earned 10,000,000 credits (10,061,196)Cullen/Woodall Sieve (suspended) Ruby: Earned 2,000,000 credits (4,170,256)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,059,304)PPS Sieve Sapphire: Earned 20,000,000 credits (22,885,121)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Amethyst: Earned 1,000,000 credits (1,035,522)TRP Sieve (suspended) Ruby: Earned 2,000,000 credits (2,051,121)AP 26/27 Jade: Earned 10,000,000 credits (10,890,516)GFN Emerald: Earned 50,000,000 credits (80,148,528)WW Sapphire: Earned 20,000,000 credits (45,284,000)PSA Jade: Earned 10,000,000 credits (12,445,029)
Message 103889 - Posted: 20 Jan 2017 | 7:10:59 UTC - in response to Message 103880.

If it's working, don't change it. :)
____________
My lucky number is 75898524288+1

stream
Volunteer moderator
Project administrator
Volunteer developer
Volunteer tester
Send message
Joined: 1 Mar 14
Posts: 938
ID: 301928
Credit: 513,433,159
RAC: 2,554
Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 1 prime in the 2021 Tour de Primes321 LLR Jade: Earned 10,000,000 credits (10,011,570)Cullen LLR Jade: Earned 10,000,000 credits (10,009,374)ESP LLR Jade: Earned 10,000,000 credits (10,009,221)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,012,217)PPS LLR Jade: Earned 10,000,000 credits (16,291,512)PSP LLR Jade: Earned 10,000,000 credits (10,044,081)SoB LLR Jade: Earned 10,000,000 credits (10,064,750)SR5 LLR Jade: Earned 10,000,000 credits (10,002,051)SGS LLR Jade: Earned 10,000,000 credits (10,001,295)TRP LLR Jade: Earned 10,000,000 credits (10,002,411)Woodall LLR Jade: Earned 10,000,000 credits (10,013,921)321 Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,004,228)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,047,667)PPS Sieve Sapphire: Earned 20,000,000 credits (20,866,490)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,043,271)TRP Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,015,177)AP 26/27 Sapphire: Earned 20,000,000 credits (20,045,194)GFN Emerald: Earned 50,000,000 credits (55,355,287)WW Sapphire: Earned 20,000,000 credits (20,292,000)PSA Double Silver: Earned 200,000,000 credits (200,301,443)
Message 103893 - Posted: 20 Jan 2017 | 8:26:54 UTC

I've already posted some info about this topic on the forum, so I'll be short here.

Both server and client keep RPC sequence number. When client contacts a server and sends incorrect sequence number, server thinks that something wrong happened on the client and forces it's reinitialization (generation of new cpid - see below). All client tasks are marked as abandoned BY SERVER, and you CANNOT recover them. (Unless you've used an option "allow_multiple_clients" on client, in this case server will assign a new computer ID for you and MAY (but not guaranteed) to keep your tasks).

The sequence number is stored in client_state.xml, so problem may appear when:
- this file (in Boinc data directory) was copied from another machine, restored from backup, or corrupted after crash;
- on Windows, bad antivirus software may somehow affect update of this file;

Second problem is an identification of computers. There is a entry in client_state.xml:

<host_cpid>112233445566778843e754e1727072f8</host_cpid>

Which is unique ID of your computer. THIS is a value used by server to distinguish between different computers, not a short host ID. This UID is created by Boinc and rules for creating it differs from version to version. Originally it was based on MAC address, recent versions of client added path to data directory to this hash.

When deploying clients on similar hardware or cloned virtual machines, be sure that client generated unique CPID's on each host! For example, Boinc for some reason may fail to determine MAC address of LAN adapter on Linux - and will not say a word about this, and since path to data directory is usually same for all Linux installations - you'll be in trouble.

puh32
Send message
Joined: 2 Feb 09
Posts: 55
ID: 34980
Credit: 260,708,443
RAC: 267,227
Discovered 1 mega primeFound 1 prime in the 2019 Tour de Primes321 LLR Amethyst: Earned 1,000,000 credits (1,862,176)Cullen LLR Ruby: Earned 2,000,000 credits (2,691,635)ESP LLR Gold: Earned 500,000 credits (763,911)Generalized Cullen/Woodall LLR Gold: Earned 500,000 credits (607,208)PPS LLR Turquoise: Earned 5,000,000 credits (9,196,247)PSP LLR Ruby: Earned 2,000,000 credits (2,445,293)SoB LLR Ruby: Earned 2,000,000 credits (3,412,731)SR5 LLR Ruby: Earned 2,000,000 credits (4,015,128)SGS LLR Ruby: Earned 2,000,000 credits (4,106,305)TRP LLR Gold: Earned 500,000 credits (906,277)Woodall LLR Ruby: Earned 2,000,000 credits (2,955,535)AP 26/27 Emerald: Earned 50,000,000 credits (50,816,467)GFN Double Bronze: Earned 100,000,000 credits (111,046,696)WW Emerald: Earned 50,000,000 credits (51,648,000)PSA Jade: Earned 10,000,000 credits (14,233,993)
Message 103897 - Posted: 20 Jan 2017 | 8:42:42 UTC - in response to Message 103893.

I'm probably misunderstanding something here.

My two main PrimeGrid computers have the same "Computer ID" and cannot be told apart in, for example, the "In progress tasks for...".

Is this not the way things are supposed to be?

These computers are miles apart and were never physically co-located. Both were installed with the Berkeley BOINC release (on different occasions). I have never moved files between them.

JimBProject donor
Honorary cruncher
Send message
Joined: 4 Aug 11
Posts: 916
ID: 107307
Credit: 974,532,191
RAC: 0
Discovered 1 mega prime321 LLR Ruby: Earned 2,000,000 credits (2,726,625)Cullen LLR Turquoise: Earned 5,000,000 credits (5,031,868)ESP LLR Turquoise: Earned 5,000,000 credits (5,064,082)Generalized Cullen/Woodall LLR Turquoise: Earned 5,000,000 credits (5,038,750)PPS LLR Turquoise: Earned 5,000,000 credits (5,000,461)PSP LLR Turquoise: Earned 5,000,000 credits (7,674,374)SoB LLR Sapphire: Earned 20,000,000 credits (42,604,648)SR5 LLR Jade: Earned 10,000,000 credits (11,829,173)SGS LLR Ruby: Earned 2,000,000 credits (2,413,082)TRP LLR Ruby: Earned 2,000,000 credits (2,291,092)Woodall LLR Turquoise: Earned 5,000,000 credits (5,046,412)321 Sieve (suspended) Jade: Earned 10,000,000 credits (10,057,614)Cullen/Woodall Sieve (suspended) Ruby: Earned 2,000,000 credits (4,002,919)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,005,451)PPS Sieve Emerald: Earned 50,000,000 credits (52,042,965)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Ruby: Earned 2,000,000 credits (2,341,676)TRP Sieve (suspended) Ruby: Earned 2,000,000 credits (2,070,804)AP 26/27 Jade: Earned 10,000,000 credits (10,742,251)GFN Emerald: Earned 50,000,000 credits (50,000,251)PSA Double Gold: Earned 500,000,000 credits (728,547,693)
Message 103898 - Posted: 20 Jan 2017 | 8:46:29 UTC - in response to Message 103897.

I'm probably misunderstanding something here.

My two main PrimeGrid computers have the same "Computer ID" and cannot be told apart in, for example, the "In progress tasks for...".

Is this not the way things are supposed to be?

These computers are miles apart and were never physically co-located. Both were installed with the Berkeley BOINC release (on different occasions). I have never moved files between them.

Looks to me like they're separate. They just have very similar id's: 514770 and 514470.

puh32
Send message
Joined: 2 Feb 09
Posts: 55
ID: 34980
Credit: 260,708,443
RAC: 267,227
Discovered 1 mega primeFound 1 prime in the 2019 Tour de Primes321 LLR Amethyst: Earned 1,000,000 credits (1,862,176)Cullen LLR Ruby: Earned 2,000,000 credits (2,691,635)ESP LLR Gold: Earned 500,000 credits (763,911)Generalized Cullen/Woodall LLR Gold: Earned 500,000 credits (607,208)PPS LLR Turquoise: Earned 5,000,000 credits (9,196,247)PSP LLR Ruby: Earned 2,000,000 credits (2,445,293)SoB LLR Ruby: Earned 2,000,000 credits (3,412,731)SR5 LLR Ruby: Earned 2,000,000 credits (4,015,128)SGS LLR Ruby: Earned 2,000,000 credits (4,106,305)TRP LLR Gold: Earned 500,000 credits (906,277)Woodall LLR Ruby: Earned 2,000,000 credits (2,955,535)AP 26/27 Emerald: Earned 50,000,000 credits (50,816,467)GFN Double Bronze: Earned 100,000,000 credits (111,046,696)WW Emerald: Earned 50,000,000 credits (51,648,000)PSA Jade: Earned 10,000,000 credits (14,233,993)
Message 103899 - Posted: 20 Jan 2017 | 8:50:10 UTC - in response to Message 103898.

>Looks to me like they're separate. They just have very similar id's: 514770 and 514470.

Ouch... stupid of me! (I did, in fact, misunderstand something :)

Thanks!

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103906 - Posted: 20 Jan 2017 | 11:46:38 UTC - in response to Message 103893.
Last modified: 20 Jan 2017 | 12:06:24 UTC

I've already posted some info about this topic on the forum, so I'll be short here.

Both server and client keep RPC sequence number. When client contacts a server and sends incorrect sequence number, server thinks that something wrong happened on the client and forces it's reinitialization (generation of new cpid - see below). All client tasks are marked as abandoned BY SERVER, and you CANNOT recover them. (Unless you've used an option "allow_multiple_clients" on client, in this case server will assign a new computer ID for you and MAY (but not guaranteed) to keep your tasks).


Two thoughts arise from this.

years ago the advice from CPDN was to back up the main boinc directory every week. With their very long running tasks this protected you, and that project, against failure where the files got trashed (including HDD failure). Advice, iirr, was to take a tarball or zip file of the entire directory and store on another machine, USB drive, etc.

Sounds to me as if that strategy is no longer applicable? If you did restore the entire folder, next time the client contacted the server there would be a sequence number error?

Secondly, running a VM. This means that rollback of the VM does not just mean you re-crunch the same numbers, it also means that if the client has talked to the server then you lose all the unreported work.




The sequence number is stored in client_state.xml, so problem may appear when:
- this file (in Boinc data directory) was copied from another machine, restored from backup, or corrupted after crash;
- on Windows, bad antivirus software may somehow affect update of this file;


Second problem is an identification of computers. There is a entry in client_state.xml:

<host_cpid>112233445566778843e754e1727072f8</host_cpid>

Which is unique ID of your computer. THIS is a value used by server to distinguish between different computers, not a short host ID. This UID is created by Boinc and rules for creating it differs from version to version. Originally it was based on MAC address, recent versions of client added path to data directory to this hash.

[/quote]

advice given in the past (maybe ten yers ago) on Einstein@home was that you could force a new cpid by setting this value to zero.

In the case of a new install, presumably this value is zero or unset (and unset may or may not be the same as zero).

When in addition to the cpid being zero/unset, most of the files do not exist, it seems to me better to generate a universally unique id (by including creation time in the hash) rather than risk landing on top of another machine.

There clearly IS a way to disambiguate, as by experiment I have found that once this problem occurs, multiple updates, at about one minute intervals, from each box alternately, eventually makes the server notice the difference. Once the difference is noticed, the server correctly puts each task under the correct computer. Internal records on the server? More likely, it eventually decides to trust what the clients are saying about their tasks in progress. If the latter, then I would prefer it to do that from the first.

By the way, this ability to split the wrongly merged records on the server makes me wonder if there isn't a third, as yet unidientified, factor in what is happening. If your account were the whole story, and If I understand you correctly, this simply would not work as it does.

EDIT: to insert next paragraph

One proof that the server does eventually figure out that the tasks were never on that machine comes from my first ever experience of this problem, back in the 2016 Pi Paddy challenge. This machine clearly has some real values for the DCF and the fraction of time a machine is on, but apparently has never contacted the server (number of times shown as zero).




When deploying clients on similar hardware or cloned virtual machines, be sure that client generated unique CPID's on each host!


how?

On LInux the boinc-client service starts automatically on install. By the time you stop it, boinc has already done its thing and klutzed one of your other machines if it is going to.

Of course, you could be unconnected from the internet at this time. This means the usual Linux pattern of download and install in one command has to be split into two steps, and unplug the LAN cable between.

And what value do I put in? Can I make up any longish hex hash of the right number of digits? Or can I use any number of hex digits?


Once I do set up my own hash in cpid, will the client at some future time decide it is "wrong" for the environment it finds itself in? (Clearly it will be, as I haven't used the client's own hash function)


For example, Boinc for some reason may fail to determine MAC address of LAN adapter on Linux - and will not say a word about this, and since path to data directory is usually same for all Linux installations - you'll be in trouble.

That is a bug, in my opinion. It does fit with all the cases I remember seeing -- I have run Boinc on windows but on reflection I am moderately sure all the problems are with linux machines.

Is the fact that it can tell a linux installation from a Windows one on the same hardware down to the MAC address being different on one of them?

thanks for your attentive reply
River~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103908 - Posted: 20 Jan 2017 | 11:55:02 UTC

@Michael, @Jim, @Stream

as Stream has made it clear that there should do with those is no prospect of the SOB tasks being "dis-abandoned", what do you think I do with them?

Abort them, losing credit worth about 20% of a single SOB for the work that is already crunched?

Run them to completion, when PG will credit me with the full value of the tasks (assuming they validate!) but wasting the possibility that I could have been crunching something else more useful to the project?



Get 100% credit for wasting time, or get 80% of that for doing something useful ?

What would each of you do in a similar situation?

R~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103928 - Posted: 20 Jan 2017 | 16:09:30 UTC - in response to Message 103906.

I realised since posting that one paragraph I wrote is untrue, and it is too late for me to edit. Where I said


On LInux the boinc-client service starts automatically on install. By the time you stop it, boinc has already done its thing and klutzed one of your other machines if it is going to.

this is nonsense, of course. I don't know why I said it.

At that time the client has not yet connected to PrimeGrid -- that cannot happen till after the attach phase.

So the workaround looks like this:

1. After the install, stop the client, command will be one the the following

/etc/init.d/boinc-client stop
service boinc-client stop
systemctl stop boinc-client

for distros using sysv, upstart, and systemd respectively.

2. Then edit in an appropriate cpid (I am still asking for advice as to what is appropriate there)

3. At the same time edit any desired changes to the *.cfg files

4. Restart the client using the same syntax as in 1, replacing "stop" with "start"

5. Wait some twenty or thirty seconds in case the client exits

6. Same command again but with "status" to check it is still running

7. boinccmd --get_messages|less to read the client messages (or do so from manager, if installed)

R~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103931 - Posted: 20 Jan 2017 | 17:41:16 UTC

update:

both the abandoned ESP tasks completed. Five minutes after upload and report, the second of these still showed abandoned. Forty mins after reporting, it was showing as completed and awaiting validation.

This is because after my experiences in March, a script was created by the project to pick up such tasks after they are reported, and dis-abandon them. This script runs periodically, rather than on upload.

That is good, and I was glad of the re-instated credit the first time that script was run. Thanks again everyone who was part of that.

I may appear to be like Oliver Twist, asking for more.

What I am asking for is for this "dis-abandonment" process to be triggered when a trickle comes in for an apparently abandoned task as well as for final reporting. Naively, I am assuming this is a smallish tweak to the SQL in the existing script, (and I may be absolutely wrong about that assumption).

R~~

Profile Michael GoetzProject donor
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 21 Jan 10
Posts: 13682
ID: 53948
Credit: 304,549,179
RAC: 210,282
The "Shut up already!" badge:  This loud mouth has mansplained on the forums over 10 thousand times!  Sheesh!!!Discovered the World's First GFN-19 prime!!!Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 2 primes in the 2021 Tour de Primes321 LLR Turquoise: Earned 5,000,000 credits (5,132,712)Cullen LLR Turquoise: Earned 5,000,000 credits (5,038,114)ESP LLR Turquoise: Earned 5,000,000 credits (6,177,890)Generalized Cullen/Woodall LLR Turquoise: Earned 5,000,000 credits (5,094,541)PPS LLR Sapphire: Earned 20,000,000 credits (20,817,335)PSP LLR Turquoise: Earned 5,000,000 credits (7,956,186)SoB LLR Sapphire: Earned 20,000,000 credits (36,067,618)SR5 LLR Jade: Earned 10,000,000 credits (10,007,110)SGS LLR Ruby: Earned 2,000,000 credits (4,110,837)TRP LLR Turquoise: Earned 5,000,000 credits (5,084,329)Woodall LLR Turquoise: Earned 5,000,000 credits (5,032,821)321 Sieve (suspended) Jade: Earned 10,000,000 credits (10,061,196)Cullen/Woodall Sieve (suspended) Ruby: Earned 2,000,000 credits (4,170,256)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (5,059,304)PPS Sieve Sapphire: Earned 20,000,000 credits (22,885,121)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Amethyst: Earned 1,000,000 credits (1,035,522)TRP Sieve (suspended) Ruby: Earned 2,000,000 credits (2,051,121)AP 26/27 Jade: Earned 10,000,000 credits (10,890,516)GFN Emerald: Earned 50,000,000 credits (80,148,528)WW Sapphire: Earned 20,000,000 credits (45,284,000)PSA Jade: Earned 10,000,000 credits (12,445,029)
Message 103933 - Posted: 20 Jan 2017 | 18:09:40 UTC - in response to Message 103931.

Regarding fixing abandoned jobs...

The cron job that fixes them runs once an hour. It will only fix them after your computer returns a result.

It's not practical for the task to be unabandoned when your computer reports the task as completed, sorry.

No comment yet on trickles. We're looking at it.
____________
My lucky number is 75898524288+1

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103941 - Posted: 20 Jan 2017 | 23:23:30 UTC - in response to Message 103933.

Regarding fixing abandoned jobs...

The cron job that fixes them runs once an hour. It will only fix them after your computer returns a result.

It's not practical for the task to be unabandoned when your computer reports the task as completed, sorry.


Once an hour is fine by me; a daily sweep would be OK too in fact. Waiting a fraction of an hour, or even few more hours is neither here nor there, IMO


No comment yet on trickles. We're looking at it.


Thanks for considering it. Again I am asking for a daily or hourly cron job, not asking for instant revival on trickle (which clearly would be a LOT more work for you guys).

R~~

JimBProject donor
Honorary cruncher
Send message
Joined: 4 Aug 11
Posts: 916
ID: 107307
Credit: 974,532,191
RAC: 0
Discovered 1 mega prime321 LLR Ruby: Earned 2,000,000 credits (2,726,625)Cullen LLR Turquoise: Earned 5,000,000 credits (5,031,868)ESP LLR Turquoise: Earned 5,000,000 credits (5,064,082)Generalized Cullen/Woodall LLR Turquoise: Earned 5,000,000 credits (5,038,750)PPS LLR Turquoise: Earned 5,000,000 credits (5,000,461)PSP LLR Turquoise: Earned 5,000,000 credits (7,674,374)SoB LLR Sapphire: Earned 20,000,000 credits (42,604,648)SR5 LLR Jade: Earned 10,000,000 credits (11,829,173)SGS LLR Ruby: Earned 2,000,000 credits (2,413,082)TRP LLR Ruby: Earned 2,000,000 credits (2,291,092)Woodall LLR Turquoise: Earned 5,000,000 credits (5,046,412)321 Sieve (suspended) Jade: Earned 10,000,000 credits (10,057,614)Cullen/Woodall Sieve (suspended) Ruby: Earned 2,000,000 credits (4,002,919)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,005,451)PPS Sieve Emerald: Earned 50,000,000 credits (52,042,965)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Ruby: Earned 2,000,000 credits (2,341,676)TRP Sieve (suspended) Ruby: Earned 2,000,000 credits (2,070,804)AP 26/27 Jade: Earned 10,000,000 credits (10,742,251)GFN Emerald: Earned 50,000,000 credits (50,000,251)PSA Double Gold: Earned 500,000,000 credits (728,547,693)
Message 103943 - Posted: 21 Jan 2017 | 0:41:06 UTC - in response to Message 103941.

Thanks for considering it. Again I am asking for a daily or hourly cron job, not asking for instant revival on trickle (which clearly would be a LOT more work for you guys).

R~~

Actually, handling it instantly as it trickles is by far the easiest way to do it. We don't keep trickle messages - once they're handled, they're deleted. Every trickle is logged, so we can always go back and see what happened.

Anyway, the trickle handler has now been modified to unabandon jobs. At the same time, it'll push back the expiration date if needed (if the expiration is prior to the maximum deadline, the job has made progress, and it's due to expire in less than a week). On the other hand, if an abandoned job has already passed the maximum deadline, it will be changed from abandoned to "no reply". That way if the client finishes before the job is purged, the server will accept it without the need for the "fix abandoned jobs with uploads" cron.

A potential complication is that the trickle handler has always worked only if the trickle comes from the same hostid as the server assigned the job to. This behavior is unchanged. So if your hostid is changed and the server doesn't have that particular job on that host, the trickle will be logged but ignored.

Any new jobs that were created when yours was abandoned are unaffected. In the case of long jobs like SoB, many of them time out or cancelled before work begins. There are very few instances where jobs are cancelled here and this doesn't seem like it should be one of them.

stream
Volunteer moderator
Project administrator
Volunteer developer
Volunteer tester
Send message
Joined: 1 Mar 14
Posts: 938
ID: 301928
Credit: 513,433,159
RAC: 2,554
Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 1 prime in the 2021 Tour de Primes321 LLR Jade: Earned 10,000,000 credits (10,011,570)Cullen LLR Jade: Earned 10,000,000 credits (10,009,374)ESP LLR Jade: Earned 10,000,000 credits (10,009,221)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,012,217)PPS LLR Jade: Earned 10,000,000 credits (16,291,512)PSP LLR Jade: Earned 10,000,000 credits (10,044,081)SoB LLR Jade: Earned 10,000,000 credits (10,064,750)SR5 LLR Jade: Earned 10,000,000 credits (10,002,051)SGS LLR Jade: Earned 10,000,000 credits (10,001,295)TRP LLR Jade: Earned 10,000,000 credits (10,002,411)Woodall LLR Jade: Earned 10,000,000 credits (10,013,921)321 Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,004,228)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,047,667)PPS Sieve Sapphire: Earned 20,000,000 credits (20,866,490)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,043,271)TRP Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,015,177)AP 26/27 Sapphire: Earned 20,000,000 credits (20,045,194)GFN Emerald: Earned 50,000,000 credits (55,355,287)WW Sapphire: Earned 20,000,000 credits (20,292,000)PSA Double Silver: Earned 200,000,000 credits (200,301,443)
Message 103945 - Posted: 21 Jan 2017 | 1:13:16 UTC - in response to Message 103906.


Two thoughts arise from this.

years ago the advice from CPDN was to back up the main boinc directory every week. With their very long running tasks this protected you, and that project, against failure where the files got trashed (including HDD failure). Advice, iirr, was to take a tarball or zip file of the entire directory and store on another machine, USB drive, etc.

Sounds to me as if that strategy is no longer applicable? If you did restore the entire folder, next time the client contacted the server there would be a sequence number error?

Correct. This strategy worked fine many years ago when everybody had single-core computer crunching only one task without trickles - i.e. client didn't contacted server between backups. Now it's not working. I'm running a farm of diskless network-booted crunchers. To decrease network load, they're running on RAM disk which is backed up to server once per hour. I'm getting abandoned tasks every time when a machine occasionally reboots and contacted a server (e.g. one of cores completed a task) after last backup.


Second problem is an identification of computers. There is a entry in client_state.xml:

<host_cpid>112233445566778843e754e1727072f8</host_cpid>

Which is unique ID of your computer. THIS is a value used by server to distinguish between different computers, not a short host ID. This UID is created by Boinc and rules for creating it differs from version to version. Originally it was based on MAC address, recent versions of client added path to data directory to this hash.


advice given in the past (maybe ten yers ago) on Einstein@home was that you could force a new cpid by setting this value to zero.


You cannot make NEW cpid, you can REGENERATE it (forcing it to zero, delete a line from cleint_state.xml, whatever). The cpid is CONSTANT for given hardware/software combination. It's a hash of MAC address, path to data directory, may be something else (I didn't looked at source here). Problems will appear when Boinc generates equal CPID's on different systems, which could be caused by VM cloning, or when Boinc fails to determine MAC address.


There clearly IS a way to disambiguate, as by experiment I have found that once this problem occurs, multiple updates, at about one minute intervals, from each box alternately, eventually makes the server notice the difference. Once the difference is noticed, the server correctly puts each task under the correct computer. Internal records on the server? More likely, it eventually decides to trust what the clients are saying about their tasks in progress. If the latter, then I would prefer it to do that from the first.

Server trust data sent by client even if CPID do not match real hardware. But if RPC sequence error was detected, server will force re-generation of CPID on client. Probably it happened in your case; initially they had same identities somehow but after a first RPC error re-generation was requested, and since their hardware was different now they had unique CPIDs assigned and become two different systems for the server.


When deploying clients on similar hardware or cloned virtual machines, be sure that client generated unique CPID's on each host!


how?


When new CPID is (re)generated, it'll be a line in the log (something like "Generated new computer ID: ..."). In all other cases, check client_state.xml.

And what value do I put in? Can I make up any longish hex hash of the right number of digits? Or can I use any number of hex digits?

Once I do set up my own hash in cpid, will the client at some future time decide it is "wrong" for the environment it finds itself in? (Clearly it will be, as I haven't used the client's own hash function)

Technically, it's a 32 hex digits. But you should not change it manually. Although client will trust you while everything is working without errors, it will rewrite CPID by server's command if an RPC error ever happens. Just be sure that CPID is not the same as on other suspicious similar system like cloned VM.

Ken_g6Project donor
Volunteer developer
Avatar
Send message
Joined: 4 Jul 06
Posts: 923
ID: 3110
Credit: 223,458,475
RAC: 27,288
Discovered 1 mega primeFound 2 primes in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 1 prime in the 2021 Tour de Primes321 LLR Ruby: Earned 2,000,000 credits (2,010,094)Cullen LLR Ruby: Earned 2,000,000 credits (2,022,806)ESP LLR Amethyst: Earned 1,000,000 credits (1,657,625)Generalized Cullen/Woodall LLR Amethyst: Earned 1,000,000 credits (1,251,083)PPS LLR Sapphire: Earned 20,000,000 credits (25,144,442)PSP LLR Turquoise: Earned 5,000,000 credits (5,182,638)SoB LLR Turquoise: Earned 5,000,000 credits (5,923,035)SR5 LLR Ruby: Earned 2,000,000 credits (2,061,736)SGS LLR Ruby: Earned 2,000,000 credits (2,139,194)TPS LLR (retired) Bronze: Earned 10,000 credits (19,376)TRP LLR Ruby: Earned 2,000,000 credits (2,520,745)Woodall LLR Ruby: Earned 2,000,000 credits (2,021,413)321 Sieve (suspended) Ruby: Earned 2,000,000 credits (2,915,071)Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (8,584,236)Generalized Cullen/Woodall Sieve (suspended) Ruby: Earned 2,000,000 credits (2,461,309)PPS Sieve Emerald: Earned 50,000,000 credits (94,299,014)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (352,129)TRP Sieve (suspended) Gold: Earned 500,000 credits (776,202)AP 26/27 Turquoise: Earned 5,000,000 credits (6,364,510)GFN Sapphire: Earned 20,000,000 credits (35,632,932)WW Jade: Earned 10,000,000 credits (15,964,000)PSA Ruby: Earned 2,000,000 credits (4,154,617)
Message 103947 - Posted: 21 Jan 2017 | 1:19:31 UTC

I have an "abandoned" task on one computer that's been running continuously the whole time the task was issued. I can only assume it never received the task:

https://www.primegrid.com/result.php?resultid=768716326
____________

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103961 - Posted: 21 Jan 2017 | 16:56:35 UTC - in response to Message 103947.
Last modified: 21 Jan 2017 | 17:25:08 UTC

I have an "abandoned" task on one computer that's been running continuously the whole time the task was issued. I can only assume it never received the task:

https://www.primegrid.com/result.php?resultid=768716326


hi Ken,

this looks like a ghost task to me, not an abandoned one.

Ghost task: does not exist on user machine, logged as existing on server, often (as in yours) status = new

Abandoned task: shown on server with status = abandoned, AND one of the following
does exist on user machine
completed on user machine since it was marked as abandoned and has been reported and accepted since it was first marked as abandoned

It is important to tell these apart, as the advice to the user is different in the two cases.

Ghost tasks do not hurt the user: they eventually time out and get sent to someone else. Meanwhile the user's box has been crunching other stuff.

Small caveat: on other projects which do double checking for some people, having ghost units will reduce your "trust factor" slightly, and mean more of your work is double checked for a while. This has not applied on PG since September 2016.

Abandoned tasks give the user a dilemma. The task has been (or will be) re-sent. If computer has nearly finished the task it is worth letting it complete and report, as then the user gets full credit (espeically important during a challenge). On the other hand, the task is now one of three, not one of two. If it has not started to crunch at all, please abort it, so that you get credit for some other task that is useful to the project.

In between those extremes, there is a judgment call to make: if you abort it you lose any credit for what has already been done, if you allow it to complete you are saving your credit but by doing work that is likely to be of little benefit to the project. Where you cross over from one to the other is up to you....

Both abandoned tasks and ghost tasks arise from glitches in the communication between server and client, but the glitches are different. Abandoned task errors are particulalry likely to arise when a user adds a new machine to their crunching farm, and it is mistaken for an already-existing one. Stream has noted (in this thread, above) another way they can arise.

Hope that clarifies things.

R~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103962 - Posted: 21 Jan 2017 | 17:11:27 UTC - in response to Message 103945.

...
Server trust data sent by client even if CPID do not match real hardware. But if RPC sequence error was detected, server will force re-generation of CPID on client. Probably it happened in your case; initially they had same identities somehow but after a first RPC error re-generation was requested, and since their hardware was different now they had unique CPIDs assigned and become two different systems for the server.


yes, but what I find confusing about your explanation is that the two machines ALWAYS had very different hardware, AND a different CPU.

So why does the cpid come identical first time a machine connects, and then be different a few connections later. It seems to me (and this is my guess, not based on reading the code) that the server must have sent some message saying "please choose a different cpid" to one or other of the machines, probably the new one as the old machine keeps all the historic.

Maybe my guess is wrong, but if so I do not understand how a cpid that is deterministic based on the hardware & software can suddenly change after a few connections, and why (in my past experience) that change can be encouraged by the technique of updating from one then the other machine alternately.

COming back to your advice to make sure the new machine has a different cpid.

With almost 20 machines here, do you mean you are suggesting I manually compare a new cpid against all the others?

I can do that of course, I can script the test. But then, if it clashes, what do I replace it with? Zero and try again?

If it is deterministic, why would it come up with something different a second time?

Sorry, but I am not understanding how to implement your advice in practice.

If I DID create my own cpid, would I have to worry about the possibility of it clashing with somebody else's machine? Or does it just have to be unique amongst my own machines?

Regards,

River~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103965 - Posted: 21 Jan 2017 | 17:23:56 UTC - in response to Message 103945.

... I'm running a farm of diskless network-booted crunchers.


me too - at present six of mine are diskless and eventually nine or ten will be


To decrease network load, they're running on RAM disk which is backed up to server once per hour. I'm getting abandoned tasks every time when a machine occasionally reboots and contacted a server (e.g. one of cores completed a task) after last backup.

or indeed one of the tasks trickled

mine are not backed up at all - if a machine crashes then the work is _correctly_ shown as abandoned when that box reboots. Trouble is, sometimes it spots it as the wrong box, and some work is abandoned which should not have been. Then eventually one crash leads to two boxes with abandoned work. (Hopefully Jim's fix will help ease that pain)

I have started to prototype a backup strategy, based on manual backups to USB, whuich has worked for me so far. It is good to know what to expect before I go too far down the road of scripting an automated version

Have you experienced the problem where the tasks that are marked as abandoned are on a different box to the one that actually crashed out?

River~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103966 - Posted: 21 Jan 2017 | 17:28:26 UTC - in response to Message 103943.

hi JimB

...Anyway, the trickle handler has now been modified to unabandon jobs. At the same time, it'll push back the expiration date if needed (if the expiration is prior to the maximum deadline, the job has made progress, and it's due to expire in less than a week). On the other hand, if an abandoned job has already passed the maximum deadline, it will be changed from abandoned to "no reply". That way if the client finishes before the job is purged, the server will accept it without the need for the "fix abandoned jobs with uploads" cron.


BIG THANKS :)

River~~

stream
Volunteer moderator
Project administrator
Volunteer developer
Volunteer tester
Send message
Joined: 1 Mar 14
Posts: 938
ID: 301928
Credit: 513,433,159
RAC: 2,554
Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 1 prime in the 2021 Tour de Primes321 LLR Jade: Earned 10,000,000 credits (10,011,570)Cullen LLR Jade: Earned 10,000,000 credits (10,009,374)ESP LLR Jade: Earned 10,000,000 credits (10,009,221)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,012,217)PPS LLR Jade: Earned 10,000,000 credits (16,291,512)PSP LLR Jade: Earned 10,000,000 credits (10,044,081)SoB LLR Jade: Earned 10,000,000 credits (10,064,750)SR5 LLR Jade: Earned 10,000,000 credits (10,002,051)SGS LLR Jade: Earned 10,000,000 credits (10,001,295)TRP LLR Jade: Earned 10,000,000 credits (10,002,411)Woodall LLR Jade: Earned 10,000,000 credits (10,013,921)321 Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,004,228)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,047,667)PPS Sieve Sapphire: Earned 20,000,000 credits (20,866,490)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,043,271)TRP Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,015,177)AP 26/27 Sapphire: Earned 20,000,000 credits (20,045,194)GFN Emerald: Earned 50,000,000 credits (55,355,287)WW Sapphire: Earned 20,000,000 credits (20,292,000)PSA Double Silver: Earned 200,000,000 credits (200,301,443)
Message 103978 - Posted: 22 Jan 2017 | 8:10:11 UTC - in response to Message 103962.

So why does the cpid come identical first time a machine connects, and then be different a few connections later. It seems to me (and this is my guess, not based on reading the code) that the server must have sent some message saying "please choose a different cpid" to one or other of the machines, probably the new one as the old machine keeps all the historic.

Yes, I've mentioned this above. Server really will send this kind of message. An old machine will generate same CPID (because it has same hardware) and server could match this host with old records. A new machine should generate new CPID and host will create new record on server.

Maybe my guess is wrong, but if so I do not understand how a cpid that is deterministic based on the hardware & software can suddenly change after a few connections, and why (in my past experience) that change can be encouraged by the technique of updating from one then the other machine alternately.

I don't remember details of source code but it could be some gap allowed between RPC sequences on client and server. At least off-by-one error must be allowed for sure - a client may not receive reply from server and his sequence will be less then on server, this is normal situation in all data transfer protocols. It could explain why few connections were necessary.

With almost 20 machines here, do you mean you are suggesting I manually compare a new cpid against all the others?

Under normal circumstances, you should not do this. But if you'll find more "abandoned" tasks without known reason, this is a one of possible places to look at.

But then, if it clashes, what do I replace it with? Zero and try again?

Stop client and delete a line from client_state.xml; client will regenerate it after restart.

Please note that what I'm suggesting is a troubleshooting path, not a thing that you must do every time. If it's working normally, don't touch it.

If it is deterministic, why would it come up with something different a second time?

We're talking about two different scenarios:

1. Boinc data directory was occasionally copied from another machine. In this case, client will generate new CPID based on new hardware and everything will work smoothly from now on.

2. Boinc generated same CPID on different machines, on real physical hardware. This is bad, it could mean a bug in client (failed to collect enough data to generate unique hash). It should never happen and cannot be solved normally here on your side. If you'll encounter this situation, try to update client to most recent version or ask on Berkeley forums.

If I DID create my own cpid, would I have to worry about the possibility of it clashing with somebody else's machine? Or does it just have to be unique amongst my own machines?

As far as I remember from the source, only your own machines count.

Have you experienced the problem where the tasks that are marked as abandoned are on a different box to the one that actually crashed out?

Never, even on boxes with exactly same hardware (differs by MAC address only).

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103983 - Posted: 22 Jan 2017 | 11:50:38 UTC - in response to Message 103978.
Last modified: 22 Jan 2017 | 11:59:53 UTC


Please note that what I'm suggesting is a troubleshooting path, not a thing that you must do every time. If it's working normally, don't touch it.


I no longer trust it to "work normally".

When this first happened in March 2016, believing that it was the MAC address on the USB ethernet dongle that had caused the ambiguity, I went out and bought separate ethernet dongles. Either the Linux Boinc client is unable to see those MAC addresses or it is ignoring them as shown by the recent incident in my OP.

Running ifconfig from root, the hwaddr is listed ok, so it could be either because the client asks after it drops into the boinc uid, or maybe is deliberately ignoring it because it knows it is a USB connection and therefore potentially mutable. The second explanation does not explain why my desktop diskless machines suffer from this.

I am not so much troubleshooting as trying to create a viable workaround.

It is not only the abandoned tasks, it is also the fact that the machine comes up with the wrong host location (ie the host location for the machine it has stolen the identity of). So what it downloads as soon as it connects to PG depends on who it randomly decides to be. Clearly both issues are symptoms of a mistaken attempt to identify the hardware.

My inclination, as a workaround, is to ask the uuid daemon to make me a unique number, which after a quick sed filter is a 32 digit hex number, and to do this EVERY time I re-install on a blank machine.

That means I will need to merge together instances of the machines using the website, but I am in control of that and know which ones belong together and which don't.


We're talking about two different scenarios:

1. Boinc data directory was occasionally copied from another machine. In this case, client will generate new CPID based on new hardware and everything will work smoothly from now on.

2. Boinc generated same CPID on different machines, on real physical hardware. This is bad, it could mean a bug in client (failed to collect enough data to generate unique hash). It should never happen and cannot be solved normally here on your side....


In this thread I have only been talking about your scenario 2. Really different physical hardware with really different network cards and other physical differences still get lumped together.

I have spent time and money (buying extra USB dongles and labelling each one with the machine it belongs to) hoping to give Boinc the clues it needs, but it does not reliably pick up those clues.

Different cpus can be lumped together.

Different kernels can be lumped together (actually that is sensible as Linux people do update our kernels quite often)

Different machine names can be lumped together

Different LAN addresses can be lumped together (actually this is sensible as I might be running DHCP without reserving IP addresses)

Different extrnal addresses can be lumped together (actually sensible because the laptops occasionally update from an internet cafe or via the neighbours (with their permission!) when my own connection is down)



... If you'll encounter this situation, try to update client to most recent version or ask on Berkeley forums.


I have just subscribed to the boinc_dev mailing list to ask about this problem and am waiting for my sub to be accepted (they have a manual anti-spam filter for first posts). I have not done so before because there was always the thought that I was doing something to provoke it, but this time I am sure this is not so. (You may remember I have been seeing this issue since the Pi Paddy challenge in 2016)

Do you think the forum at https://boinc.berkeley.edu/dev/forum_index.php is more appropriate before going to the mailing list?

River~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 103998 - Posted: 22 Jan 2017 | 15:53:31 UTC
Last modified: 22 Jan 2017 | 15:58:13 UTC

@stream

do you happen to know the purpose of <external_cpid> which is set empty on my systems?

I mean there is a line in every config.xml saying

<external_cpid></external_cpid>


and i do not remember seeing it set to any non empty value

R~~

stream
Volunteer moderator
Project administrator
Volunteer developer
Volunteer tester
Send message
Joined: 1 Mar 14
Posts: 938
ID: 301928
Credit: 513,433,159
RAC: 2,554
Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 1 prime in the 2021 Tour de Primes321 LLR Jade: Earned 10,000,000 credits (10,011,570)Cullen LLR Jade: Earned 10,000,000 credits (10,009,374)ESP LLR Jade: Earned 10,000,000 credits (10,009,221)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,012,217)PPS LLR Jade: Earned 10,000,000 credits (16,291,512)PSP LLR Jade: Earned 10,000,000 credits (10,044,081)SoB LLR Jade: Earned 10,000,000 credits (10,064,750)SR5 LLR Jade: Earned 10,000,000 credits (10,002,051)SGS LLR Jade: Earned 10,000,000 credits (10,001,295)TRP LLR Jade: Earned 10,000,000 credits (10,002,411)Woodall LLR Jade: Earned 10,000,000 credits (10,013,921)321 Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,004,228)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,047,667)PPS Sieve Sapphire: Earned 20,000,000 credits (20,866,490)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,043,271)TRP Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,015,177)AP 26/27 Sapphire: Earned 20,000,000 credits (20,045,194)GFN Emerald: Earned 50,000,000 credits (55,355,287)WW Sapphire: Earned 20,000,000 credits (20,292,000)PSA Double Silver: Earned 200,000,000 credits (200,301,443)
Message 104018 - Posted: 22 Jan 2017 | 22:19:25 UTC - in response to Message 103998.

@stream

do you happen to know the purpose of <external_cpid> which is set empty on my systems?

I mean there is a line in every config.xml saying

<external_cpid></external_cpid>


and i do not remember seeing it set to any non empty value

char external_cpid[MD5_LEN]; // the "external" user CPID (as exported to stats sites)

It's generated on the server, probably PG server do not support it. It's hash of some other cross-project user ID and user email. It's not used anywhere in the source and not related to your situation (it's a user ID, a not a computer ID).

stream
Volunteer moderator
Project administrator
Volunteer developer
Volunteer tester
Send message
Joined: 1 Mar 14
Posts: 938
ID: 301928
Credit: 513,433,159
RAC: 2,554
Discovered 1 mega primeFound 1 prime in the 2018 Tour de PrimesFound 1 prime in the 2019 Tour de PrimesFound 1 prime in the 2020 Tour de PrimesFound 1 prime in the 2021 Tour de Primes321 LLR Jade: Earned 10,000,000 credits (10,011,570)Cullen LLR Jade: Earned 10,000,000 credits (10,009,374)ESP LLR Jade: Earned 10,000,000 credits (10,009,221)Generalized Cullen/Woodall LLR Jade: Earned 10,000,000 credits (10,012,217)PPS LLR Jade: Earned 10,000,000 credits (16,291,512)PSP LLR Jade: Earned 10,000,000 credits (10,044,081)SoB LLR Jade: Earned 10,000,000 credits (10,064,750)SR5 LLR Jade: Earned 10,000,000 credits (10,002,051)SGS LLR Jade: Earned 10,000,000 credits (10,001,295)TRP LLR Jade: Earned 10,000,000 credits (10,002,411)Woodall LLR Jade: Earned 10,000,000 credits (10,013,921)321 Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,004,228)Generalized Cullen/Woodall Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,047,667)PPS Sieve Sapphire: Earned 20,000,000 credits (20,866,490)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,043,271)TRP Sieve (suspended) Sapphire: Earned 20,000,000 credits (20,015,177)AP 26/27 Sapphire: Earned 20,000,000 credits (20,045,194)GFN Emerald: Earned 50,000,000 credits (55,355,287)WW Sapphire: Earned 20,000,000 credits (20,292,000)PSA Double Silver: Earned 200,000,000 credits (200,301,443)
Message 104046 - Posted: 23 Jan 2017 | 13:49:55 UTC - in response to Message 103983.


In this thread I have only been talking about your scenario 2. Really different physical hardware with really different network cards and other physical differences still get lumped together.


TL;DR: see conclusions at the end.

I looked in the source to find out how CPID is generated. I wish I didn't. Now I know that I am in trouble :)

CPID is a simple MD5 hash of MAC address and, in recent versions, path to data directory. An interesting point is how MAC address is obtained. Of course it's platform-specific, so we'll discuss Linux implementation.

Client queries list of adapters using some IOCTL call. It stops when end of list is reached, or network interface named like "eth..." is encountered.

Why I'm in trouble - because at least one of my Ubuntu installations has no eth0 interface. It's named p2p1. So Boinc will never stop, scan to the end of the list and return MAC address of last adapter in the list.

If an order of interfaces is "lo", "p2p1", it will work - function will return MAC address of last adapter. In my case list of adapters is "lo", "p2p1", "ppp0", so it'll return either some garbage, either error because it checks for "0:0....:0" and considers this an error. In case of error, client generates random CPID.

Now let's look on server side. And... the CPID is used only when client was detached or reinstalled, i.e. doing its first connect to server. In case of RPC sequence error, CPID is not used at all! Code immediately jumps to following:
// One final attempt to locate an existing host record: // scan backwards through this user's hosts, // looking for one with the same host name, // IP address, processor and amount of RAM.

On my GFN server I see really a lot of Linux systems which reported IP address of 127.0.0.1. If they also happens to have similar hardware, same host name (e.g. booted from same network image and host name was not generated automatically), Boinc server will incorrectly match wrong host with existing one.

Conclusions:

- Be sure that your crunchers have unique CPIDs;
- Be sure that Boinc correctly determined host IP address (NOT a 127.0.0.1) - check it in the log / client_state.xml / host status page on server.
- Be sure that host name is unique (in case of network-booted systems, change it on startup from MAC address before Boinc client was run).

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 104055 - Posted: 23 Jan 2017 | 23:08:03 UTC - in response to Message 104046.

hi Stream,

I have been tracing the same path as you!

Yes, on Ubuntu and Mint we now have network card names that are long and complicated. USB dongles are called enxzzzzzzzzzzzz where the z are the hex digits of the MAC address. PCI cards are enpzzzz where the z are which slot, etc, the card is plugged into. This is for good reasons: we used to have to do all kinds of trickery to make sure that the "right" card came up as eth0 and not as eth1 on a two card system. Now that side of things works better.

But, as you say, there are no network devices that match the traditional names. I do not know why they want to exclude other devices (like, why not take the MAC address from a wifi card, for example?).


About the host IP addresses.

When you install Debian, Ubuntu, or Mint, if you do not connect to a network at install time, or if you use DHCP at install time, localhost goes in the hosts file as 127.0.0.1 and the domain name as 127.0.1.1 (the installers want to put the domain name into the hosts file, and you did not give them an address so this is their best attempt)


Later, Boinc seems to find the IP from domain name so I always get an IP listed as 127.0.1.1 -- I think I can avoid this by putting a proper address next to the domain name in the hosts file, but this is not yet tested....

R~~


River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 104058 - Posted: 23 Jan 2017 | 23:47:02 UTC

hi Jim, Michael

I have raised this issue on the Boinc dev mailing list.

I am now reasonably sure that although the symptoms on my laptops and my diskless desktops are similar, the causes are different.

In the case of the laptops, the devs have referred me to a server commit that might have avoided my problems last March and again now, assuming that the client has correctly picked up the vendor names of my two laptops (Asus and Lenovo). Basically it does a sanity check before treating two hosts as the same on the basis of cpid, checking that things like the vendor match.

I am in two minds about this patch - it would mean that changing the number of cpus would break continuity, and this is something that is possible for a virtual machine. So introducing this patch would mean that a user who changed the number of cpus would find any existing tasks listed as abandoned. (I guess they would not do so more than once!). Do we want to remove that option?

More seriously, I have noticed that the OS version is used for the kernel version in Linux. If this is how the info is held on the database, then this patch would mean that updating the kernel (which most Linux users do a few times a year) would break all running tasks.

Also the same might happen adding a service pack to Win 7 or 8, or updating Win 10 when Microsoft do a major update as for example Win 10 v1607.

And I do not see things like vendor name listed in the details for my hosts -- so I have no way of telling if these are actually picked up OK. Certainly there is no point doing this patch if you cannot see Lenovo and Asus listed in your own records for hosts 533322 and 527527.

And it would not solve the issue with my identical diskless desktops, because there really is nothing but IP and MAC addresses to disambiguate them.

So, I am passing on this suggestion from the Boinc dev mailing list, but am leaving it up to you whether you believe, on balance, it is worth implementing.

Meanwhile I am doing a tweak to the install scripts that I use here, to pre-set cpid before boinc gets to it. The devs seem to think this will work and will not break anything else.

R~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 104059 - Posted: 23 Jan 2017 | 23:56:58 UTC
Last modified: 23 Jan 2017 | 23:58:16 UTC

hi again Stream

at first I was confused by how your post relates to my diskless desktops.

They run Debian and still have the traditionally named eth0 network interface.

But something else in your post gave me a clue. You said that when it is not a new attach, it ignores cpid.

What I am doing at present on those workstations is presenting boinc with a directory that looks like the directory after the post install trigger has run BUT with three differences:

1. edit gui_rpc_auth.cfg to set password
2. edit remote_hosts.cfg to set list of allowed boinc-view controllers
3. add a four line account_www.primegrid.com.xml file with my weak auth in it

On the basis of your recent reding of the code do you think that when the client gets a pre-existing but incomplete account_...xml file, but no cpid, it might skip to the same code you were describing? That would fit the symptoms I have seen on my diskless boxes.

R~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 104125 - Posted: 25 Jan 2017 | 18:28:25 UTC - in response to Message 104046.

hi again Stream

Could I ask you to report this as a bug to the Ubuntu devs, if you have not done so already?



...
Client queries list of adapters using some IOCTL call. It stops when end of list is reached, or network interface named like "eth..." is encountered.

Why I'm in trouble - because at least one of my Ubuntu installations has no eth0 interface. It's named p2p1. So Boinc will never stop, scan to the end of the list and return MAC address of last adapter in the list.


This is a bug on Ubuntu and Mint, but not on Debian.

The reason I say that is that Debian still uses eth0 style interface names, but Ubuntu has gone over to the more modern interface names that avoid the problem that different interfces will sometimes come up as eth0. The way Ubuntu does this is to use interface names that usually start ep and have six to thirten more characters.

A usb dongle for example would be epx0123456789ab where 0123456789ab is the MAC address,

PCI cards are epnxxxx where xxxx describe the slot number and so on.

As far as I can see the binaries in the Ubuntu repository have not been changed from the relevant Debian versions. The bug is that they should have been changed to take into account the new possibilities for the interface names.

I could report this, but I do not directly use Ubuntu and you do. Also, you spotted the exact place in the code where the reference to eth0 is made, and can direct them to the right place more easily than I could.

Hence my request that you make the bug report. Please let me know, on this forum of by PM, whether you are (or are not) willing to do this.

regards
River~~

River~~

River~~
Send message
Joined: 17 Mar 07
Posts: 342
ID: 6533
Credit: 15,792,075
RAC: 0
321 LLR Silver: Earned 100,000 credits (124,889)Cullen LLR Silver: Earned 100,000 credits (200,779)ESP LLR Silver: Earned 100,000 credits (112,791)Generalized Cullen/Woodall LLR Silver: Earned 100,000 credits (106,156)PPS LLR Amethyst: Earned 1,000,000 credits (1,358,025)PSP LLR Silver: Earned 100,000 credits (150,832)SoB LLR Gold: Earned 500,000 credits (573,744)SR5 LLR Gold: Earned 500,000 credits (500,731)SGS LLR Silver: Earned 100,000 credits (479,282)TRP LLR Silver: Earned 100,000 credits (328,373)Woodall LLR Silver: Earned 100,000 credits (119,260)Generalized Cullen/Woodall Sieve (suspended) Turquoise: Earned 5,000,000 credits (7,061,082)PPS Sieve Silver: Earned 100,000 credits (326,987)Sierpinski (ESP/PSP/SoB) Sieve (suspended) Silver: Earned 100,000 credits (174,708)TRP Sieve (suspended) Gold: Earned 500,000 credits (505,558)AP 26/27 Gold: Earned 500,000 credits (598,364)GFN Ruby: Earned 2,000,000 credits (3,066,295)
Message 104416 - Posted: 31 Jan 2017 | 21:04:12 UTC

and a counter example, or rather two

The two abandoned tasks here really have got lost somehow, as I went round the houses getting my GPU sorted out. http://www.primegrid.com/results.php?hostid=534074&offset=0&show_names=0&state=0&appid=8

Crunching had begun but somehow the BOINC directory got re-initialised, yet BOINC managed to re-connect the two correctly in this instance. No complaints here :)

Message boards : Number crunching : tasks mislabelled as abandoned

[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2021 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 2.19, 2.15, 2.30
Generated 2 Dec 2021 | 6:21:45 UTC