Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
checkpoint file was saved by an older version of genefer.
Author |
Message |
|
Got a blackout -- two GFN-Shorts survived, third one didn't. It restarted from null with $Subject what is stupid -- it's clear there was no version difference (not to mention, two other genefers didn't found any verion difference). I've checked logs and found out this one was the only one that client sacrificed to give SZTAKI a chance to run. Conicidence? Bad luck? Bad wording? Nobody cares? Anything else?
____________
I'm counting for science,
Points just make me sick. | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
616143567 190301 26 Mar 2015 | 22:34:37 UTC 8 Apr 2015 | 1:33:24 UTC Aborted by user 1,038,331.54 1,015,724.00 --- Genefer v3.07 Hmm. "Aborted by user" seems clear but you use the older BOINC client 6.10.58 and kernel 3.2.x. Both should not cause any troubles.
What do you mean with "got a blackout"?
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
|
blackout: number 3. It wasn't big, something about a minute.
____________
I'm counting for science,
Points just make me sick. | |
|
|
I've received this error message in the stderr.txt while crunching GFN-WR tasks immediately following a crash of either Windows 8.0 or what might be a GeForce driver crash. The task or tasks then start over. I did not investigate further and did not save a copy of the stderr.txt to paste here. I believe that this has only occurred since the latest client version. | |
|
|
I've saved whole slot. As I can see it was totally unnecessary.
____________
I'm counting for science,
Points just make me sick. | |
|
|
Just checked that two other GFNSes validated just fine, in couple of days backed up slot will be gone, In summary -- nobody sees, nobody cares.
____________
I'm counting for science,
Points just make me sick. | |
|
|
It maybe that the blackout occurred while the third task was doing a checkpointing. When Boinc came to, the checkpoint was deemed invalid so boinc started the task from the start. So I would say it was just bad luck. We all have bad tasks every so often.
As to that SZTAKI task, it was just an opportunist waiting in the shadows.
____________
Werinbert is not prime... or PRPnet keeps telling me so.
Badge score: 2x3 + 5x4 + 5x5 + 4x7 + 1x8 + 1x9 + 3x10 = 126 | |
|
|
It maybe that the blackout occurred while the third task was doing a checkpointing. When Boinc came to, the checkpoint was deemed invalid so boinc started the task from the start. So I would say it was just bad luck. We all have bad tasks every so often.
That's my understanding too. Except, I still don't get it -- why science code blames the version difference? Not to say, it's very disappointing way to do checkpoints :)
As to that SZTAKI task, it was just an opportunist waiting in the shadows.
Me suggests yet-to-be-discovered conspiracy by client. The client is true bastard here.
____________
I'm counting for science,
Points just make me sick. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14045 ID: 53948 Credit: 482,737,361 RAC: 581,467
                               
|
That's my understanding too. Except, I still don't get it -- why science code blames the version difference? Not to say, it's very disappointing way to do checkpoints :)
The "interrupted save file" seems to fit the scenario. The "old version" error message likely comes because the first thing in the save file is the version identifier. If the file was corrupted (it's likely it was zero length), the program is probably going to interpret it as a wrong version number. Most likely it will see zero, which is less than whatever version number it's looking for.
The part that I don't like is that I wrote (or at least modified) the checkpoint routine specifically to avoid this problem. It renames the old checkpoint file before writing the new checkpoint file, so if the process is interrupted, the previous checkpoint still exists. Unfortunately it looks like the restart process didn't correctly detect that the checkpoint file was missing or corrupt and reload the previous checkpoint instead. (It's also possible both checkpoint files were bad for some reason.)
____________
My lucky number is 75898524288+1 | |
|
|
It renames the old checkpoint file before writing the new checkpoint file, so if the process is interrupted, the previous checkpoint still exists.
If the brown out happened at the exact point of the rename, this could cause the corruption and missing checkpoint file. As rare as it seems, it does take ms to do disk reads and writes and a brown out could happen at the exact moment. Very, very rare, but possible.
Or another thing, if there is a bad spot on the disk (have you run a full check disk lately, or ever?) and the rename (which actually moves the file name to a new index spot) may have been placed in a bad spot and the brown out caused the new file to be corrupt.
New disks most people do not think about bad sectors, or spots. We just run on our way and most OSes or drive controllers will automatically mark them bad, but occasionally a full check disk (or FSCK) can find spots that may not have already been detected and mark them.
| |
|
|
The part that I don't like is that I wrote (or at least modified) the checkpoint routine specifically to avoid this problem. It renames the old checkpoint file before writing the new checkpoint file, so if the process is interrupted, the previous checkpoint still exists. Unfortunately it looks like the restart process didn't correctly detect that the checkpoint file was missing or corrupt and reload the previous checkpoint instead. (It's also possible both checkpoint files were bad for some reason.)
May I remind that this WU was the only one (among three) what was sacrifised by *client* to give share to SZTAKI? Meanwhile, I've got eight GFNShorts, all happily survived another blackout (and I don't understand anymore how wiring is done in my area :[ ). But! All of them run un-interupted before this (they were intended for the past challenge, it was a mistake). Just allowed work for concurent projects -- let's see, they have 13..35 hours to do that thing again.
____________
I'm counting for science,
Points just make me sick. | |
|
|
If the brown out happened at the exact point of the rename, this could cause the corruption and missing checkpoint file. As rare as it seems, it does take ms to do disk reads and writes and a brown out could happen at the exact moment. Very, very rare, but possible.
I would say -- journal.
Or another thing, if there is a bad spot on the disk (have you run a full check disk lately, or ever?) and the rename (which actually moves the file name to a new index spot) may have been placed in a bad spot and the brown out caused the new file to be corrupt.
With S.M.A.R.T. in action? That's ridiculous. (A True Story: friend of mine was awarded :P with a disk that had bads all over. He was very happy after managing to reset it's guts. Not to mention with couple of days bads were back (where they belong). So, if disk shows any signs of grief it has to be replaced asap and put out of circulation (iow -- disassembling; make really stylish plates to place cup or glass on). "Grief" means closing to thresholds, before there're any bads.)
____________
I'm counting for science,
Points just make me sick. | |
|
Message boards :
Generalized Fermat Prime Search :
checkpoint file was saved by an older version of genefer. |