Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Sophie Germain Prime Search :
Over 50 invalid results on one host
Author |
Message |
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
http://www.primegrid.com/show_host_detail.php?hostid=247504
Can someone send some kind of alert for this host
Thanks!
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
Can someone send some kind of alert for this host
Short answer: No.
Slightly longer answer: Why do you care?
Curious question: Do you know how many valid WUs that same host has completed? Hint: It's a lot more that the invalid tasks. For all we know, he or she is playing around with their overclocking and is perfectly aware of the bad results. Sending them an email could very well be seen as criticism and/or spying on them.
Narrative: We don't really care that much if a computer is returning invalid results. The BOINC system weeds those out with little impact upon either the computers or the people. Of far greater impact are people who have their computers download WUs (especially big WUs like SoB, PSP, or GFN-WR) and then turn off BOINC or don't run their computers enough to finish the WUs before the deadline.
Sending email or PMs to the owners of those computers has several risks. There may be a language problem, and the recipient may totally misunderstand the message. Even without a language barrier, the user may feel insulted that we're "complaining" about how he's managing his computers.
In short, it's a big enough job keeping the server side of things running properly. Although we try to be as helpful as possible, trying to micro-manage the users who have computers returning errors is probably a bad idea. The servers are capable of handling this on their own, without the added risk of offending someone who is volunteering their hardware and electricity and money.
____________
My lucky number is 75898524288+1 | |
|
Lumiukko Volunteer tester Send message
Joined: 7 Jul 08 Posts: 165 ID: 25183 Credit: 870,503,958 RAC: 37,406
                           
|
http://www.primegrid.com/show_host_detail.php?hostid=247504
Can someone send some kind of alert for this host
Thanks!
In Genefer (Cuda) errors are even more common, here is a good sample:
http://www.primegrid.com/show_host_detail.php?hostid=248498
2 valid, 7 in progress, 4584 errors.
--
Lumiukko | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
http://www.primegrid.com/show_host_detail.php?hostid=247504
Can someone send some kind of alert for this host
Thanks!
In Genefer (Cuda) errors are even more common, here is a good sample:
http://www.primegrid.com/show_host_detail.php?hostid=248498
2 valid, 7 in progress, 4584 errors.
--
Lumiukko
LOL, yes, but with GPU programs there are ways you can cause your computer to fail 100% of the time, which is what's happening with that computer. He's got something configured such that the GPU isn't available for CUDA processing at all, which causes all CUDA WUs (not just Genefer, and not just PrimeGrid) to fail as soon as the program attempts to access the GPU.
I've always wondered why a person running a computer that was getting 100% errors wouldn't even attempt to figure out why...
____________
My lucky number is 75898524288+1 | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
Michael, if it is ok for you, then it will be ok for me :)
I just ask, and got the answer.
Thanks
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Narrative: We don't really care that much if a computer is returning invalid results. The BOINC system weeds those out with little impact upon either the computers or the people. Of far greater impact are people who have their computers download WUs (especially big WUs like SoB, PSP, or GFN-WR) and then turn off BOINC or don't run their computers enough to finish the WUs before the deadline.
Sending email or PMs to the owners of those computers has several risks. There may be a language problem, and the recipient may totally misunderstand the message. Even without a language barrier, the user may feel insulted that we're "complaining" about how he's managing his computers.
In short, it's a big enough job keeping the server side of things running properly. Although we try to be as helpful as possible, trying to micro-manage the users who have computers returning errors is probably a bad idea. The servers are capable of handling this on their own, without the added risk of offending someone who is volunteering their hardware and electricity and money.
Looking at my buffer (after seeing a new SGS prime, to see if I was the primefinder or the dc asap...), a realized that if a WU is returned with an error or aborted, it is automatically sent to two new users, even if the initial quorum was 1.
On the single user side, this has no impact (except if he has a reliable host and a long buffer, as the chances of getting a lucky number and being the second to report it do increase), but I think that it might have some impact on the project progress, because many WUs that otherwise would not be double-checked end up being double-checked.
This said, if it is feasible on the server settings, I would suggest that subprojects with adaptive replication active should not increase the minimum quorum unless the first user returned an error. If the WU was aborted, then there's no need to immediately increase the quorum.
I've seen a lot of aborted SGS today (maybe it's just people preparing for the challenge) and all them got resent twice.
____________
676754^262144+1 is prime | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
I don't know the exact logic that's affecting those resends, but it's possible that the AR logic may only assign a quorum of 1 when the first task is sent out. I don't really know.
Just be glad I'm not the guy in charge... I wouldn't use AR at all and I'd set it up to require a quorum of 2 for all primality tests, on both the BOINC and PRPNet side. I understand why it's done the way it's done, but I personally would opt for slower progress with a higher assurance that we didn't miss any primes. But that's just my opinion, and reasonable people can disagree.
____________
My lucky number is 75898524288+1 | |
|
|
...I'd set it up to require a quorum of 2 for all primality tests
If a unit does come in marked as a prime, it is automatically double checked.
I know what you are saying is that we may miss a few that are not found because of a host that marks it as not a prime when it may in fact be a prime. This is minimal and since the numbers are so low they do not have as much significance as the larger numbers. The chances are very low this would happen anyway, probably less than 1 in a million or more.
My bet is they will be found at another time, with another set of tests. Since what we search is not 1 to infinity, but special sets of numbers, those numbers may come up in other tests anyway.
So I think AR on the shorter units is fine.
____________
My lucky numbers are 121*2^4553899-1 and 3756801695685*2^666669±1
My movie https://vimeo.com/manage/videos/502242 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
Someone correct me if I'm wrong, but I don't think any automatic mechanism is going to help here.
For argument's sake, assume that a heavily overclocked and liquid cooled AVX capable CPU can crank out one SGS WU every 5 minutes.
Assume a maximum cache size of 10 days.
Assume that the WU limit is per-core and not per-host. (The problem is far worse if it's per-host.)
In 10 days, this hypothetical CPU core can process 12*24*10 or 2880 WUs.
We do not want to inhibit properly operating hosts because of a desire to reduce the number of WUs sent to faulty hosts, so we need to be able to send at least 2880 WUs to a host before we even begin to think about limiting work flow to it.
So, at best, any automatic method we use won't kick in until at least 3000 error WUs have been returned.
Keeping work flowing to healthy hosts is far more important than suppressing work going to unhealthy hosts.
I don't see how an automated mechanism is going to have a significant effect given how high the limits need to be.
That leaves us with someone manually identifying unhealthy hosts and banning them. I'd be against that for several reasons, not the least of which is that it's going to eventually annoy some people who have temporary problems with their computers as well as creating more work for the admins. Generally speaking, not pissing off users and creating less work for admins is what we try to accomplish, rather than the other way around. :)
With the longer WUs that takes days, rather than minutes, an automatic method might work -- and it would also be more valuable there as well. Nobody waits six months for credit on an SGS WU, but that certainly does happen with SoB and other long tasks. However, I don't know if you can set different limits for the different sub-projects.
____________
My lucky number is 75898524288+1 | |
|
|
Assume a maximum cache size of 10 days.
Just a question but why is the cache limit set to 10 days? Seems like that might be one of the few controllable variables in the whole process.
____________
@AggieThePew
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
Assume a maximum cache size of 10 days.
Just a question but why is the cache limit set to 10 days? Seems like that might be one of the few controllable variables in the whole process.
I always use that in my examples because that was the maximum amount you could configure in your BOINC client. (At least up to version 6.x.x. Version 7 is different.)
Don't get too hung up about the exact numbers in the example. The point was that you need to be able to send many hundreds or even thousands of WUs, per core, to a host, so considering 50 errors to be "a lot" and needing remedial action just isn't going to work.
I'm also getting confused about what we're discussing here.
Are we talking about handling error-prone hosts (which is what this thread is about), or did we suffer a thread wandering/hijacking and are now talking about adaptive replication? I fear I may have just posted an exhaustive answer to a question that wasn't being asked. :)
____________
My lucky number is 75898524288+1 | |
|
|
I think it's a combination of both... to me your answer was relevant because we do seem to have a few systems that spit out a continuous stream of bad results (for whatever reason). In my opinion most dedicated PG folks would really like to see some form of policing (well maybe that's the wrong word to use). I know you've answered several times that a system returning bad results really doesn't have much affect on other users or the server but for me as just a simple minded person it's hard to understand that reasoning.
So, to further muddy the thread I pose this question. In the upcoming challenge, if a group of systems begins to get wu's but continuously returns invalid workunits won't this hinder the server for the challenge especially if it's of the smaller ones like SG? Maybe my logic is flawed because I don't understand how the server works but based on past experience with the PPS LLR challenges any additional strain on the server is not a good thing.
lol - sorry for the long winded post, but now comes the real question. If you answer yes to the above question, then by default it does affect the server regardless of the situation. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13956 ID: 53948 Credit: 393,160,197 RAC: 187,115
                               
|
So, to further muddy the thread I pose this question. In the upcoming challenge, if a group of systems begins to get wu's but continuously returns invalid workunits won't this hinder the server for the challenge especially if it's of the smaller ones like SG? Maybe my logic is flawed because I don't understand how the server works but based on past experience with the PPS LLR challenges any additional strain on the server is not a good thing.
For a challenge like this with short WUs, the number of tasks sent to each host is lowered significantly, especially at the start, so a bad host won't have much of an impact.
____________
My lucky number is 75898524288+1 | |
|
|
Sorry if I hijacked the thread bringing up the AR subject. My point was the difference between trashed and aborted WUs and the fact that they are treated the same way by the server and not AR itself . If admins think AR is worthy - and I also have doubts on that - then setting different parameters to errors and abortions could be usefull and probably not too hard to do. I do not know how many WUs are aborted every day. Maybe not enough to justify any changes. I just posted earlier because I saw a lot of them today.
____________
676754^262144+1 is prime | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
I found one host that complete PPS WU in 5.45 SECONDS. So if that host have 4 cores, , calculate how many errors that host made per day...
But I assume that you will be answer again: that is small percent of all sum of WU , and that can be neglecting
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
|
Just to add some more to think about :) There is a "min cputime" in validator also.
That means a wu with a cputime < min time in validator will not be a good result :)
Lennart | |
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3208 ID: 50683 Credit: 135,132,479 RAC: 57,320
                         
|
Just to add some more to think about :) There is a "min cputime" in validator also.
That means a wu with a cputime < min time in validator will not be a good result :)
Lennart
So if you can set that rule: can you set also
if cpu time is less then "set seconds" and host returns more then 100 WU then stop sending new WU to that host ID?
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
2022202116^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! | |
|
Message boards :
Sophie Germain Prime Search :
Over 50 invalid results on one host |