Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Number crunching :
Bad Hosts
Author |
Message |
|
Can anything be done about bad hosts / users / PC's?
'Anonymous' has a PC that is constantly returning invalids for PPS - over 2150 this past week.
Not one valid or pending...
Very frustrating when you get these work units as you have a much lower chance of returning it first.
Thanks
https://www.primegrid.com/results.php?hostid=153120
| |
|
|
If the work units are invalid, does it matter whether or not they are returned before yours?
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
Can anything be done about bad hosts / users / PC's?
Not a lot.
Very frustrating when you get these work units as you have a much lower chance of returning it first.
Not really. That would imply someone else has a higher chance than you. An error task, by definition, can't be first, so someone else is going to be first. That can just as easily be you as it can be not you.
____________
My lucky number is 75898524288+1 | |
|
|
I'll re-word this line to say: Very frustrating when you get these work units AFTER HOST 153120 as you have a much lower chance of returning it first.
If the work units are invalid, does it matter whether or not they are returned before yours?
Yes, because they have delayed the task.
Meanwhile your wingman has returned it first...
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
I'll re-word this line to say: Very frustrating when you get these work units AFTER HOST 153120 as you have a much lower chance of returning it first.
If the work units are invalid, does it matter whether or not they are returned before yours?
Yes, because they have delayed the task.
Meanwhile your wingman has returned it first...
You're just as likely to be the guy who gets it first. Of course, when you're first and your wingmen subsequently has an error later on, you're a lot less likely to notice it. Your perception might be that you're at a disadvantage, but if you think about the big picture, you should come to the conclusion that in the end it neither helps nor hurts your chances. The only person who gets hurt is they guy whose tasks are all failing.
Yes, it's frustrating, but that frustration is based upon a distorted perception due to the sequence in which your tasks finish. Everyone feels as if they're at a disadvantage. Literally, everyone. If you think about that, it obviously can't be true that everyone is at a disadvantage. It just feels that "some other guy" has a better chance than you. The truth is you're also that "other guy".
____________
My lucky number is 75898524288+1 | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
Even if you are running THE *slowest* host on PrimeGrid,
there is a nonzero chance of being the wingman of
an incorrectly configured BOINC client which has
downloaded a gazillion tasks, almost all of which
will time out, so your host can still be first to return a valid result.
| |
|
|
Even if you are running THE *slowest* host on PrimeGrid,
there is a nonzero chance of being the wingman of
an incorrectly configured BOINC client which has
downloaded a gazillion tasks, almost all of which
will time out, so your host can still be first to return a valid result.
^This
My love/hate relationship with bad hosts or error returns in general.
- Love when the laptop can score a 1st
- Hate waiting weeks on tasks you know will time out
Overall though I do wish something could be implemented that would limit these hosts. Like if your return ratio of error/good is 80%+ (over the course of a week maybe?) then the server only allows 1 task to be in process at a time.
But it is what it is - I'll keep crunching regardless | |
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 990,017,653 RAC: 50,599
                     
|
In the past I've set hosts like this to be limited to less than 10 jobs per day. The scheduler immediately sets that number back into the hundreds the first time the host returns any "successful" work. The scheduler doesn't know anything about the content of a job, just that the host considered the job to be completed successfully. All those validate errors started out with a blank upload and the host reporting success. "Validate error" is reserved for results that are so flawed that they shouldn't ever be looked at again. Here it's triggered by an empty upload file. Missing uploads or those that don't look like what we're expecting also get a validate error. | |
|
|
I used to get a kick of seeing how old some of my tasks were, as in "subproject status" -> Oldest unfinished work unit. Then seeing if I had the oldest waiting on a wingman. There was a couple times I was for SOB, GFN 22 and 21, and PSP. Not because I got sent the first unit and took forever but because I was sometimes the fifth person sent the unit and the first to give the correct response. However with those I knew it'd not be prime.
I think the frustration is when you have computing horsepower but get dealt the task second. Eg. Gfn 16 I have a 2080ti and the work unit got sent to someone running it as a cpu task and someone giving constant invalids. I would have been the first to return the wu on the GFN in sheer compute time but the cpu got a head start or finished before I was sent it.
While the above can be annoying when your the DC'er on a mega prime (twice but who's counting ;) ... ) I would be very saddened if that opportunity didn't exist. Yes the chances of winning a lottery increase with more tickets. Being a prime finder increase's with better hardware. But always having a chance is what keeps a lot of people here. | |
|
|
You're just as likely to be the guy who gets it first. Of course, when you're first and your wingmen subsequently has an error later on, you're a lot less likely to notice it. Your perception might be that you're at a disadvantage, but if you think about the big picture, you should come to the conclusion that in the end it neither helps nor hurts your chances. The only person who gets hurt is they guy whose tasks are all failing.
Yes, it's frustrating, but that frustration is based upon a distorted perception due to the sequence in which your tasks finish. Everyone feels as if they're at a disadvantage. Literally, everyone. If you think about that, it obviously can't be true that everyone is at a disadvantage. It just feels that "some other guy" has a better chance than you. The truth is you're also that "other guy".
If some user John Doe has the fastest computer in the universe, those malfunctioning hosts are more bad for him than good. Because without them, he would "win" virtually all battles, and be 1st every time. But with bad hosts being around, it will sometimes occur that a workunit is first sent to users A and B; and A finishes in a moderate tempo, while B is a bad host and screws up. Then our John Doe will get the "third" task of the workunit, and even though he has the universe's fastest machine, he still loses to A because A has this head start.
This is why, to some users, with faster hardware than most, the presence of bad hosts may be perceived as a negative thing.
But as many people have already explained, if we see this from the perspective of A, the guy with a somewhat slow machine, this gives him some chances to sometimes beat both B (the person who does not care or understand his computer does not actually work) and John Doe (the smart guy with the fast and well configured hardware). As many said, it is fine that even the slow computers have some chance of coming first with an extraordinary prime find.
Bad hosts bring an additional element of randomness with which David will sometimes beat Goliath.
/JeppeSN | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
I used to get a kick of seeing how old some of my tasks were, as in "subproject status" -> Oldest unfinished work unit. Then seeing if I had the oldest waiting on a wingman. There was a couple times I was for SOB, GFN 22 and 21, and PSP. Not because I got sent the first unit and took forever but because I was sometimes the fifth person sent the unit and the first to give the correct response. However with those I knew it'd not be prime.
I think the frustration is when you have computing horsepower but get dealt the task second. Eg. Gfn 16 I have a 2080ti and the work unit got sent to someone running it as a cpu task and someone giving constant invalids. I would have been the first to return the wu on the GFN in sheer compute time but the cpu got a head start or finished before I was sent it.
While the above can be annoying when your the DC'er on a mega prime (twice but who's counting ;) ... ) I would be very saddened if that opportunity didn't exist. Yes the chances of winning a lottery increase with more tickets. Being a prime finder increase's with better hardware. But always having a chance is what keeps a lot of people here.
And some of us don't care about being 1st we just like crunching for crunchings sake, so to me a way to only be 2nd would work too. I don't have the fastest hardware and most likely never will, but I do have quantity on my side, I have over 100 cpu cores I could bring here with most being in the 2.xghz range as opposed to the 3.5ghz and up range for the fastest cpu's today. My gpu's aren't the fastest either but again I have more than a few I could bring here. I do realize that may take the 'fun' out of being #1 for some people, and I have been a few times, but that's not why I personally crunch here. | |
|
|
If some user John Doe has the fastest computer in the universe, those malfunctioning hosts are more bad for him than good. Because without them, he would "win" virtually all battles, and be 1st every time. But with bad hosts being around, it will sometimes occur that a workunit is first sent to users A and B; and A finishes in a moderate tempo, while B is a bad host and screws up. Then our John Doe will get the "third" task of the workunit, and even though he has the universe's fastest machine, he still loses to A because A has this head start.
/JeppeSN
This is simply an untrue statement. Why would the fastest hardware possible care if there is a bad host consuming tasks or a good host consuming tasks.. or another system just like yours out there? Either way it doesn't make any difference to you, if you will be 1st on a specific task or not. You either get the task first or you don't. 50% chance. Other machines out there make no difference in YOUR outcome. Your *fastest machine in the universe* also does not affect the outcome of other machines out there (unless they are a bad host of course, because they are not outputing valid results so they can't possibly be first). Their bad results get put right back into queue where you can get them.
Without them (Bad hosts) you would be "winning" the same number of battles if they were there or not, and would be 1st based on a 50% chance. <--- That is a true statement.
Bad hosts cannot not change your 50% chance of being first to a lower percentage... that's mathmatically impossible. It's hard for me to understand how you don't understand that. What Michael said earlier is absolutly true in that there is a distorted perception due to the sequence in which your tasks finish. 50 percent is 50 percent.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
If some user John Doe has the fastest computer in the universe, those malfunctioning hosts are more bad for him than good. Because without them, he would "win" virtually all battles, and be 1st every time. But with bad hosts being around, it will sometimes occur that a workunit is first sent to users A and B; and A finishes in a moderate tempo, while B is a bad host and screws up. Then our John Doe will get the "third" task of the workunit, and even though he has the universe's fastest machine, he still loses to A because A has this head start.
/JeppeSN
This is simply an untrue statement. Why would the fastest hardware possible care if there is a bad host consuming tasks or a good host consuming tasks.. or another system just like yours out there? Either way it doesn't make any difference to you, if you will be 1st on a specific task or not. You either get the task first or you don't. 50% chance. Other machines out there make no difference in YOUR outcome. Your *fastest machine in the universe* also does not affect the outcome of other machines out there (unless they are a bad host of course, because they are not outputing valid results so they can't possibly be first). Their bad results get put right back into queue where you can get them.
Without them (Bad hosts) you would be "winning" the same number of battles if they were there or not, and would be 1st based on a 50% chance. <--- That is a true statement.
Bad hosts cannot not change your 50% chance of being first to a lower percentage... that's mathmatically impossible. It's hard for me to understand how you don't understand that. What Michael said earlier is absolutly true in that there is a distorted perception due to the sequence in which your tasks finish. 50 percent is 50 percent.
Actually, everyone is at least partially right. We're all just looking at different scenarios.
Consider 3 hosts, A, B, and C. A is super fast. B is super slow. C errors every task after 2 seconds. In the following scenarios, I'll show the sequence of tasks, with times relative to the time the first task is sent out. For these tasks, assume A can run the task in 1 minute, and B can run the task in 5 minutes.
Scenario 1:
0:00 A gets a task
0:00 B gets a task
A is always going to win, because it's faster than B. B will always lose. A wins 100% of the time.
Scenario 2:
0:00 B gets a task
6:00 A gets a task
B wins because it got the task sooner. In order for B to win, it needs to be lucky enough to have A not request the other task until B has enough time to finish most of the computation.
Under normal circumstances, A is going to beat B unless the luck of the draw gives B enough of a headstart on the task.
Now let's look at scenarios including the bad host C. For argument's sake, let's say that each bad task includes a 6 minute delay, either because of caching in host C, or because it takes a few minutes for the recycled task to get sent out again.
Scenario 3:
0:00 A gets a task
0:00 C gets a task, which fails
6:00 B gets a task
A always wins here.
Scenario 4:
0:00 B gets a task
0:00 C gets a task, which fails
6:00 A gets a task
B wins, because the failed task slows down the replacement task enough to create the equivalent of Scenario 2. The failed task gives B a head start, the same as if A's task had randomly been sent out later as it was in Scenario 2.
If hosts A and B are relatively close in speed, then host C really doesn't change anything. You have a roughly 50/50 chance of being the task before the bad host as you do the task after the bad host. The task before will usually win. This is where my perception thing comes in: If you're the first task (which wins), you may not see the bad host. If you're the the third task (which loses), you do see the failed task. You see the failed task more often when you lose than when you win, so your perception is that you're at a disadvantage because of host C's failed tasks.
That's if A and B are the same speed. If, as in my scenarios, A is faster, C does have an affect because instead of winning 100% in a straight up race vs. host B, you're now going to be the third task 50% of the time and possibly lose races against a slower computer.
Of course, sooner or later Host A will run into scenario 5:
0:00 Host A gets a task
0:00 Host C gets a task, which fails
6:00 Host D gets a task
Host A may be fast, but host D is even faster, and would easily beat host A in a fair race. But the tables are flipped here and C's failed task lets A finish before D. There's always a bigger fish"
So even host A, who normally wins but may lose some because of the effect of the bad host, still benefits some times from the bad host. And if you happen to own host D -- the fastest, badest, $2000 CPU that exists and never, ever loses a fair rave -- you're winning almost every competition anyway, and running a huge number of tasks, and aren't worrying but the handful that you lose for a variety of inconsequential reasons.
____________
My lucky number is 75898524288+1 | |
|
|
If some user John Doe has the fastest computer in the universe, those malfunctioning hosts are more bad for him than good. Because without them, he would "win" virtually all battles, and be 1st every time. But with bad hosts being around, it will sometimes occur that a workunit is first sent to users A and B; and A finishes in a moderate tempo, while B is a bad host and screws up. Then our John Doe will get the "third" task of the workunit, and even though he has the universe's fastest machine, he still loses to A because A has this head start.
/JeppeSN
This is simply an untrue statement. Why would the fastest hardware possible care if there is a bad host consuming tasks or a good host consuming tasks.. or another system just like yours out there? Either way it doesn't make any difference to you, if you will be 1st on a specific task or not. You either get the task first or you don't. 50% chance. Other machines out there make no difference in YOUR outcome. Your *fastest machine in the universe* also does not affect the outcome of other machines out there (unless they are a bad host of course, because they are not outputing valid results so they can't possibly be first). Their bad results get put right back into queue where you can get them.
Without them (Bad hosts) you would be "winning" the same number of battles if they were there or not, and would be 1st based on a 50% chance. <--- That is a true statement.
Bad hosts cannot not change your 50% chance of being first to a lower percentage... that's mathmatically impossible. It's hard for me to understand how you don't understand that. What Michael said earlier is absolutly true in that there is a distorted perception due to the sequence in which your tasks finish. 50 percent is 50 percent.
Everyone has 50% chance of being sent the task first (out of the two initial tasks, that is). But be aware that it is not determined who is the "first-checker" and who is the "double-checker" as soon as the tasks are sent out to the users. That is only decided when the results come back to the PrimeGrid server. The guy who got his task latest, might nonetheless return it earlier. If you have faster-than-average hardware, your chance of submitting the result first is over 50%.
My post was not meant to deny what Michael said. It was just another perspective on things.
I totally agree with Michael's most recent post above.
And I also agree there is a "distorted perception" where, psychologically, we pay more attention to the cases where we got an "unfair" disadvantage, than to the cases where we got an "unfair" advantage. But that is not the whole story.
/JeppeSN | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,968,032,238 RAC: 5,388,098
                               
|
There's always a bigger fish
So even host A, who normally wins but may lose some because of the effect of the bad host, still benefits some times from the bad host. And if you happen to own host D -- the fastest, badest, $2000 CPU that exists and never, ever loses a fair rave -- you're winning almost every competition anyway, and running a huge number of tasks, and aren't worrying but the handful that you lose for a variety of inconsequential reasons.
Evidence to back up Mike's scenarios
http://www.primegrid.com/download/GFN-2312092_524288.pdf
Much weaker gpu won because of error machines delaying the wu.
____________
My lucky number 10590941048576+1 | |
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 990,017,653 RAC: 50,599
                     
|
The other thing about these scenarios is that it depends on how the bad host fails. If it produces a client error (or any error immediately apparent) then a new result is created by the transitioner within 15 seconds. If the client reports that the job is successful but doesn't upload, it will not be noticed until the wingman returns his/her result. Then the validator looks at both, sees that one result is faulty and only then is a new result created. This particular bad host was sometimes reporting success. That's rare though, most bad results are reported as bad.
Now I haven't checked all of the jobs from this host, but some of them (possibly all of the ones with the empty upload) did not report themselves as finished. In that case, one of our cron jobs (runs once an hour) picked up the upload and marked the task as completed. In that case, regardless of whether there were two results pending, the job is immediately validated just in case the upload is faulty. If the upload is OK, no action is taken. If it's empty, blank or doesn't actually contain a result, a new result is again created within 15 seconds.
We have another cron that takes jobs that reported as errors but that did have uploads and tries to fix them as well. Again, the job is immediately validated no matter what and so we don't suffer delays from bad uploads. And by the way, the received time is set to the creation time of the upload file, so you're not going to see a lot of jobs ending at the time the cron was run.
So, this is a rather complex environment. From our side, we're trying to make sure that everyone who returns a valid result gets credit for it even if the BOINC client screws up somehow at the end. The crons above produce log entries for every job they "fix" and only when looking at those is it clear what the exact sequence of events was. Anything can and does happen. | |
|
|
There's always a bigger fish
So even host A, who normally wins but may lose some because of the effect of the bad host, still benefits some times from the bad host. And if you happen to own host D -- the fastest, badest, $2000 CPU that exists and never, ever loses a fair rave -- you're winning almost every competition anyway, and running a huge number of tasks, and aren't worrying but the handful that you lose for a variety of inconsequential reasons.
Evidence to back up Mike's scenarios
http://www.primegrid.com/download/GFN-2312092_524288.pdf
Much weaker gpu won because of error machines delaying the wu.
I've had tasks that didn't have an error. It was only sent to two computers and the verified task wasn't delayed. The time difference between when the task was requested was just that large of a gap.
Computer A ran task in 44 minutes
Computer B requested task 43 minutes later
Computer B ran task in 6 minutes and reported second
Fastest computer doesn't need a bad host to lose either. | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
There's always a bigger fish
So even host A, who normally wins but may lose some because of the effect of the bad host, still benefits some times from the bad host. And if you happen to own host D -- the fastest, badest, $2000 CPU that exists and never, ever loses a fair rave -- you're winning almost every competition anyway, and running a huge number of tasks, and aren't worrying but the handful that you lose for a variety of inconsequential reasons.
Evidence to back up Mike's scenarios
http://www.primegrid.com/download/GFN-2312092_524288.pdf
Much weaker gpu won because of error machines delaying the wu.
I've had tasks that didn't have an error. It was only sent to two computers and the verified task wasn't delayed. The time difference between when the task was requested was just that large of a gap.
Computer A ran task in 44 minutes
Computer B requested task 43 minutes later
Computer B ran task in 6 minutes and reported second
Fastest computer doesn't need a bad host to lose either.
Michael or some other person who works here might have the answer but it may also depend on the cache size, if someone has a 3 day cache of wu's does that mean that each wu doesn't get returned as fast as someone with a zero cache?
I think if someone has a zero resource share and/or a config file saying to return the results immediately they may stand a better chance of being first than someone who just runs Boinc plain vanilla as it comes from Berkeley. It also makes them more susceptible to outages but if being first as much as possible is your goal I think you've got to be tweaking. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
I think if someone has a zero resource share and/or a config file saying to return the results immediately they may stand a better chance of being first than someone who just runs Boinc plain vanilla as it comes from Berkeley. It also makes them more susceptible to outages but if being first as much as possible is your goal I think you've got to be tweaking.
This is known and used by many who do aim to be 1st. One additional consideration is that multi-thread can be used to speed up processing, even if it comes at a cost of reducing overall throughput. For example, running 1 PPS per core would get you more throughput than 1 unit per 2 cores, which do it in over half the time and lose some efficiency.
Back on the original question, I do occasionally check my task status. If I see an inconclusive I usually check out the wingman's host to see if it is more likely for my system to be wrong or theirs. If theirs has a bad track record, I may (but not always) send a message via forum to inform them of that. Of course, can't do this if they're anonymous. | |
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 990,017,653 RAC: 50,599
                     
|
There's always another bad host. Here's a guy helping all you people running SoB to come out as the 1st to return the result: http://www.primegrid.com/results.php?hostid=944350 | |
|
|
Sorry off topic, Jim is he running a server or HEDT processor that can actually finish those or is this a case of boinc being over optimistic of a cache running user. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
There's always another bad host. Here's a guy helping all you people running SoB to come out as the 1st to return the result: http://www.primegrid.com/results.php?hostid=944350
Well, almost. That host has 48 logical cores. Still, it doesn't help to be running more tasks (71) than processors.
It was nice of him/her to abort one so that we could see that task's host details.
EDIT: He/she has 162 cores across 8 systems, all badly configured.
Wow, really trying to be first... this task ran with 24 cores very inefficient. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,151,826,606 RAC: 22,774,079
                                                
|
Wow, really trying to be first... this task ran with 24 cores very inefficient.
Unless you have one of those specific CPUs to show the inefficiency, you may want to use caution in criticizing a crunching choice. I would point out that 24 core (48 thread) box has 30MB of L3 cache. That means that an SoB unit will fit entirely in cache if running one at a time.
Efficiency depends on one's goals. With that configuration, I'd estimate that CPU can complete 29 or 30 SoB units within the 15 day challenge.
| |
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 990,017,653 RAC: 50,599
                     
|
Sorry off topic, Jim is he running a server or HEDT processor that can actually finish those or is this a case of boinc being over optimistic of a cache running user.
He's never contacted the server again since two hours after he got all those jobs. They're all abandoned as far as I can tell. | |
|
|
Plenty of examples of bad hosts but how about when several of them gang up on you?
https://www.primegrid.com/workunit.php?wuid=587855899
If you don't want to dive into the rabbit hole here's the timeline:
23 Nov. PSP task sent to myself and wingman
24 Nov. I return task
5 Dec. wingman returns task - validation inconclusive
wingman has 11 tasks; 5 inconclusive, 5 invalid, 1 valid. Pretty sure his is invalid.
5 Dec. wingman2 gets task
27 Dec. wingman2 times out
wingman2 has 18 tasks, all error (timed out)
29 Dec. wingman3 gets task (this is my current wingman)
wingman3 has 100 tasks; 53 in process, 7 valid, 40 aborted.
Of the 53 in process there are 2 CUL, 18 PSP, 26 SOB and 7 WOO
This is on a Ryzen 7 1800X (8/16) CPU
No tasks received after 29 Dec. so maybe I get lucky and the task gets aborted before too much longer.
Looking forward to meeting wingman4! | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
Plenty of examples of bad hosts but how about when several of them gang up on you?
https://www.primegrid.com/workunit.php?wuid=587855899
If you don't want to dive into the rabbit hole here's the timeline:
No tasks received after 29 Dec. so maybe I get lucky and the task gets aborted before too much longer.
Looking forward to meeting wingman4!
I wish there was a way to send the task to a reliable host after the 2nd wingman doesn't finish a wu, that would make the wait time ALOT less, but would also mean that that 3rd wingman CAN'T be #1. | |
|
|
I wish there was a way to send the task to a reliable host after the 2nd wingman doesn't finish a wu, that would make the wait time ALOT less, but would also mean that that 3rd wingman CAN'T be #1.
Yeah, I don't think people would want an increased chance of getting double checks simply because they run reliable hosts. It's really not that big of a deal, if I was waiting on this for a badge I would have already run more tasks.
The scary part is the max# total is 15 which means it could actually go all the way to wingman14.
I wonder if that has ever happened and what the protocol is for a maxed out WU. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
I wish there was a way to send the task to a reliable host after the 2nd wingman doesn't finish a wu, that would make the wait time ALOT less, but would also mean that that 3rd wingman CAN'T be #1.
Yeah, I don't think people would want an increased chance of getting double checks simply because they run reliable hosts.
That is exactly the reason we don't do this. It is an option we could turn on if we wanted to, but given that finishing first makes a difference at PrimeGrid, we would in effect be punishing people for running their computers reliably. Or, put another way, we would be encouraging people to intentionally be unreliable. This is definitely not what we want to do.
____________
My lucky number is 75898524288+1 | |
|
Monkeydee Volunteer tester
 Send message
Joined: 8 Dec 13 Posts: 548 ID: 284516 Credit: 1,723,831,943 RAC: 3,275,944
                            
|
I wonder if that has ever happened and what the protocol is for a maxed out WU.
Keep an eye on this one. It's on person number 14 now.
http://www.primegrid.com/workunit.php?wuid=584794302
____________
My Primes
Badge Score: 4*2 + 6*2 + 7*1 + 8*11 + 9*1 + 11*3 + 12*1 = 169
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
The scary part is the max# total is 15 which means it could actually go all the way to wingman14.
I wonder if that has ever happened and what the protocol is for a maxed out WU.
If a task hits 15, it will automatically get bumped up to 60.
Before it hits 60, somebody is going to notice and ask us, "Is there something wrong here?"
For your amusement, here's the distribution of task numbers for the entire result database:
+----+----------+
| x | count(*) |
+----+----------+
| _0 | 569693 |
| _1 | 569693 |
| _2 | 112685 |
| _3 | 34418 |
| _4 | 12263 |
| _5 | 5655 |
| _6 | 3053 |
| _7 | 1707 |
| _8 | 962 |
| _9 | 533 |
| 10 | 269 |
| 11 | 142 |
| 12 | 83 |
| 13 | 40 |
| 14 | 25 |
| 15 | 13 |
| 16 | 9 |
| 17 | 5 |
| 18 | 3 |
| 19 | 1 |
| 20 | 1 |
+----+----------+
EDIT: It just so happens that the workunit with the _19 and _20 tasks is complete, so you are able to look at it:
http://www.primegrid.com/workunit.php?wuid=589990008
____________
My lucky number is 75898524288+1 | |
|
dukebgVolunteer tester
 Send message
Joined: 21 Nov 17 Posts: 242 ID: 950482 Credit: 23,670,125 RAC: 0
                  
|
I wish there was a way to send the task to a reliable host after the 2nd wingman doesn't finish a wu, that would make the wait time ALOT less, but would also mean that that 3rd wingman CAN'T be #1.
Actually, if you think about the (sub)-project globally, it won't change time at all. If a bad host doesn't have a task in unit X, it would just have a task in unit Y. It still requests a unit all the same way. This proposal doesn't make the overall total sum of wait time any lesser. You're moving this wait time to a different workunit. Sure, it makes units be completed more "reliably", but then you'll just have wait times in more of other units that would otherwise only ever see 2 tasks.
This is of course in the addition to what's said above.
As for longest running staggering units, [from his words somewhere on the forum or discord, though I think it was just about LLR] Jim sometimes runs some of them manually and adds the result to the db to keep things going. So the "completing reliably" is covered in that way to enough extent, I'd say. | |
|
|
http://www.primegrid.com/workunit.php?wuid=589990008
Summary (skipping eighteen unsuccessful users): User "Denis*" gets the task more than six days after it was first sent out. He gets task 13 of that workunit (WU)! He spends more than 3 days to complete the task, but is still first by a huge margin. User "jun" gets the task almost 27 days after it was first sent out. This is task no. 20 of the same WU! He finishes in less than one day and gets credit as double checker. /JeppeSN | |
|
|
If a task hits 15, it will automatically get bumped up to 60.
???
So "max # of error/total/success tasks 15, 15, 5" is a total fabrication? Fiction? Falsehood?
I... I don't know who to trust anymore.
Seriously though, why not just say 60 if that's what it really is? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
If a task hits 15, it will automatically get bumped up to 60.
???
So "max # of error/total/success tasks 15, 15, 5" is a total fabrication? Fiction? Falsehood?
I... I don't know who to trust anymore.
Seriously though, why not just say 60 if that's what it really is?
I think there might only be one person who knows the answer to that question, and it's not me.
The only thing I can thing of is that it makes it much easier to spot workunits that have a lot of failed tasks, but that's just a guess.
____________
My lucky number is 75898524288+1 | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
Wow, really trying to be first... this task ran with 24 cores very inefficient.
Unless you have one of those specific CPUs to show the inefficiency, you may want to use caution in criticizing a crunching choice. I would point out that 24 core (48 thread) box has 30MB of L3 cache. That means that an SoB unit will fit entirely in cache if running one at a time.
Efficiency depends on one's goals. With that configuration, I'd estimate that CPU can complete 29 or 30 SoB units within the 15 day challenge.
It was a sub-optimal decision to use 24 cores for one task. That CPU has 12 physical cores, so it's a 2-socket system which actually has 60 MB of L3 cache. He should be running 2 tasks simultaneously with 12 threads each if he wants to maximize throughput. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,151,826,606 RAC: 22,774,079
                                                
|
Wow, really trying to be first... this task ran with 24 cores very inefficient.
Unless you have one of those specific CPUs to show the inefficiency, you may want to use caution in criticizing a crunching choice. I would point out that 24 core (48 thread) box has 30MB of L3 cache. That means that an SoB unit will fit entirely in cache if running one at a time.
Efficiency depends on one's goals. With that configuration, I'd estimate that CPU can complete 29 or 30 SoB units within the 15 day challenge.
It was a sub-optimal decision to use 24 cores for one task. That CPU has 12 physical cores, so it's a 2-socket system which actually has 60 MB of L3 cache. He should be running 2 tasks simultaneously with 12 threads each if he wants to maximize throughput.
Maybe, but you are making far too many assumptions about how he is running. The task output simply says the work unit completed using 24 "cores". Since BOINC sees threads as equal to cores, it is entirely possible that the user is indeed running 2 units at a time. Running on threads is usually not a good idea, but some have reported better times using cores and threads combined on larger LLR tasks such as the SoB work. I have not ever tested that particular CPU, so I cannot say one way or the other.
I can say that, in my experience, running the combined cores on dual-Xeon machines does sometimes increase the performance of LLR work (and sometimes not), so that could be possible in this case.
| |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
I wish there was a way to send the task to a reliable host after the 2nd wingman doesn't finish a wu, that would make the wait time ALOT less, but would also mean that that 3rd wingman CAN'T be #1.
Yeah, I don't think people would want an increased chance of getting double checks simply because they run reliable hosts. It's really not that big of a deal, if I was waiting on this for a badge I would have already run more tasks.
See I thought of it another way, not #1 but credits far faster credits than waiting for 15 or even 60, or whatever, other people to have it fail and me #1 person is still waiting for someone to finish it to get my credits. It takes two people for the #1 finisher to get anything and the double check person gets their name mentioned as well when something is found.
I do understand what others have said about why it is the way it is though. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
I wish there was a way to send the task to a reliable host after the 2nd wingman doesn't finish a wu, that would make the wait time ALOT less, but would also mean that that 3rd wingman CAN'T be #1.
Yeah, I don't think people would want an increased chance of getting double checks simply because they run reliable hosts. It's really not that big of a deal, if I was waiting on this for a badge I would have already run more tasks.
See I thought of it another way, not #1 but credits far faster credits than waiting for 15 or even 60, or whatever, other people to have it fail and me #1 person is still waiting for someone to finish it to get my credits. It takes two people for the #1 finisher to get anything and the double check person gets their name mentioned as well when something is found.
I do understand what others have said about why it is the way it is though.
Mikey, I think your view isn't correct except in certain situations. You get pending credits while your WU is waiting for a valid wingman's task. You don't have to wait for someone else unless you want to be certain of having valid credits.
In fact if you always want to have credits "right away", then technically the better thing to do is not be #1, so run with a large cache of work. It was said earlier in this thread why we don't have a preference for issing new WU to reliable hosts, but you've just presented a use case for having a user setting in PrimeGrid to prefer being the double-checker, regardless of whether your host is reliable or not.
If you are just impatient to see the next badge, then you are free to compute more tasks than needed. Or if you want to see your ranking increase on external stats, I believe pending credit is not counted, because your result might yet be invalid.
For Primegrid challenge placement, you get instant credit which is revoked later if your task turns out to be invalid.
And for SoB double check work that already had a valid checksum loaded, you would be the first prime reporter, so trying to be first was pointless until we started digging into the range having no prior results. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
Wow, really trying to be first... this task ran with 24 cores very inefficient.
Unless you have one of those specific CPUs to show the inefficiency, you may want to use caution in criticizing a crunching choice. I would point out that 24 core (48 thread) box has 30MB of L3 cache. That means that an SoB unit will fit entirely in cache if running one at a time.
Efficiency depends on one's goals. With that configuration, I'd estimate that CPU can complete 29 or 30 SoB units within the 15 day challenge.
It was a sub-optimal decision to use 24 cores for one task. That CPU has 12 physical cores, so it's a 2-socket system which actually has 60 MB of L3 cache. He should be running 2 tasks simultaneously with 12 threads each if he wants to maximize throughput.
Maybe, but you are making far too many assumptions about how he is running. The task output simply says the work unit completed using 24 "cores". Since BOINC sees threads as equal to cores, it is entirely possible that the user is indeed running 2 units at a time. Running on threads is usually not a good idea, but some have reported better times using cores and threads combined on larger LLR tasks such as the SoB work. I have not ever tested that particular CPU, so I cannot say one way or the other.
I can say that, in my experience, running the combined cores on dual-Xeon machines does sometimes increase the performance of LLR work (and sometimes not), so that could be possible in this case.
Yes, you are right, assumptions are unwarranted since we don't know how is workload is set up. How many times have I already said that I get better perfomance using HT for some workloads in Linux? One thing we do know is that the host is running Windows, and we suspect that Linux has better processor affinity for workloads than Windows.
EDIT: How does dual-socket Xeon maintain cache coherency between sockets? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
mikey wrote: See I thought of it another way, ... It takes two people for the #1 finisher to get anything and the double check person gets their name mentioned as well when something is found.
With regards to names being mentioned, if there's a pending prime large enough to warrant an announcement, remember that we can see both the pending prime and the trickles from the wingman. We usually run a double check on the prime ourselves, and assuming the prime is confirmed we validate the prime without a wingman. You won't wait an extended period of time when you find a large prime.
We don't do that for smaller prime -- but these are smaller tasks with shorter deadlines and lower failure rates.
____________
My lucky number is 75898524288+1 | |
|
|
And for SoB double check work that already had a valid checksum loaded, you would be the first prime reporter, so trying to be first was pointless until we started digging into the range having no prior results.
For an SoB double check workunit that already had a valid checksum loaded, what you do is confirm the candidate is composite, so no-one is a prime reporter.
For SoB with an imported result that later turns out to be invalid, I guess there is never a "race". First one person gets the task alone (no race), and that user disagrees with the imported residue. Then another person gets a triple check task (alone), to find out who was right. I suppose two people never work simultaneously on a workunit that has an imported result. If a prime is found there, the finder has to thank the Fortuna goddess only, not his superior hardware.
/JeppeSN | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
In fact if you always want to have credits "right away", then technically the better thing to do is not be #1, so run with a large cache of work. It was said earlier in this thread why we don't have a preference for issing new WU to reliable hosts, but you've just presented a use case for having a user setting in PrimeGrid to prefer being the double-checker, regardless of whether your host is reliable or not.
If you are just impatient to see the next badge, then you are free to compute more tasks than needed. Or if you want to see your ranking increase on external stats, I believe pending credit is not counted, because your result might yet be invalid.
That's essentially what I do now, I have a 1.5 day wu cache and am mostly interested in getting bigger badges as a slow and steady pace.
As for the setting you mentioned yes I would use it but as others have said I could be in the VERY small minority of people who might, so it may not be worth the trouble to implement. If you get 100 people to sign up for it is that really worth all the troubleshooting needed to put it in place? I would have to say no it is not, the project works as is and MOST people by far are happy with it, including me. | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
mikey wrote: See I thought of it another way, ... It takes two people for the #1 finisher to get anything and the double check person gets their name mentioned as well when something is found.
With regards to names being mentioned, if there's a pending prime large enough to warrant an announcement, remember that we can see both the pending prime and the trickles from the wingman. We usually run a double check on the prime ourselves, and assuming the prime is confirmed we validate the prime without a wingman. You won't wait an extended period of time when you find a large prime.
We don't do that for smaller prime -- but these are smaller tasks with shorter deadlines and lower failure rates.
I had no clue you did it that way...thanks for the peek behind the curtain. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,151,826,606 RAC: 22,774,079
                                                
|
EDIT: How does dual-socket Xeon maintain cache coherency between sockets?
That baffles me a but as well.
| |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
EDIT: How does dual-socket Xeon maintain cache coherency between sockets?
That baffles me a but as well.
Cache coherency protocol. It's complicated
Not only that, what we see as the logical layout of the cores on a single chip doesn't even correspond to the physical layout. A 12-core Haswell-EP chip has 8 cores on one ring bus and 4 on the other. According to the article,
However, mapping the asymmetrical chip layout to a balanced NUMA topology creates performance variations between the cores with latency reductions between 5 and 15% depending on the location of the core on the chip. | |
|
|
https://www.primegrid.com/results.php?hostid=514984&offset=0&show_names=0&state=0&appid=
This guy is killing me... | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
https://www.primegrid.com/results.php?hostid=514984&offset=0&show_names=0&state=0&appid=
This guy is killing me...
Totally different topic. His computer works fine -- he's just got too large of a cache. Unlike errors, it takes several days to abort those tasks, whereas errors take about 3 seconds.
There's a lot of hosts like this. Hosts with errors, however, are fairly uncommon.
____________
My lucky number is 75898524288+1 | |
|
|
https://www.primegrid.com/results.php?hostid=514984&offset=0&show_names=0&state=0&appid=
This guy is killing me...
On a plus side, you know you will probably be first on tasks you get done with him as your dc.
Cheers
____________
@AggieThePew
| |
|
|
https://www.primegrid.com/results.php?hostid=514984&offset=0&show_names=0&state=0&appid=
This guy is killing me...
On a plus side, you know you will probably be first on tasks you get done with him as your dc.
Cheers
Oddly enough, that hasn't happened, but I have multiple hosts on my end recieving his crud.
You know what else is upsetting? I can tune the crap out of my system, but I'm still not first because the server is issuing WUs 30 minutes apart from each other. How is this fair for me?
https://www.primegrid.com/workunit.php?wuid=595062192
I swear this system has it out for me to not find primes. I have 1 prime. It's very discouraging. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Can't look at single examples. Look at the long term average. You win some, you lose some. Making the system faster does swing the stats in your favour, perhaps more so for longer units where the send time difference makes less impact. | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
https://www.primegrid.com/results.php?hostid=514984&offset=0&show_names=0&state=0&appid=
This guy is killing me...
Totally different topic. His computer works fine -- he's just got too large of a cache. Unlike errors, it takes several days to abort those tasks, whereas errors take about 3 seconds.
There's a lot of hosts like this. Hosts with errors, however, are fairly uncommon.
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised. Similar to the way the PSA wu's are setup but on a more automatic scale. ie a new pc gets a max of 100 wu's at a time then as the formula sees they are successfully completing and returning those wu's within 5 days, for example, they get 200 at a time. Obviously the number would change based on the sub-project as some have very long crunching times while others are much shorter. As people return more units they get more units, and they can get more at one time as well as their pc proves it can handle the work successfully.
I turned on some Genefer wu's the other day on one of my pc's and it promptly errored out every single one of them before I could turn them off. The pc would also get work from other projects but not be able to return it so lots of errors there too...turns out the battery had died and the date was not current which caused the problems and everything is working correctly now. An alogrythym would also stop that if it was smart enough, it would stop me from getting new units, or least drastically reduce them so I wasn't chewing thru wu's just to have them error out. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14043 ID: 53948 Credit: 481,383,442 RAC: 516,327
                               
|
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised.
All BOINC servers already do that.
The system is very lenient, however, otherwise you would have massive problems with people not being able to get enough tasks the first time they try to do something.
The server handles recycling error tasks much better than the admins handle irate users who can't get enough tasks. :)
____________
My lucky number is 75898524288+1 | |
|
dthonon Volunteer tester
 Send message
Joined: 6 Dec 17 Posts: 435 ID: 957147 Credit: 1,764,432,883 RAC: 56,410
                                 
|
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised. Similar to the way the PSA wu's are setup but on a more automatic scale. ie a new pc gets a max of 100 wu's at a time then as the formula sees they are successfully completing and returning those wu's within 5 days, for example, they get 200 at a time.
That is not as easy as it seems.
Just for the sake of argument (but still very credible), let's imagine someone starts a 16 cores server that crunches PPSE in 15 mn each, with constant internet access. 100 WUs will be correctly processed in about 1.5 hours and then he will have to wait 5 days to get more. That would be very frustrating.
Or he starts a very old PC that is only connected to Internet on some week-ends and processes, still without errors, only a few tasks per week. 100 WUs will be correctly processed in months and he will not care about limitations.
Or he has a standard PC that has been working for years and his "credit" is in thousands of WUs. And a fan fails, or Windows updates a driver..., and tasks fails. And then limits should be lowered ?
And so on.
BOINC is meant to manage tasks sent to computers ranging between very fast and reliable to very slow and unreliable. And that is what happens all the time.
So, looking at what happens with your wingperson is not very productive and a bit stressful:
- he is faster, and you get envious of his faster hardware
- he is a bit slower and you forget about it
- he is much slower, or fails, and you get frustrated as your task does not get validated
- he failed and you get his task as second-hand, and you get frustrated because it as already been processed 1st
It is better, in my opinion, to just forget about DC and try to process tasks as best as possible. On average, that will raise your chances of finding a prime. | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,968,032,238 RAC: 5,388,098
                               
|
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised.
All BOINC servers already do that.
The system is very lenient, however, otherwise you would have massive problems with people not being able to get enough tasks the first time they try to do something.
The server handles recycling error tasks much better than the admins handle irate users who can't get enough tasks. :)
🤣
____________
My lucky number 10590941048576+1 | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,968,032,238 RAC: 5,388,098
                               
|
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised. Similar to the way the PSA wu's are setup but on a more automatic scale. ie a new pc gets a max of 100 wu's at a time then as the formula sees they are successfully completing and returning those wu's within 5 days, for example, they get 200 at a time.
That is not as easy as it seems.
Just for the sake of argument (but still very credible), let's imagine someone starts a 16 cores server that crunches PPSE in 15 mn each, with constant internet access. 100 WUs will be correctly processed in about 1.5 hours and then he will have to wait 5 days to get more. That would be very frustrating.
Or he starts a very old PC that is only connected to Internet on some week-ends and processes, still without errors, only a few tasks per week. 100 WUs will be correctly processed in months and he will not care about limitations.
Or he has a standard PC that has been working for years and his "credit" is in thousands of WUs. And a fan fails, or Windows updates a driver..., and tasks fails. And then limits should be lowered ?
And so on.
BOINC is meant to manage tasks sent to computers ranging between very fast and reliable to very slow and unreliable. And that is what happens all the time.
So, looking at what happens with your wingperson is not very productive and a bit stressful:
- he is faster, and you get envious of his faster hardware
- he is a bit slower and you forget about it
- he is much slower, or fails, and you get frustrated as your task does not get validated
- he failed and you get his task as second-hand, and you get frustrated because it as already been processed 1st
It is better, in my opinion, to just forget about DC and try to process tasks as best as possible. On average, that will raise your chances of finding a prime.
Well said.
____________
My lucky number 10590941048576+1 | |
|
|
Lol Robish are you working to get your post count up?
Cheers | |
|
|
Thanks for the conversation. BTW, I'm not irate, just frustrated. Anyway, it is what its is, and I'll just do my best and forget about the stressful stuff, and just have fun.
Cheers! | |
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2223 ID: 126266 Credit: 7,968,032,238 RAC: 5,388,098
                               
|
Lol Robish are you working to get your post count up?
Cheers
Absolutely! I need my shut up badge!! :)
____________
My lucky number 10590941048576+1 | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised.
All BOINC servers already do that.
The system is very lenient, however, otherwise you would have massive problems with people not being able to get enough tasks the first time they try to do something.
The server handles recycling error tasks much better than the admins handle irate users who can't get enough tasks. :)
I did not know that...thank you!! And I can only imagine the pm's from irate people who "can't get enough tasks", I do not envy you at all for having to deal with that!! | |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1905 ID: 37043 Credit: 830,859,386 RAC: 799,198
                     
|
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised. Similar to the way the PSA wu's are setup but on a more automatic scale. ie a new pc gets a max of 100 wu's at a time then as the formula sees they are successfully completing and returning those wu's within 5 days, for example, they get 200 at a time.
That is not as easy as it seems.
Just for the sake of argument (but still very credible), let's imagine someone starts a 16 cores server that crunches PPSE in 15 mn each, with constant internet access. 100 WUs will be correctly processed in about 1.5 hours and then he will have to wait 5 days to get more. That would be very frustrating.
Or he starts a very old PC that is only connected to Internet on some week-ends and processes, still without errors, only a few tasks per week. 100 WUs will be correctly processed in months and he will not care about limitations.
Or he has a standard PC that has been working for years and his "credit" is in thousands of WUs. And a fan fails, or Windows updates a driver..., and tasks fails. And then limits should be lowered ?
And so on.
BOINC is meant to manage tasks sent to computers ranging between very fast and reliable to very slow and unreliable. And that is what happens all the time.
So, looking at what happens with your wingperson is not very productive and a bit stressful:
- he is faster, and you get envious of his faster hardware
- he is a bit slower and you forget about it
- he is much slower, or fails, and you get frustrated as your task does not get validated
- he failed and you get his task as second-hand, and you get frustrated because it as already been processed 1st
It is better, in my opinion, to just forget about DC and try to process tasks as best as possible. On average, that will raise your chances of finding a prime.
That's makes sense...thank you!!
I guess from a projects perspective as long as people get wu's it's a good thing because we users CAN be complainers if we don't. From a crunchers perspective most of the time the system just works and that's all we need too, but sometimes we get frustrated by the less than reliable hosts we encounter and then the Project hears about that too. I don't envy Project people as they hear about it for various reasons, some valid and some perceived. I appreciate your patience and explanations...THANK YOU!! | |
|
|
There is also the other side of the problem where an Owner stores many days worth of WU days and then the Owner Aborts or Times Out hundreds of WU all at once. The "bad host" is forcing others to clean up the WU that piled up.
Computer
942077 created 27 Nov 2018 | 8:52:17 UTC
940375 created 31 Jan 2019 | 21:26:31 UTC
941358 created 31 Jan 2019 | 21:24:31 UTC
939466 created 1 Feb 2019 | 19:20:32 UTC
That's a case for limiting the max number of wu's a pc gets until it proves it can do them successfully, then it gets raised. Similar to the way the PSA wu's are setup but on a more automatic scale. ie a new pc gets a max of 100 wu's at a time then as the formula sees they are successfully completing and returning those wu's within 5 days, for example, they get 200 at a time.
That is not as easy as it seems.
Just for the sake of argument (but still very credible), let's imagine someone starts a 16 cores server that crunches PPSE in 15 mn each, with constant internet access. 100 WUs will be correctly processed in about 1.5 hours and then he will have to wait 5 days to get more. That would be very frustrating.
Or he starts a very old PC that is only connected to Internet on some week-ends and processes, still without errors, only a few tasks per week. 100 WUs will be correctly processed in months and he will not care about limitations.
Or he has a standard PC that has been working for years and his "credit" is in thousands of WUs. And a fan fails, or Windows updates a driver..., and tasks fails. And then limits should be lowered ?
And so on.
BOINC is meant to manage tasks sent to computers ranging between very fast and reliable to very slow and unreliable. And that is what happens all the time.
So, looking at what happens with your wingperson is not very productive and a bit stressful:
- he is faster, and you get envious of his faster hardware
- he is a bit slower and you forget about it
- he is much slower, or fails, and you get frustrated as your task does not get validated
- he failed and you get his task as second-hand, and you get frustrated because it as already been processed 1st
It is better, in my opinion, to just forget about DC and try to process tasks as best as possible. On average, that will raise your chances of finding a prime.
That's makes sense...thank you!!
I guess from a projects perspective as long as people get wu's it's a good thing because we users CAN be complainers if we don't. From a crunchers perspective most of the time the system just works and that's all we need too, but sometimes we get frustrated by the less than reliable hosts we encounter and then the Project hears about that too. I don't envy Project people as they hear about it for various reasons, some valid and some perceived. I appreciate your patience and explanations...THANK YOU!!
| |
|
|
Not if all we are running are Aborted by user Tasks that have been completed days ago. Too Funny
Yes we have to run them but users need to know how this kills finding Primes for everyone.
Once all the Aborted Tasks are ran we should then get some fresh ones and then maybe we well see some Primes Pop Up.
Genefer 16 now running the tasks from 10 Feb caused by a lot of Timed out - no response
Looks like some found a cheat maybe Error while computing
Over Error 132 all from second run Tasks from the 15h.
Error 4687 more all on the 14/15th
Error 554 same Genefer 17 Mega so some how users have found a way or the NVIDIA Quadro M2000M (4034MB) is really bad Graphics Card.
Sad Day for Primes only 2 Found on the 14th. | |
|
|
Once all the Aborted Tasks are ran we should then get some fresh ones and then maybe we well see some Primes Pop Up.
Aborted tasks have the same chance of popping up as primes as other tasks. /JeppeSN | |
|
|
Once all the Aborted Tasks are ran we should then get some fresh ones and then maybe we well see some Primes Pop Up.
Aborted tasks have the same chance of popping up as primes as other tasks. /JeppeSN
Only on Frist Run, all the tasks I looked at was completed 5 days ago.
But yes if Aborted Task has never been ran before then yes. | |
|
|
Can a Volunteer moderator or Project administrator review this users host:
https://www.primegrid.com/results.php?hostid=153120
I want to run PPS but not sure this would work for this event will all the tasks being returned as Invalid 1896 Validate error
Thanks
| |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
Can a Volunteer moderator or Project administrator review this users host:
https://www.primegrid.com/results.php?hostid=153120
I want to run PPS but not sure this would work for this event will all the tasks being returned as Invalid 1896 Validate error
Thanks
Windows error code 0xc0000005 is an access violation. There could be numerous causes.
See this blog post about possible causes and their solutions.
To rule out a bad DLL download from PrimeGrid or BOINC:
- detach that host from PrimeGrid
- uninstall BOINC
- erase the BOINC data directory
- reinstall BOINC
- reattach to PrimeGrid
- try to run PPS units again. If the error still happens, it's something else. | |
|
|
Can a Volunteer moderator or Project administrator review this users host:
https://www.primegrid.com/results.php?hostid=153120
I want to run PPS but not sure this would work for this event will all the tasks being returned as Invalid 1896 Validate error
Thanks
Windows error code 0xc0000005 is an access violation. There could be numerous causes.
See this blog post about possible causes and their solutions.
To rule out a bad DLL download from PrimeGrid or BOINC:
- detach that host from PrimeGrid
- uninstall BOINC
- erase the BOINC data directory
- reinstall BOINC
- reattach to PrimeGrid
- try to run PPS units again. If the error still happens, it's something else.
Thank you, if I could contact this owner of this Host I would let him know but I cannot due do being Anonymous.
But could or can this impact this Challenge is what I really want to know before I start running PPS Tasks | |
|
JimB Honorary cruncher Send message
Joined: 4 Aug 11 Posts: 920 ID: 107307 Credit: 990,017,653 RAC: 50,599
                     
|
How does anything his computer is doing or not doing affect you in any way? | |
|
|
How does anything his computer is doing or not doing affect you in any way?
None, so running tasks only to fail 100% of the time is ok with you them.
OK I am moving on now. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,151,826,606 RAC: 22,774,079
                                                
|
How does anything his computer is doing or not doing affect you in any way?
None, so running tasks only to fail 100% of the time is ok with you them.
OK I am moving on now.
Sorry, but I am not sure where the negativity is coming from. Errors on others computers don't really affect other users or challenges in any significant manner.
If your question is about whether this user's error identifies some bug or other issue with the PPS project/program, the answer is no. For example, I ran PPS on my machines for about 3 days and received no errors at all. You should feel safe running any of the PPS projects as there are no current problems with any of them.
| |
|
|
How does anything his computer is doing or not doing affect you in any way?
None, so running tasks only to fail 100% of the time is ok with you them.
OK I am moving on now.
Sorry, but I am not sure where the negativity is coming from. Errors on others computers don't really affect other users or challenges in any significant manner.
If your question is about whether this user's error identifies some bug or other issue with the PPS project/program, the answer is no. For example, I ran PPS on my machines for about 3 days and received no errors at all. You should feel safe running any of the PPS projects as there are no current problems with any of them.
It is Not negativity on my Part.
If PG is OK for a computer to download tasks to only fail over and over then OK.
I would think that you would care that a host is using your network bandwidth only to fail and have to be reissued for someone else to run.
State: All 1882 Invalid 1874 and Error 6
100% of what this Host is Downloading is Failing that's all.
Sorry you took me as a Negative Member. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2420 ID: 1178 Credit: 20,151,826,606 RAC: 22,774,079
                                                
|
It is Not negativity on my Part.
If PG is OK for a computer to download tasks to only fail over and over then OK.
I would think that you would care that a host is using your network bandwidth only to fail and have to be reissued for someone else to run.
State: All 1882 Invalid 1874 and Error 6
100% of what this Host is Downloading is Failing that's all.
Mike or Jim know the details better than I do, but there are some limits in place to prevent such a host from causing problems for things like server bandwidth. Of course, we'd prefer things were working properly for everyone connected to PG, but sometimes that cannot be helped. Even the most diligent of us can have a machine go into error mode (e.g., while off on vacation, etc.) where we cannot correct it for some time.
Sorry you took me as a Negative Member.
Not as a negative member. Was just worried that you were getting frustrated over something that is less of a problem than you might perceive. It can be difficult to read the feelings behind typed text, sorry if I misinterpreted. | |
|
|
Not that I care for the machines that error all the tasks. I consider it a better option than the ones that "time out" or the ones that are kept for 3-4 days and then "mass aborted". It, in a way evens up the playing field for the slower GPU/ CPUs to have a chance against the faster machines and those with many many machines. From my location (IL) it takes 12 seconds from completion to it being sent to server after the "uploading" "ready to report" has been completed. I prefer my machine got beat by hours or minutes, not mere seconds.
| |
|
|
https://www.primegrid.com/results.php?hostid=153120
Microsoft Windows Server 2003 "R2" Enterprise Server x86 Edition, Service Pack 2, (05.02.3790.00)
All Tasks are failing.
Validate error on all Tasks since 1 Feb 2019 | 12:06:42 UTC the Start of Tour de Primes
This Host only seems to be only Trolling for a lack of another word.
I would think that after 17 Days the Owner would have stopped downloading new tasks to only Fail
| |
|
|
All this talk about bad hosts could be done in the thread Bad hosts where it seems more on-topic than here in the 2019 TdP thread. (EDIT: That thread even starts from the discussion of the very same machine, hostid=153120.) /JeppeSN
https://www.primegrid.com/results.php?hostid=153120
Microsoft Windows Server 2003 "R2" Enterprise Server x86 Edition, Service Pack 2, (05.02.3790.00) Intel(R) Xeon(TM) CPU 2.40GHz [Family 15 Model 2 Stepping 9] 2 Cores
All Tasks are failing.
Validate error on all Tasks since 1 Feb 2019 | 12:06:42 UTC the Start of Tour de Primes
This Host only seems to be only Trolling for a lack of another word.
I would think that after 17 Days the Owner would have stopped downloading new tasks to only Fail
This was why I posted here was because I think it is on purpose only to disrupt the Tour de Primes and am Not Being Negative here only posting what I am seeing here, | |
|
|
All this talk about bad hosts could be done in the thread Bad hosts where it seems more on-topic than here in the 2019 TdP thread. (EDIT: That thread even starts from the discussion of the very same machine, hostid=153120.) /JeppeSN
Please split off these posts as JeppeSN has suggested. This topic has been discussed before. There are more than one such hosts on PrimeGrid - this is not some lone rogue machine. It's nothing nefarious nor a conspiracy. Some folks just never look at their machines or read the forums.
Or just Delete them and move on with the 2019 Tour de Primes.
I am fine with them being deleted. | |
|
|
https://www.primegrid.com/results.php?hostid=153120
Microsoft Windows Server 2003 "R2" Enterprise Server x86 Edition, Service Pack 2, (05.02.3790.00)
All Tasks are failing.
Validate error on all Tasks since 1 Feb 2019 | 12:06:42 UTC the Start of Tour de Primes
Thanks for the tip.
Switched my laptop to PPS. Poor thing is so slow I'm lucky to get 30% firsts on PPSE using t -4.
Best chance it has is to grab initial tasks with hosts like this. | |
|
|
Switched my laptop to PPS... Best chance it has is to grab initial tasks with hosts like this.
Breakdown of 11 tasks done, "*" denotes host 153120 was involved (note "returned 1st" means wingman was still processing):
Not 1st
1. Wingman done before I received
2. Wingman was faster
*3. got 3rd task (received 50min after initial wingman who was faster anyway)
1st
1. .5hr early, faster
2. .5hr late, returned 1st
3. .5hr late, faster
4. 10min early, returned 1st
5. 4min early, returned 1st
*6. got 3rd task (received 12min before initial wingman, I would still have still been faster head to head)
*7. got 3rd task, returned 1st (received .5hr after initial wingman)
*8. got 3rd task, returned 1st (received .5hr after initial wingman)
So far host 153120 has had no effect.
Still waiting to be it's initial wingman. | |
|
|
Being the owner of a bad host once in a while (I hate Windows automatic updates and deciding to boff a video driver). I can say that I understand the frustration. I also know that because both the CPU and GPU were doing work, the CPU was not failing, so the GPU was being fed regularly as I was sending back successful work on the CPU. I know the backend of this software is designed to not care about bad hosts, because it's just a quick couple bit record in the scheme of things. It is meant to work around these issues, instead of dealing with them. The issue is what ways to deal with them, many might call foul on. The current way seems to be the least problematic to most. You are going to be either the first or second wingman in most task anyway. Time on these tasks have not a lot of relevance. It will be found when it is found. A lot of it is luck, mainly. I mean even the fastest devices are not always first, because sometimes they get too many tasks and do not return them first. We are all here to do a small part in a large project, and that is what I feel I am doing, even when I find nothing, I am helping others find things, because I helped removed the stuff that was not a find. | |
|
|
This host was created back on 14 Jun 2010 | 5:58:56 UTC
I think it is time to stop the Host from being able to get new Tasks
https://www.primegrid.com/show_host_detail.php?hostid=153120
It is causing issue being we have to run his failed tasks today and we are not getting NEW ones to find Primes.
There are not Windows Updates to cause this issue on this Host
Microsoft Windows Server 2003 "R2" Enterprise Server x86 Edition, Service Pack 2, (05.02.3790.00)
Because the owner is Anonymous no one can send a PM to let them know of the issue. Only a Volunteer moderator or Project administrator can fix this and they seem to not care. | |
|
dukebgVolunteer tester
 Send message
Joined: 21 Nov 17 Posts: 242 ID: 950482 Credit: 23,670,125 RAC: 0
                  
|
It is causing issue being we have to run his failed tasks today and we are not getting NEW ones to find Primes.
From the perspective of finding primes, there's no difference if the workunit had or had not a failed task from a bad host. Running these tasks is no different from running new ones from the perspective of finding primes.
From the perspective of being 1st it also doesn't make that much difference: you have equal chances of being the initial wingman or the delayed wingman.
All of this is reiteration what was already said. | |
|
|
It is causing issue being we have to run his failed tasks today and we are not getting NEW ones to find Primes.
From the perspective of finding primes, there's no difference if the workunit had or had not a failed task from a bad host. Running these tasks is no different from running new ones from the perspective of finding primes.
From the perspective of being 1st it also doesn't make that much difference: you have equal chances of being the initial wingman or the delayed wingman.
All of this is reiteration what was already said.
Being 1st gets the Prime how can you say that it makes no difference, how can you say this.
Then why have a Tour de Primes to complete all the Tasks that management can't handle or just don't care in members run computer failing 100% of the time.
It makes no since that you say being 1st doesn't make that much difference.
And I thought that Tour de Primes was a time to find the Most Primes.
And then run all the fail aborted tasks the rest of the year. Pun Intended. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1172 ID: 55391 Credit: 1,219,398,374 RAC: 1,416,880
                        
|
Being 1st gets the Prime how can you say that it makes no difference, how can you say this.
Then why have a Tour de Primes to complete all the Tasks that management can't handle or just don't care in members run computer failing 100% of the time.
It makes no since that you say being 1st doesn't make that much difference.
I think you misunderstood. The prime finder is the one that returns the first VALID result. A failing host doesn't return valid results.
If your machine is paired on a workunit with a failing host, you then have a 50% chance of being selected as the first wingman, and 50% as the second wingman.
So for a workunit containing a failed result, the relative speeds of the first and second machines returning the valid results don't much matter.
This contrasts with a workunit with no failed results, where the significantly faster machine has a distinct advantage to be first to return a valid result.
This advantage evaporates for workunits containing a failed result, and the downside for a fast machine is an upside for a slow machine.
| |
|
|
I think you misunderstood. The prime finder is the one that returns the first VALID result. A failing host doesn't return valid results.
Bingo. | |
|
|
Being 1st gets the Prime how can you say that it makes no difference, how can you say this.
I think you misunderstood. The prime finder is the one that returns the first VALID result. A failing host doesn't return valid results.
Aborted or Failed tasks don't somehow taint the workunit and make it impossible for anyone other than the initial wingmen to find a prime.
Here is an example:
https://www.primegrid.com/workunit.php?wuid=600033850
The 3rd wingman got credit for finding the prime:
1st one received at 8:22
2nd one received at 8:24 and then aborted at 8:31
3rd one received at 8:41 <- 1st to return a valid task, is the prime finder
And guess what, that is one of YOUR primes.
If you can be the prime finder as 3rd wingman on a workunit that had an initial task aborted after 7 minutes; can you understand how tasks failing after a few seconds don't really have any effect on your ability to find a prime? | |
|
Message boards :
Number crunching :
Bad Hosts |