Author |
Message |
|
I have a question regarding my hostid=274101.
It is i7-2600 host, running Linux Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-38-generic x86_64).
I am currently running LLR tasks, recently PPS LLR.
Only this subproject is marked at current preferences for this host.
Since LLR upgrade to 3.8.9, boinc manager ( version 7.0.28) sometimes stops getting new task with this error message:
PrimeGrid 2013-02-27 23:40:59 Wiadomośc z serwera: This project doesn't support computers of type x86_64-pc-linux-gnu
PrimeGrid 2013-02-28 11:55:59 Wiadomośc z serwera: This project doesn't support computers of type x86_64-pc-linux-gnu
I observe it for last three days, once per day at various times.
Am I the only one that gets this message from server?
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
If you manually tell the computer to update a minute or two later, does the problem go away?
____________
My lucky number is 75898524288+1 |
|
|
Crun-chi Volunteer tester
 Send message
Joined: 25 Nov 09 Posts: 3114 ID: 50683 Credit: 76,797,694 RAC: 4,051
                       
|
Add line in cc_config
<cc_config>
<options>
<alt_platform>i686-pc-linux-gnu</alt_platform>
</options>
</cc_config>
____________
92*10^1439761-1 NEAR-REPDIGIT PRIME :) :) :)
4 * 650^498101-1 CRUS PRIME
314187728^131072+1 GENERALIZED FERMAT
Proud member of team Aggie The Pew. Go Aggie! |
|
|
|
Yes, when I notice the problem and update manually - it downloads new WU without any problem.
Another fact is, this host has HT off, so it crunches 4 tasks at once. Additional work buffer is set to 0, so it repeats updates quite often...
I have observed this message only 3 or 4 times, so it is not a large scale but still...
____________
|
|
|
|
Yes, when I notice the problem and update manually - it downloads new WU without any problem.
Another fact is, this host has HT off, so it crunches 4 tasks at once. Additional work buffer is set to 0, so it repeats updates quite often...
I have observed this message only 3 or 4 times, so it is not a large scale but still...
I've seen that happen (under windows 7) right after the server update every morning (around 8 PM UTC, I think). You're not alone....
____________
676754^262144+1 is prime |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
I *think* this is related to tuning the server's shmem memory buffer, which is used to stage results ready to be sent out.
As designed by BOINC, this, by default, is set to hold 200 tasks. Until a few days ago, PrimeGrid had it set to 1000. It's currently 5000.
However, since the default of 200 was established, cpus have grown from 1 core to (for an i7) 8 cores. So when a computer gets hungry, it wants a lot more tasks than it used to.
Also instead of running 1 app, we're running 12 apps, and each app gets a fixed fraction of the total buffer. Until a few minutes ago, PPS-LLR had about 700 of the tasks in the buffer. However, if 200 was a good default for 1 app when single-core CPUs were common, then 1600 is probably a better number when the most common number of cores amongst all the hosts is 8.
It may be that the buffer is simply too small, and if a bunch of computers request work at once it's just getting depleted. It gets refilled within a few seconds, but if you hit the server at just the right time you may get some kind of error about not having any work. I've increased the mumber of PPS-LLR tasks in the buffer. PPS-LLR and SGS now have about 1450 tasks in the buffer. Let me know if you continue to see this error, and, if so, in which projects. I may need to increase the total buffer size beyond 5000. My gut tells me it should be at least 20,000 (200 * 8 * 12) and I may increase it again.
Ironically, part of the problem may be that the server is faster than it used to be. It can service client requests much faster, and can, therefore, empty the buffer much faster than before.
____________
My lucky number is 75898524288+1 |
|
|
|
Right now, I had 2 machines with the same issue. I have been seeing this since you've introduced the x64 version of LLR for other platforms.
It's quite annoying, as the machine stops crunching and also doesn't send oud the crunched results.
I have this issue on 4 Win7Pro 64bit machines, running PPS (LLR) and SGS
____________
|
|
|
|
This is the message I get: PrimeGrid | Bericht van de server: Dit project ondersteund geen computers van het type windows_x86_64
Which means: Message from server: This project doesn't support computers of the type windows_x86_64
I have never seen this issue before.
Last time this happened was 28-03-2013 @ 21:15 CET.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
It's possible this problem isn't related the the memory buffer, but is a bug in the scheduler code.
We'll address that at some point (this is quite important to us), but it will take some time. Not only is this going to be somewhat difficult, but there are more urgent needs right now (which I'm not going to talk about just yet.)
____________
My lucky number is 75898524288+1 |
|
|
|
Just to say that I'm affected by this issue too, every couple of days one of my PCs tends to get the message described by [DPC]Division_Brabant~TFH|Fony (also with Windows 7 x64), and the same frustrating symptoms - i.e. no new work requested and no results reported (even though I have report immediately in my cc_config files). This is a significant problem when people are away from one or more of their hosts for any period of time, so I'm glad that fixing it is a proirity.
More urgent needs? Blimey, I'm starting to fear the worst! Hope everything is worked out, and also that everything doesn't get too stressful.
I italicise the word "too" there, because of course we've got to keep you mods on your toes now, haven't we ;) |
|
|
|
Just as another data point, I have a computer using app_info.xml, and it got a log message "this project doesn't support computers of type anonymous" yesterday. I don't believe I've ever seen this before. Through random chance I happened to check on the box just a few minutes after it occurred, and a boinc manager "update" cured it.
--Gary |
|
|
|
This is the first occurrence since extending server's memory buffer in my case. So the frequency has been lowered, but problem still exists.
PrimeGrid 2013-03-05 01:21:02 Wiadomośc z serwera: This project doesn't support computers of type x86_64-pc-linux-gnu
____________
|
|
|
|
I had similar problems yesterday, with all tasks not reporting even with 'report tasks immediately' in my cc_config.xml and ran out of tasks. Also had 'Reporting xx completed tasks, not requesting new tasks' even when I was out of tasks. This started shortly after I completed a couple of GeneferCUDA tasks on my new GTX 470. This happened on all 3 of my systems with Win 7 x64 OS at the same time. A manual update on each cleared up everything. They did upload the finished files but didn't report the tasks themselves, so no network problems.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
|
|
|
|
. . . received today, timestamped 10:36 PST. To wit:
"This project doesn't support computers of type windows_x86_64"
That would be disheartening news to a great many users ;)
Bw
_
____________
|
|
|
|
And again one machine stopped crunching... And also with the nice message: project backed up for 24hrs :-(
C'mon guys... A problem like this should be on the highest place on your priority list.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
Just to be clear, this is a relatively new phenomena, correct? You never saw this problem before a few weeks ago, right?
I'm going to undo one of the changes we did -- tell me if the problems keeps occurring.
Mike
____________
My lucky number is 75898524288+1 |
|
|
|
This was the first occurence for me, and contrary to my assumptions made in that post, it did happen several more times. |
|
|
|
Just to be clear, this is a relatively new phenomena, correct? You never saw this problem before a few weeks ago, right?
I'm going to undo one of the changes we did -- tell me if the problems keeps occurring.
Mike
I can't give you an exact date, but my guess is that it started when you introduced the other 64bit version. I will keep an eye open on my machines and let you know if they give this error again :)
____________
|
|
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
I can't give you an exact date, but my guess is that it started when you introduced the other 64bit version.
I think, i have an exact date. The problem started after posting the thread Server moves complete!...
____________
Best wishes. Knowledge is power. by jjwhalen
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
I'm not that concerned about the exact date it started. There were a lot of changes made, so it's not easy to pin this problem down to a single change just based upon when it started.
The only thing that's useful to me is A) did this problem exist several months ago (meaning it's not due to a recent change, and B) (most important) has the problem gone away now?
____________
My lucky number is 75898524288+1 |
|
|
|
The problem didn't exist for me until my post linked to up there (16th Feb), and it does seem to have gone away now - I think it's been 5 days since it last happened to me, maybe slightly longer. |
|
|
|
I'm not that concerned about the exact date it started. There were a lot of changes made, so it's not easy to pin this problem down to a single change just based upon when it started.
The only thing that's useful to me is A) did this problem exist several months ago (meaning it's not due to a recent change, and B) (most important) has the problem gone away now?
A) -> No, I haven't seen this issue several months ago.
B) Last time it happened:
7-3-2013 7:39:26 | PrimeGrid | Bericht van de server: Dit project ondersteund geen computers van het type windows_x86_64
This is CET.
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
It's been about 3 days since I reduced the buffer size. Has anyone seen this error happen since then?
____________
My lucky number is 75898524288+1 |
|
|
|
I had it today at 12:30 on my i7. That was the first time in about 10 days though.
Clicking "update" reported my completed WUs but as usual, for the CPU it's switched to getting me PPS Sieve tasks, and not SGS like I've got selected. I know how to fix it, as I think I mentioned above - just check GFN WR and SGS (and uncheck PPS Sieve for GPU), then it'll behave, and I can set my preferences back to SGS and PPS Sieve for the GPU. Doesn't take that much effort but it's far from ideal, especially the bit which gives another "Aborted" result for a GFN WR. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
I had it today at 12:30 on my i7. That was the first time in about 10 days though.
Ok, if that didn't fix the problem then I'm going to increase the buffer size again. It will improve performance during the challenge.
Clicking "update" reported my completed WUs but as usual, for the CPU it's switched to getting me PPS Sieve tasks, and not SGS like I've got selected. I know how to fix it, as I think I mentioned above - just check GFN WR and SGS (and uncheck PPS Sieve for GPU), then it'll behave, and I can set my preferences back to SGS and PPS Sieve for the GPU. Doesn't take that much effort but it's far from ideal, especially the bit which gives another "Aborted" result for a GFN WR.
It's easier than that.
The permanent fix is to remove the execute permission from the CPU sieve executable. Do that once, and it's fixed forever, and you don't need to take any more action.
If you don't feel comfortable doing that -- it's very easy if you want directions -- and want to micro-manage the preferences, it's better to switch the "use CPU" and "use Nvidia GPU" check boxes at the top rather than changing which projects you're selecting. It's less clicking, and it also means you're not downloading tasks and then cancelling them.
____________
My lucky number is 75898524288+1 |
|
|
|
Ah, I remember someone talking about that fix - I thought it was a bit extreme at the time, but maybe I should do it. Doesn't matter for now anyway, I'm doing the ESP challenge. Cheers for the quick reply. |
|
|
|
It's been about 3 days since I reduced the buffer size. Has anyone seen this error happen since then?
Had it today again at 11h41 UTC. I hadn't seen it for about a week or so.
____________
676754^262144+1 is prime |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
Found it.
I think this one is fixed now. Or at least mostly fixed.
There's a fundamental flaw in the BOINC software, or at least the older version that we're using here. I'm not sure if it's fixed in a later version or not.
There's two pieces of software that need to run, and work together, called the feeder and the scheduler.
The scheduler runs as a cgi when your computer contacts the server, and gets its information from a large memory buffer. The memory buffer contains lots of results to be sent out, as well as all the information about platforms, applications, and application versions.
The feeder is a continuously running process that fills up that memory buffer. It loads the platform and application and application version information when it starts up, and keeps the ready-to-send result table in that buffer filled continuously.
The platform not found problem was apparently caused by the scheduler reading the memory buffer before it was fully initialized.
As it turns out, we re-initialize that buffer every 5 minutes in order to force high-priority challenge cleanup tasks to the front of the queue. The times at which these errors have been occurring correspond to when we do this initialization. It's supposed to be safe to do this; clearly it's not.
I've changed the challenge priority system so that it doesn't re-initialize the buffer anymore, and that should stop 99% of the occurrences of this problem. It may still occur right when the system starts up, which happens once a day automatically after the backups are done, or whenever we manually stop the system for whatever reason. But the initialization will now be happening automatically only once per day rather than 289 times per day, so we should be seeing a lot less of this error.
____________
My lucky number is 75898524288+1 |
|
|
|
Good to hear! :) Thanks!
____________
|
|
|
|
[...]
I've changed the challenge priority system so that it doesn't re-initialize the buffer anymore, and that should stop 99% of the occurrences of this problem. It may still occur right when the system starts up, which happens once a day automatically after the backups are done, or whenever we manually stop the system for whatever reason. But the initialization will now be happening automatically only once per day rather than 289 times per day, so we should be seeing a lot less of this error.
Just got this:
PrimeGrid 3/11/2013 4:36:09 AM (11:36 UTC) Message from server: This project doesn't support computers of type windows_x86_64
Seems to still happen long after the daily server restarts.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
[...]
I've changed the challenge priority system so that it doesn't re-initialize the buffer anymore, and that should stop 99% of the occurrences of this problem. It may still occur right when the system starts up, which happens once a day automatically after the backups are done, or whenever we manually stop the system for whatever reason. But the initialization will now be happening automatically only once per day rather than 289 times per day, so we should be seeing a lot less of this error.
Just got this:
PrimeGrid 3/11/2013 4:36:09 AM (11:36 UTC) Message from server: This project doesn't support computers of type windows_x86_64
Seems to still happen long after the daily server restarts.
Look at the UTC timestamp in your error message.
Look at the UTC timestamp of my message post.
Your error message was before I put the fix in.
____________
My lucky number is 75898524288+1 |
|
|
|
Found it.
I think this one is fixed now. Or at least mostly fixed.
the fetching of CPU tasks instead of CUDA seems to be better behaved as well - at least on one of my PCs it's done 2 full fetches without any incorrect tasks arriving. I need to wait for Genefer WR to finish on PC that was getting the worst of it to see if it's improved there as well. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
Found it.
I think this one is fixed now. Or at least mostly fixed.
the fetching of CPU tasks instead of CUDA seems to be better behaved as well - at least on one of my PCs it's done 2 full fetches without any incorrect tasks arriving. I need to wait for Genefer WR to finish on PC that was getting the worst of it to see if it's improved there as well.
It it possible that this has lessened the CPU/GPU problem, but it certainly will not eliminate it. The CPU/GPU problem predates the root cause of the platform problem by more than a year.
____________
My lucky number is 75898524288+1 |
|
|
|
[...]
I've changed the challenge priority system so that it doesn't re-initialize the buffer anymore, and that should stop 99% of the occurrences of this problem. It may still occur right when the system starts up, which happens once a day automatically after the backups are done, or whenever we manually stop the system for whatever reason. But the initialization will now be happening automatically only once per day rather than 289 times per day, so we should be seeing a lot less of this error.
Just got this:
PrimeGrid 3/11/2013 4:36:09 AM (11:36 UTC) Message from server: This project doesn't support computers of type windows_x86_64
Seems to still happen long after the daily server restarts.
Look at the UTC timestamp in your error message.
Look at the UTC timestamp of my message post.
Your error message was before I put the fix in.
That's when it started. It continued for 3+ hours and wouldn't report or download (ran out of tasks also) until I manually hit update just before my last post
Could BOINC have something to do with it not returning to normal without the manual update?
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
|
|
|
|
New problem?
PrimeGrid 3/11/2013 11:33:40 PM (06:33 UTC) Server can't open log file (../log_www/scheduler.log)
And continues to;
PrimeGrid 3/12/2013 12:33:48 AM (07:33 UTC) Server can't open log file (../log_www/scheduler.log)
Manual update just before this post gets same message.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
|
|
|
|
me too:
12.03.2013 09:41:46 | PrimeGrid | Server can't open log file (../log_www/scheduler.log)
can't send work |
|
|
|
I noticed I had a few WUs completed so I did a manual update.
12/03/2013 10:31:21 | PrimeGrid | update requested by user
12/03/2013 10:31:24 | PrimeGrid | Sending scheduler request: Requested by user.
12/03/2013 10:31:24 | PrimeGrid | Reporting 63 completed tasks, not requesting new tasks
12/03/2013 10:31:28 | PrimeGrid | Scheduler request completed
12/03/2013 10:31:28 | PrimeGrid | Server can't open log file (../log_www/scheduler.log)
While the Scheduler request completed, it didn't report the WU's. Does anyone know a workaround/fix for this issue?
____________
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
The log problem's unrelated, and fixed, and should never happen again.
____________
My lucky number is 75898524288+1 |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
[...]
I've changed the challenge priority system so that it doesn't re-initialize the buffer anymore, and that should stop 99% of the occurrences of this problem. It may still occur right when the system starts up, which happens once a day automatically after the backups are done, or whenever we manually stop the system for whatever reason. But the initialization will now be happening automatically only once per day rather than 289 times per day, so we should be seeing a lot less of this error.
Just got this:
PrimeGrid 3/11/2013 4:36:09 AM (11:36 UTC) Message from server: This project doesn't support computers of type windows_x86_64
Seems to still happen long after the daily server restarts.
Look at the UTC timestamp in your error message.
Look at the UTC timestamp of my message post.
Your error message was before I put the fix in.
That's when it started. It continued for 3+ hours and wouldn't report or download (ran out of tasks also) until I manually hit update just before my last post
Could BOINC have something to do with it not returning to normal without the manual update?
The problem was fixed, but your boinc client was waiting before attempting to contact the server again. If you looked at the project tab, it should have said "communications deferred 12:34:56" or something like that next to PrimeGrid. Hitting the update button forces it to ignore the wait time.
____________
My lucky number is 75898524288+1 |
|
|
|
[...]
I've changed the challenge priority system so that it doesn't re-initialize the buffer anymore, and that should stop 99% of the occurrences of this problem. It may still occur right when the system starts up, which happens once a day automatically after the backups are done, or whenever we manually stop the system for whatever reason. But the initialization will now be happening automatically only once per day rather than 289 times per day, so we should be seeing a lot less of this error.
Just got this:
PrimeGrid 3/11/2013 4:36:09 AM (11:36 UTC) Message from server: This project doesn't support computers of type windows_x86_64
Seems to still happen long after the daily server restarts.
Look at the UTC timestamp in your error message.
Look at the UTC timestamp of my message post.
Your error message was before I put the fix in.
That's when it started. It continued for 3+ hours and wouldn't report or download (ran out of tasks also) until I manually hit update just before my last post
Could BOINC have something to do with it not returning to normal without the manual update?
The problem was fixed, but your boinc client was waiting before attempting to contact the server again. If you looked at the project tab, it should have said "communications deferred 12:34:56" or something like that next to PrimeGrid. Hitting the update button forces it to ignore the wait time.
I have 'report tasks immediately' in my cc_config. All tasks did upload the files but wouldn't report as BOINC keep trying as each tasks finished well past your fix time.
PS, What was the server log file problem that happened hours earlier that's now fixed?
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
|
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 2,648
                              
|
PS, What was the server log file problem that happened hours earlier that's now fixed?
A configuration issue that was, naturally, designed to preemptively prevent future problems.
____________
My lucky number is 75898524288+1 |
|
|