Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Cullen/Woodall prime search :
Why are my LLR WUs taking so long?
Author |
Message |
|
Example: http://www.primegrid.com/result.php?resultid=555370092
I have only recently returned to crunching for PrimeGrid after a lot of focus on other projects, so it will take a while to reeducate myself on all the subprojects.
The WU above expired on 17 JUL with 220 hrs crunched and 44 hrs remaining (80.89% complete). This is not right. I have a dual Xeon E5520 (Nehalem) with 24GB ECC RDIMMs. Yes, this machine is no longer a spring chicken, but it should be more capable than this. It runs BOINC 24/7 with the occasional snooze when I want to watch a video without choppiness. All 16 (8 + HT) cores are made available to BOINC projects. No GPU is available.
I just read a thread that mentioned the loss of performance when running LLR units on multiple cores at the same time. I hadn't been watching the system and I found that it was running 4 or more LLRs at once. I have since paused all but one at a time. Could this be a sufficient explanation for the long work time? If not, what else should I look at, or should I just select different subprojects? | |
|
|
It it generally recomended to run at max as many LLR tasks as you have physical cores available so your 4 tasks are ok with the 8 physical cores.
But your CPU has a bad performance on double precicion floting point operation compared to newer cpus that makes it a lot slower -> no AVX or FMA3 instrucion set and a slower clock per core than newer cpus
Second point is that LLR tasks benefit from fast RAM, and yours may have at best 1066Mhz. That also slows the computation.
So over all it's no unusual thing that those tasks take that long on your cpu.
As far as you have described this i can't see any method at speeding up those workunits. It may be best if you switch to another subproject with short tasks, where the execution time - deadline ration is better. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 475,889,971 RAC: 246,026
                               
|
There's no way you should be missing that deadline. The deadline is 14 days, and the average run time on those tasks is about 3.5 days. Even with HT on and all cores running, and considering your computer isn't the latest and greatest, you still shouldn't be missing the deadline. Computers much older than yours don't have trouble with the deadline.
Remember, that 3.5 days is the average and the latest and greatest are two or three times faster than that.
Looking only at Xeon E5520 CPUs, the average run time is under 6 days. I don't know if that's with HT on or off, but assuming the worst case that it's 6 days with HT off (so only half the virtual cores running), and you're running with HT on and LLR running on all the virtual cores, you still would be finishing in about 12 days, two days short of the deadline. I don't think, however, that you're looking at a worst case scenario.
Looking at the individual computers with E5520's, the slowest of them took under 9 days to do the workunit. (The fastest time for an E5520 was just under 4 days, so I'd say 4 days vs. 9 days is the difference you see when you turn hyperthreading on and off.)
Based on that, there's no way you should be exceeding the deadline. I'd be looking for a problem with the computer. Two things I can think of: something else is running and stealing CPU cycles (that's the more likely cause), or you have a severe cooling problem and the CPU is slowing down to avoid damaging temperatures. LLR runs much hotter than almost anything else, so if your cooling is compromised this is a problem you might see with LLR but not with other apps.
I hope this information if helpful.
____________
My lucky number is 75898524288+1 | |
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
Based on that, there's no way you should be exceeding the deadline.
But the WU didn't crunch for the whole 14 days. As per OP, the deadline was exceeded, and after accumulating 220hrs (~ 9days) of crunching (with another 44 to go), it was aborted. So BOINC estimation was off. It'd have still taken 11 days, so something's off there as well. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 475,889,971 RAC: 246,026
                               
|
Based on that, there's no way you should be exceeding the deadline.
But the WU didn't crunch for the whole 14 days. As per OP, the deadline was exceeded, and after accumulating 220hrs (~ 9days) of crunching (with another 44 to go), it was aborted. So BOINC estimation was off. It'd have still taken 11 days, so something's off there as well.
It's possible I misunderstood the OP's question. I thought it was "Why is my computer running so slowly that it missed the deadline?" If the question is actually, "Why did a task that takes 11 days (220+44) miss the 14 day deadline, the answer is that you usually can't trust BOINC to estimate times correctly. Sometimes you'll need to insure that BOINC runs tasks immediately by setting your queue to 0 days. That's always a good idea with PrimeGrid regardless because of the prime-finder/double-checker competition.
One other interesting point to be made. The task is listed as "aborted by user". It wasn't necessary to abort the task just because it was late. It could have been completed and returned and credit would have been granted if the result was correct.
____________
My lucky number is 75898524288+1 | |
|
|
I aborted the task only after checking its status on this website, where it said something like "not completed in time", which I took to mean that I would not get credit for it. If I look at the status of the task now, it only shows that it was "aborted by user", which doesn't tell the whole story.
My original question was really about why my machine takes so long (in actual FLOPs/cycles/whatever) to complete the task. I didn't really think about the scheduling problem at the time. But that is also a concern. Shouldn't BOINC be smart enough by now (v7.2.42) to realize that the task is in such danger of being overdue that it won't pause it in favor of other projects? I don't have time to become an expert in the nuances of the scheduler algorithm, which over the years seems to have gotten too clever for its own good. I just let BOINC run unmolested, but I manually suspend projects when I want to focus 100% effort on my preferred projects.
I guess I will need to figure out which subprojects here are better suited to the instruction set on my CPU, and only subscribe to those.
I will also monitor the temp on my CPUs/memory to see if there is a noticeable problem there. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 475,889,971 RAC: 246,026
                               
|
I aborted the task only after checking its status on this website, where it said something like "not completed in time", which I took to mean that I would not get credit for it. If I look at the status of the task now, it only shows that it was "aborted by user", which doesn't tell the whole story.
As far as the server was concerned, the task was past due and was shown as such. It has no way to know that the task is still being worked on. That's a major shortcoming of BOINC in my opinion, and I've asked for changes that would permit this to be handled in a better manner, but at the present time it can't be done.
Despite the server showing the task as late, as long as you return the correct result before the work unit is purged from the database you will get credit.
My original question was really about why my machine takes so long (in actual FLOPs/cycles/whatever) to complete the task. I didn't really think about the scheduling problem at the time. But that is also a concern. Shouldn't BOINC be smart enough by now (v7.2.42) to realize that the task is in such danger of being overdue that it won't pause it in favor of other projects? I don't have time to become an expert in the nuances of the scheduler algorithm, which over the years seems to have gotten too clever for its own good. I just let BOINC run unmolested, but I manually suspend projects when I want to focus 100% effort on my preferred projects.
"Too clever for its own good" is a pretty good way of describing it. The only way to be 100% certain that it's going to do things correctly is to force it do so by leaving it no opportunity to do anything else. Running only one project with no queue at all is a pretty good way to force it go run what you want it to run.
Under ideal (read as "hopelessly naive") conditions, it would balance everything correctly, but it often doesn't work that way in practice.
I guess I will need to figure out which subprojects here are better suited to the instruction set on my CPU, and only subscribe to those.
The prime test projects (all the LLR and GFN-SHORT projects) will work fine on your computer -- but they'll run a heck of a lot faster on a CPU with AVX or FMA. Even with HT on, you can easily complete the tasks within the deadline. You're at a disadvantage vs. the Sandy Bridge and later computers, but this isn't really a competition unless you want it to be. I was running a Core2 CPU until earlier this year and was perfectly happy with it. The TRP and PSP/SOB/ESP sieves can also be run, and those will run as well on your CPU as on a newer machine since AVX doesn't help with that app. I'd stay away from the PPS-Sieve app since it's primarily designed for GPUs. Running it on a CPU is, in my opinion, a waste of CPU resources, which are better applied to the tasks that GPUs are unable to do.
I will also monitor the temp on my CPUs/memory to see if there is a noticeable problem there.
That's always a good idea. :)
____________
My lucky number is 75898524288+1 | |
|
|
I aborted the task only after checking its status on this website, where it said something like "not completed in time", which I took to mean that I would not get credit for it. If I look at the status of the task now, it only shows that it was "aborted by user", which doesn't tell the whole story.
It might be better to check your ACCOUNT ...
VIEW tasks link under the BADGES ...
... and see how many computers are working on the TASK. You will get credit if you are one of the first two completions. Many times, my crunching runs over but my computer is the only one assigned so I will be get PENDING CREDIT when I complete but not get full credit until the second machine completes processing and is validated.
I guess I will need to figure out which subprojects here are better suited to the instruction set on my CPU, and only subscribe to those..
You can do any of them but it seems there is problems with the BOINC scheduler when multiple projects are selected. Primegrid seems to start out with their short jobs but then the longer jobs begin to fill the backlog and .... then you see your overrun problems.
I will also monitor the temp on my CPUs/memory to see if there is a noticeable problem there.
I also had temperature problems. I usually set up preferences to run on all cores 100% of the time. I also have a GTX 650i running GPU tasks. I monitored the temperature and was trying to keep the CPU below 80 degrees. I had to lower the CPU percentage to 60% to keep the temperature below 80 and temperature bounced around quite a bit.
I found that if I lowered the preferences to only use 90% of the cores (idle one for GPU use) and 100% of the time, the temperature dropped about 10 degrees and became very stable.
The TASK MANAGER shows that the page fault delta numbers drop from 2000 per second to near zero. I am not sure what the exact cause is, but freeing 1 virtual CPU from compute so the GPU task can use it, seems to make the temperature problems drop. A work in progress ....
If you use the GPU, then try freeing one CPU from CPU processing.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 475,889,971 RAC: 246,026
                               
|
You will get credit if you are one of the first two completions.
That is not correct, at least not at PrimeGrid. I can't speak as to how other projects handle late tasks.
You do not need to be one of the first two completions. Any task that returns the correct result before the workunit is purged from the database gets full credit. Workunits don't get purged from the database until several days/weeks/months (depending on the size of the task) after the last in-progress task has completed. In the case of Woodall tasks, the workunit is usually purged 14 days after the last task completes.
____________
My lucky number is 75898524288+1 | |
|
|
You will get credit if you are one of the first two completions.
That is not correct, at least not at PrimeGrid. I can't speak as to how other projects handle late tasks.
You do not need to be one of the first two completions. Any task that returns the correct result before the workunit is purged from the database gets full credit. Workunits don't get purged from the database until several days/weeks/months (depending on the size of the task) after the last in-progress task has completed. In the case of Woodall tasks, the workunit is usually purged 14 days after the last task completes.
I am not sure where I got that idea but thanks for clearing that up. It seems rather dumb for a project to not give points for a completed result. There is no real reason to abort any long running job then. thanks again. | |
|
|
You will get credit if you are one of the first two completions.
That is not correct, at least not at PrimeGrid. I can't speak as to how other projects handle late tasks.
You do not need to be one of the first two completions. Any task that returns the correct result before the workunit is purged from the database gets full credit. Workunits don't get purged from the database until several days/weeks/months (depending on the size of the task) after the last in-progress task has completed. In the case of Woodall tasks, the workunit is usually purged 14 days after the last task completes.
I am not sure where I got that idea but thanks for clearing that up. It seems rather dumb for a project to not give points for a completed result. There is no real reason to abort any long running job then. thanks again.
Actually I think Mike meant that if the 2 first returned WUs haved been returned and validated and a 3rd (or more) WUs are assigned, you will still get credit if it is returned before the deadline or a period of time afterwards before the database gets purged. The timer for that purge begins once the last in progress WU passes it's deadline, which, depending on the length of that sub project, that time will vary.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
| |
|
|
I just experienced something while running 8 copies of CUL on my i7. BOINC showed 3 of the workloads accumulating seconds on in the BOINC ELAPSED column but 5 of the workunits were not accumulating seconds and the ELAPSED column was blank ... implying none had yet accumulated any seconds. I checked the TASK MANAGER and it was running at 35% busy, like only 3 of 8 tasks were actually running.
5 of the CUL workloads seemed stuck and were confirmed by the Windows 7 task manager. I don't know how long it took the 5 CUL workloads to percolate to the the RUN state. Nothing I seemed to do in Boinc seemed to get all 8 CUL running. I finally shutdown Boinc and workloads ..... and restarted. When I restarted Boinc and workunits, all 8 CUL workloads restarted.
I watched them stuck for several minutes as I played with boinc knobs so I know it was happening.
I don't know how long that condition would persist OR if it would clear itself up.
This happened on my Haswell i7-4770 machine named sandybridge1 (ID: 238540 ). 6 of the 8 CUL workloads are still running.
It appeared to me like there was some resource deadlock in the BOINC/primgrid environment that stalled the start up and could not clear itsefl. There were no Windows events logged.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14036 ID: 53948 Credit: 475,889,971 RAC: 246,026
                               
|
It appeared to me like there was some resource deadlock in the BOINC/primgrid environment that stalled the start up and could not clear itsefl. There were no Windows events logged.
I'm not all that familiar with the inner workings of the BOINC client that runs on your computer. I believe it uses a shared memory segment to communicate between the BOINC API that runs in the app and the BOINC client. In theory, it's conceivable that the tasks could have had a problem connecting to that shared memory segment.
It also could have been due to an almost infinite number of other causes. If it happens more than once, maybe we can try to figure out why. I don't think I've ever heard of this problem before.
Another possibility is that this is a feature and not a bug. As some one else put it, BOINC might be "too smart for its own good." Perhaps there was some environmental condition that caused BOINC to decide it shouldn't be running the other 5 tasks.
____________
My lucky number is 75898524288+1 | |
|
Dave  Send message
Joined: 13 Feb 12 Posts: 3253 ID: 130544 Credit: 2,422,216,586 RAC: 3,911,326
                           
|
8 tasks means you have HT on. I'd turn that off to start with.
Then if still trouble try 75% CPU utilisation (3 cores) on BOINC options. | |
|
Message boards :
Cullen/Woodall prime search :
Why are my LLR WUs taking so long? |