PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise
1) Message boards : Seventeen or Bust : SoB WU in queue for weeks without any work done (Message 79664)
Posted 1701 days ago by Jazzop
So, should I kill this WU, or keep crunching?

It has already been reissued to someone else, but if I continue it will certainly finish before they do. Will I get any credit for it?

I'm getting really sick of Primegrid tsks timing out because BOINC doesn't seem to place any priority on giving them CPU time.
2) Message boards : Seventeen or Bust : SoB WU in queue for weeks without any work done (Message 79575)
Posted 1705 days ago by Jazzop
I have a single SoB workunit in my queue, which otherwise consists of hundreds of SIMAP workunits. Before downloading the SIMAP workunits about 5 weeks ago, the SoB WU received occasional CPU time along with some CPDN tasks. Ever since I received the SIMAP tasks, the SoB task has never seen a single CPU cycle. It finally started running yesterday, presumably because it became the task with the shortest deadline. The deadline is 20 SEP, with 22.8% complete and 902 hours remaining!

I have not been tinkering with projects or WUs during this time. All projects are set to "no new work", except for SIMAP.
The relative priority for SIMAP is 100 and Primegrid is 10000.
The computer runs 24/7 and very rarely is BOINC paused.
2x quad-core Xeon E5520 with HT (16 WUs working at a time)
24GB DDR3 RDIMMs
BOINC v7.2.42
3) Message boards : Cullen/Woodall prime search : Why are my LLR WUs taking so long? (Message 78232)
Posted 1764 days ago by Jazzop
I aborted the task only after checking its status on this website, where it said something like "not completed in time", which I took to mean that I would not get credit for it. If I look at the status of the task now, it only shows that it was "aborted by user", which doesn't tell the whole story.

My original question was really about why my machine takes so long (in actual FLOPs/cycles/whatever) to complete the task. I didn't really think about the scheduling problem at the time. But that is also a concern. Shouldn't BOINC be smart enough by now (v7.2.42) to realize that the task is in such danger of being overdue that it won't pause it in favor of other projects? I don't have time to become an expert in the nuances of the scheduler algorithm, which over the years seems to have gotten too clever for its own good. I just let BOINC run unmolested, but I manually suspend projects when I want to focus 100% effort on my preferred projects.

I guess I will need to figure out which subprojects here are better suited to the instruction set on my CPU, and only subscribe to those.

I will also monitor the temp on my CPUs/memory to see if there is a noticeable problem there.
4) Message boards : Cullen/Woodall prime search : Why are my LLR WUs taking so long? (Message 78207)
Posted 1765 days ago by Jazzop
Example: http://www.primegrid.com/result.php?resultid=555370092

I have only recently returned to crunching for PrimeGrid after a lot of focus on other projects, so it will take a while to reeducate myself on all the subprojects.

The WU above expired on 17 JUL with 220 hrs crunched and 44 hrs remaining (80.89% complete). This is not right. I have a dual Xeon E5520 (Nehalem) with 24GB ECC RDIMMs. Yes, this machine is no longer a spring chicken, but it should be more capable than this. It runs BOINC 24/7 with the occasional snooze when I want to watch a video without choppiness. All 16 (8 + HT) cores are made available to BOINC projects. No GPU is available.

I just read a thread that mentioned the loss of performance when running LLR units on multiple cores at the same time. I hadn't been watching the system and I found that it was running 4 or more LLRs at once. I have since paused all but one at a time. Could this be a sufficient explanation for the long work time? If not, what else should I look at, or should I just select different subprojects?
5) Message boards : Seventeen or Bust : WUs stalled out? (Message 68116)
Posted 2115 days ago by Jazzop
There are definitely no output files in my project directory.

Since I last posted the progress of this WU, the progress bar is now at 40%, with 203 hrs elapsed and 49 hrs remaining. For a dual Xeon E5520 w/ 24GB RDIMMs things should be (and have been) speedier than this.

I am going to abort all 12 WUs and revert to crunching SIMAP again.

Thanks for your attempts to troubleshoot this problem with me. It is unfortunate that so much time was wasted.
6) Message boards : Seventeen or Bust : WUs stalled out? (Message 68113)
Posted 2116 days ago by Jazzop

The second file can be found by looking in a file called llr.out in the same slot directory. Inside the llr.out file will be something like this:

<soft_link>../../projects/www.primegrid.com/llr_sr5_189626089_2_0</soft_link>


The highlighted portion is the name of the output file this task will create, and can be found in the boinc/projects/www.primegrid.com directory. The actual file name will be different than what you see in my example. If this output file exits in the www.primegrid.com directory, please post the contents of that file. It might not exist, however.


I just noticed something. When I open the llr.out, it gives me the following:
<soft_link>../../projects/www.primegrid.com/llr_sob_71270569_2_0</soft_link>


Note the bolded/red portion above.

When I view the contents of the /projects/www.primegrid.com directory, I see what I assume are the output files for the 12 tasks I currently have on board. However, each file does not have the "_x_y" suffix (as bolded above) in its filename. Is this significant?
7) Message boards : Seventeen or Bust : WUs stalled out? (Message 68108)
Posted 2116 days ago by Jazzop
Here are the complete contents of the stderr.txt file for task llr_sob_71270569_2:

BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
02:30:47 (4332): No heartbeat from core client for 30 sec - exiting
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K
BOINC LLR 6.03 wrapper: starting

Major OS version: 6; Minor OS version: 1
FFT length: 1920K


Here are the complete contents of the output file for the same task:
1000000000:P:0:2:25755459 21295606


For the record, this task currently shows 39.181% complete, 196:29:xx elapsed, 46:49:xx remaining, and both times are increasing at about 50% real-time speed (i.e., 1 sec every 2 actual secs).
8) Message boards : Seventeen or Bust : WUs stalled out? (Message 68098)
Posted 2116 days ago by Jazzop
Well, it appears that all of my SoB WUs want to stall out around the 40% completion mark. Once a WU hits the magic combination of 35-40% complete/~190 hrs elapsed/~40 hrs remaining, they exhibit the following behavior: both the time elapsed and time remaining begin to increment higher, & the % completion bar effectively halts.

I have tried suspending all other tasks except for one SoB task so that it could have as much resources as it wanted, but I woke up the next morning to see that all it did was increase the time elapsed/remaining.

Unless I get some guidance, I will abort these tasks and take a hiatus from this project. I am beginning to feel like I have wasted the >1500 CPU hours accumulated so far on these tasks.
9) Message boards : Seventeen or Bust : WUs stalled out? (Message 68016)
Posted 2119 days ago by Jazzop
1. I have tried exiting BOINC, manually killing the boinctray process, then restarting the program.---> no success

2. I have restarted the computer (shut down, then power back on). --> no success

3. I have no suitable GPUs.

4. In order to see if the problem had something to do with running 6-8 SoB tasks simultaneously, I have suspended most of them and will run a couple at a time to completion. Since I am 7 days away from the deadline for all my SoB tasks, I will leave the problem tasks suspended until all others are complete. I don't want to risk losing "good" tasks by wasting time troubleshooting problematic ones.

I'll report back in a few days. In the meantime, any other suggestions are still welcome.
10) Message boards : Seventeen or Bust : WUs stalled out? (Message 67989)
Posted 2120 days ago by Jazzop
I have 2 WUs that appear to be stalled out: llr_sob_71249911_3 & llr_sob_71263408_2. The time elapsed continues to increment, but the progress bar and time remaining values have not changed in a couple of days. I have several other WUs that are progressing normally.

Suggestions?


Next 10 posts
[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2019 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 2.28, 1.94, 1.71
Generated 21 May 2019 | 5:25:29 UTC