PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise
1) Message boards : Number crunching : Proth Prime Search (Sieve) v1.35 (cuda23) Errors on Linux64 (Message 29895)
Posted 3275 days ago by Profile miwProject donor
Bingo!

Downgraded Nvidia driver to 256.53 and now they are going through nicely.

Thanks a lot for the pointer...
2) Message boards : Number crunching : Proth Prime Search (Sieve) v1.35 (cuda23) Errors on Linux64 (Message 29878)
Posted 3275 days ago by Profile miwProject donor
Also note that 1.35_cuda23 is the most up-to-date app for Linux64 cuda....
3) Message boards : Number crunching : Proth Prime Search (Sieve) v1.35 (cuda23) Errors on Linux64 (Message 29875)
Posted 3275 days ago by Profile miwProject donor
Nope. No App-info file, and in fact the WUs sometimes run for up to a minute before erroring out, using primegrid_tpsieve_1.35_x86_64-pc-linux-gnu_cuda23

But on reflection the stderr does look strange. I've set the project to no new tasks and will reset it once the other subproject tasks finish.

--mark
4) Message boards : Number crunching : Proth Prime Search (Sieve) v1.35 (cuda23) Errors on Linux64 (Message 29871)
Posted 3275 days ago by Profile miwProject donor
Hi All,

on my Linux64 system, every unit for the above application errors out, and when I look at the units, it seems they error out for almost everyone else as well. I have aborted a whole bunch on the basis that I'm getting 100% errors, but Primegrid just gives me more. :(

Any cluse to what the problem and/or solution might be. STDerror is always the same - see below.

--mark

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 139 (0x8b, -117)
</message>
<stderr_txt>
Unrecognized XML in parse_init_data_file: hostid
Skipping: 175788
Skipping: /hostid
Unrecognized XML in parse_init_data_file: starting_elapsed_time
Skipping: 0.000000
Skipping: /starting_elapsed_time
Sieve started: 7581952000000000 <= p < 7581955000000000
Thread 0 starting
Detected GPU 0: GeForce 9600 GT
Detected compute capability: 1.1
Detected 8 multiprocessors.
Computation Error: no candidates found for p=7581952059666497.
called boinc_finish

</stderr_txt>
]]>
5) Message boards : Number crunching : The Dreaded 5.5.0 (Message 4165)
Posted 4780 days ago by Profile miwProject donor
First Miw's post; B52 follows

The fact that you can endlessly replicate up/download servers but you can't do that so easily with the database server is obviously a very important one.

The point that fixing the problem with upload and download servers would in many cases just shift the bottleneck inexorably towards the one bit of the system that is hard to scale (the database) is also telling. I've experienced the same kind of issue in other systems.

I'm still not totally convinced that allowing "return_results_immediately" would have the terrible impact some people seem to think it would.



It does not.
It seems we use the feature you are discussing on BOINCSIMAP -
if the <upload_when_present/> Tag is actually what you are talking about.

It causes every client (at least for mine I can be sure, it's 5.4.11) to upload a result if that WU/Result is/was finished by the client; reporting will happen lateron with the next scheduler request.

"return_results_immediately" goes one further and reports the result as soon as it is finished and uploaded.


1. Unless it was the default (which I am *not* advocating) a small minority of people would turn it on.


I don't get it - is this a clientside AND a serverside setting?
Client side only


2. The goodness of result batching for reporting still helps you out when the system is in trouble because the "ready-to-report" queue will pile up in individual hosts.


Hm why should it? I don't exactly get your point: what trouble if the "ready-to-report" queue piles up? Handling THAT is just a bunch of SQL queries.

"ready-to-report" queue piling up is neither good nor bad - it will happen when things go bad due to server outage/unreachability. The fact that the system alows many results to be reported in one connection is a good thing. If the queue banks up because of server unavailability, then the effort per WU drops off, and the total effort to catch up grows significantly less than linearly as the length of the outage. (as opposed to the upload queue which grows linearly as the length of the outage until people run out of work.)

My point was this *this* is probably the main benefit of multiple WU reporting, rather than the benefit of having hosts batching reporting when things are in steady state.


Cheers
-Jonathan
Admin Boincsimap

6) Message boards : Number crunching : The Dreaded 5.5.0 (Message 4134)
Posted 4788 days ago by Profile miwProject donor
For miw: Rom's answer. :-)


A fine and thoughtful response.

The fact that you can endlessly replicate up/download servers but you can't do that so easily with the database server is obviously a very important one.

The point that fixing the problem with upload and download servers would in many cases just shift the bottleneck inexorably towards the one bit of the system that is hard to scale (the database) is also telling. I've experienced the same kind of issue in other systems.

I'm still not totally convinced that allowing "return_results_immediately" would have the terrible impact some people seem to think it would.

1. Unless it was the default (which I am *not* advocating) a small minority of people would turn it on.

2. The goodness of result batching for reporting still helps you out when the system is in trouble because the "ready-to-report" queue will pile up in individual hosts.

3. The real trouble comes from people endlessly clicking on "update" (or retry) when the database (or upload server or whatever) is chugging.

4. I think the only way to solve these problems (other than finding another way for the reporting process to be somehow "atomic" between the client and the database or some database proxy) would be to install a front end in front of all the servers that (a) puts a hard limit on the number of TCP connections onto each part of the system by doing TCP connection proxy and holding excess connections in a wait state (b) preferably performing TCP multiplexing. (opening, say, 30 connections to a server permanently and feeding all requests through these TCP connections.) There are appliances on the market today that do this, but It could be done by any Linux box with the appropriate code written for it.

7) Message boards : Number crunching : The Dreaded 5.5.0 (Message 4128)
Posted 4789 days ago by Profile miwProject donor
OK, guys, CALM DOWN!.

Sorry miw, but your statements are mainly speculations ("maybe", "might", etc). File upload handler is often running on a separate server to help split reporting from uploads. Uploads CAN'T be batched with reports mainly because it disables the ability to split servers and distribute the load.


But this does not stop multiple file uploads from being batched together, which is where the greatest benefit could be obtained. This is not speculation. You just have to compare the rate of connection requests coming from a host as it is today with 50 results waiting to be uploaded, each on its own backoff cycle peaking at about 3 hours with the same host that just thinks it has one upload to do with 50 files in it. (hint: one of them is 50 times the other)

The best way to do this would be to write a custom file transfer routine for boinc, but you could get much of the benefit simply by chaining multiple http requests in a single TCP session.


As for "return results immediately", what about you do the test: set up a test project with a sample app running, and make clients report in batches, and another one where clients would report immediately. Of course this needs quite a big number of participants to see the actual load level and be able to compare. But the results would really be interesting.


Actually, the real-world probably already has these results, because there was a time when nearly everybody who wanted "return_results_immediately" was running a client where it was available. The world did not tear apart, as I recall.

8) Message boards : Number crunching : The Dreaded 5.5.0 (Message 4126)
Posted 4789 days ago by Profile miwProject donor
3. I don't agree with Rom's analysis, because he only looks at the SQL query load when comparing upload with reporting. When things are busy, uploading of results kills the servers dead too, because a very significant percentage of the work is in the TCP connection management involved with uploading. In fact, from what I can see from my end, TCP load seems to be worse than database load on servers. BTW this could well be an argument for batching uploads rather than unbatching reporting. Or it could be an argument for allowing an attempt to report happen in the same HTTP/TCP connection as the upload. That way, you'd save a lot of HTTP connections. If the upload succeeds but the report fails, then the WU goes into "ready to report" mode and by default waits for the next upload or the next work request.


In repeat: When uploading your result, you send data back to the project server. It's written into a folder on the hard drive. Nothing more, nothing less.
No CPU dependency on this, no RAM being used, no database being written to. It's pure a write only action. Just the same thing as you writing a 2KB text file to your hard drive. Please check how much CPU that takes.


You can repeat it all you like, but that doesn't make it correct. Each and every file upload (the way things are currently written) requires a 3-way TCP handshake and then the activation of a HTTP connection. That takes CPU for the handshake and process spawn, and it takes RAM for the connection blocks. In fact, if you look a a server that is overloaded by people trying to upload files, I think yu'll find in excess of 75% of the resources are being taken just by network overhead and very little by the actual task of writing a file to disk. This is basic stuff.

It may not be obvious to you, because with the typical tools available on a web server/database engine, you can quite easily see where the resource is going for database actions and the web server application. You need to know what you are doing to find all the resource that is going to protocol overhead.

There have been several recent examples where servers have been killed simply by people trying to upload results. The classic one was in Rieselsieve early in September. The reporting engine was even turned off so people could only upload results, but the server got so badly in congestion mode that uploads were not going through. No database load in there, as you yourself pointed out.

The stupid thing is that it would be just as easy to allow multiple file writes in a single HTTP: connection as it is to allow multiple reports in a single connection. But the boinc software requires a whole new connection setup for each file, in many cases multiple files per WU. So the offered load out there grows as the number of completed WUs (which gets bigger the longer the server is congested) rather than growing as the number of hosts with completed work (which stops growing fairly quickly.)



Now tell me, are you (speaking in general here) so important that you are allowed to take up all this overhead? Why don't you have the patience that others have and report multiple results at the same time? 1 write for one result in the tables takes as much overhead as 2, 5, 10, 20,100 results at the same time written in those tables.

And if everyone would do this immediate reporting, there would be only complaints about how you cannot download new work, how slow everything is and that the project of your interest should get better hardware/connections/whatever. You (still speaking in general here) aren't the only person running this project or any of the other 39+ projects out there.

Once you manage to reach that level of thinking, you may understand why the BOINC developers programmed it this way.


I consider myself to be as important as the next guy. Since you haven't the faintest idea how I run BOINC (in fact 95% of the time I just let it do its thing and when things are congested I often suspend activity to let the system get back in kilter) your remarks are 100% off base. Personal attacks about things like level of thinking do you no credit.



4. In the end, the folks out there running the clients are the ones donating their computing time. The most active crunchers overwhelmingly want the "report results immediately" feature. Rather than making learned, but in the end pointless arguments why it is a bad thing, the developers should be working on a way to make it happen. If they continue to not get it, then we'll just see a proliferation of things like 5.5.0 being used by people who would otherwise use the stock client.

Excuse me, but this is bullshit. It's those "most active crunchers" who come up with the pointless arguments (although I bet you didn't ask any of those "most active crunchers" what their opinion about it is, you just see a small amount of them crunch with this version). To those people their reasoning is "Me, me, me, me, I don't care about the rest".


It's patently not bullshit because there are a lot of people out there using 5.5.0 (and I might add that I am not one of them.) If you read the forums, you'll see many people wanting the "return_results_immediately" feature.

But for all we care, they can take a hike. The science will be done by others.
It's not about speed, it's not about credits. If you're solely here for the credits, be a man or woman and say you are, then run with the stock client and the stock applications/optimized applications. But don't come up with crap like "most active cruncher" as if that gives you all the rights to be above the law and that every project you return results in would crumble to dust if it weren't for you.


Pure poetry.


Umm.... BOINC 5.4.11 gets into EDF mode at the drop of a hat if you have a number of projects running, even when the cache is a small fraction (like 1/7) of shortest deadline.

And if you leave it alone well enough, your Duration Correction Factor will equalize and in due time you'll not run into EDF anymore.

Or use BOINC 5.6.5
[/quote]

DCF corrects things in steady state. 5.4.11 (which I mght remind you is the current recommended release) has a broken scheduler that gets into trouble because of transient conditions. The issues are (largely) orthogonal.

In fact it is useful to look at the failure modes in a BOINC project in terms of steady state events and transient events.

Not having clients with return_results_immediately as an option helps steady state load by batching reporting, thus sharing connection overhead and database connection overhead between multiple WUs. That's fine, but you have to ask yourself what effect would there be if everybody out there had "return_results_immediately" enabled? I suspect not much. Steady state server load would be somewhat higher, but the fact is a robust BOINC project should not be sized to handle steady state conditions. They are sized so that they can recover in a reasonable time from transient conditions like a server outage. When I look at the logs on my crunchers in steady state, I note that most of them on most projects report a single unit at a time anyway. I think it is because the way things work with the particular relationship between elapsed time per WU and cache size means they typically get hungry before they feel the need to take a *blush*.

Primegrid is a bit of an exception here because its WU length is such a tiny fraction of my cache size.

But most of the project outages stem from something transient like the sudden availability/non-availability of work or a maintenance downtime or a hardware failure, etc. It is these transient outages where BOINC really struggles to recover. People are not getting work, not able to upload results, and not able to report tasks. Mass behaviour (you can cuss it all you like but you can't change it. Treat it like an inconvenient law of physics.) means lots of people are clicking on the update button just making things worse. Note that while the ability to batch reports helps a *lot* here (as would the ability to batch transfers), the absence or presence of "return_results_immediately" makes NOT ONE SKERRICK OF DIFFERENCE. This is because the boinc clients are at best in a retry/back-off cycle or at worst being clicked every 5 minutes by frantic users.

Paradoxically, "return_results_immediately" might actually *help* many congestion events arising from transient failures, as it would mean that at any given time there are fewer machines out there with unreported results. (and hence a smaller backlog when the system comes back online.)

So, for projects that are just scraping by in steady state and those projects with very short WUs, I agree that *not having* "return_results_immediately" is a helpful thing. For all the others, I doubt it makes much difference. Balance that against the fact that *having* "return_results_immediately" carries a lot of benefits, at least in the minds of some users, not to mention that it would take away one more reason for people to not use stock clients, and I'd say the evidence is in favour of its inclusion.

Of course, if the operators of this project still think "return_results_immediately" is so evil, then that is just one more reason to ban version 5.5.0, which I was advocating when I started this thread. No skin off my nose. I use 5.4.11 on half my hosts and 5.6.5 on the other half.

9) Message boards : Number crunching : The Dreaded 5.5.0 (Message 4122)
Posted 4790 days ago by Profile miwProject donor
I must say I am between the camps here. You'll note that in my original post that I said that there are reasons besides cheating for using version 5.5.0.

1. 5.4.11 does have a tendency to download too much work and get itself into trouble. That's a fact.

2. The test that says automatically upload if the WU is less than 24 hours before its deadline does not leave enough time. Outages of more than 24 hours are rife, and if you happen to be a long way from the servers you'll see more outage than those who are close.

3. I don't agree with Rom's analysis, because he only looks at the SQL query load when comparing upload with reporting. When things are busy, uploading of results kills the servers dead too, because a very significant percentage of the work is in the TCP connection management involved with uploading. In fact, from what I can see from my end, TCP load seems to be worse than database load on servers. BTW this could well be an argument for batching uploads rather than unbatching reporting. Or it could be an argument for allowing an attempt to report happen in the same HTTP/TCP connection as the upload. That way, you'd save a lot of HTTP connections. If the upload succeeds but the report fails, then the WU goes into "ready to report" mode and by default waits for the next upload or the next work request.

4. In the end, the folks out there running the clients are the ones donating their computing time. The most active crunchers overwhelmingly want the "report results immediately" feature. Rather than making learned, but in the end pointless arguments why it is a bad thing, the developers should be working on a way to make it happen. If they continue to not get it, then we'll just see a proliferation of things like 5.5.0 being used by people who would otherwise use the stock client.

5. Notwithstanding all the above, and despite the fact that I have a bunch of crunchers that often get into EDF mode, I'm not aware of a case where I lost credit because a WU was sitting there in "ready to report" mode over the deadline. I have had some that were reported late because of it, but not enough to lose credit. Whether that is because of good luck or because the system, despite appearances, is working OK is something different people will have different opinions on. :-)

The only thing I have to say is thanks for your explanation, but it does not help ME one bit with my problem.

You only see it as a problem, where there is none.

I'm still sure that I eventuallyu will loose credit for allready crunched wu's cause off the "intelligent boinc scheduler"

And how exactly do you lose credit? BOINC's auto-reporting will always report before the deadline of the result. You only lose credit if you don't report or if there's something wrong with the validator.

The first thing can be your problem (corrupt hard drive, you accidentally reformat the wrong drive, lightning strike, a meteorite strike) while the second one will hit everyone else as well. So I still don't see your "problem".

Maybe that you don't like the "ready to report" entries in Tasks, but that's also easily solved: Don't look at Boinc every 10 minutes, just let it do its job. Or run a cache lower than the lowest deadline of all your projects. That way you won't as easily run into EDF trouble, nor will you see so many ready to report results as they'll get purged earlier. See list in my other post.


Umm.... BOINC 5.4.11 gets into EDF mode at the drop of a hat if you have a number of projects running, even when the cache is a small fraction (like 1/7) of shortest deadline.


10) Message boards : Number crunching : The Dreaded 5.5.0 (Message 4108)
Posted 4794 days ago by Profile miwProject donor
The way Primegrid hands out credits, it is particularly prone to benchmark cheating, and it looks like 5.5.0 would be a very fine way to bump up ones results....

There is one thing you don't know about PrimeGrid credit granting: when calculating the claimed credit, an average is used, however, if the user has claimed more than 6 credits, only 6 is used in the calculation. This way high values do not affect granted credit very much.


Fair enough. I had worked out it was an average, but didn't notice that high claims were truncated at 6. Mind you, it's hard to spot the difference without a calculator, since 6 is still more than double what most (normal) hosts are claiming and only slightly less than what most 5.5.0 hosts are claiming.

But then, I guess if you were a real credit hound, even 1.5 times the average credit/hr on primegrid is somewhat shy of normal stock credits per hour in some other projects.....



Next 10 posts
[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2019 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 1.81, 2.12, 2.15
Generated 12 Dec 2019 | 1:37:27 UTC