Status of app performance / bugs and deadlines

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0
Topic 192726

It has been almost 4 days since I downloaded 4 work units. I've returned two, but none of my "wingmen" have returned any. Looking at their performance, they are taking between 4 and 7 times as long to report the units as the actual CPU time taken to actually crunch the units. One person took 11 days to report their first S5R2 unit, although the unit only ran for about 35 hours total CPU time...

This condition implies one or more of the following:

  • * The hosts are not on all the time. If this is true, then summer time will likely exacerbate this issue as some people may not run their computers as much so as to save energy.

* The hosts have resource shares to other projects. This will end up forcing the work scheduler into EDF mode and likely irritate some people who don't understand why they aren't getting another unit from EAH (due to the debt issues caused by EDF).

* The owner does not run BOINC all the time. If they are unaware of the drastic slowdown of the application, they may easily let deadlines pass not even knowing it.

While one could argue that this is merely a "I want my credit faster" thread, looking at it solely on that basis is flawed. If users miss deadlines, the science is delayed. Results have to stay in the various databases for longer, thus decreasing cross-table / cross-database performance. IMO, it would greatly serve the project to get these situations under control relatively quickly.

I emplore someone to give an update on the status of the applications.

Thanks,

Brian

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

Status of app performance / bugs and deadlines

I guess I struck fear into the hearts of two of the wingmen (LOL), because I have two results in now; the one that I'll turn in sometime in the next 4 hours and the one for Saturday... The other two are still out...

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 688955354
RAC: 209487

RE: I guess I struck fear

Message 64188 in response to message 64187

Quote:
I guess I struck fear into the hearts of two of the wingmen (LOL), because I have two results in now; the one that I'll turn in sometime in the next 4 hours and the one for Saturday... The other two are still out...

The question of the ratio between processing time / deadline length is an actively discussed issue , see for example the "ping" thread.

Sometimes I check the state of my "wingmen", and my personal impression is that "missed deadline" happens very, very rarely. I don't think changing the deadline will have any effect at the moment.

Eventually we will see the applicationbeing optimized and will run maybe twice as fast or faster on modern CPUs. The scientist had to take this into account when doing the splitting of work into workunits because if the processing of WU is getting too fast, this might overload the servers, especially during the "end-game" when users will receive WUs from different "datapacks" frequently.

CU

BRM

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: Sometimes I check the

Message 64189 in response to message 64188

Quote:

Sometimes I check the state of my "wingmen", and my personal impression is that "missed deadline" happens very, very rarely. I don't think changing the deadline will have any effect at the moment.

"Rarely" pre-S5R2? I don't know how you could have enough info yet to know about S5R2. No offense intended, I just don't think there's been enough time for an appropriately large statistical sampling... The one person that's with me took 11 days to report the one just before this one, which is getting awful close to the 14-day deadline. Fortunately it's about the same size, so they should report it within the deadline, but if they go offline for a day here, don't have BOINC running there, or decide that they are going to concentrate on one project (like BOINC Synergy does), that could cause an issue. Of course, one could argue that I don't know that they may already be doing that and all is well with the world... ;-)

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Well, I noticed that people

Well, I noticed that people turned WUs in late, too. During the beginning of the run, my pendings grew simply enormous due to that. But I have the impression it's getting a bit better now, with more people returning their work a little faster. And besides, I'm not sure how much of the problem was due to computation errors, because I got an awful lot of WUs which crashed on my wingman, or even multiple wingmen (record is, I think, needing 6 people for 2 valid results) and had to be re-issued because of that.

Henk Haneveld
Henk Haneveld
Joined: 5 Feb 07
Posts: 18
Credit: 14120040
RAC: 283

There are 2 subjects here,

There are 2 subjects here, application performance and fast return of completed WUs.

If the application can be made faster that is always nice.
But it would mean nothing to my return times.

I run Einstein as a backup project with a low resource share and in a normale situation will use the maximum time allowed for proccessing, currently 2 weeks.

Only when my primary project SETI goes offline will I return WUs faster.

I run 24/7 so I won't return WUs late.

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: Only when my primary

Message 64192 in response to message 64191

Quote:

Only when my primary project SETI goes offline will I return WUs faster.

Not always true... If the RDCF was correct, then probably always true. Right now however, the RDCF values for Einstein S5R2 units are in flux. This means that hosts that attach and/or old hosts that grab new units may grab too many. Because the estimated completion time is inaccurate, BOINC will go on for a while thinking it has enough time...switching between applications at the "normal" rate that you have set up... I'm a little fuzzy on this next part, but I think that the cpu scheduler will eventually figure out that it's not going to make it at the current resource allocation. It would then go into EDF mode. The big lingering question is, does the CPU scheduler figure out that there's too much work in time to enter EDF appropriately and complete all results that are in danger of missing a deadline...

At any rate, once you enter into EDF it will likely want to complete all of these larger Einstein units, which means that you will crunch and return the units faster than you normally would, unless you do something dastardly :gasp:, like get frustrated that you're not doing any SETI work, and suspend Einstein (not accusing you of wanting to do something like that, but there are people who certainly could and would)... The larger your cache, the worse the problem can get...assuming that the CPU Scheduler doesn't figure out what's going on in time...

Edit: Oh, and yeah, I know that if someone suspends a project and then misses a deadline, it is their fault, but if they have their computer(s) hidden and come here on the message boards and rant about it, nobody except for project admins would even have a chance to be able to tell them it's their fault...as they're surely not going to admit it...

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2140
Credit: 2769783797
RAC: 938053

I don't think this will be

I don't think this will be too much of a problem, because of the way that BOINC handles RDCF. In the case that you're worried about (previous RDCF too low, project estimate too short), all it takes is one WU running to completion to re-calculate RDCF and re-estimate the size of the work buffer. As soon as the first S5R2 result is ready, BOINC will 'know' what to expect and react accordingly, including EDF and suspend work fetch if needed.

It's going in the opposite direction that BOINC feels its way slowly and carefully. If a WU finishes earlier than predicted by the project and RDCF, then RDCF is only reduced by a little bit after each WU: it's usually reckoned it takes about 30 completed WUs before the estimates are so nearly right that you wouldn't notice the difference.

Of course this isn't foolproof, and someone with a really low resource share will still take a long time to crunch that first WU and get the estimate corrected. But in that case, there won't be a large cache to worry about (in terms of numbers of WUs).

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 688955354
RAC: 209487

BTW, what is the reason

BTW, what is the reason anyway that BOINC CC splits "uploading a result" and "reporting a result" into two server interactions (by default, at least), with some delay between the two events?

In other words: Why isn't BOINC reporting a result always immediately after uploading the results? This would shorten the delay we are discussing here a bit .

CU

BRM

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: In other words: Why

Message 64195 in response to message 64194

Quote:
In other words: Why isn't BOINC reporting a result always immediately after uploading the results? This would shorten the delay we are discussing here a bit .


Uploading a result just uploads the data to a directory on the server, with little or no CPU overhead. Reporting a result uses database hook-ins. Reporting 1 result takes almost as much overhead as reporting multiple results. So to not overload the database server, it's best to report multiple results.

If only you report immediately, it won't be a problem. If everyone does it, it will be a problem.

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: BTW, what is the reason

Message 64196 in response to message 64194

Quote:

BTW, what is the reason anyway that BOINC CC splits "uploading a result" and "reporting a result" into two server interactions (by default, at least), with some delay between the two events?

In other words: Why isn't BOINC reporting a result always immediately after uploading the results? This would shorten the delay we are discussing here a bit .

CU

BRM

Like Jord said, it's database intensive... The version 5.5.0 that I'm using reports immediately. I only have it on this one host. When I find something past 5.4.11 that is relatively bug-free, I'm going to be updating to a CC that doesn't report immediately... Well, for that reason and that people at LHC tend to get way too itchy about overclaims, calling anyone that overclaims a "cheater" and "disgusting", "don't know how they sleep at night", etc... This version overclaims there, and so it's way too much drama...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.