Recently, in this thread, a participant reported a completed task that was denied credit. You can read the thread to see the full story but, in a nutshell, the particular task in question was a resend caused by another host's deadline miss. That other host did complete the quorum after the resend was issued but before the resend was returned. The resend had now become superfluous but can't be canceled if crunching had already started and also can't be canceled if the project hasn't enabled server initiated aborts of such redundant tasks. The two unfortunate consequences are a waste of resources in crunching that task and a loss of credit if that task misses its own personal deadline.
There will always be (for quite legitimate reasons) tasks that get returned after deadline so what turns out to be an unnecessary extra task is sometimes just a hazard of "doing business". It's when these unnecessary extra tasks are created as a direct consequence of thoughtless action, that it becomes quite annoying.
Consider this scenario. A host is set up with a very large cache and downloads all the tasks needed to fill the cache. The initial tasks will be returned quickly but as the remaining ones become "stale", there is an increasing risk that a relatively small hiccup of some sort will see tasks starting to miss their deadline. No big deal you might think, because the project will accept back and happily use deadline miss tasks which are just over the deadline.
Many people seem to count on the fact that a few hours, even up to a day over deadline is unlikely to cause the task to miss credit so why worry. They are relying on several delays which all add up - the delay for the scheduler to send the replacement, the delay for that replacement to rise to the top of the queue on the new host, the delay to actually crunch it, and the delay for the client to get around to reporting it. To be honest, I'm sure I used to have that sort of attitude so I'm certainly not pointing the finger at anyone in particular. I do flirt with large caches for specific reasons but I try to make very sure there won't be deadline misses as a result.
The problem I want to discuss is that where multiple tasks are likely to miss the deadline and where the owner of the host seems to have deliberately arranged for that to happen and to be quite unaware of the full import of it all. As an example, take a look at the tasks list for hostID 3387511. Please remember that I'm describing things as they are at the time of writing and that there will be changes over time.
If you click through to the 6th page you will see the current action. The last completed task at the time of writing had a TaskID of 226039421 and it was returned just a couple of hours before deadline. If you look back in time from that point you can easily see that the time gap between when sent and when returned is very close to the 14 day deadline. All the tasks are just making it.
If you count back a total of 8 tasks you will come to the one belonging to WUID 95584371 and you will see that this one was returned approximately 30 mins after deadline. If you click the WUID link, you can see that the server created an extra task, but didn't actually get around to sending it out. If you look at the next older WUID, you can see a deadline miss of around an hour and this time a resend task was issued. You can see that this resend was aborted. That's because it was issued to a host of mine that was deliberately collecting resends and I had noticed that this type of resend was going to a number of different wingmen who would be quite unlikely to notice how their resources were being wasted in this way. To try to prevent unsuspecting participants from wasting their resources, I collected and aborted as many of these as I could (about 5 all up) and if you take the trouble to look back through all the quorums you will see a number of completed quorums where other participants have (or will have) wasted resources on resends that will be of no use to the project. In some cases the redundant resend has required the downloading of a full batch of LIGO data because it doesn't belong to the frequency set that the host was using at the time.
If you take the trouble to use the database to find all the computers that belong to the owner of the one I've linked to in this post, you will see that there are 11 in total. They all seem to have identical specs so probably a computing lab of some sort. Each one has what appears to be a 14 day cache, judging by the value of average turnaround time. Some are a bit greater than 14 days and some are a bit less. I don't know that I want to look at the history of the other 10 to see how many more unnecessary resends they might be creating.
I'm sure the project values the contribution of this group of hosts and I'm sure it would be more valuable to have that contribution without the potential waste of other people's resources whenever a task just misses the deadline. So, having gone to the trouble of documenting all this, I'll now send a PM to the owner to ask if it might be possible to reduce the cache setting on all 11 machines to something a bit more friendly to other participants.
I hope anybody else who thinks an excessively large cache is a good idea will perhaps understand the issues and decide to reduce it to something more reasonable. If you got this far, thanks very much for your attention and persistence in following this rather convoluted story.
Cheers,
Gary.
Copyright © 2024 Einstein@Home. All rights reserved.
Excessive Work Cache Size - How to screw your new Wingman!!
)
Some good sleuthing there and good deductions.
What you've uncovered is more of a design fault with the Boinc system and/or the limits that are set on the e@h server.
I thought that long ago Boinc was updated so that you couldn't have your settings exceed the project set deadline limits.
Also, there will need to be a good bit of margin to allow for the inaccuracies of estimating the proportion of time a particular host will be available and also for the WU length estimates...
Can that fail scenario be fixed by e@h adjusting their server-side limits?
Keep searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
RE: ... I hope anybody else
)
A large host-side cache is bad in all respects for encumbering the smooth WU flow for Boinc.
However... The settings are there and available to be abused. If large caches really are a problem for a project, then that is where limits should be imposed from server-side for that one project.
Otherwise, we should keep the flexibility in Boinc so that all the settings can be used to good advantage by those that know. There will also be some abuse by those that do not know, but hopefully they can become educated, or at least not be too big a problem.
For my systems, I run a mere 0.25 day cache. More than good enough for when you're involved in a number of projects to keep any idle machines usefully employed.
Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
RE: RE: ... I hope
)
I agree on all points!! Apparently there ARE still people who do not have 100% internet access with their machines and a cache makes them able to crunch, I think that is a good thing. If they also lose some credits for some units along the way, I think that is okay too, as long as they understand the pitfalls.....keep on crunching!
There's actually another
)
There's actually another aspect to this, which I'm experiencing now.
For my higher-performance machines, I generally like to keep a ten-day cache. That way, if the project servers go down, I can make sure that I have enough workunits to get me through the down period. Until just this week, I've never had any problems getting all of my workunits returned on time.
What's happening now on the two machines that I've upgraded to CUDA-capable, is that for whatever strange reason, BOINC is giving precedence to workunits whose due dates start with a "0". So, on my one machine, I have a whole bunch of workunits that are due in April, but BOINC started crunching both GWs and BRPs whose due dates are "03 May". Because I didn't catch that in time, I'm now scrambling to get the April GW workunits done. I've seen the same problem on the other machine, too, but I was able to catch it in time enough to prevent workunits from reaching their deadline.
So, now, on both machines, toward the middle of the month I have to start watching to see if any dues dates with a leading "0" come up. Then, I just have to place a hold on them to ensure that the shorter-deadline workunits get crunched first.
It's quite annoying.
RE: ...BOINC is giving
)
That's a very interesting observation. Did you tell the BOINC developers about it? You should at least open a thread at the BOINC Message boards.
[off topic]
Your BOINCstats signature obviously changed its ID (the BOINCstats database has had problems). You should investigate the new URL and change your Community preferences.
[/off topic]
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: RE: ...BOINC is
)
No, I haven't had time to mention it to the BOINC developers just yet. I was hoping to see if anyone else had the same experience.
As far as the BOINCstats signature goes, I've noticed that this has happened to a few other people, also.
RE: BOINC is giving
)
Are you sure it's the start with 0 that's significant? Only looking at the day, but not the month part of the timestamp would be a more plausible bug. If you can get any May 10 wu's before finishing the last April task the two scenarios would be easy to differentiate.
RE: As far as the
)
Yes, it has happened to all BOINCstats users that use the default signature from the details page. After the database issues, Willy had to renumber the IDs.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: RE: As far as the
)
Ah, that explains it.
RE: RE: BOINC is giving
)
You could be right. I haven't run any extensive tests on that, but your theory sounds plausible.