The "cleanup" for the S5GC1HF run

Henk Haneveld
Henk Haneveld
Joined: 5 Feb 07
Posts: 18
Credit: 14120227
RAC: 266

It looks to me that this may

It looks to me that this may introduce a new problem.

Quote:
The basic idea right now is to do it the other way round: when a client reports "his" list of files, the scheduler checks whether a) this file is associated with any workunit (this includes all tasks, until the workunit itself is deleted by the db purger) or b) if the WUG could still produce work for this file. Only if both conditions are not met, the scheduler will issue a delete request for this file on the client.

All clients will have after some time a much larger list of data files then at the current time. I am thinking 3 or 4 times the current number. Every time a client asks for work this large list will be send to the server.

Possible problems:
1. A large jump in income traffic because of the increased size of the file containing the data file list.

2. A large jump in server load because the server has to work thru the large list of data files checking if there is work available for any of these files and checking if any of the data files can be deleted.

Side note: If/when this change is made, you may want to send out a general warning to all users that this project will need more diskspace on there system.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 946
Credit: 25167626
RAC: 19

RE: All clients will have

Quote:


All clients will have after some time a much larger list of data files then at the current time. I am thinking 3 or 4 times the current number. Every time a client asks for work this large list will be send to the server.

Possible problems:
1. A large jump in income traffic because of the increased size of the file containing the data file list.

2. A large jump in server load because the server has to work thru the large list of data files checking if there is work available for any of these files and checking if any of the data files can be deleted.

Hm, the current list of files sent to the server is about 8 kB. The size of the list itself shouldn't matter though. The number of files contained in that list will of course affect the result selection process but we've already taken care of that. So, the main issue of an increased file retention time will be the diskspace required on the client as I already mentioned above.

Please keep in mind that all of this is still under investigation and we're definitely going to discuss the pros and cons...

Best,
Oliver

 

Einstein@Home Project

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109971846362
RAC: 29971678

I know it's a little

I know it's a little premature since there is still around 10% of the tasks for the old run still available, I thought I'd experiment with setting up a "resend vacuum cleaner" :-).

I have a large number of hosts that have been cleaning out the 143x.xxHz frequency set. They started at 1430.60Hz quite some time ago and have progressed to around 1437.50Hz. I have saved LIGO data and blocks through to 1438.20Hz and I've deployed the LIGO files to all hosts. It's reached the stage that the remaining tasks are disappearing very rapidly so I'll probably transition many of these hosts to a higher frequency like 149x.xxHZ where I also have saved data and there are a lot more tasks still available.

Because I have a huge range of data files for 143x.xxHz and because there should be plenty of compute error/task abort type resends (since this frequency set has been so active very recently) I set up a quad over the weekend and activated the entire frequency range on it - from memory around 2GB of LIGO data all up. The other advantage for setting up now is that after the compute error resends fade away, there should be a continuing feast of work once the "deadline miss" resends kick in at a later stage.

I was quite interested to see if this reasonably fast host could be fed entirely from resends. It's been going for around 2 days now. Here is a link to its task list with task names activated. You can see the resends by observing the suffix portion of the task name. The _0 and _1 suffixes denote a primary task and you can see the very latest task is one of those since I kept drawing tasks until all currently available resends were gone. You can also see that the 27 tasks prior to the very latest one were all resends (_2 or hugher suffixes). Seems like there are plenty of resends available at the moment.

Cheers,
Gary.

leks
leks
Joined: 21 Nov 07
Posts: 28
Credit: 483169031
RAC: 0

can write a small proxy?

can write a small proxy? which will only store your files (h *, l *) and description ().
intercepts the packet to the server. adds its . download new files .. distributes their hosts.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109971846362
RAC: 29971678

Sure, this could be done by

Sure, this could be done by someone with sufficient skills. It's fairly safe to assume that it's quite beyond my level of expertise and I don't really have the time to test the assumption on the rather small chance I'm wrong :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109971846362
RAC: 29971678

RE: ... I thought I'd

Quote:
... I thought I'd experiment with setting up a "resend vacuum cleaner" :-).


Today's update.

After waiting a full 24 hours for more resends to accumulate, I ran my script that removes tags in the state file and then started acquiring as many resends as possible. Once again there were plenty, and today I was unable to exhaust the supply before reaching my max cache limit.

I got 26 new resends and if you use the link I posted yesterday, you'll need to go to the second page to see them all. So I've set NNT for the next 24 hours to allow even more to accumulate and then I'll repeat the process.

At this point, it's not taking much time to manage a single host. The tags are removed automatically but I'm manually copying any LIGO files that do get deleted and manually adding the blocks back in to the state file. The next stage is to get the script to do these jobs as well (if I get time to write and test the code).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109971846362
RAC: 29971678

I've been too busy on other

I've been too busy on other things to write any extra features into my script yet but I've streamlined the manual process of maintaining the supply of LIGO data files and maintaining the blocks in the state file. It takes less than 10mins to put a host back into the state where it can draw resends (if they exist) from the full frequency range - now from 1430.xxHz to 1439.xxHz.

Because there seem to be plenty of resends, I've created a second quad core "resend vacuum cleaner". My intention was to update each host once per day but the resends seem to be re-generating quite quickly so yesterday I actually updated twice on each of two hosts. In each case I stopped drawing new tasks when the resends ran out, or the cache was completely full, whichever came up first. Over the 24 hour period, one host got 27 resends and one primary task whilst the new addition got 23 resends and 1 primary task.

The script I use for getting rid of tags takes around 5-10 seconds to run. Most of that time (and why it's a bit variable) is the time spent looping while waiting for BOINC to be confirmed stopped - ie BOINC's PID has disappeared from the process table. It's important to wait for the state file to be written and closed before trying to edit it :-). The actual removing of the tags is extremely fast, even if there are hundreds of them.

Occasionally I run the script on other hosts without adding to the cache of LIGO data. I just remove the tags on current blocks if (by chance) it appears that there are too many frequency bands that have been removed prematurely from play. I did that on a third host after the first two had filled up and this third host was also able to fill up on resends, just from recent frequency bands that had only just been depleted of primary tasks. Maybe it was just luck of the draw but I'm getting the impression that there really is a higher than expected "failure rate" (for whatever reason) in primary tasks.

To look at this a bit more closely, I decided to look at the 23 resends sent to the second host mentioned above. I looked at each quorum in turn and determined the main reason for the resend. In three cases there was more than one reason but I've determined the single reason why the extra task was sent to me. For example, perhaps the initial task had a comp error and then the resend for that timed out. So in that case I would count the reason as a deadline miss because that's exactly what caused a further task to be sent to me.
[pre]
#Resends Reason for resend

4 Deadline miss
1 Some form of comp error
18 Task aborted
[/pre]
As it takes a bit of time to go through the quorum for each resend task (the WU ID link) I'm not intending to keep doing this. I found it interesting to do it once :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109971846362
RAC: 29971678

RE: [pre] #Resends

Quote:

[pre]
#Resends Reason for resend

4 Deadline miss
1 Some form of comp error
18 Task aborted
[/pre]
As it takes a bit of time to go through the quorum for each resend task (the WU ID link) I'm not intending to keep doing this. I found it interesting to do it once :-).


I decided to do it again a day later since I wanted to see if the high proportion of aborts would continue. Both of these hosts are continuing to live entirely on resends and not even two hosts can keep up with the flow currently.

So, for the same host for which the data was listed yesterday, here is today's breakdown. 23 further resends were added during the day with no primary tasks.
[pre]
#Resends Reason for resend

19 Deadline miss
1 Some form of comp error
3 Task aborted
[/pre]
Interestingly, it's virtually the complete reverse of the previous breakdown - the vast bulk this time were deadline misses. As I went through them, quite a few were double deadline misses. The wingmen have been waiting over a full month. As I'm trying to stock up on these, I have a full 10 day cache so unfortunately they'll have to wait a bit more.

I also found an example of the same situation that caught the author of this message. If you take a look at the full quorum for one of my resends, you can see my task, issued at 14 Apr 2011 21:31:05 UTC. This resend was in response to the deadline miss task that is no longer showing as a deadline miss since it was returned late and then validated just a half hour after my task was issued. So it was a deadline miss for a short time - long enough for it to be reissued to me. Of course it is now validated and the quorum is complete so when I get home tonight I'll need to remember to find it and abort it rather than let it waste time when it gets to the top of the queue.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2140
Credit: 2770481732
RAC: 913018

RE: [pre] #Resends

Quote:

[pre]
#Resends Reason for resend

4 Deadline miss
1 Some form of comp error
18 Task aborted
[/pre]
[pre]
#Resends Reason for resend

19 Deadline miss
1 Some form of comp error
3 Task aborted
[/pre]
Interestingly, it's virtually the complete reverse of the previous breakdown - the vast bulk this time were deadline misses. As I went through them, quite a few were double deadline misses. The wingmen have been waiting over a full month. As I'm trying to stock up on these, I have a full 10 day cache so unfortunately they'll have to wait a bit more.


Gary, you were too polite to say that the full text of the message is "aborted by user". But in fact, I think you were right to avoid pointing the finger of blame. The BOINC client itself - maybe only recent versions, I'd need to look back in the version change history - will abort a task if it has not even started by the time of the deadline. Those mechanical 'host aborts' end up in the database with the same error code as human 'user aborts', and I don't think you could distinguish the two cases. And of course, the aborts don't appear at all until the host next reports, whereas the unforgiving server shows 'deadline miss' the second the 14 days are up. I suspect the same tasks could appear one day as a deadline miss, and 24 hours later as an aborted task - it might just be a question of random timing which message you see.

Either way, it's interesting, and to my mind worrying, to see what a high proportion of resends are triggered this way. There's not a lot any project can do about a new user who attaches, downloads a few WUs, then loses interest and wanders away without returning them. That's life. But my next research question, if you're up for it, would be: what proportion of those deadline/abort cases come from new/low credit hosts (or hosts with low task counts), and what proportion from established high-throughput hosts, like in that bug-report thread you linked.

I would be disappointed, but not surprised, if there were a high number of deadline misses from high-yield hosts running on the ragged edge of of a 10+ day cache. Such behaviour is costly to the project, and impolite to other users - unless coupled with the download management skills you're demonstrating in this thread. As we've seen many times, without management, a high proportion of resends in this cleanup phase require substantial download bandwidth - irritating and potentially costly for users, a big bottleneck for the project.

I've argued the point many times over the years at SETI: in order to continue participating in the project they profess to love, there has to be a functioning server at the other end of the line. They can fiddle and tweak their own hardware and software as much as they like, but without the server, it comes to nothing. So, I urge people to turn down their local cache sizes - to keep the BOINC database size down and efficient, and to reduce the need for workunit storage space. And likewise, here at Einstein, to reduce the need for unecessary resends. Unfortunately, I usually end up on the wrong side of the argument. C'est la vie.

[And when talking about deadline misses here, we mustn't forget the effect of the big, but part-time, computing clusters. IIRC, when they are pulled of the project because they are needed for their 'day job', any work in progress is simply wiped clean without being reported back to the server. Not a lot we can do about that - the project is grown-up enough to make their own decisions about the balance of computation and bandwidth the clusters provide/require.]

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109971846362
RAC: 29971678

Richard, many thanks for your

Richard, many thanks for your comments and insights. They are really appreciated.

Quote:
... The BOINC client itself - maybe only recent versions, I'd need to look back in the version change history - will abort a task if it has not even started by the time of the deadline.


This must only happen with relatively recent versions of the client. Most of my hosts run Linux and virtually all of those run 6.2.15, and I've never seen a client instigated abort, although I have observed a few cases where machines have locked up for a sufficiently long period to have deadline misses. I've always had to abort the real deadline misses manually. There are ways to give yourself a deadline extension for tasks that haven't actually missed the deadline yet but will likely do so if allowed to continue without intervention.

Quote:
... But my next research question, if you're up for it, would be: what proportion of those deadline/abort cases come from new/low credit hosts (or hosts with low task counts), and what proportion from established high-throughput hosts, like in that bug-report thread you linked.


The flow of resends for the particular frequency range for which I have all the LIGO files continues to be greater than I expected and I now have 3 quads that are still being fully fed on them. Between the three hosts there would now be probably more than 200 resends accumulated. I'll see if I can look at at least some of those in detail. There are quite a lot coming from well established hosts and I think I understand the reasons for some of these. I'll comment briefly now and possibly more later when I've looked into it a bit more.

Quote:
I would be disappointed, but not surprised, if there were a high number of deadline misses from high-yield hosts running on the ragged edge of of a 10+ day cache. Such behaviour is costly to the project, and impolite to other users ...


I think there was always an impact of the "Seti cache" mentality on E@H but I also think there are a couple of other factors that are adding to this. Those factors didn't really exist before the efficiency of the GPU app was so dramatically improved recently.

Firstly there are those who set large caches seemingly to get enough GPU tasks for fast GPUs. I don't know from experience but I'm assuming that doing GW tasks on the CPU and BRP tasks on the GPU is being hampered by a large overestimate of the GPU task crunch time, leading to a low number of GPU tasks on board. I think people are increasing the cache size and also aborting GW tasks in order to encourage the BRP supply. This is probably particularly so for people running under AP with the 4 hour project backoff being seen as a real problem.

Secondly, there are people who have decided just to run GPU tasks and to abort any GW tasks since they don't understand how to set preferences to do this automatically.

Thirdly, some of those people attempting to use AP to run multiple simultaneous tasks per GPU are having difficulties "getting it right", so there are probably some resends coming from user mistakes/inexperience.

Quote:
[And when talking about deadline misses here, we mustn't forget the effect of the big, but part-time, computing clusters. IIRC, when they are pulled of the project because they are needed for their 'day job', any work in progress is simply wiped clean without being reported back to the server. Not a lot we can do about that - the project is grown-up enough to make their own decisions about the balance of computation and bandwidth the clusters provide/require.]


I must say that I have exactly the same thoughts about this as yourself. There have been a number of occasions in the past where this seems to have happened exactly as you describe. I've often wondered why a quite simple script using boinccmd couldn't be used to properly "clear the decks" rather than just letting the work time out.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.