The "cleanup" for the S5GC1HF run

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,037,552,756
RAC: 35,691,054
Topic 195735

As I'm sure regular readers would have noticed some time ago, the S5GC1HF run is rapidly drawing to a close. This will have some predictable consequences for scheduling of the remaining "primary" tasks and also for the "resend" tasks that will be created for primary tasks that fail in some way or are not returned by their deadline.

You should also have noticed this thread where Bernd has kept us informed of what to expect with the new S6 run that will soon be launched. The most recent update which mentioned the distribution of the S6 data to the mirrors, probably means that the S6 run will be launched fairly soon now.

My recollection is that there is usually (when practicable) a changeover period when both the remnant tasks from the old and the initial tasks from the new are both available. During this period as hosts run out of work for the specific "old" frequency bands they are working on the scheduler may immediately switch them to a new frequency band for the "new" run. It's usually quite an orderly transition and it tends to happen fairly quickly so that the number of hosts still working on the remnants of the old run can decline quite rapidly.

The cleanup of the old tasks is going to take quite a while (it always does) - assuming that nothing "special" is done to shorten it. It's not really to do with any significant lag in rolling out the remaining primary tasks (those with an _0 or _1 extension in their name) although this will slow down as the new run kicks in. It's much more to do with what percentage of these last primary tasks either error out or miss the deadline.

My gut feeling is that under normal circumstances, something like 20-30% of primary tasks will be lost in some way and will need to be replaced by a "resend". These extra tasks are easily recognisable as they have extensions like _2, _3, _4, ..., (however many are needed) until the quorum is eventually completed or the upper limit of 20 failed tasks is reached. The end of a run is not really "normal circumstances". Tasks can fail while downloading, or while computing or because of being aborted, or because the host fails or is shut down or removed from the project. My observation is that the "new run" attraction tends to give a spike in the aborted tasks. Also I think there is an element of "it's a good time to quit the project because it's the end of a run" type of thinking. A lot of people probably don't bother to finish off their cache of work by aborting the remaining tasks so the reissuing of these is further delayed until the deadline is reached.

As a very much simplified example of what tends to happen during the finishing off of the old run, let's assume that for the very last 200K tasks to be issued, 10% of these will ultimately miss the deadline and another 10% will fail in some way but be reported to the server, before deadline. So during the two week period after all primary tasks were sent out there would be 20K resends created (and issued) to replace the failed tasks and a further 20K tasks at the end to cover the deadline misses. If you repeat this logic after 4 weeks, 6 weeks, 8 weeks, etc, you can see that there will still be resend tasks being sent out more than a couple of months after the old run has nominally finished.

Improving the efficiency of "cleaning up the dregs" of an old run has always interested me. As well, I've been concerned about the blowout in bandwidth usage that is directly attributable to the size and number of large LIGO data files that need to be sent out to a host each time it is allocated one of these resend tasks. So, if hosts could be prevented from switching prematurely to the new run and if hosts could be prevented from ditching their current LIGO data files way too early, we might be able to shorten the cleanup time as well as make a reduction in the "bandwidth spike" that occurs at this time.

To fully understand this, you need to understand "locality scheduling" that Einstein uses. As an example, let's assume that your new host has just joined up and has been allocated its first task whose name is h1_1461.15_S5R4__1014_S5GC1HFa_2. I'm simply using the name of a task allocated to one of my hosts recently. You wont find a file anywhere on your computer with this name since the task exists as a small "result template" that is stored in your state file (client_state.xml). At the time you receive the task (which itself is very small) you will also need to download 49 large data files (1 skygrid file plus 48 LIGO data files) which totals something like 180-200 MB of data. The LIGO data are grouped into frequency bands with 4 files per band. The data would start at the frequency of 1461.15Hz and go through to 1461.70Hz in 0.05Hz increments. The data files for the 1461.15Hz band are named h1_1461.15_S5R4, h1_1461.15_S5R7, l1_1461.15_S5R4, l1_1461.15_S5R7 respectively. There are 11 further groups of similarly named files to get right through to the 1461.70Hz frequency. As I said, all this data is required for just one task.

The concept of locality scheduling is that the scheduler will always be made aware of exactly what data you already have and will always try to send you further tasks that are based on the same LIGO files if possible. At this frequency, task sequence numbers (eg __1014 in the above example) start at something like 1500 and go all the way down to zero. So, potentially, a fast host could get hundreds of tasks based on the same data before the data is exhausted.

When nearing the end of a run (as now) the high seq#s will be gone and you will be heading down to the zero end for those frequency bands still 'in play'. Also, when requesting work, the scheduler will send you any available resend tasks for the same frequency. The example above was actually a resend task - note the _2 suffix. When my host received it, there was a companion primary task, h1_1461.15_S5R4__731_S5GC1HFa_0, which is a "primary" task (the _0 suffix) with a much lower seq#. Those tasks were received about 10 days ago and that frequency band is steadily being drained. The very latest task received by that host for that frequency has a seq# of __550 so you can see the steady progression towards zero.

Under normal circumstances without applying a "trick" or two, that host would not still be crunching 1461.15Hz tasks, even though (as you can clearly see) there are 549 still available (without resends - of which there will be quite a few more). To understand why I can make this statement, you need to understand how the work unit generator (WUG) works.

Obviously, with thousands of tasks for thousands of frequency bands to be generated, they can't all exist in the online database ready for downloading right from the start. The database would be horrendously large and slow to manipulate. For responsiveness, there needs to be a quite small number of tasks for each active frequency band at any given time and as they are sent out and exhausted, the scheduler sets a flag that the WUG notices and some replacements are generated. When tasks are used up, it takes less than a minute for replacements to appear - it's a continuous process. However, if a host happens to ask for a task during this small window (last available task being used and replacements appearing), the scheduler wont be able to supply that particular frequency and will move to the next higher frequency band. That could also be temporarily empty and so the jump could well be more than just one band. Along with the new task, the scheduler sends a request for the client to delete the 4 LIGO data files for the frequency band that is temporarily without available tasks, even though (in less than a minute) that band will have more tasks. I imagine the scheduler could be made a bit more intelligent about this - say by it not issuing a delete request if the __0 seq# has not yet been issued (for example). I'm sure there would be other possible strategies.

I see this as a real problem for people like me because of the bandwidth consumed by a large number of hosts potentially being forced to download large numbers of large data files unnecessarily. Last December, my hosts used around 250GB of downloads. This March, after implementing strategies described below, the same hosts used less than 25GB. In April, even with the possible flood of data downloads caused by the "end of run" scenario, I doubt that I will need much more, provided I keep exploiting the workaround I've discovered.

If a host always asks for tasks exactly when needed, it will be downloading them one at a time, and if a band is empty, the delete request will be issued. Alternatively, if No New Tasks (NNT) is set for a period (eg a day) and then removed, the host will ask for a days work (around 18-20 tasks on my quads) in a single hit. In the process of filling the request, the scheduler will temporarily exhaust quite a few bands (in supplying 20 tasks) without issuing a single delete request - unless by chance, the very first band happens to be empty at the time. The scheduler will set the appropriate flags as it empties each band (so the WUG will fill them again) but it wont actually issue any delete requests for them. So within a minute, all the bands will be refilled and further requests for the same frequencies could easily be made. This is a marvellous "feature" of the current scheduler that I've been able to exploit to drastically reduce my bandwidth consumption.

So a brief summary of my strategy for the last couple of months (from early January) has been as follows:-

  • * Choose a frequency band to work in with plenty of available tasks (seq# > 1300 say) as close to the top frequency (1500Hz) as possible. By chance, I noticed one of my hosts had just acquired tasks for 1430.xxHz with a seq# around 1300 - seemed perfect to me.
    * Copy all the LIGO files, and all the blocks describing those files (extract this from the state file) to a share that all hosts can access.
    * For each other host you have, set NNT and allow all tasks for the current set of frequency bands to be completed, returned and acknowledged.
    * Stop BOINC and remove all the blocks for those frequency bands.
    * Seed each host's project directory with copies of the saved LIGO files (and remove the old LIGO files) and seed each host's state file with the saved blocks for the frequency bands you wish to transition to. It takes maybe a minute or two per host to edit the state file to remove the old and add in the new. Once you have done it a few times it becomes quite easy and quite routine.
    * Restart BOINC and enable tasks and they will be for the frequency band you desire.

I started with 1430.xxHz on all my hosts and they are now up to 1435.xxHz. I have a couple of fast hosts that can chew through a lot of tasks so I periodically allow one of them to download a lot of tasks (and so jump through many frequency bands so as to harvest new higher frequency data bands. I currently have through to 1436.90Hz saved and this is deployed as required to all the other hosts that need it. By using the boinccmd utility in a script, running on a single host on the LAN, it is very easy to automate the whole process of setting and unsetting NNT on all participating hosts so that they get to request a bunch of new tasks at a time and so largely avoid the delete request from the scheduler. I have another simple script that can stop BOINC and remove any tags that the scheduler happens to insert in any state file. I'm finding I don't need to use it very often and I'm seeing seq#s dropping to zero before data files are being deleted.

My hosts have chewed through many thousands of tasks from this limited subset of frequencies and the download bandwidth used is miniscule by comparison with what it was.

I have some additional thoughts for the coming "end of run" scenario that I started talking about earlier on but I'll save them for a subsequent post as this one is already way too long and I need to be elsewhere :-).

EDIT: I should make it abundantly clear that the techniques and procedures I have described above are not "standard practice" and I'm certainly not attempting to encourage people to take risks by editing their state files unless they are already quite skilled at doing so. My intended audience is that relatively small group of participants who are not averse to experimenting with micromanaging of their setups. When I started this post, my intention was to suggest ways to help with the cleanup of the old run. That will come in the next installment. If the above is gobbledegook, you'll probably want to give the next installment a complete miss. Nobody should do anything they are not fully comfortable with and which they don't fully understand.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 249,006,160
RAC: 33,803

The "cleanup" for the S5GC1HF run

Hi Gary!

Could you please post an abstract or a summary to this article? This must have taken you a lot of time to write...

BM

BM

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 982
Credit: 25,170,813
RAC: 2

RE: However, if a host

Quote:
However, if a host happens to ask for a task during this small window (last available task being used and replacements appearing), the scheduler wont be able to supply that particular frequency and will move to the next higher frequency band. That could also be temporarily empty and so the jump could well be more than just one band. Along with the new task, the scheduler sends a request for the client to delete the 4 LIGO data files for the frequency band that is temporarily without available tasks, even though (in less than a minute) that band will have more tasks.

Are you 100% sure about this? I don't question your observations but I'll definitely have to verify your assessment dissecting the actual implementation.

FYI, I'm analyzing the scheduler code for related reasons right now (optimizing locality scheduling, also WRT client file deletion). Your post comes just at the right time...

Cheers,
Oliver

Einstein@Home Project

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,037,552,756
RAC: 35,691,054

RE: Could you please post

Quote:
Could you please post an abstract or a summary to this article?


I presume you are talking about the promised "installment to come" :-).

I tend to launch into these sorts of discussions without pre-planning exactly what I'm going to write so the following summary may not necessarily agree with what will be said when I actually get around to writing it down :-). However this is what I think I'll be saying. I'll probably end up digressing as new thoughts occur to me during the journey :-).

The cleanup of the "dregs" of a previous run is hampered by the way the scheduler issues "delete" requests to the client. As a consequence, an unacceptably large proportion of the "old run" LIGO data files that will still be needed for quite a number of weeks or months after an old run has nominally finished, will need to be sent repeatedly to various new hosts as "resend" tasks progressively are required to complete the outstanding quorums. This will cause unnecessary extra load and bandwidth consumption on the servers as well as bandwidth demands on any client selected to receive the occasional resend tasks. I believe this could be addressed fairly easily in the client by making the client report to the scheduler all LIGO files that haven't actually been physically deleted yet. I believe there probably wont be time to design, test and implement the necessary changes before the new run kicks in so I have already implemented strategies so that a subset of my hosts will exist for weeks after the transition on a diet consisting entirely of resend tasks. I've done it before and I know it works. I want to publish the strategies so that other participants prepared to spend some time editing state files can extend the coverage to frequency bands that I wont be covering.

Quote:
This must have taken you a lot of time to write...


A few hours so far this time and probably quite a few hours more. I have tried to document this stuff previously and I've just now paused this reply while I found what I wrote last time. I've just re-read that message and all the following posts to the end of the thread. I must confess that I'd forgotten exactly how much detail I provided last time and I'd completely forgotten that I'd done enough digging to show it was the client rather than the server that worked out exactly which files needed the tags. So technically I'm wrong to state that the scheduler "issues delete requests" since it appears that that the client infers this from what is missing in a scheduler reply. Minor detail since the upshot is still the same ;-).

I'm very happy to spend the time documenting my experiences. I find that the very best way to really understand a problem (that I think I already understand), is to set out to completely explain it in terms that an average reader can comprehend. I'm not claiming I'm going to achieve that outcome perfectly, but just making the attempt really helps me to see things that I hadn't really understood properly beforehand. So it's not really a chore, it's just a bit of self enlightenment :-).

Since I'm getting old and forgetful, it's very good to document stuff and then be able to refresh my memory by going back over all the details any time I need to :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,037,552,756
RAC: 35,691,054

RE: RE: However, if a

Quote:
Quote:
However, if a host happens to ask for a task during this small window (last available task being used and replacements appearing), the scheduler wont be able to supply that particular frequency and will move to the next higher frequency band. That could also be temporarily empty and so the jump could well be more than just one band. Along with the new task, the scheduler sends a request for the client to delete the 4 LIGO data files for the frequency band that is temporarily without available tasks, even though (in less than a minute) that band will have more tasks.

Are you 100% sure about this? I don't question your observations but I'll definitely have to verify your assessment dissecting the actual implementation.


Let me state my assumptions and then give details of experiments I've done many times which throw further light on the whole process.

From what I've read (quite a while ago), I assume that the scheduler is supposed to set a flag when it exhausts the available tasks in a particular frequency band. The WUG notices this flag, generates a few more tasks for the band in question and resets the flag, or something along these lines. I also read that the scheduler is supposed to give the WUG a small amount of time to come up with the extra new tasks. If that doesn't happen within a short period, the scheduler moves on, assuming no more tasks will be forthcoming. Of course, I do realise that what I've read at some time in the past may not be current any more as the software undergoes continuing development. I can't know about that if it's not documented or if the documentation isn't up to date. I'm not a programmer and I don't read code.

So I tend to rely more on actual experiments and deductions. If I progressively increase the "extra days" cache size of a modern quad core host (configured to receive only GW tasks) by say 0.05 days at a time, the host will download a single new task and then go into a 1 minute backoff. During that time if I keep making the 0.05 day cache increases, the client will continue to request and receive one task at a time (most times at least). On average there will be something like 0 to 4 of these "single step" new work requests that are filled from the same frequency band before a gets announced and the new task gets supplied from the next highest frequency band. If I keep going for as long as I please, I'll get on average around 2 or 3 tasks per frequency before the dreaded appears. The only way I know of reconciling this very repeatable behaviour is to assume that once a frequency band is exhausted, the WUG can't refill it (or perhaps isn't told to refill it) in sufficient time for the scheduler to see the newly created tasks (and therefore be able to continue to access them) and so there has to be the frequency shift. As I intimated, sometimes it's a double frequency band shift and I interpret this as meaning that the next band also had no tasks immediately available. I could do 20 of these 0.05 day increases and get a whole day's worth of extra work at the cost of quite a number of tags being applied - and of course a whole bunch of new LIGO data downloads.

So I contrasted that with what happens if I make a full 1.0 day increase in a single step. The result is perhaps 20 new tasks in a single hit and quite commonly, not a single . However these tasks still span about the same number of frequency bands (2 or 3 tasks per band) and there will be much the same number of new LIGO data downloads so frequency bands are still being emptied. The big advantage is no (or very few) tags are created. If (during the 1 minute backoff) I add an extra day to the cache, I can get another 20 tasks from very much the same set of frequency bands that the first set of 20 tasks came from. So I interpret this as meaning that bands are being exhausted as usual but are then refilled by the WUG during the backoff interval so that the same bands are available after the 1 minute delay. I don't know for certain what causes the relative absence of tags with a bulk download as compared with the "single task at a time" behaviour, but I've been happily exploiting it for months now. In fact in thinking more about it as I try to describe it, I'm wondering if the real problem is that the scheduler is not quite smart enough to notice when it is taking the very last task for a particular band and has to discover it again later (trying to access a non-existent next task) before it sets the WUG flag.

There is another observation that I think supports this. Consider the situation I described where a bulk fetch of 20 new tasks doesn't result in any tags being applied. It is possible that the very last frequency band used in supplying the full 20 tasks may be partly depleted only, or it could be fully depleted. If it is only partly depleted, there will be no problem and a subsequent bulk work request (from the same host or even a different host feeding on the same set of frequencies) will fully deplete the band and then move on to the next band with no tag being set. However if that last band was fully depleted, the scheduler isn't smart enough to notice and the flag is not set. The next work request (from any host feeding on the same set of frequency bands) will discover the empty band and the WUG flag will only then be set, along with the tag.

So, in summary, the scheduler doesn't actually see that it has assigned the very last available task in the last band accessed in filling a work request. Therefore it doesn't get to set the flag for the WUG to refill this band. A subsequent work request (which could be quite a while later) has to find the empty band once again and so gets set when it wouldn't have been if the WUG had seen and flagged the band at the time of the initial work request.

Quote:
FYI, I'm analyzing the scheduler code for related reasons right now (optimizing locality scheduling, also WRT client file deletion). Your post comes just at the right time...


I hope you find the comments useful. This is not particularly new as I've tried to document the behaviour previously. Take a look at the link to previous info that I gave in my recent reply to Bernd. If you follow all the conversation through to the end of that previous thread from last year, you will find a typo in my very last message (the last bullet point in the second bullet point list) where I say "files from 1139.35 to 1141.25 ...". This should obviously read "files from 1140.35 to 1141.25 ..." to make sense in the context of the example being used.

In the light of my observation that the client seems to be deducing when to apply tags to files rather than being directly issued with the request by the scheduler, perhaps this all needs to be addressed in the client code rather than the scheduler code so it may be beyond the scope of what you are working on currently. I'll be very interested to hear about what you discover and what improvements you can make from the scheduler perspective.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,037,552,756
RAC: 35,691,054

I mentioned in my opening

I mentioned in my opening post in this thread that I would be making a followup post about strategies to assist with the cleanup of the "dregs" of the current GW search once the new run has started and once many hosts start being moved to the new run. If you have actually understood what I was trying to get across in the initial post and if you've followed the replies and checked out the links to previous posts about the issues, I should really congratulate you for your persistence. Perhaps you are just the sort of person to whom I can "sell" the following suggestions :-).

Obviously, what follows is absolutely NOT for the person that wants a simple "set and forget" BOINC experience. There really are lots of opportunities to screw things up right royally if you are not very careful. Please read on entirely at your own risk!!

The Problem

As the current S5GC1HF run winds down, (particularly after the new run fires up for a couple of days) you will notice that work requests will increasingly result in replies from the scheduler that not only give you the requested work but will also say

Got server request to delete file h1_1435.35_S5R4
Got server request to delete file h1_1435.35_S5R7
Got server request to delete file l1_1435.35_S5R4
Got server request to delete file l1_1435.35_S5R7
....
....


along with a bunch of new LIGO data file downloads. This is quite normal during the "end game" of a run but it does put extra load on the servers and it puts quite a spike in the bandwidth used by each participating host. It is a symptom of the fact that many hosts transition quickly to the new run and that LIGO data files from the old run are thrown away too quickly when apparently finished with. So when the scheduler needs to send out some tasks for the old run (particularly resend tasks that progressively become available in the weeks following the nominal end of the old run), there will be very few hosts that still have the correct LIGO data files and almost invariably the scheduler will give up waiting to find one and will send a single resend task along with the huge LIGO data payload (180-200MB) to the next poor sucker that comes along looking for a task.

I have a lot of hosts so I really notice this. I know from past experience (I've been through this a few times already) that there are strategies that can be adopted by individual participants that can prevent LIGO data from being lost prematurely and so can allow the request of lots of resend tasks that suit the data you have saved and so avoid all the unnecessary downloads.

The Strategy

The key thing to do is to pick a set of related frequency bands that your host is currently working on and to save copies of all the LIGO data files that are part of the chosen frequency set. Put them on a network share or even a pen drive, etc. You also will need the descriptions for these selected files which means you need to copy them out of your state file (client_state.xml) in a timely fashion before they get deleted and lost.

As an example, consider this snippet from my opening post

Quote:
... let's assume that your new host has just joined up and has been allocated its first task whose name is h1_1461.15_S5R4__1014_S5GC1HFa_2. I'm simply using the name of a task allocated to one of my hosts recently. You wont find a file anywhere on your computer with this name since the task exists as a small "result template" that is stored in your state file (client_state.xml). At the time you receive the task (which itself is very small) you will also need to download 49 large data files (1 skygrid file plus 48 LIGO data files) which totals something like 180-200 MB of data. The LIGO data are grouped into frequency bands with 4 files per band. The data would start at the frequency of 1461.15Hz and go through to 1461.70Hz in 0.05Hz increments. The data files for the 1461.15Hz band are named h1_1461.15_S5R4, h1_1461.15_S5R7, l1_1461.15_S5R4, l1_1461.15_S5R7 respectively. There are 11 further groups of similarly named files that must be downloaded, right through to the 1461.70Hz frequency. As I said, all this data is required for just one task.

Lets imagine your host just received this task and you decided that 146x.xx would be a good frequency set to get involved with. As your host requests work over the coming days, you would expect quite a few delete requests for frequency bands within this frequency set and quite a few new bands to be added to the set. In a weeks time for something like the crunching speed of a quad, I wouldn't be surprised to see that you had acquired extra frequency bands up to 1461.xx or 1462.xx or even higher if many other hosts were feeding on the same frequencies (as they probably will be) and the remaining tasks were being depleted rapidly. This is all fine since you want to acquire as much of a spread of frequencies as possible. So you simply save copies of all files as you receive them.

In addition, perhaps once every day or two as you see new LIGO data being downloaded you make a copy of your state file and you edit the copy so as to extract all the descriptions for every single LIGO file you are saving. The easy way is to open the state file copy in one editor window, search for the first new frequency band you have received (but not saved) and copy and paste the blocks from the state file copy into a new window where you have opened the blocks you have already saved up to now. The first time you do it you will obviously be saving a completely new file. Each subsequent time you will be adding to the end of the existing file. I would use the name file_info_1460.xml so I knew what frequency set it belonged to and I'd save this file on the network share along with all the actual LIGO data files.

CAUTION: When you edit your state file make sure you use a simple text editor that doesn't insert any formatting or other special characters of its own. You want a simple, text only, text editor. I use Notepad2 for Windows, Kwrite for Linux and TextWrangler for Mac OS X. It can be quite a benefit if your editor of choice can handle unix style regular expressions. The above three certainly can. This can save heaps of time later when you want to do a global remove of all tags in a state file.

Depending on how frequently you harvest the blocks, you may find that some already have a line inside the block. If so, simply delete that line in its entirety from your saved copy. When finished with the harvest, just throw away the state file copy as you'll get a fresh copy next time you want to repeat the exercise. At the moment I'm doing this for 3 separate frequency sets as I intend to set up 3 quads to participate in the cleanup of resends. The frequencies I'm saving (or have saved) are 143X.xxHz, 146X.xx and 149X.xx. In each case I have hundreds of LIGO data files saved already. For a variety of reasons I believe that the highest frequencies will be the most productive so if you intend to join in and have the opportunity to select a high frequency, choose something around 1450.xx or higher. It's good to start near the bottom of a 10Hz band (eg 1470.xx) because then you would need only a single skygrid file (skygrid_1480Hz_S5GC1.dat) for all LIGO data up to essentially 1479.95Hz. I've been collecting for a while and I haven't (yet) got a full 10Hz range for any of the three ranges I'm saving.

Here is an example of a file_info block for one of the 4 files that belong to the 1460.75Hz frequency band, exactly as extracted from the state file.
[pre]

h1_1460.75_S5R4
4262400.000000
0.000000
908b28b4fe97e636aae88a4e9bbe69bf
1


http://einstein.ligo.caltech.edu/download/fb/h1_1460.75_S5R4
http://einstein-dl4.phys.uwm.edu/download/fb/h1_1460.75_S5R4
http://einstein-dl3.phys.uwm.edu/download/fb/h1_1460.75_S5R4
http://einstein-dl2.phys.uwm.edu/download/fb/h1_1460.75_S5R4
http://einstein-mirror.aei.uni-hannover.de/EatH/download/fb/h1_1460.75_S5R4

[/pre]

So once you have selected the frequency set you intend to save, just keep saving appropriate data files and blocks for as long as possible. You shouldn't be particularly worried if files are getting with increasing regularity as long as the new files that replace them are just extending the coverage of your selected frequency set. The new run will commence shortly and at some stage after that the remaining primary tasks for your particular frequency set will be depleted to the point where the scheduler may decide to shift you to a different frequency or even to the new run. If your intention is to help with the cleanup, you don't really want this to happen. You can lessen the likelihood of being shifted to a different frequency by taking steps of offer the scheduler the full choice of the frequency set you have been saving. Here's how you would do that if your cache of work consists entirely of tasks for your desired frequency set.

  • * Make sure your einstein project directory contains a copy of every single LIGO file you have been saving. This could be quite a large number.
    * Set NNT for einstein and then stop BOINC completely (and confirm stopped)
    * Open your state file for editing with your chosen text editor. Find the start of the einstein project and scan down a bit until you find blocks for the sun, earth and skygrid files. Immediately after these you will find blocks for the most recent LIGO data files your host is processing. It's quite likely that many of the earlier ones that you previously saved will have been removed over the intervening period. Spend a bit of time sussing things out as you need to get this right and you want to be able to do it once and not have to repeat it. The aim is to make sure there is a block for as many as possible of the LIGO files you have saved.
    * You want to make sure there are no lines in any of the blocks. There are bound to be some already in the file when you start the exercise so just remove the complete line. Depending on your editing skills, this can be quite trivial if you understand regular expressions.
    * It doesn't matter much if you are missing a few blocks as long as you have a good selection which cover the lowest frequency LIGO files in your selected frequency set and you have all the LIGO files themselves. When you send a request for work and receive a task that depends on a missing block, BOINC will think it needs to download a fresh copy of that single LIGO file. It will then discover that you already have the file so you will see a "file exists - skipping download" message and the file_info> block will have been added to the state file.
    * It also doesn't matter if you are missing the odd LIGO file. They will be downloaded when needed. You don't want to be missing too many as the scheduler may fail to send you a resend task for which you are otherwise eligible. So the best policy is to take care to save all blocks and all data files if possible.
    * Once you complete the editing you can save the state file and restart BOINC (with your fingers crossed hoping that you haven't made any typos). It's entirely up to you as to how you protect yourself against editing mistakes.

If BOINC restarts cleanly and the tasks that were running, just before you shut down for the edit, all restart cleanly then, congratulations, you didn't make any mistakes :-). The first time you do this it's quite stressful. You become quite blase about it after a while. Just don't become complacent :-).

You now have a host which can announce to the scheduler that it has a very wide range of available LIGO data. The scheduler will love you because it's bound to be able to find plenty of resends that it can dump on you without having to switch you to different frequencies and arrange for huge new data downloads. If your earliest saved LIGO data is from a couple of weeks before the time that you actually start this up, there is almost certainly going to be "deadline miss" tasks that the scheduler has been struggling to get rid of. You should expect to get plenty of those. There will be some frequency bands (remember my definition of a band is the 4 LIGO files of the same precise frequency - like 1460.75 for example) where the scheduler doesn't currently have a resend task available. For each of these (and there may be quite a few as you gather resend tasks over the range) a flag will be added to the state file. There probably wont be too many actual deletes because a LIGO file will not actually be deleted until there are absolutely no tasks that depend on it. Since a single resend task depends on 48 LIGO files, it's quite likely that a file will actually be required to be kept for some other task(s) anyway, particularly in the early stages of the cleanup when resend tasks are relatively abundant.

The ongoing operation and maintenance of your "resend vacuum cleaner" is pretty straightforward. From previous experience, this is what I would recommend.

  • * Plan to maintain a cache of at least 6 days or more so that you can allow the host to run with NNT for a couple of days without the risk of running out of work.
    * Always run with NNT so you have control over when you ask for work. The best strategy is to leave a day or two gap between "harvests" so that the supply of resends will build up in between requests.
    * When you ask for work, do it in small blocks rather than single tasks but be a little circumspect about how big a block to ask for. You don't want to ask for so many tasks that you completely exhaust the scheduler's ability to supply. If you exceed that limit, the scheduler will switch you to a different frequency or even to the new run. Once you know what you are doing you can quite easily recover from this but it's best not to risk the problem in the first place. When I last did this I was able to get plenty of resends to keep a quad fully fed for much more than a month after the old run had nominally finished. I found it very helpful not to try to grab too many tasks in one hit. It's pretty easy to see when the scheduler is getting a bit short of tasks so you don't push and just wait another day before trying again. During the wait interval the supply of resends always seems to improve, sometimes dramatically so.
    * Every time you go through a work fetch cycle, you will probably get a number of messages. You need to do two things. You need to replace any LIGO files that actually get deleted and you need to remove the

Once again I've run short of time and there's probably a lot of polish that needs to be applied to what has already been written, without even thinking of more points that I should add. It's actually quite hard to properly set out understandable instructions for something that seems quite simple when I just do it. It's probably best to see what comments or queries arise and then fix things as needed to handle the queries as (or perhaps if) they come in :-). So please feel free to fire away.

Cheers,
Gary.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 982
Credit: 25,170,813
RAC: 2

RE: RE: FYI, I'm

Quote:

Quote:
FYI, I'm analyzing the scheduler code for related reasons right now (optimizing locality scheduling, also WRT client file deletion). Your post comes just at the right time...

I hope you find the comments useful.

They are indeed! Although I haven't yet fully read/digested all your (related) postings, let me quickly give you some details about what I'm working on:

Over a year ago we introduced some major optimizations to the BOINC database schema that make available detailed information about the files used by a given workunit/result in way that allows for real-time analysis of that data in the scheduler process. In addition to this we also moved the state information of the locality scheduler into the project database. These two major changes now allow us to do the following: a) notify the client of files that are definitely not needed anymore and b) take the list of (sticky) files reported by the client during a work request and select those unsent results that lead to the least required download volume (based on that very list).

Both things should address what I figured from your elaborations so far. I'm happy to discuss the details as soon as they've matured enough. In the meantime I'll continue sifting through your posts... :-)

Best,
Oliver

PS: first efficient data selection algorithms for a) and b) are almost done but the logic still needs to be integrated into the current scheduler implementation.

Einstein@Home Project

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,037,552,756
RAC: 35,691,054

RE: ... let me quickly give

Quote:
... let me quickly give you some details about what I'm working on:


Thanks very much for taking the time to do this - It's very much appreciated.

I've gone through what I wrote last year (particularly this message and this message and if the problem described there gets addressed, I'll be very pleased. I was keen to ensure that what I had written was accurate and understandable and I think it was. Please let me know if that's not the case.

I'm very interested in what

Quote:
a) notify the client of files that are definitely not needed anymore


actually means. A file not being needed anymore could possibly be that point in time when ALL (eg right down to __0) primary tasks for the single frequency band represented by the LIGO data filename have been issued. Alternatively, it could be that point in time when all primary tasks for all the different frequency bands that need that file have been issued. I'm hoping it's not one of those since LIGO files are still needed until all secondary tasks (what I call resends) have been issued and satisfactorily returned. I would be very happy if LIGO files do really stick around until all the resends are done although I realise that this might have unacceptable consequences for the disk space needed on the average volunteer host. Quite possibly it could be something entirely different so it would be really good to have the meaning clarified.

Quote:
... and b) take the list of (sticky) files reported by the client during a work request and select those unsent results that lead to the least required download volume (based on that very list).


I'm sure I understand this one :-).

Quote:
Both things should address what I figured from your elaborations so far. I'm happy to discuss the details as soon as they've matured enough. In the meantime I'll continue sifting through your posts... :-)


Thanks for your gentle choice of words concerning my verbosity. I'm sure others may not be so kind :-). I find it hard to be succint when I'm passionate about something.

There is another point I need to make about scheduler behaviour so sorry in advance for the length this will probably grow to.

Imagine a client has LIGO data for just a small number of frequency bands - lets say 1430.20Hz to 1430.40Hz - that's just 5 bands. It has blocks for all of these only. A request for a single task will most likely result in a 1430.20Hz task being issued (assume availability is OK) and all the extra LIGO files from 1430.45Hz and above (it would actually be 28 files covering 1430.45Hz to 1430.75Hz) will be downloaded and all the blocks for these will be added to the state file. No problem with any of this.

For behaviour comparison purposes, assume the work request was much larger (say 20 tasks - a full day was added to the cache) with everything else the same. I do this quite a lot. From my experience, around 2 tasks per frequency band will be issued for the 5 available frequency bands - for simplicity let's assume it is exactly 2 so that 10 of the 20 requested will be for frequencies from 1430.20Hz to 1430.40Hz. It's what happens for the further 10 tasks needed to fill the request that's the interesting bit. Before explaining that I should also mention that there will be quite a few more blocks added to the state file this time - 44 blocks covering the range from 1430.45Hz to 1430.95Hz.

The remaining 10 tasks will be supplied after jumping to a completely different frequency. Even though 48 LIGO files will need to be downloaded for a single frequency band, the scheduler will not take advantage of that and will not choose more tasks from immediately adjacent frequency bands. Quite often, the new frequency band contains a single task (the scheduler loves to use the opportunity to get rid of a single resend). I can remember one example where the extra 10 tasks were going to be supplied from 7 completely new frequency sets - around 350 new LIGO files in total. It's easy enough to recover from this - suspend network activity, set NNT, abort the 10 downloading tasks, abort around 350 skygrid and LIGO data downloads, resume network activity, and finally report the 10 aborted tasks. The only problem is recovering from the strain of clicking OK to "abort the file transfer" 350 times :-).

So I have a particular feature request for the new scheduler - which, from what you've already mentioned, may already be part of the deal. When the request for 20 tasks was being handled and the scheduler had allocated the 10 matching tasks for the original frequency set, could it also be made smart enough to realise that the most "efficient" way to supply the extra 10 tasks would be to simply continue on with higher frequency bands in the existing series rather then jumping to multiple new frequency sets?

Thanks for your patience in dealing with all the detail.

Cheers,
Gary.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 982
Credit: 25,170,813
RAC: 2

RE: Thanks very much for

Quote:

Thanks very much for taking the time to do this - It's very much appreciated.


You're welcome!

Quote:

I was keen to ensure that what I had written was accurate and understandable and I think it was. Please let me know if that's not the case.


I've read everything by now and I don't think there's anything unclear.

Quote:

I'm very interested in what
Quote:
a) notify the client of files that are definitely not needed anymore

actually means.


The main issue seems to be that the scheduler notifies the client to delete a file too early. The basic idea right now is to do it the other way round: when a client reports "his" list of files, the scheduler checks whether a) this file is associated with any workunit (this includes all tasks, until the workunit itself is deleted by the db purger) or b) if the WUG could still produce work for this file. Only if both conditions are not met, the scheduler will issue a delete request for this file on the client.

Doing it like that doesn't require the client to be changed. There's no need to report files or "unflag" those files if that flag gets set correctly in the first place. The proposed algorithm is most conservative.

The only thing we need to make sure is that the client can effectively notify the scheduler when its diskspace is running low, but that's already implemented. However, we're also going to improve that algorithm because the choice which file is selected for deletion in that case is purely random - a more sophisticated way to handle this could be to choose the file (to be deleted) based on a "least value" metric, i.e. the number of remaining unsent tasks depending on it.

Quote:

Thanks for your gentle choice of words concerning my verbosity. I'm sure others may not be so kind :-). I find it hard to be succint when I'm passionate about something.


Trust me, I know ;-)
FWIW: "matured enough" referred to our ideas/work, not your elaborations.

Quote:

So I have a particular feature request for the new scheduler - which, from what you've already mentioned, may already be part of the deal. When the request for 20 tasks was being handled and the scheduler had allocated the 10 matching tasks for the original frequency set, could it also be made smart enough to realise that the most "efficient" way to supply the extra 10 tasks would be to simply continue on with higher frequency bands in the existing series rather then jumping to multiple new frequency sets?


As you already pointed out, this should partially be solved by the revised scheme to select tasks for a given client request. As it uses a "least download volume required" strategy it should already mitigate/fix most of the problem. Since the client will be reporting all relevant files due to the corrected file deletion scheme (no premature state) this algorithm should be nearly optimal. However, what I described so far only processes those unsent tasks already in the database, meaning already produced by the WUG. We should definitely have a look at the way a) the scheduler asks the WUG for new work and b) how the WUG processes those requests with regard to current app/run setups - there could be room for improvements.

Cheers,
Oliver

Einstein@Home Project

Christoph
Christoph
Joined: 25 Aug 05
Posts: 41
Credit: 5,954,206
RAC: 0

So, what is a rough schedule

So, what is a rough schedule to get the scheduler updated with the remaining necessary changes?

The questions are all there.
Most of the answers as I read it are there too.
Some of the answers got already coded and need 'only' to be activated.
The remaining answers are work in progress.

Is this summary about correct?

Christoph

Greetings, Christoph

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 982
Credit: 25,170,813
RAC: 2

Yes it is. The actual

Yes it is. The actual integration into the scheduler shouldn't take more than a few days. However, we first need to analyze the current code in more detail to assess potential side-effects. After all that's done there'll be a testing phase, so final deployment could be in a few weeks. Hopefully in time, before the next run starts...

Stay tuned,
Oliver

Einstein@Home Project

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.