The "cleanup" for the S5GC1HF run

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119643007571

RAC: 25073921

I now have 7 quad VCs

3 May 2011 6:08:17 UTC

Message 104896

(moderation:

)

I now have 7 quad VCs running. The supply of resends for the frequencies I'm targeting is a bit variable at times but there has been quite a spurt very recently which is why I added the 3 extra quads.

The original 4 hosts are still working in the 1430 - 1440Hz range. Because tasks right at the top of that range require LIGO data files of somewhat higher frequency (up to 1440.55Hz), these hosts have been progressively getting resends further above 1440Hz and now well into the 1440 - 1450Hz range. So over the last few days, I set up 3 new VCs specifically for the 1440 - 1450Hz range. I had no great supply of saved data but, with plenty of resends, these 3 hosts have worked their way up to 1445Hz and now will have no trouble extending the range even further.

The main reason why they got established so quickly (they started with around 2 days of work each but now have close to full 10 day caches) is that there has been a very significant increase in the number of resends available. As an example, take a look at the latest page of the work cache for one of the new VCs. This will change over time as more new resends are added, but right now you can go back from here to the 12th page to find the tasks sent on April 20 that were being returned on April 29. These were primary tasks from the 1430 - 1440Hz range. No new tasks were added between April 20 to April 28. NNT was set and the 10 day work cache was being allowed to drain prior to a planned frequency shift to the 1480 - 1490Hz range. Instead of doing that shift, on April 28 I added some LIGO data for just above 1440Hz and allowed the host to look for resends. You can clearly see the first new tasks acquired on April 28 and after that were resends (and also some low sequence number primary tasks) between 1439.75 and 1441.95Hz.

This host has been filling up its cache towards 10 days worth of tasks between April 28 and May 2. 233 tasks were obtained over this ~4 day period, of which 200 were in the target range of 1439.70 - 1444.00Hz. Whilst most of these were resends there were some primaries from particular bands that hadn't been fully depleted of primary tasks at the time. There aren't any further primary tasks now - the most recent tasks acquired were all resends.

As this is just one of the 3 hosts of mine that are now working this range, it's rather staggering to realise just how large the total number of resends must be.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119643007571

RAC: 25073921

It's now a full month since I

7 May 2011 3:29:04 UTC

Message 104897

(moderation:

)

It's now a full month since I started my first quad VC in the 1430 -1440Hz range. That one was soon followed by three more. It's only in the last day or so that the resends for this range have really dried up. I refreshed the full data range on one host a few hours ago and it wasn't able to get a single resend. The previous attempt some 12 hours before that got very few resends, as did the attempt prior to that as well. I intend to move these four hosts to more productive ranges for the time being.

Over the period, those four VCs have cleaned up around 2000 resends. I wasn't expecting such a sudden "turning off of the resend tap" but now that I've seen it, I'm sure I understand why. It all seems rather obvious.

In normal operations, there is always a significant fraction of any tasks issued that error out or is not returned. This is what perpetuates the flow of resends for months after a run has terminated. This time, the run is not yet finished but the resends for a particular frequency range essentially are finished (or so it would seem). A key difference is that, for my four hosts, there has been a zero failure rate in returning tasks for almost a month now.

It's quite likely that my VCs only secured a fraction of the available resends over the last month. You would assume that there will still be failures amongst other hosts that shared in this range. So my intention is to keep all the 1430 - 1440Hz data until after the transition to the new run and then to try that range again with just one host, perhaps about a week after the new run kicks in. I would expect there to be more resends by then with little competition for them.

It would be really interesting to see the numbers for incomplete quorums for the 1430 - 1440Hz frequency range as compared to say the 1420 - 1430Hz range. This latter range would have been crunched a little before the former one but I would expect it to have more incomplete quorums when it should actually have less - all other things being equal. If there are any Devs reading this who are proficient in constructing SQL queries, perhaps it might be possible to do the comparison? I'd really like to know if what I've been doing for the last month is having any measurable impact.

The other range I'm working with 3 hosts (1440 - 1447.50Hz) continues to supply a good number of resends. As I mentioned in an earlier post, I wasn't planning on working this range (I started with very little saved data) but the range keeps getting extended every day as the new, higher frequency resends cause extra LIGO downloads. I reckon I'll soon be into the 1450+ range at the current rate of progress.

Cheers,
Gary.

samuel7

Joined: 16 Feb 05

Posts: 34

Credit: 1579363

RAC: 0

I got so intrigued by this

7 May 2011 12:05:01 UTC

Message 104898

(moderation:

)

I got so intrigued by this thread that decided to do some "cleanup" myself. I only have nine cores (+4 HT not currently used) available but that's plenty for me to manage.

I've gathered the 1455.00 - 1458.95 Hz range of LIGO files and corresponding segments. The 1459.xx range is missing a few files. I've also saved some files from higher frequencies for later use.

From what I've seen, there are still quite a number of primary tasks available for the 1455-1460 range. For some frequencies primary tasks have already been exhausted but for others the sequence number is at around 200.

I'm planning to establish the full 1455-1460 range on the hosts on Monday. Then try to build up the caches to several days and wait for the primaries to run out if they haven't already done so.

Gary, many thanks for bringing these methods to everyone's attention and for explaining things so clearly. This sort of DC that requires manual work is the most fun for me.

Sami.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119643007571

RAC: 25073921

RE: I got so intrigued by

9 May 2011 9:27:05 UTC

Message 104899 in response to message 104898

(moderation:

)

Quote:

I got so intrigued by this thread that decided to do some "cleanup" myself....

Hi Sami, thanks for the feedback. It's good to know that you found it understandable. Now that you've joined, I hope you enjoy the ride.

One of the things I don't recall commenting on (I wrote a lot of words so maybe it's in there somewhere) is what appears to be different this time compared to previous "cleanup" behaviour. Previously the scheduler used to wait quite a while for a host with the "correct" data to come along before giving up and issuing the resend task at random. I recall often seeing resend tasks that had remained unsent for several days while the scheduler waited patiently. There appeared to be a multi-day timeout period before the scheduler decided that no host was likely to come along with the correct data. This time, the scheduler seems hell bent on getting rid of them very quickly, correct data or not.

Based on the assumption that the scheduler would behave pretty much as it had previously, I've been waiting at least 6 - 8 hours minimum and often much longer before trying to harvest more resends. Over a period I've noticed that this doesn't seem to work as expected. So today I tried an experiment to more closely define the scheduler behaviour with regards to whether or not there was any sort of "hold" period before disposal.

In my previous message, I mentioned I was going to shift the 4 quads working in the 1430 - 1440Hz range to something different. I wasn't intending to allow them to simply drift into the 1440 - 1450Hz range because of the other 3 quads that were already there. My assumption was that there wouldn't be enough work to keep 7 quads properly fed. Despite my misgivings, I decided to encourage that to happen and to make each host do a harvest about every 2 - 3 hours. I started this over 12 hours ago by simply adding the LIGO data and blocks for the 1440 - 1450Hz range to the 4 hosts in question, allowing a host to harvest what it could and then move on to the next host and repeat. It takes about 30 mins or so per host so after about 3 hours or so, I was back at the first host ready to rinse and repeat.

I was quite amazed at the result. Each host was able to get a good supply of resends each time, even although the previous host had just cleaned up what was available. Two things are pretty clear. Firstly, lots of resends are being generated fairly continuously. Secondly, the scheduler gets rid of the resends fairly quickly. Trying to allow them to accumulate over a period before harvesting turns out to be a dead loss.

Over the period where I performed this experiment, I harvested 219 resends in the 1440 - 1450Hz range and managed to extend the range from just over 1448Hz to around 1450Hz by scoring resends at the top of the range which required extra LIGO data downloads which thereby extended the range a bit each time.

Quote:

I've gathered the 1455.00 - 1458.95 Hz range of LIGO files and corresponding segments. The 1459.xx range is missing a few files. I've also saved some files from higher frequencies for later use.

In the next day or so, I imagine I'll get well into the lower end of that range so if you would like the s for 1450 - 1455Hz I could easily send them to you. My experience has been that the best time to start harvesting resends is a day or two after all primary tasks in that range have been distributed. I would expect that there won't be any primary tasks now around 1450 - 1452Hz.

Quote:

From what I've seen, there are still quite a number of primary tasks available for the 1455-1460 range. For some frequencies primary tasks have already been exhausted but for others the sequence number is at around 200.

They'll pretty much be gone in a day or two.

Quote:

I'm planning to establish the full 1455-1460 range on the hosts on Monday. Then try to build up the caches to several days and wait for the primaries to run out if they haven't already done so.

Sounds about right. Should work quite nicely.

Quote:

Gary, many thanks for bringing these methods to everyone's attention and for explaining things so clearly. This sort of DC that requires manual work is the most fun for me.

You're most welcome. I quite enjoy figuring out what's happening and then working out strategies to optimise performance. The best way to sharpen your understanding of something is to try to explain it to someone else :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119643007571

RAC: 25073921

A few things have happened

12 May 2011 4:00:26 UTC

Message 104900

(moderation:

)

A few things have happened since I last posted.

Firstly, all the remaining GC1HF tasks have been generated and inserted into the database. One consequence of this was that a host topping up its cache had complete access to all the remaining tasks for a particular frequency bin. There would be no shift in frequency until all the sequence numbers all the way down to _0 had been allocated. At the time all remaining tasks were inserted (about 2.5 days ago) I remember taking a look at the server status page and noticing that what was left to send out was about 370K tasks. Now that number is around 200K tasks, which means around 70K tasks per day are being sent out. At the current rate, these primary tasks will all be gone sometime on Sunday.

Secondly, a very small number of test tasks for the new "Bucket" run were generated and inserted into the database. It took a little while before any were sent out. When they were released, they disappeared quite quickly (as you would expect) and now the first of them have been returned and validated. As I write this there are 11 valid quorums (22 valid tasks) and zero invalid tasks. Looks good so far. There are no further tasks for the new run yet but it wouldn't surprise to see some more (perhaps a somewhat larger number) generated soon. If everything continues to be OK, the floodgates should be opened soon after that.

This is really the event I've been preparing for. At the moment, essentially all of the active hosts are fighting over a rapidly diminishing pool of frequencies which still have available old run tasks. There would be a constant stream of hosts which are exhausting the remaining tasks at a particular frequency and so require shifting to a new frequency that still has some left. Quite often, the scheduler will use this opportunity to distribute resend tasks as part of the shift. The severe competition from these hosts is limiting the opportunity for any host deliberately trying to get resends to actually do so.

As soon as the flood gates are opened, these competing hosts should be rapidly transitioned to the new run and once there they will cease to be in competition for the remaining resends. My impression is that this transition will be quite quick and within a few days there will be a much reduced number of hosts left on the old run. Most of those will be confined to specific frequencies that still have old run primary tasks. There should be little competition for resends for those growing numbers of frequencies where no primary tasks remain.

What happens next depends on whether or not the scheduler behaviour has been modified to distribute resends more urgently this time, compared with previous transitions. Assuming it hasn't changed in this regard, I would expect that resends will steadily accumulate and be able to be harvested at a much more leisurely pace. However, given the recent behaviour where resends were pushed out the door very quickly, I suspect that this may continue and so it may be difficult to keep the work cache filled with resends only. It will be interesting to see exactly what happens.

At the moment, it has become more difficult to get resends for the frequencies I'm trying to harvest. The 1430 - 1440Hz range continues to supply none. The 1440 - 1452Hz range has diminished considerably and the few that are available make it hardly worth the effort of replenishing the LIGO data files and the blocks in the state file prior to each harvesting run. All my machines, whether trying to harvest resends or not, are set up with data and s for frequencies between 1481Hz and 1500Hz and none as yet have had any difficulty in getting primary tasks. For the seven hosts harvesting resends, they are supplementing any shortfall in resends with primary tasks. I'm simply going to wait until a day or two after the floodgates for the new run are opened and then set them up again for all different ranges for which I have available data and s and for which the primary tasks are exhausted.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119643007571

RAC: 25073921

In the last 24 hours, the

13 May 2011 5:50:16 UTC

Message 104901

(moderation:

)

In the last 24 hours, the supply of new run tasks has increased dramatically. 8K tasks have already been distributed and there are a further 9K ready to send. In the same period about 50K old run tasks have been sent out along with the 8K new run tasks. So, as expected, a significant number of hosts have already made the transition. I'm surprised that there haven't been any posts from people who have seen the first of the new tasks.

I've been busy so haven't spent much time looking at my hosts. I had a quick look at a few this morning and didn't see any that were having problems still getting old run tasks. Now that the new run is well under way, I'll be looking to return to harvesting resends probably sometime on Saturday or Sunday. By that stage a lot more hosts should be over on the new run.

Cheers,
Gary.

samuel7

Joined: 16 Feb 05

Posts: 34

Credit: 1579363

RAC: 0

An update on my efforts on

13 May 2011 7:24:59 UTC

Message 104902

(moderation:

)

An update on my efforts on the cleanup:

I tried to harvest resends for the 1455-1460 range this week, but was only able to build a decent cache for the laptop (1 core). Probably I was there a little early, and should have waited for a few days after the primary tasks ran out.

Right now the i7 is crunching primaries in the 1490's and the other quad was used the gather data from the early part of the 1470's. It'll be fully on Einstein once its Primegrid challenge tasks are completed (sometime today) and downloading more tasks.

I'm away from the hosts for the weekend and will assess the situation on Sunday or Monday.

tolafoph

Joined: 14 Sep 07

Posts: 122

Credit: 74659937

RAC: 0

RE: In the last 24 hours,

13 May 2011 7:40:49 UTC

Message 104903 in response to message 104901

(moderation:

)

Quote:

In the last 24 hours, the supply of new run tasks has increased dramatically. 8K tasks have already been distributed and there are a further 9K ready to send. In the same period about 50K old run tasks have been sent out along with the 8K new run tasks. So, as expected, a significant number of hosts have already made the transition. I'm surprised that there haven't been any posts from people who have seen the first of the new tasks.

Hi Gary,
one of my hosts (3688408) got S6-tasks yesterday and finished them over night. The WUs are pending right now.
WU 1,WU 2,WU 3,WU 4

They took 27500s vs. 30000s for the S5 tasks.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119643007571

RAC: 25073921

@Sami - you should be able to

14 May 2011 6:19:07 UTC

Message 104904

(moderation:

)

@Sami - you should be able to get more resends once more hosts move to the new run and the competition lessens. Certainly by Monday things should settle down. Even then there will probably be some old run primary tasks left but these will be at higher frequencies than the range you mention.

@tolafoph - thanks for the links. Interesting to see the first new run tasks.

For the latest 24 hours, the ratio of new run tasks to old run tasks distributed seems to be about 12K : 40K. For the 24 hours before that it was about 8K : 50K. There are still 110K old run primary tasks left to send. The number distributed each day will continue to fall so it could easily be another week or so before they are all gone. I was a bit surprised not to see a bigger number of new run tasks sent out. Too many hosts still able to access old run tasks for the frequency they already have on board, I guess.

Cheers,
Gary.

Henk Haneveld

Joined: 5 Feb 07

Posts: 18

Credit: 14434298

RAC: 1753

Gary, there is a new option

14 May 2011 6:34:40 UTC

Message 104905 in response to message 104904

(moderation:

)

Gary, there is a new option in Einstein account preferences for the new run.

On my account this is default set to off. Users may need to turn it on to receive results for the new run.

The "cleanup" for the S5GC1HF run

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner