The "cleanup" for the S5GC1HF run

paul milton

Joined: 16 Sep 05

Posts: 329

Credit: 35825044

RAC: 0

RE: Richard, many thanks

17 Apr 2011 8:54:52 UTC

Message 104876 in response to message 104875

(moderation:

)

Quote:

Richard, many thanks for your comments and insights. They are really appreciated.

Quote:
... The BOINC client itself - maybe only recent versions, I'd need to look back in the version change history - will abort a task if it has not even started by the time of the deadline.

This must only happen with relatively recent versions of the client. Most of my hosts run Linux and virtually all of those run 6.2.15, and I've never seen a client instigated abort, although I have observed a few cases where machines have locked up for a sufficiently long period to have deadline misses. I've always had to abort the real deadline misses manually. There are ways to give yourself a deadline extension for tasks that haven't actually missed the deadline yet but will likely do so if allowed to continue without intervention.

not to intrude, but i can tell you that 6.10.58 apparently all so does not abort tasks past deadline. my other host has apparently not been returning valid results for a while and has been locked up for the last month! after a hard boot boinc started to chug along processing a WU that had expired on march 31st (it was now april 16th) and i had to manually abort the task (no idea what caused the lockup. now looking in to a way to monitor that system remotely with out using much overhead)

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3041621252

RAC: 1960158

RE: RE: Richard, many

17 Apr 2011 9:36:34 UTC

Message 104877 in response to message 104876

(moderation:

)

Quote:

Quote:
Richard, many thanks for your comments and insights. They are really appreciated.

Quote:
... The BOINC client itself - maybe only recent versions, I'd need to look back in the version change history - will abort a task if it has not even started by the time of the deadline.

This must only happen with relatively recent versions of the client. Most of my hosts run Linux and virtually all of those run 6.2.15, and I've never seen a client instigated abort, although I have observed a few cases where machines have locked up for a sufficiently long period to have deadline misses. I've always had to abort the real deadline misses manually. There are ways to give yourself a deadline extension for tasks that haven't actually missed the deadline yet but will likely do so if allowed to continue without intervention.

not to intrude, but i can tell you that 6.10.58 apparently all so does not abort tasks past deadline. my other host has apparently not been returning valid results for a while and has been locked up for the last month! after a hard boot boinc started to chug along processing a WU that had expired on march 31st (it was now april 16th) and i had to manually abort the task (no idea what caused the lockup. now looking in to a way to monitor that system remotely with out using much overhead)

That's a slightly different case. BOINC will eventually abort a task which has got stuck, but it uses a different mechanism, not one based on deadlines.

The one I was thinking of was

Quote:

- client: abort jobs that are unstarted and past deadline

which was introduced with BOINC v6.6.12, in March 2009 - [trac]changeset:17399[/trac].

@ Gary - how much of a problem are these 'locked up tasks' at Einstein, these days? I see it occasionally on my rigs, too - usually a quick suspend/resume, or a BOINC restart, gets them on their way. But on an unmonitored remote host like Paul's, they can be annoying. Josef W. Segur, a third-party developer for SETI, submitted a BOINC API library patch to overcome this just a couple of days ago [boinc_dev, "check_progress option"], but got "I don't think this is worth doing." from David. Could we persuade him otherwise, or just get Einstein to consider the patch for independent use in the next batch of apps?

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253629703

RAC: 35427

RE: how much of a problem

17 Apr 2011 19:30:17 UTC

Message 104878 in response to message 104877

(moderation:

)

Quote:

how much of a problem are these 'locked up tasks' at Einstein, these days?

I can see 0.05% of S5GC1HF tasks are terminated because of "Resource limit exceeded", for BRP3 (CPU) the number is even lower.

It doesn't look like the tasks being actually stuck in computation is a major problem. It might be, though, that something fails in updating the progress, while the task is actually running.

I'd rather like to learn more about what such a "stuck" task is actually doing (and what not), e.g. by attaching a debugger or "strace"ing it. There is certainly no infinite loop (or wait) in the E@H application code. If there is in the BOINC API, it should be found and eliminated.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119642980905

RAC: 25075696

RE: @ Gary - how much of a

18 Apr 2011 2:55:56 UTC

Message 104879 in response to message 104877

(moderation:

)

Quote:

@ Gary - how much of a problem are these 'locked up tasks' ...

Firstly, I'm not sure that Paul actually meant that a task was stuck at some particular %done. I assumed (perhaps wrongly) that he was referring to a machine that had locked up rather than a specific task.

A couple of years ago I used to observe task lockup on some machines occasionally. I could always restart the task just by restarting BOINC. Of late I haven't seen any of these so it doesn't appear to be a problem.

I now see machines that (fairly infrequently - just enough to be a nuisance) lockup or crash in a variety of ways. Sometimes the machine seems partly alive because the mouse pointer can be moved or the numlock can be toggled but nothing will respond. Sometimes the machine will shut itself down and sometimes the power is still on but it's otherwise dead. Occasionally, the machine will appear to be still running because it will respond to pings and processes can be launched but then they immediately crash. In all cases, a reboot fixes whatever the ailment is.

It mainly happens in summer and I'm reasonably confident it's related to heat. As the weather has cooled down considerably, this is not happening as much now as it was in the middle of summer. It was the same experience last year. Because of last year's experience, I had resolved to shut most things down for the three hottest months this year but in the end I couldn't bring myself to actually do it :-). So I just tried to keep an eye on things and reboot when required.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119642980905

RAC: 25075696

I had some fun over the

18 Apr 2011 3:51:48 UTC

Message 104880

(moderation:

)

I had some fun over the weekend with my resend vacuum cleaners. I've had to deploy 4 quads now to keep up with the flow. When I deployed the third, I thought that would be enough but when I did the early morning pass first thing today (Monday), I couldn't soak them all up even after going 0.5 days over my hard cache limit on each of the three. So much for hard limits :-).

So I selected another host and deployed the LIGO data files and added the blocks to the state file and after sucking up about another 16 or so resends, eventually received a primary task, pretty much exactly when I'd reached the cache hard limit on the 4th machine.

My data set covers the frequency range from 1430.xxHz to 1439.95Hz and all frequency bands within this range up to 1439.00Hz are devoid of primary tasks. There are now very few primary tasks left up to 1439.95Hz so I'll leave the 4 vacuum cleaners going and shift all other hosts to 1490.xxHz and above frequencies. I've actually already got data up to about 1494.xxHz. I've been saving LIGO data from a couple of hosts already working in this range and there are many more tasks left here than in the 143x.xxHz frequency bands.

Over the weekend, I came across something a bit disturbing that I'd not recognised previously. It relates to a particular nasty side effect caused by hosts that have excessively large caches. I'm going to start a new thread about it since it's rather important to explain to participants and it's not particularly to do with cleaning up resends.

Cheers,
Gary.

Christoph

Joined: 25 Aug 05

Posts: 41

Credit: 5954206

RAC: 0

Hello Gary, I decided to

18 Apr 2011 8:15:02 UTC

Message 104881 in response to message 104880

(moderation:

)

Hello Gary,

I decided to join you in your endeavour. I will keep my actual files. I only re-started beginning this month after a couple of weeks break, so I only have the 1468 band currently.
Starting 1467.95, continuesly to 1468.80. Got my first delet requests which are not yet executed.
I have an old backup of my BOINC folder, there I have some 1323 frequencies, but that is propably already too old I guess?
If you have some lower frequency bands which you are willing to upload to supplement mine, I can give you access to some online space from my ISP.

All other projects on this host are on NNT, so after a day or two it will be dedicated to Einstein and some SETI AP beta testing for Raistmer.

Regards,

Christoph

EDIT: I still get fresh files on my frequency.

Greetings, Christoph

paul milton

Joined: 16 Sep 05

Posts: 329

Credit: 35825044

RAC: 0

RE: RE: @ Gary - how much

18 Apr 2011 9:43:20 UTC

Message 104882 in response to message 104879

(moderation:

)

Quote:

Quote:
@ Gary - how much of a problem are these 'locked up tasks' ...

Firstly, I'm not sure that Paul actually meant that a task was stuck at some particular %done. I assumed (perhaps wrongly) that he was referring to a machine that had locked up rather than a specific task.

(..big snip...)

im sorry i should have checked this sooner. yes thats what i meant. not that the task had stalled, but that the system, did exactly as you described. you could see the mouse but clicking anything did nothing. attempting to login via vnc was my first clue there was a problem (it didnt work lol) ive no idea what caused it. that system is a p4 running at about 120*f at full load, so i had ruled out heat. but perhaps that warrants further investigation.

sorry for the hijack. i had thought it pertinent in regards to boinc not aborting the task after reboot as it was well beyond the deadline.

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119642980905

RAC: 25075696

RE: I decided to join you

18 Apr 2011 11:50:19 UTC

Message 104883 in response to message 104881

(moderation:

)

Quote:

I decided to join you in your endeavour...

Hi Christoph, thanks for your message.

Quote:

Starting 1467.95, continuesly to 1468.80. Got my first delet requests which are not yet executed.

That should be a reasonably good starting point. I have saved files from about 1460.xxHz through to about 1463 or so and the small number of hosts I have in that range is not yet having any trouble getting primary tasks.

My impression is that the 1200 - 1500 full frequency range of this run is being depleted (right from the word go) essentially from the bottom up. I would think that whilst there will be the occasional resend for frequencies below about 1430 they are likely to be difficult to grab as they are few and far between and the scheduler seems to want to get rid of them rather quickly to the first host that comes along. Your best bet is to stay with what you already have and try to increase the data range you have saved. I'm in the process of switching a lot of hosts into the 1490.xx and above range so that range will start disappearing very quickly shortly. My guess is that in a few days time your own current range will be a better place to be. There should already be resends from what has been sent out previously and the supply should increase as the run winds down. It's good to still have primary tasks available so that you can get them if occasionally there is a shortage of resends for the particular frequencies you have.

Having said that, if you wanted a much wider range of data, I could easily email you a file containing a wide range of blocks rather than uploading the LIGO data itself. All you would need to do is crunch and return your current tasks and then remove the old and insert the new blocks into your state file. As soon as you restarted BOINC it would see the s and use the URLs inside them to get all the LIGO data quite automatically. Have BOINC on NNT while this is going on. Also be aware of how big a download it would be. Once you have some data you could unset NNT and BOINC would grab some tasks for that data.

Unfortunately, whilst I've saved about 5 different frequency sets, the only two that have a data range significantly larger than what you have been saving is the 1430 - 1440 range where there are already 4 quads actively working and the 1490 - 1494+ range where there will be vigorous activity shortly. If I were you, I'd take steps to increase your own range. It's actually quite easy to do. Let's say you have all LIGO data and all s saved from 1467.95 to 1468.80. Just stop BOINC and edit your state file to actually insert tags into all blocks except for the four at exactly 1468.80Hz. When you restart BOINC, your current tasks will prevent the data they depend on from actually being deleted so nothing in your cache will have a problem. If you then increase your cache size to get 1 extra task, the scheduler request will only advertise one particular frequency (1468.80). You should get a task for exactly that frequency. Of course, you need 11 frequency bands above that so the scheduler should oblige and send you all data and blocks up to about 1469.35. You can then stop BOINC and remove all the tags and enjoy your suddenly expanded range. You should also check that there are no missing blocks or data files. Replace any from backup if necessary. You can repeat the process as many times as you like if you want a wider range (which you do want if you're serious about getting lots of resends) :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119642980905

RAC: 25075696

RE: sorry for the

18 Apr 2011 11:53:21 UTC

Message 104884 in response to message 104882

(moderation:

)

Quote:

sorry for the hijack....

No problem at all. Please don't concern yourself.

Cheers,
Gary.

Christoph

Joined: 25 Aug 05

Posts: 41

Credit: 5954206

RAC: 0

Thanks for that explaination.

18 Apr 2011 13:14:16 UTC

Message 104885 in response to message 104883

(moderation:

)

Thanks for that explaination. I will try it later today when SETI and MW have completed some work. In my 23 tasks I have currently two re-sends.

Christoph

Greetings, Christoph

The "cleanup" for the S5GC1HF run

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner