error condition i am not understanding

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

Gary Roberts wrote:But they

Gary Roberts wrote:

But they are a very worthy 2nd prize.  The more you think about them, and about the next stage (black hole) if the star that went 'supernova' happened to be a bit bigger to start with, the more you must surely be impressed by these objects.

By knowing where they are, how massive they are, how fast they're spinning, what forms of radiation they are emitting, etc, it must make attaining the 1st prize a much more achievable outcome.

What does it feel like to be part of a brand new scientific discipline - the observation of the universe in gravitational waves - something that was in fantasy land just a few short years ago?

Think about the many recent scientific advances, like exoplanet discoveries kicked off by the Kepler mission, the confirmation of GW through both BH-BH and NS-NS mergers, the first photograph of a supermassive BH, etc, etc.  These sorts of events (and many others) have made this a most exciting time to be alive, to be following and to be helping with the scientific progress.  Does it really matter what particular part you're helping with?  You're a winner if you're helping at all!

 

Well said.

Clear skies,
Matt
Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250500662
RAC: 34625

Gary Roberts schrieb:Bernd

Gary Roberts wrote:
Bernd Machenschalk wrote:
... I've yet to see a case where this change actually takes effect, though.
Is it possible that the change in the handling of the O2MDF resends also affects the FGRPB1G search?  As a result of looking into the problem mentioned in this thread, I've noticed a very large number of what were probably lost FGRPBIG tasks (inadvertently created by the OP of the thread) that are now listed as 'timed out - no response' errors.  That might be a rather dramatic case where the change has taken effect :-).

The change was in the handling of "lost tasks", there is no separate handling for tasks of different apps. "Yet to see a case" meant that at the time of writing I didn't see a scheduler contact which actually had "lost results".

The more I think about it, the more it seems to me that expiring (i.e. dropping) "lost" tasks that can't be processed by the client is the right thing to do and was simply  neglected when the "resend lost tasks" feature was first implemented.

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117608666366
RAC: 35203374

Bernd Machenschalk wrote:The

Bernd Machenschalk wrote:
The change was in the handling of "lost tasks", there is no separate handling for tasks of different apps.

Yes, I imagined that would be the case.  I have lost tasks for FGRPB1G reasonably regularly and it never causes the type of blockage we see with O2MDF.  My guess was that the test app status was the trigger since that seems to be the relevant difference.

I like to have the 'resend' feature for two main reasons.  I live in a hot climate and the machines run hot and, happily relatively infrequently, this can lead to the entire cache of work being trashed and the machine in a multi-hour back-off.  I would much rather spend 15 mins to turn the comp errors into 'lost tasks' and have the server immediately resend them as fresh copies and not have to wait in the penalty box for a new day and then start with a limit of 1 task per core.  Apart from that, there's the little side effect of having to be continually reminded about all those comp errors just because the MD5 check on some data file or app didn't quite compute and gave a spurious result at that particular moment :-).

A more important reason is internet reliability.  Australia is still going through the building of and the conversion to the all singing, all dancing NBN (National Broadband Network).  There are lots of teething issues and although (when everything is working) the speed I now have is great, the reliability at the moment is questionable at best.  I'll blame the NBN but I'm sure there are probably hiccups at your end (and other places) as well.  The upshot is that there are now regular examples in different machines' event logs of non-received scheduler responses (and consequently client back-offs from minutes to hours) until a subsequent scheduler request eventually picks up the otherwise lost tasks.

Bernd Machenschalk wrote:
... expiring (i.e. dropping) "lost" tasks that can't be processed by the client is the right thing to do ...

I actually quite agree -- the key words being, "can't be processed by the client".  With FGRPB1G, there is also no CPU app and yet there is no similar lost task blockage.  It seems likely that the only way a similar problem could arise is by operator error - ie. the volunteer failed to provide an acceptable environment to which the scheduler could resend the lost GPU tasks.   With O2MDF the problem arises (possibly frequently) without operator error.  The problem seems to be that the scheduler refuses to replace a lost test task even when the quorum doesn't have another test task to act as a block.

Whatever you decide to do, here is a compromise that hopefully could work for all concerned.  Put a time limit (my suggestion would be 24hrs) on how long the scheduler would attempt to resend lost tasks.  After that limit expires, drop the lost tasks.  I would be very happy with that because, even if I'm not paying too much attention, I do get alerted to network issues, hosts crashed or not making regular server connections, hosts not consuming proper clock ticks, etc, so the worst case scenario for me is hopefully well within a 24hr limit :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.