error condition i am not understanding

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1589825716
RAC: 769788

I consider pulsars to be

I consider pulsars to be second prize.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117596823107
RAC: 35220440

But they are a very worthy

But they are a very worthy 2nd prize.  The more you think about them, and about the next stage (black hole) if the star that went 'supernova' happened to be a bit bigger to start with, the more you must surely be impressed by these objects.

By knowing where they are, how massive they are, how fast they're spinning, what forms of radiation they are emitting, etc, it must make attaining the 1st prize a much more achievable outcome.

What does it feel like to be part of a brand new scientific discipline - the observation of the universe in gravitational waves - something that was in fantasy land just a few short years ago?

Think about the many recent scientific advances, like exoplanet discoveries kicked off by the Kepler mission, the confirmation of GW through both BH-BH and NS-NS mergers, the first photograph of a supermassive BH, etc, etc.  These sorts of events (and many others) have made this a most exciting time to be alive, to be following and to be helping with the scientific progress.  Does it really matter what particular part you're helping with?  You're a winner if you're helping at all!

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117596823107
RAC: 35220440

robl wrote:Rather then sit

robl wrote:
Rather then sit idle I have selected "pulsar binary search #1".

That's exactly what I did with mine.  I was in the process of gathering performance data for some different hardware types and different task multiplicities to work out what is the best type of system to process these tasks.  I'm comparing very modern stuff with 3 particular CPU types.  These are Intel 2C/4T, Intel 4C/4T and AMD 6C/12T, all with an RX 570 GPU.  I'm comparing from 1x right through to 4x and even at least 5x on the 6C/12T - a Ryzen 2600.

The machine that was affected was the 4C/4T one (Intel i3-9100F) and I had built 3 of these so I'll just convert a second one and start gathering data again.

The biggest problem is something that Bernd mentioned - the fact that the WU generator isn't doing a good job of 'slicing and dicing' tasks of equal work content.  The changes in crunch time are quite large when they happen and I don't see any way of selecting ones of 'equal size' over all three host types.  I'm basically at the point of gathering many hundreds (possibly thousands) of tasks to try to ensure a believable average crunch time.  I suspect the run might end pretty soon so maybe if the WU generator can be 'fixed' for the next iteration, I might be better waiting for that.

Cheers,
Gary.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

robl wrote:Rather then sit

robl wrote:
Rather then sit idle I have selected "pulsar binary search #1".

Nice to see you found another option for getting the host working again!
Didn't think of that obvious one. Doh!

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250488199
RAC: 34718

It's terribly difficult to

It's terribly difficult to get rid of tasks that have been assigned to a computer once. If the application versions or the preferences change, there may be no app version to complete the tasks that the scheduler tries to "resend", however usually it tries over and over again with every work request.

I now changed the scheduler such that it will "expire" a task when there is no app version to process it. I've yet to see a case where this change actually takes effect, though.

BM

Anonymous

Bernd Machenschalk wrote:It's

Bernd Machenschalk wrote:

It's terribly difficult to get rid of tasks that have been assigned to a computer after it was assigned to it once. If the application versions or the preferences change, there may be no app version to complete the tasks that the scheduler tries to "resend", however usually it tries over and over again with every work request.

I now changed the scheduler such that it will "expire" a task when there is no app version to process it. I've yet to see a case where this change actually takes effect, though.

Bernd,

I just noticed that on the PC  in question I now have "Gravitational Wave search O2 Multi-Directional GPU" work units.  I would think that your change to the scheduler might have resolved the problem.  

EDIT:   I have also noticed that the original 54 jobs are now showing:  "Timed out - no response".  FYI - not complaining.

Anonymous

Holmis wrote:robl

Holmis wrote:
robl wrote:
Rather then sit idle I have selected "pulsar binary search #1".

Nice to see you found another option for getting the host working again!
Didn't think of that obvious one. Doh!

Sometimes I amaze myself!!!  Tongue Out

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Bernd Machenschalk wrote:I

Bernd Machenschalk wrote:
I now changed the scheduler such that it will "expire" a task when there is no app version to process it.

If this works without any adverse effects I think it's an excellent solution!
Thank you for trying to fix this without turning the "resend lost tasks" feature off.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117596823107
RAC: 35220440

Bernd Machenschalk wrote:I

Bernd Machenschalk wrote:
I now changed the scheduler such that it will "expire" a task when there is no app version to process it. I've yet to see a case where this change actually takes effect, though.

Thanks for that.  In the tasks list for my host that had the problem, I now see the 9 lost tasks showing up with a status of "Timed out - no response" and listed as a group under the 'Errors' column.  I don't care about that, but up to that point, the host had an unblemished record - 499 total with 9 in progress, 471 valid, 19 pending, 0 invalid.  So my unblemished record is now 'polluted' with 9 errors :-).

As a result of your quick fix, I've now changed the 'location' (aka venue) of the host so it would stop requesting FGRPB1G and start requesting O2MDF once again.  It now has a fresh bunch of tasks in progress.  The very first batch were resends - at first glance I thought I was getting my 'lost' ones back again :-).  They turned out to be different 'sequence' numbers for slightly different frequency bins, eg. 654.35 instead of 654.70, but close enough so that I already had the needed large data files.  Obviously, locality scheduling is working nicely as intended.   Once again, thanks very much for finding a solution.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117596823107
RAC: 35220440

Bernd Machenschalk wrote:...

Bernd Machenschalk wrote:
... I've yet to see a case where this change actually takes effect, though.

Is it possible that the change in the handling of the O2MDF resends also affects the FGRPB1G search?  As a result of looking into the problem mentioned in this thread, I've noticed a very large number of what were probably lost FGRPBIG tasks (inadvertently created by the OP of the thread) that are now listed as 'timed out - no response' errors.  That might be a rather dramatic case where the change has taken effect :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.