All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250477690
RAC: 35289

Over the long weekend the

Over the long weekend the server has recovered a bit, and work is being sent again (already since yesterday).

One lesson that we learned is that we have to issue the "work" in smaller chunks (frequency ranges) at a time. Unfortunately this means that the "search progress" indicator on the server status page isn't very useful anymore. It shows only the remaining work for the current chunk, and in addition takes two days to adjust to each new chunk.

Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.

BM

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1886
Credit: 1407844571
RAC: 1151401

Thanks for the update

Thanks for the update Bernd 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3953
Credit: 46814512642
RAC: 64243410

Glad to see them back.

Glad to see them back. crunching through at full speed again :)

_________________________________________________________________________

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 53
Credit: 1598953090
RAC: 4997070

Bernd Machenschalk

Bernd Machenschalk wrote:

Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.

 

What is on the horizon after that? Any idea how long the break in GW work will be?

Thank you.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117579143209
RAC: 35187635

Bernd Machenschalk wrote:  

Bernd Machenschalk wrote:
   ... we have to issue the "work" in smaller chunks (frequency ranges) at a time.

This had started happening sometime before you turned off work supply just before Easter.

When h1_14xx tasks first appeared, a number of my hosts got frequency bins around 146x.00 and above as well as lower values.  Very shortly after the initial downloads, the supply for those higher frequency bins stopped and the hosts were given new frequencies in the 1400 to 1410 range or thereabouts.  Very soon after that it appeared that everybody was in the same boat since the number of tasks per bin initially being supplied (>50,000) started disappearing at an alarming rate, meaning that there must be an alarming number of hosts 'drinking from the same pot'.

Obviously we now know that was a deliberate action on your part rather than some bug in Locality Scheduling.

Using a 'back of the envelope' type estimate, the number of volunteer hosts assigned to each bin has increased by more than an order of magnitude compared to what existed a few months ago when frequencies between 1000 and 1200 were being processed.  I had seen a single set of large data files and the associated skygrid lasting for a couple of weeks or more in those earlier times.  Currently, I see all my hosts having to get new full sets, almost on a daily basis.

The very desirable LS benefit of limiting the frequency of excessively large downloads has effectively been canceled and I'm not particularly happy about that.  As I've advised previously, excessive download volumes are a problem for me - and perhaps for others as well.

Is there no other way to solve your problem?  What about extra servers handling extra frequency chunks so that a much smaller number of hosts can be supplied from each frequency bin?  Effectively what you seem to be doing is offloading the project's data handling and storage problem on to your volunteers by having them download and store more than an order of magnitude more than what they had to handle previously.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2956986383
RAC: 718831

Bernd Machenschalk

Bernd Machenschalk wrote:

Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.

Not sure if it's related or something else - I haven't been following this conversation closely - but all my gravity wave crunchers have no work this morning.

Sample event log entry from host 1001562:

2024-04-07 07:44:03.5878 [PID=362357]    [mixed] sending locality work first (0.2385)
2024-04-07 07:44:03.5941 [PID=362357]    [send] send_old_work() no feasible result older than 336.0 hours
2024-04-07 07:44:04.6178 [PID=362357]    [send] send_old_work() no feasible result younger than 175.7 hours and older than 168.0 hours
2024-04-07 07:44:04.8716 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)

server status page says there are just 7 tasks to send.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117579143209
RAC: 35187635

Richard Haselgrove wrote:...

Richard Haselgrove wrote:
... all my gravity wave crunchers have no work this morning.

The supply of work finished around 9:30PM UTC on Saturday - 7:30AM Sunday here in Oz.  I happened to see a machine get one of the remaining tasks (very low sequence number) for the h1_1433.60 frequency series.  The next request a minute later got the "No work" reply.

I would guess the outage might be due to the new policy of "issuing work in smaller chunks".  Perhaps not enough 'chunks' were available to meet the demand over a full weekend.  The 1433.60 'chunk' had only become available quite recently.  A frequency set like this only seems to last a day or two before the next higher one (eg 1435.60 in this case) should kick in.  I guess there was no 1435.60 series ready to go.

The other possibility is that storing results was becoming a problem again so any new chunks were quickly withdrawn.  I hope this isn't the case.  I would think Bernd would have issued a warning if he needed to do that.  Hopefully, Bernd will notice the outage and will add a 'chunk' or two to tide everyone over for the remainder of the weekend.

Cheers,
Gary.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3953
Credit: 46814512642
RAC: 64243410

seems like something is stuck

seems like something is stuck rather than truly running out of work. the RTS has been stuck at 7-9 tasks for 10+ hours. if it was out of work, i would expect that to be 0 with some outliers as some tasks occasionally come back and need to be resent. but it's flatlined at 7-9 tasks with almost no variance.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3953
Credit: 46814512642
RAC: 64243410

Bernd, is there a problem

Bernd, is there a problem with the O3AS scheduler? every time I try to get work since you added more tasks (around 0900 UTC), the request either times out:
Mon 08 Apr 2024 07:23:20 AM EDT | Einstein@Home | Scheduler request failed: Timeout was reached

or I get a message that a scheduler instance is already running:
Mon 08 Apr 2024 07:26:10 AM EDT | Einstein@Home | Another scheduler instance is running for this host

A project reset hasn't helped.

I'm guessing a lot of other folks are in this situation since the rate at which work is being sent out has been very slow for both O3AS and BRP7

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250477690
RAC: 35289

This "chunky" release seem to

This "chunky" release seem to confuse not only the scheduler, but us as well. We need t think about and work on this a little more. For now I will disable "locality scheduling" and the O3ASHF search.

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.