All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250477690

RAC: 35289

Over the long weekend the

4 Apr 2024 10:37:03 UTC

Message 223849

(moderation:

)

Over the long weekend the server has recovered a bit, and work is being sent again (already since yesterday).

One lesson that we learned is that we have to issue the "work" in smaller chunks (frequency ranges) at a time. Unfortunately this means that the "search progress" indicator on the server status page isn't very useful anymore. It shows only the remaining work for the current chunk, and in addition takes two days to adjust to each new chunk.

Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1886

Credit: 1407844571

RAC: 1151401

Thanks for the update

4 Apr 2024 10:46:20 UTC

Message 223850 in response to message 223849

(moderation:

)

Thanks for the update Bernd

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3953

Credit: 46814512642

RAC: 64243410

Glad to see them back.

4 Apr 2024 15:37:58 UTC

Message 223854

(moderation:

)

Glad to see them back. crunching through at full speed again :)

_________________________________________________________________________

Ben Scott

Joined: 30 Mar 20

Posts: 53

Credit: 1598953090

RAC: 4997070

Bernd Machenschalk

5 Apr 2024 4:32:46 UTC

Message 223860 in response to message 223849

(moderation:

)

Bernd Machenschalk wrote:

Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.

What is on the horizon after that? Any idea how long the break in GW work will be?

Thank you.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117579143209

RAC: 35187635

Bernd Machenschalk wrote:

5 Apr 2024 13:17:59 UTC

Message 223869 in response to message 223849

(moderation:

)

Bernd Machenschalk wrote:

... we have to issue the "work" in smaller chunks (frequency ranges) at a time.

This had started happening sometime before you turned off work supply just before Easter.

When h1_14xx tasks first appeared, a number of my hosts got frequency bins around 146x.00 and above as well as lower values. Very shortly after the initial downloads, the supply for those higher frequency bins stopped and the hosts were given new frequencies in the 1400 to 1410 range or thereabouts. Very soon after that it appeared that everybody was in the same boat since the number of tasks per bin initially being supplied (>50,000) started disappearing at an alarming rate, meaning that there must be an alarming number of hosts 'drinking from the same pot'.

Obviously we now know that was a deliberate action on your part rather than some bug in Locality Scheduling.

Using a 'back of the envelope' type estimate, the number of volunteer hosts assigned to each bin has increased by more than an order of magnitude compared to what existed a few months ago when frequencies between 1000 and 1200 were being processed. I had seen a single set of large data files and the associated skygrid lasting for a couple of weeks or more in those earlier times. Currently, I see all my hosts having to get new full sets, almost on a daily basis.

The very desirable LS benefit of limiting the frequency of excessively large downloads has effectively been canceled and I'm not particularly happy about that. As I've advised previously, excessive download volumes are a problem for me - and perhaps for others as well.

Is there no other way to solve your problem? What about extra servers handling extra frequency chunks so that a much smaller number of hosts can be supplied from each frequency bin? Effectively what you seem to be doing is offloading the project's data handling and storage problem on to your volunteers by having them download and store more than an order of magnitude more than what they had to handle previously.

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956986383

RAC: 718831

Bernd Machenschalk

7 Apr 2024 7:58:26 UTC

Message 223911 in response to message 223849

(moderation:

)

Bernd Machenschalk wrote:

Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.

Not sure if it's related or something else - I haven't been following this conversation closely - but all my gravity wave crunchers have no work this morning.

Sample event log entry from host 1001562:

2024-04-07 07:44:03.5878 [PID=362357]    [mixed] sending locality work first (0.2385)
2024-04-07 07:44:03.5941 [PID=362357]    [send] send_old_work() no feasible result older than 336.0 hours
2024-04-07 07:44:04.6178 [PID=362357]    [send] send_old_work() no feasible result younger than 175.7 hours and older than 168.0 hours
2024-04-07 07:44:04.8716 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL]   get_working_set_filename(fasthost): pattern not found (file list empty)

server status page says there are just 7 tasks to send.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117579143209

RAC: 35187635

Richard Haselgrove wrote:...

7 Apr 2024 9:08:01 UTC

Message 223912 in response to message 223911

(moderation:

)

Richard Haselgrove wrote:

... all my gravity wave crunchers have no work this morning.

The supply of work finished around 9:30PM UTC on Saturday - 7:30AM Sunday here in Oz. I happened to see a machine get one of the remaining tasks (very low sequence number) for the h1_1433.60 frequency series. The next request a minute later got the "No work" reply.

I would guess the outage might be due to the new policy of "issuing work in smaller chunks". Perhaps not enough 'chunks' were available to meet the demand over a full weekend. The 1433.60 'chunk' had only become available quite recently. A frequency set like this only seems to last a day or two before the next higher one (eg 1435.60 in this case) should kick in. I guess there was no 1435.60 series ready to go.

The other possibility is that storing results was becoming a problem again so any new chunks were quickly withdrawn. I hope this isn't the case. I would think Bernd would have issued a warning if he needed to do that. Hopefully, Bernd will notice the outage and will add a 'chunk' or two to tide everyone over for the remainder of the weekend.

Cheers,
Gary.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3953

Credit: 46814512642

RAC: 64243410

seems like something is stuck

7 Apr 2024 14:38:49 UTC

Message 223917

(moderation:

)

seems like something is stuck rather than truly running out of work. the RTS has been stuck at 7-9 tasks for 10+ hours. if it was out of work, i would expect that to be 0 with some outliers as some tasks occasionally come back and need to be resent. but it's flatlined at 7-9 tasks with almost no variance.

_________________________________________________________________________

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3953

Credit: 46814512642

RAC: 64243410

Bernd, is there a problem

8 Apr 2024 11:33:49 UTC

Message 223943

(moderation:

)

Bernd, is there a problem with the O3AS scheduler? every time I try to get work since you added more tasks (around 0900 UTC), the request either times out:
Mon 08 Apr 2024 07:23:20 AM EDT | Einstein@Home | Scheduler request failed: Timeout was reached

or I get a message that a scheduler instance is already running:
Mon 08 Apr 2024 07:26:10 AM EDT | Einstein@Home | Another scheduler instance is running for this host

A project reset hasn't helped.

I'm guessing a lot of other folks are in this situation since the rate at which work is being sent out has been very slow for both O3AS and BRP7

_________________________________________________________________________

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250477690

RAC: 35289

This "chunky" release seem to

8 Apr 2024 12:38:05 UTC

Message 223946

(moderation:

)

This "chunky" release seem to confuse not only the scheduler, but us as well. We need t think about and work on this a little more. For now I will disable "locality scheduling" and the O3ASHF search.

All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Forums › Technical News

Comment viewing options

Forums › Technical News