Over the long weekend the server has recovered a bit, and work is being sent again (already since yesterday).
One lesson that we learned is that we have to issue the "work" in smaller chunks (frequency ranges) at a time. Unfortunately this means that the "search progress" indicator on the server status page isn't very useful anymore. It shows only the remaining work for the current chunk, and in addition takes two days to adjust to each new chunk.
Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.
... we have to issue the "work" in smaller chunks (frequency ranges) at a time.
This had started happening sometime before you turned off work supply just before Easter.
When h1_14xx tasks first appeared, a number of my hosts got frequency bins around 146x.00 and above as well as lower values. Very shortly after the initial downloads, the supply for those higher frequency bins stopped and the hosts were given new frequencies in the 1400 to 1410 range or thereabouts. Very soon after that it appeared that everybody was in the same boat since the number of tasks per bin initially being supplied (>50,000) started disappearing at an alarming rate, meaning that there must be an alarming number of hosts 'drinking from the same pot'.
Obviously we now know that was a deliberate action on your part rather than some bug in Locality Scheduling.
Using a 'back of the envelope' type estimate, the number of volunteer hosts assigned to each bin has increased by more than an order of magnitude compared to what existed a few months ago when frequencies between 1000 and 1200 were being processed. I had seen a single set of large data files and the associated skygrid lasting for a couple of weeks or more in those earlier times. Currently, I see all my hosts having to get new full sets, almost on a daily basis.
The very desirable LS benefit of limiting the frequency of excessively large downloads has effectively been canceled and I'm not particularly happy about that. As I've advised previously, excessive download volumes are a problem for me - and perhaps for others as well.
Is there no other way to solve your problem? What about extra servers handling extra frequency chunks so that a much smaller number of hosts can be supplied from each frequency bin? Effectively what you seem to be doing is offloading the project's data handling and storage problem on to your volunteers by having them download and store more than an order of magnitude more than what they had to handle previously.
Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.
Not sure if it's related or something else - I haven't been following this conversation closely - but all my gravity wave crunchers have no work this morning.
2024-04-07 07:44:03.5878 [PID=362357] [mixed] sending locality work first (0.2385)
2024-04-07 07:44:03.5941 [PID=362357] [send] send_old_work() no feasible result older than 336.0 hours
2024-04-07 07:44:04.6178 [PID=362357] [send] send_old_work() no feasible result younger than 175.7 hours and older than 168.0 hours
2024-04-07 07:44:04.8716 [PID=362357] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
2024-04-07 07:44:04.8717 [PID=362357] [CRITICAL] get_working_set_filename(fasthost): pattern not found (file list empty)
server status page says there are just 7 tasks to send.
... all my gravity wave crunchers have no work this morning.
The supply of work finished around 9:30PM UTC on Saturday - 7:30AM Sunday here in Oz. I happened to see a machine get one of the remaining tasks (very low sequence number) for the h1_1433.60 frequency series. The next request a minute later got the "No work" reply.
I would guess the outage might be due to the new policy of "issuing work in smaller chunks". Perhaps not enough 'chunks' were available to meet the demand over a full weekend. The 1433.60 'chunk' had only become available quite recently. A frequency set like this only seems to last a day or two before the next higher one (eg 1435.60 in this case) should kick in. I guess there was no 1435.60 series ready to go.
The other possibility is that storing results was becoming a problem again so any new chunks were quickly withdrawn. I hope this isn't the case. I would think Bernd would have issued a warning if he needed to do that. Hopefully, Bernd will notice the outage and will add a 'chunk' or two to tide everyone over for the remainder of the weekend.
seems like something is stuck rather than truly running out of work. the RTS has been stuck at 7-9 tasks for 10+ hours. if it was out of work, i would expect that to be 0 with some outliers as some tasks occasionally come back and need to be resent. but it's flatlined at 7-9 tasks with almost no variance.
Bernd, is there a problem with the O3AS scheduler? every time I try to get work since you added more tasks (around 0900 UTC), the request either times out: Mon 08 Apr 2024 07:23:20 AM EDT | Einstein@Home | Scheduler request failed: Timeout was reached
or I get a message that a scheduler instance is already running: Mon 08 Apr 2024 07:26:10 AM EDT | Einstein@Home | Another scheduler instance is running for this host
A project reset hasn't helped.
I'm guessing a lot of other folks are in this situation since the rate at which work is being sent out has been very slow for both O3AS and BRP7
This "chunky" release seem to confuse not only the scheduler, but us as well. We need t think about and work on this a little more. For now I will disable "locality scheduling" and the O3ASHF search.
Over the long weekend the
)
Over the long weekend the server has recovered a bit, and work is being sent again (already since yesterday).
One lesson that we learned is that we have to issue the "work" in smaller chunks (frequency ranges) at a time. Unfortunately this means that the "search progress" indicator on the server status page isn't very useful anymore. It shows only the remaining work for the current chunk, and in addition takes two days to adjust to each new chunk.
Just to say, there will be work for at least four more weeks, and we'll probably even extend beyond that.
BM
Thanks for the update
)
Thanks for the update Bernd
Glad to see them back.
)
Glad to see them back. crunching through at full speed again :)
_________________________________________________________________________
Bernd Machenschalk
)
What is on the horizon after that? Any idea how long the break in GW work will be?
Thank you.
Bernd Machenschalk wrote:
)
This had started happening sometime before you turned off work supply just before Easter.
When h1_14xx tasks first appeared, a number of my hosts got frequency bins around 146x.00 and above as well as lower values. Very shortly after the initial downloads, the supply for those higher frequency bins stopped and the hosts were given new frequencies in the 1400 to 1410 range or thereabouts. Very soon after that it appeared that everybody was in the same boat since the number of tasks per bin initially being supplied (>50,000) started disappearing at an alarming rate, meaning that there must be an alarming number of hosts 'drinking from the same pot'.
Obviously we now know that was a deliberate action on your part rather than some bug in Locality Scheduling.
Using a 'back of the envelope' type estimate, the number of volunteer hosts assigned to each bin has increased by more than an order of magnitude compared to what existed a few months ago when frequencies between 1000 and 1200 were being processed. I had seen a single set of large data files and the associated skygrid lasting for a couple of weeks or more in those earlier times. Currently, I see all my hosts having to get new full sets, almost on a daily basis.
The very desirable LS benefit of limiting the frequency of excessively large downloads has effectively been canceled and I'm not particularly happy about that. As I've advised previously, excessive download volumes are a problem for me - and perhaps for others as well.
Is there no other way to solve your problem? What about extra servers handling extra frequency chunks so that a much smaller number of hosts can be supplied from each frequency bin? Effectively what you seem to be doing is offloading the project's data handling and storage problem on to your volunteers by having them download and store more than an order of magnitude more than what they had to handle previously.
Cheers,
Gary.
Bernd Machenschalk
)
Not sure if it's related or something else - I haven't been following this conversation closely - but all my gravity wave crunchers have no work this morning.
Sample event log entry from host 1001562:
server status page says there are just 7 tasks to send.
Richard Haselgrove wrote:...
)
The supply of work finished around 9:30PM UTC on Saturday - 7:30AM Sunday here in Oz. I happened to see a machine get one of the remaining tasks (very low sequence number) for the h1_1433.60 frequency series. The next request a minute later got the "No work" reply.
I would guess the outage might be due to the new policy of "issuing work in smaller chunks". Perhaps not enough 'chunks' were available to meet the demand over a full weekend. The 1433.60 'chunk' had only become available quite recently. A frequency set like this only seems to last a day or two before the next higher one (eg 1435.60 in this case) should kick in. I guess there was no 1435.60 series ready to go.
The other possibility is that storing results was becoming a problem again so any new chunks were quickly withdrawn. I hope this isn't the case. I would think Bernd would have issued a warning if he needed to do that. Hopefully, Bernd will notice the outage and will add a 'chunk' or two to tide everyone over for the remainder of the weekend.
Cheers,
Gary.
seems like something is stuck
)
seems like something is stuck rather than truly running out of work. the RTS has been stuck at 7-9 tasks for 10+ hours. if it was out of work, i would expect that to be 0 with some outliers as some tasks occasionally come back and need to be resent. but it's flatlined at 7-9 tasks with almost no variance.
_________________________________________________________________________
Bernd, is there a problem
)
Bernd, is there a problem with the O3AS scheduler? every time I try to get work since you added more tasks (around 0900 UTC), the request either times out:
Mon 08 Apr 2024 07:23:20 AM EDT | Einstein@Home | Scheduler request failed: Timeout was reached
or I get a message that a scheduler instance is already running:
Mon 08 Apr 2024 07:26:10 AM EDT | Einstein@Home | Another scheduler instance is running for this host
A project reset hasn't helped.
I'm guessing a lot of other folks are in this situation since the rate at which work is being sent out has been very slow for both O3AS and BRP7
_________________________________________________________________________
This "chunky" release seem to
)
This "chunky" release seem to confuse not only the scheduler, but us as well. We need t think about and work on this a little more. For now I will disable "locality scheduling" and the O3ASHF search.
BM