Then I'm guessing they haven't implemented the specific change we are talking about.
changing __constant to __global is most beneficial in the GR app. using global also isnt supported in OpenCL <2.0. so if you still have older drivers, then I'm guessing they made other improvments to the GW app processing. which would be welcomed.
do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?
do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?
Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.
But the two Linux machines are both showing a serious problem:
All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.
It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.
I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.
06/08/2021 08:21:36 | Einstein@Home | No work sent
06/08/2021 08:21:36 | Einstein@Home | (reached daily quota of 21 tasks)
06/08/2021 08:21:36 | Einstein@Home | [sched_op] Deferring communication for 16:53:02
Yes - no obvious problem there. The scheduler seems to assign 1.00 tasks to validate 1.01 tasks. For me, most 1.01 are pending, some valid already, no invalids thus far.
do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?
Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.
But the two Linux machines are both showing a serious problem:
All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.
It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.
I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.
did the tasks look like they were running normally during those 3 minutes?
certainly looks like they are set to an incorrect deadline or something.
did the tasks look like they were running normally during those 3 minutes?
Yes - and longer. I extracted to client log from the journal. These are the first four tasks run with v1.01: they overlapped, so I've switched the order round for clarity. Run times of 24 down to 15 minutes are within normal tolerance, but note how they decrease as the older work (not shown) is flushed through the system. The machine has 2x GTX 1660 Ti, running at 0.5 GPU usage - so both cards were fully occupied by the new app, by the end.
Aug 05 22:14:43 Michelle boinc[1730]: 05-Aug-2021 22:14:43 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0
Aug 05 22:14:43 Michelle boinc[1730]: 05-Aug-2021 22:14:43 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 5
Aug 05 22:37:56 Michelle boinc[1730]: 05-Aug-2021 22:37:56 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0 finished
Aug 05 22:37:59 Michelle boinc[1730]: 05-Aug-2021 22:37:59 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_0
Aug 05 22:37:59 Michelle boinc[1730]: 05-Aug-2021 22:37:59 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_1
Aug 05 22:38:00 Michelle boinc[1730]: 05-Aug-2021 22:38:00 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_1
Aug 05 22:38:00 Michelle boinc[1730]: 05-Aug-2021 22:38:00 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_2
Aug 05 22:38:01 Michelle boinc[1730]: 05-Aug-2021 22:38:01 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_0
Aug 05 22:38:01 Michelle boinc[1730]: 05-Aug-2021 22:38:01 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_2
Aug 05 22:25:46 Michelle boinc[1730]: 05-Aug-2021 22:25:46 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0
Aug 05 22:25:46 Michelle boinc[1730]: 05-Aug-2021 22:25:46 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 4
Aug 05 22:39:40 Michelle boinc[1730]: 05-Aug-2021 22:39:40 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0 finished
Aug 05 22:39:42 Michelle boinc[1730]: 05-Aug-2021 22:39:42 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_0
Aug 05 22:39:42 Michelle boinc[1730]: 05-Aug-2021 22:39:42 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_1
Aug 05 22:39:43 Michelle boinc[1730]: 05-Aug-2021 22:39:43 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_1
Aug 05 22:39:43 Michelle boinc[1730]: 05-Aug-2021 22:39:43 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_2
Aug 05 22:39:44 Michelle boinc[1730]: 05-Aug-2021 22:39:44 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_0
Aug 05 22:39:44 Michelle boinc[1730]: 05-Aug-2021 22:39:44 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_2
Aug 05 22:31:15 Michelle boinc[1730]: 05-Aug-2021 22:31:15 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0
Aug 05 22:31:15 Michelle boinc[1730]: 05-Aug-2021 22:31:15 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 0
Aug 05 22:50:24 Michelle boinc[1730]: 05-Aug-2021 22:50:24 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0 finished
Aug 05 22:50:26 Michelle boinc[1730]: 05-Aug-2021 22:50:26 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_0
Aug 05 22:50:26 Michelle boinc[1730]: 05-Aug-2021 22:50:26 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_1
Aug 05 22:50:27 Michelle boinc[1730]: 05-Aug-2021 22:50:27 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_1
Aug 05 22:50:27 Michelle boinc[1730]: 05-Aug-2021 22:50:27 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_2
Aug 05 22:50:28 Michelle boinc[1730]: 05-Aug-2021 22:50:28 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_0
Aug 05 22:50:28 Michelle boinc[1730]: 05-Aug-2021 22:50:28 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_2
Aug 05 22:37:58 Michelle boinc[1730]: 05-Aug-2021 22:37:58 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0
Aug 05 22:37:58 Michelle boinc[1730]: 05-Aug-2021 22:37:58 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 5
Aug 05 22:52:17 Michelle boinc[1730]: 05-Aug-2021 22:52:17 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0 finished
Aug 05 22:52:19 Michelle boinc[1730]: 05-Aug-2021 22:52:19 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_0
Aug 05 22:52:19 Michelle boinc[1730]: 05-Aug-2021 22:52:19 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_1
Aug 05 22:52:21 Michelle boinc[1730]: 05-Aug-2021 22:52:21 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_1
Aug 05 22:52:21 Michelle boinc[1730]: 05-Aug-2021 22:52:21 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_2
Aug 05 22:52:23 Michelle boinc[1730]: 05-Aug-2021 22:52:23 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_0
Aug 05 22:52:23 Michelle boinc[1730]: 05-Aug-2021 22:52:23 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_2
All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.
I saw this a couple of years ago when a new app was getting tested. I didn't bother digging too hard because it's not something caused by the client.
It seems like it's an immediate bail-out mechanism when the staff realise there's a serious problem with a new app and the plug needs to be pulled immediately. My guess is that the deadline is truncated server-side so that as soon as a client makes contact, all 'waiting to run' tasks will have a time limit exceeded and be returned as errors. Whilst this does immediately save a lot of wasted crunching, the poor user gets to wear the penalty. The 'in-progress' tasks probably finish normally but that's the end of it. I don't know if this is exactly what happened this time but it seems very similar to what I saw previously.
As you later found out, there may have been sufficient 'server created errors' for you to exceed the daily quota so you can't even use some new tasks for a different search to get your quota restored.
Whenever anything like this happens to a host of mine, I use the following procedure to get the host crunching again in a very short time. It works for me since no systemd, no complications, single Einstein project and everything BOINC related owned by me personally and under a single directory tree structure. I deliberately cache apps and data files so a fully populated template structure is always ready to use.
Rename the existing BOINC tree to BOINC.save
Copy a template tree to replace it. The template has everything needed including a template state file
From the 'All Computers' list on the website, pick a long retired old host to be a surrogate.
Change its location (venue) to agree with what is now needed if necessary.
Retrieve the 3 critical values (hostid, venue ,last RPC+1) and use them in the template state file.
The state file also has <dont_request_more_work/> set to prevent immediate work requests.
Fire up BOINC and when all is looking good, allow new work (cache size 0.05 days to start with).
When crunching is working correctly, adjust work cache, concurrency, etc to last for say 2 days.
When that work is completed, shut down, restore the original BOINC tree and get back to work on the original ID, since it's now well out of the penalty box.
Of course, this simple procedure works because the state file only needs the single <project>....</project> block. I guess if you run multiple projects, it's best to forget Einstein while you do something else for a while :-).
Just one question first: is this happening to other people as well, or are my two outliers with this problem?
I do run other projects as well, so changing the HostID to bypass the quota block isn't so urgent. I've served out my 24 hours, and my house heating is back up to full power thanks to the Gamma-ray pulsar binary search. Now I can think for a bit.
I've cut down my cache to 0.05 + 0.05 days, and turned off Beta apps. That allows FGRPB1G tasks in, but the server won't send me Gravity Wave work. (it's seen that the v1.01 beta is best for my machines, and won't consider the still-active v1.00 production app)
I'm going to have a thorough look through the server task logs for one machine. It looks as if various other batches have been issued overnight, and some of them have been accepted: but others have suffered the '(near) instant timeout' problem. BTW, those short deadlines aren't shown in the local manager: those that reach me have the normal 7-day lifespan.
I do have one suspicion about a possible trigger scenario: I'll test that one out later, when I can watch both the local Manager and the server task logs on parallel screens.
Then I'm guessing they
)
Then I'm guessing they haven't implemented the specific change we are talking about.
changing __constant to __global is most beneficial in the GR app. using global also isnt supported in OpenCL <2.0. so if you still have older drivers, then I'm guessing they made other improvments to the GW app processing. which would be welcomed.
do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?
_________________________________________________________________________
Ian&Steve C. wrote: do you
)
Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.
But the two Linux machines are both showing a serious problem:
https://einsteinathome.org/host/12788501/tasks/6/56?sort=asc&order=Sent
https://einsteinathome.org/host/12808716/tasks/6/56?sort=asc&order=Sent
All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.
It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.
I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.
Well, that didn't
)
Well, that didn't work.
06/08/2021 08:21:36 | Einstein@Home | No work sent
06/08/2021 08:21:36 | Einstein@Home | (reached daily quota of 21 tasks)
06/08/2021 08:21:36 | Einstein@Home | [sched_op] Deferring communication for 16:53:02
Richard Haselgrove
)
That problem seems to be limited to Nvidia as AMD tasks, both 1.00 and 1.01, are executed flawlessly in Ubuntu.
I think it's a server
)
I think it's a server problem, not an app problem - they seemed to run normally here, too.
Have you checked your AMD report lists on this website? (your computers are hidden, so I can't check them from here)
Yes - no obvious problem
)
Yes - no obvious problem there. The scheduler seems to assign 1.00 tasks to validate 1.01 tasks. For me, most 1.01 are pending, some valid already, no invalids thus far.
Richard Haselgrove
)
did the tasks look like they were running normally during those 3 minutes?
certainly looks like they are set to an incorrect deadline or something.
_________________________________________________________________________
Ian&Steve C. wrote: did the
)
Yes - and longer. I extracted to client log from the journal. These are the first four tasks run with v1.01: they overlapped, so I've switched the order round for clarity. Run times of 24 down to 15 minutes are within normal tolerance, but note how they decrease as the older work (not shown) is flushed through the system. The machine has 2x GTX 1660 Ti, running at 0.5 GPU usage - so both cards were fully occupied by the new app, by the end.
Richard Haselgrove wrote:All
)
I saw this a couple of years ago when a new app was getting tested. I didn't bother digging too hard because it's not something caused by the client.
It seems like it's an immediate bail-out mechanism when the staff realise there's a serious problem with a new app and the plug needs to be pulled immediately. My guess is that the deadline is truncated server-side so that as soon as a client makes contact, all 'waiting to run' tasks will have a time limit exceeded and be returned as errors. Whilst this does immediately save a lot of wasted crunching, the poor user gets to wear the penalty. The 'in-progress' tasks probably finish normally but that's the end of it. I don't know if this is exactly what happened this time but it seems very similar to what I saw previously.
As you later found out, there may have been sufficient 'server created errors' for you to exceed the daily quota so you can't even use some new tasks for a different search to get your quota restored.
Whenever anything like this happens to a host of mine, I use the following procedure to get the host crunching again in a very short time. It works for me since no systemd, no complications, single Einstein project and everything BOINC related owned by me personally and under a single directory tree structure. I deliberately cache apps and data files so a fully populated template structure is always ready to use.
Of course, this simple procedure works because the state file only needs the single <project>....</project> block. I guess if you run multiple projects, it's best to forget Einstein while you do something else for a while :-).
Cheers,
Gary.
Thanks Gary. Just one
)
Thanks Gary.
Just one question first: is this happening to other people as well, or are my two outliers with this problem?
I do run other projects as well, so changing the HostID to bypass the quota block isn't so urgent. I've served out my 24 hours, and my house heating is back up to full power thanks to the Gamma-ray pulsar binary search. Now I can think for a bit.
I've cut down my cache to 0.05 + 0.05 days, and turned off Beta apps. That allows FGRPB1G tasks in, but the server won't send me Gravity Wave work. (it's seen that the v1.01 beta is best for my machines, and won't consider the still-active v1.00 production app)
I'm going to have a thorough look through the server task logs for one machine. It looks as if various other batches have been issued overnight, and some of them have been accepted: but others have suffered the '(near) instant timeout' problem. BTW, those short deadlines aren't shown in the local manager: those that reach me have the normal 7-day lifespan.
I do have one suspicion about a possible trigger scenario: I'll test that one out later, when I can watch both the local Manager and the server task logs on parallel screens.