Improvements in the code of the clients

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3953
Credit: 46819842642
RAC: 64265824

Then I'm guessing they

Then I'm guessing they haven't implemented the specific change we are talking about.

 

changing __constant to __global is most beneficial in the GR app. using global also isnt supported in OpenCL <2.0. so if you still have older drivers, then I'm guessing they made other improvments to the GW app processing. which would be welcomed.

 

do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?

_________________________________________________________________________

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957023046
RAC: 716581

Ian&Steve C. wrote: do you

Ian&Steve C. wrote:

do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?

Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.

But the two Linux machines are both showing a serious problem:

https://einsteinathome.org/host/12788501/tasks/6/56?sort=asc&order=Sent

https://einsteinathome.org/host/12808716/tasks/6/56?sort=asc&order=Sent

All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.

It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.

I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957023046
RAC: 716581

Well, that didn't

Well, that didn't work.

06/08/2021 08:21:36 | Einstein@Home | No work sent
06/08/2021 08:21:36 | Einstein@Home | (reached daily quota of 21 tasks)
06/08/2021 08:21:36 | Einstein@Home | [sched_op] Deferring communication for 16:53:02
 

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1577557978
RAC: 21778

Richard Haselgrove

Richard Haselgrove wrote:

But the two Linux machines are both showing a serious problem:

That problem seems to be limited to Nvidia as AMD tasks, both 1.00 and 1.01, are executed flawlessly in Ubuntu.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957023046
RAC: 716581

I think it's a server

I think it's a server problem, not an app problem - they seemed to run normally here, too.

Have you checked your AMD report lists on this website? (your computers are hidden, so I can't check them from here)

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1577557978
RAC: 21778

Yes - no obvious problem

Yes - no obvious problem there. The scheduler seems to assign 1.00 tasks to validate 1.01 tasks. For me, most 1.01 are pending, some valid already, no invalids thus far.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3953
Credit: 46819842642
RAC: 64265824

Richard Haselgrove

Richard Haselgrove wrote:

Ian&Steve C. wrote:

do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?

Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.

But the two Linux machines are both showing a serious problem:

https://einsteinathome.org/host/12788501/tasks/6/56?sort=asc&order=Sent

https://einsteinathome.org/host/12808716/tasks/6/56?sort=asc&order=Sent

All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.

It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.

I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.

did the tasks look like they were running normally during those 3 minutes?

certainly looks like they are set to an incorrect deadline or something.

_________________________________________________________________________

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957023046
RAC: 716581

Ian&Steve C. wrote: did the

Ian&Steve C. wrote:

did the tasks look like they were running normally during those 3 minutes?

Yes - and longer. I extracted to client log from the journal. These are the first four tasks run with v1.01: they overlapped, so I've switched the order round for clarity. Run times of 24 down to 15 minutes are within normal tolerance, but note how they decrease as the older work (not shown) is flushed through the system. The machine has 2x GTX 1660 Ti, running at 0.5 GPU usage - so both cards were fully occupied by the new app, by the end.

Aug 05 22:14:43 Michelle boinc[1730]: 05-Aug-2021 22:14:43 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0
Aug 05 22:14:43 Michelle boinc[1730]: 05-Aug-2021 22:14:43 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 5
Aug 05 22:37:56 Michelle boinc[1730]: 05-Aug-2021 22:37:56 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0 finished
Aug 05 22:37:59 Michelle boinc[1730]: 05-Aug-2021 22:37:59 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_0
Aug 05 22:37:59 Michelle boinc[1730]: 05-Aug-2021 22:37:59 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_1
Aug 05 22:38:00 Michelle boinc[1730]: 05-Aug-2021 22:38:00 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_1
Aug 05 22:38:00 Michelle boinc[1730]: 05-Aug-2021 22:38:00 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_2
Aug 05 22:38:01 Michelle boinc[1730]: 05-Aug-2021 22:38:01 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_0
Aug 05 22:38:01 Michelle boinc[1730]: 05-Aug-2021 22:38:01 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_2


Aug 05 22:25:46 Michelle boinc[1730]: 05-Aug-2021 22:25:46 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0
Aug 05 22:25:46 Michelle boinc[1730]: 05-Aug-2021 22:25:46 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 4
Aug 05 22:39:40 Michelle boinc[1730]: 05-Aug-2021 22:39:40 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0 finished
Aug 05 22:39:42 Michelle boinc[1730]: 05-Aug-2021 22:39:42 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_0
Aug 05 22:39:42 Michelle boinc[1730]: 05-Aug-2021 22:39:42 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_1
Aug 05 22:39:43 Michelle boinc[1730]: 05-Aug-2021 22:39:43 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_1
Aug 05 22:39:43 Michelle boinc[1730]: 05-Aug-2021 22:39:43 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_2
Aug 05 22:39:44 Michelle boinc[1730]: 05-Aug-2021 22:39:44 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_0
Aug 05 22:39:44 Michelle boinc[1730]: 05-Aug-2021 22:39:44 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_2


Aug 05 22:31:15 Michelle boinc[1730]: 05-Aug-2021 22:31:15 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0
Aug 05 22:31:15 Michelle boinc[1730]: 05-Aug-2021 22:31:15 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 0
Aug 05 22:50:24 Michelle boinc[1730]: 05-Aug-2021 22:50:24 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0 finished
Aug 05 22:50:26 Michelle boinc[1730]: 05-Aug-2021 22:50:26 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_0
Aug 05 22:50:26 Michelle boinc[1730]: 05-Aug-2021 22:50:26 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_1
Aug 05 22:50:27 Michelle boinc[1730]: 05-Aug-2021 22:50:27 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_1
Aug 05 22:50:27 Michelle boinc[1730]: 05-Aug-2021 22:50:27 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_2
Aug 05 22:50:28 Michelle boinc[1730]: 05-Aug-2021 22:50:28 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_0
Aug 05 22:50:28 Michelle boinc[1730]: 05-Aug-2021 22:50:28 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_2


Aug 05 22:37:58 Michelle boinc[1730]: 05-Aug-2021 22:37:58 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0
Aug 05 22:37:58 Michelle boinc[1730]: 05-Aug-2021 22:37:58 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 5
Aug 05 22:52:17 Michelle boinc[1730]: 05-Aug-2021 22:52:17 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0 finished
Aug 05 22:52:19 Michelle boinc[1730]: 05-Aug-2021 22:52:19 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_0
Aug 05 22:52:19 Michelle boinc[1730]: 05-Aug-2021 22:52:19 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_1
Aug 05 22:52:21 Michelle boinc[1730]: 05-Aug-2021 22:52:21 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_1
Aug 05 22:52:21 Michelle boinc[1730]: 05-Aug-2021 22:52:21 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_2
Aug 05 22:52:23 Michelle boinc[1730]: 05-Aug-2021 22:52:23 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_0
Aug 05 22:52:23 Michelle boinc[1730]: 05-Aug-2021 22:52:23 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_2


Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117581966525
RAC: 35188908

Richard Haselgrove wrote:All

Richard Haselgrove wrote:
All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.

I saw this a couple of years ago when a new app was getting tested.  I didn't bother digging too hard because it's not something caused by the client.

It seems like it's an immediate bail-out mechanism when the staff realise there's a serious problem with a new app and the plug needs to be pulled immediately.  My guess is that the deadline is truncated server-side so that as soon as a client makes contact, all 'waiting to run' tasks will have a time limit exceeded and be returned as errors.  Whilst this does immediately save a lot of wasted crunching, the poor user gets to wear the penalty.  The 'in-progress' tasks probably finish normally but that's the end of it.  I don't know if this is exactly what happened this time but it seems very similar to what I saw previously.

As you later found out, there may have been sufficient 'server created errors' for you to exceed the daily quota so you can't even use some new tasks for a different search to get your quota restored.

Whenever anything like this happens to a host of mine,  I use the following procedure to get the host crunching again in a very short time.  It works for me since no systemd, no complications, single Einstein project and everything BOINC related owned by me personally and under a single directory tree structure.  I deliberately cache apps and data files so a fully populated template structure is always ready to use.

  1. Rename the existing BOINC tree to BOINC.save
  2. Copy a template tree to replace it.  The template has everything needed including a template state file
  3. From the 'All Computers' list on the website, pick a long retired old host to be a surrogate.
  4. Change its location (venue) to agree with what is now needed if necessary.
  5. Retrieve the 3 critical values (hostid, venue ,last RPC+1) and use them in the template state file.
  6. The state file also has <dont_request_more_work/> set to prevent immediate work requests.
  7. Fire up BOINC and when all is looking good, allow new work (cache size 0.05 days to start with).
  8. When crunching is working correctly, adjust work cache, concurrency, etc to last for say 2 days.
  9. When that work is completed, shut down, restore the original BOINC tree and get back to work on the original ID, since it's now well out of the penalty box.

Of course, this simple procedure works because the state file only needs the single <project>....</project> block.  I guess if you run multiple projects, it's best to forget Einstein while you do something else for a while :-).

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957023046
RAC: 716581

Thanks Gary. Just one

Thanks Gary.

Just one question first: is this happening to other people as well, or are my two outliers with this problem?

I do run other projects as well, so changing the HostID to bypass the quota block isn't so urgent. I've served out my 24 hours, and my house heating is back up to full power thanks to the Gamma-ray pulsar binary search. Now I can think for a bit.

I've cut down my cache to 0.05 + 0.05 days, and turned off Beta apps. That allows FGRPB1G tasks in, but the server won't send me Gravity Wave work. (it's seen that the v1.01 beta is best for my machines, and won't consider the still-active v1.00 production app)

I'm going to have a thorough look through the server task logs for one machine. It looks as if various other batches have been issued overnight, and some of them have been accepted: but others have suffered the '(near) instant timeout' problem. BTW, those short deadlines aren't shown in the local manager: those that reach me have the normal 7-day lifespan.

I do have one suspicion about a possible trigger scenario: I'll test that one out later, when I can watch both the local Manager and the server task logs on parallel screens.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.