Improvements in the code of the clients

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46608682642

RAC: 64190196

Then I'm guessing they

5 Aug 2021 22:23:09 UTC

Message 188012

(moderation:

)

Then I'm guessing they haven't implemented the specific change we are talking about.

changing __constant to __global is most beneficial in the GR app. using global also isnt supported in OpenCL <2.0. so if you still have older drivers, then I'm guessing they made other improvments to the GW app processing. which would be welcomed.

do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?

_________________________________________________________________________

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954629952

RAC: 712260

Ian&Steve C. wrote: do you

6 Aug 2021 7:17:52 UTC

Message 188018 in response to message 188012

(moderation:

)

Ian&Steve C. wrote:

do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?

Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.

But the two Linux machines are both showing a serious problem:

https://einsteinathome.org/host/12788501/tasks/6/56?sort=asc&order=Sent

https://einsteinathome.org/host/12808716/tasks/6/56?sort=asc&order=Sent

All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.

It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.

I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954629952

RAC: 712260

Well, that didn't

6 Aug 2021 7:23:33 UTC

Message 188019

(moderation:

)

Well, that didn't work.

solling2

Joined: 20 Nov 14

Posts: 219

Credit: 1577467987

RAC: 19162

Richard Haselgrove

6 Aug 2021 7:37:08 UTC

Message 188020 in response to message 188018

(moderation:

)

Richard Haselgrove wrote:

But the two Linux machines are both showing a serious problem:

That problem seems to be limited to Nvidia as AMD tasks, both 1.00 and 1.01, are executed flawlessly in Ubuntu.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954629952

RAC: 712260

I think it's a server

6 Aug 2021 7:49:07 UTC

Message 188021 in response to message 188020

(moderation:

)

I think it's a server problem, not an app problem - they seemed to run normally here, too.

Have you checked your AMD report lists on this website? (your computers are hidden, so I can't check them from here)

solling2

Joined: 20 Nov 14

Posts: 219

Credit: 1577467987

RAC: 19162

Yes - no obvious problem

6 Aug 2021 8:03:17 UTC

Message 188022 in response to message 188021

(moderation:

)

Yes - no obvious problem there. The scheduler seems to assign 1.00 tasks to validate 1.01 tasks. For me, most 1.01 are pending, some valid already, no invalids thus far.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46608682642

RAC: 64190196

Richard Haselgrove

6 Aug 2021 13:50:46 UTC

Message 188027 in response to message 188018

(moderation:

)

Richard Haselgrove wrote:

Ian&Steve C. wrote:

do you see a heavy reliance on CPU still? or a noticeable increase in GPU utilization?

Haven't got that far yet. Windows machines are showing heavy CPU use as usual - assumed to be the normal OpenCL spin-wait bug - but possibly less than usual.

But the two Linux machines are both showing a serious problem:

https://einsteinathome.org/host/12788501/tasks/6/56?sort=asc&order=Sent

https://einsteinathome.org/host/12808716/tasks/6/56?sort=asc&order=Sent

All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.

It's not a machine breakdown - some gamma-ray pulsar tasks got through and reported normally, as did tasks for other projects.

I'll go into recovery mode with gamma-ray work only, and no beta apps, while I finish waking up.

did the tasks look like they were running normally during those 3 minutes?

certainly looks like they are set to an incorrect deadline or something.

_________________________________________________________________________

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954629952

RAC: 712260

Ian&Steve C. wrote: did the

6 Aug 2021 14:54:58 UTC

Message 188028 in response to message 188027

(moderation:

)

Ian&Steve C. wrote:

did the tasks look like they were running normally during those 3 minutes?

Yes - and longer. I extracted to client log from the journal. These are the first four tasks run with v1.01: they overlapped, so I've switched the order round for clarity. Run times of 24 down to 15 minutes are within normal tolerance, but note how they decrease as the older work (not shown) is flushed through the system. The machine has 2x GTX 1660 Ti, running at 0.5 GPU usage - so both cards were fully occupied by the new app, by the end.

Aug 05 22:14:43 Michelle boinc[1730]: 05-Aug-2021 22:14:43 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0
Aug 05 22:14:43 Michelle boinc[1730]: 05-Aug-2021 22:14:43 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 5
Aug 05 22:37:56 Michelle boinc[1730]: 05-Aug-2021 22:37:56 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0 finished
Aug 05 22:37:59 Michelle boinc[1730]: 05-Aug-2021 22:37:59 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_0
Aug 05 22:37:59 Michelle boinc[1730]: 05-Aug-2021 22:37:59 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_1
Aug 05 22:38:00 Michelle boinc[1730]: 05-Aug-2021 22:38:00 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_1
Aug 05 22:38:00 Michelle boinc[1730]: 05-Aug-2021 22:38:00 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_2
Aug 05 22:38:01 Michelle boinc[1730]: 05-Aug-2021 22:38:01 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_0
Aug 05 22:38:01 Michelle boinc[1730]: 05-Aug-2021 22:38:01 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1307_0_2


Aug 05 22:25:46 Michelle boinc[1730]: 05-Aug-2021 22:25:46 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0
Aug 05 22:25:46 Michelle boinc[1730]: 05-Aug-2021 22:25:46 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 4
Aug 05 22:39:40 Michelle boinc[1730]: 05-Aug-2021 22:39:40 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0 finished
Aug 05 22:39:42 Michelle boinc[1730]: 05-Aug-2021 22:39:42 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_0
Aug 05 22:39:42 Michelle boinc[1730]: 05-Aug-2021 22:39:42 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_1
Aug 05 22:39:43 Michelle boinc[1730]: 05-Aug-2021 22:39:43 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_1
Aug 05 22:39:43 Michelle boinc[1730]: 05-Aug-2021 22:39:43 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_2
Aug 05 22:39:44 Michelle boinc[1730]: 05-Aug-2021 22:39:44 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_0
Aug 05 22:39:44 Michelle boinc[1730]: 05-Aug-2021 22:39:44 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1301_0_2


Aug 05 22:31:15 Michelle boinc[1730]: 05-Aug-2021 22:31:15 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0
Aug 05 22:31:15 Michelle boinc[1730]: 05-Aug-2021 22:31:15 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 0
Aug 05 22:50:24 Michelle boinc[1730]: 05-Aug-2021 22:50:24 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0 finished
Aug 05 22:50:26 Michelle boinc[1730]: 05-Aug-2021 22:50:26 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_0
Aug 05 22:50:26 Michelle boinc[1730]: 05-Aug-2021 22:50:26 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_1
Aug 05 22:50:27 Michelle boinc[1730]: 05-Aug-2021 22:50:27 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_1
Aug 05 22:50:27 Michelle boinc[1730]: 05-Aug-2021 22:50:27 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_2
Aug 05 22:50:28 Michelle boinc[1730]: 05-Aug-2021 22:50:28 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_0
Aug 05 22:50:28 Michelle boinc[1730]: 05-Aug-2021 22:50:28 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1309_0_2


Aug 05 22:37:58 Michelle boinc[1730]: 05-Aug-2021 22:37:58 [Einstein@Home] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0
Aug 05 22:37:58 Michelle boinc[1730]: 05-Aug-2021 22:37:58 [Einstein@Home] [cpu_sched] Starting task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0 using einstein_O3AS version 101 (GW-opencl-nvidia) in slot 5
Aug 05 22:52:17 Michelle boinc[1730]: 05-Aug-2021 22:52:17 [Einstein@Home] Computation for task h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0 finished
Aug 05 22:52:19 Michelle boinc[1730]: 05-Aug-2021 22:52:19 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_0
Aug 05 22:52:19 Michelle boinc[1730]: 05-Aug-2021 22:52:19 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_1
Aug 05 22:52:21 Michelle boinc[1730]: 05-Aug-2021 22:52:21 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_1
Aug 05 22:52:21 Michelle boinc[1730]: 05-Aug-2021 22:52:21 [Einstein@Home] Started upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_2
Aug 05 22:52:23 Michelle boinc[1730]: 05-Aug-2021 22:52:23 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_0
Aug 05 22:52:23 Michelle boinc[1730]: 05-Aug-2021 22:52:23 [Einstein@Home] Finished upload of h1_0135.40_O3aC01Cl1In0__O3AS1_135.50Hz_1308_0_2

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117468567246

RAC: 35437322

Richard Haselgrove wrote:All

6 Aug 2021 23:05:21 UTC

Message 188035 in response to message 188018

(moderation:

)

Richard Haselgrove wrote:

All GW tasks (only) are being errored with a server time out after 3 minutes, and I woke to an empty cache and a 24-hour punishment lockout. Local logs have cycled, so no detail there - I'll have to dig into the systemd journal for any answers.

I saw this a couple of years ago when a new app was getting tested. I didn't bother digging too hard because it's not something caused by the client.

It seems like it's an immediate bail-out mechanism when the staff realise there's a serious problem with a new app and the plug needs to be pulled immediately. My guess is that the deadline is truncated server-side so that as soon as a client makes contact, all 'waiting to run' tasks will have a time limit exceeded and be returned as errors. Whilst this does immediately save a lot of wasted crunching, the poor user gets to wear the penalty. The 'in-progress' tasks probably finish normally but that's the end of it. I don't know if this is exactly what happened this time but it seems very similar to what I saw previously.

As you later found out, there may have been sufficient 'server created errors' for you to exceed the daily quota so you can't even use some new tasks for a different search to get your quota restored.

Whenever anything like this happens to a host of mine, I use the following procedure to get the host crunching again in a very short time. It works for me since no systemd, no complications, single Einstein project and everything BOINC related owned by me personally and under a single directory tree structure. I deliberately cache apps and data files so a fully populated template structure is always ready to use.

Rename the existing BOINC tree to BOINC.save
Copy a template tree to replace it. The template has everything needed including a template state file
From the 'All Computers' list on the website, pick a long retired old host to be a surrogate.
Change its location (venue) to agree with what is now needed if necessary.
Retrieve the 3 critical values (hostid, venue ,last RPC+1) and use them in the template state file.
The state file also has <dont_request_more_work/> set to prevent immediate work requests.
Fire up BOINC and when all is looking good, allow new work (cache size 0.05 days to start with).
When crunching is working correctly, adjust work cache, concurrency, etc to last for say 2 days.
When that work is completed, shut down, restore the original BOINC tree and get back to work on the original ID, since it's now well out of the penalty box.

Of course, this simple procedure works because the state file only needs the single <project>....</project> block. I guess if you run multiple projects, it's best to forget Einstein while you do something else for a while :-).

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2954629952

RAC: 712260

Thanks Gary. Just one

7 Aug 2021 9:16:45 UTC

Message 188049 in response to message 188035

(moderation:

)

Thanks Gary.

Just one question first: is this happening to other people as well, or are my two outliers with this problem?

I do run other projects as well, so changing the HostID to bypass the quota block isn't so urgent. I've served out my 24 hours, and my house heating is back up to full power thanks to the Gamma-ray pulsar binary search. Now I can think for a bit.

I've cut down my cache to 0.05 + 0.05 days, and turned off Beta apps. That allows FGRPB1G tasks in, but the server won't send me Gravity Wave work. (it's seen that the v1.01 beta is best for my machines, and won't consider the still-active v1.00 production app)

I'm going to have a thorough look through the server task logs for one machine. It looks as if various other batches have been issued overnight, and some of them have been accepted: but others have suffered the '(near) instant timeout' problem. BTW, those short deadlines aren't shown in the local manager: those that reach me have the normal 7-day lifespan.

I do have one suspicion about a possible trigger scenario: I'll test that one out later, when I can watch both the local Manager and the server task logs on parallel screens.

Improvements in the code of the clients

Forums › Wish List

Comment viewing options

Forums › Wish List