Gamma ray GPU tasks hanging?

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7221344931

RAC: 975952

Keith Myers wrote:I would

26 Jan 2021 17:45:48 UTC

Message 182856 in response to message 182855

(moderation:

)

Keith Myers wrote:

I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.

So far, we do not know there is anything at all wrong with the tasks. We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message". That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.

Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen. My personal bet is that the tasks do not contain an error, but that is a guess.

I wish the project people good fortune in their debugging and fixing.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46774642642

RAC: 64083280

archae86 wrote: So far, we

26 Jan 2021 18:03:33 UTC

Message 182857 in response to message 182856

(moderation:

)

archae86 wrote:

So far, we do not know there is anything at all wrong with the tasks. We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message". That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.

Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen. My personal bet is that the tasks do not contain an error, but that is a guess.

I wish the project people good fortune in their debugging and fixing.

I tend to agree that there isn't likely an "error" if one wants to be pedantic about verbiage. However when you look at the scope of the issue from a higher level, you can see that the only thing that changed was the task itself. Same app, same GPU, same software, just new tasks. that leads us to conclude that the root of the issue at hand is *something* in these new tasks. whether you classify that *something* as an "error" or "tweak" or "adjustment" or whatever else is pretty trivial.

_________________________________________________________________________

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956533095

RAC: 716042

Another, somewhat orthogonal,

26 Jan 2021 18:51:37 UTC

Message 182858

(moderation:

)

Another, somewhat orthogonal, data point. Make of it what you will.

Like others, I disabled requests for 'Gamma-ray pulsar binary search #1 on GPUs (FGRPB1G)' on machines with NVidia Ampere cards, when the LATeah3001L00 tasks started showing problems. I re-enabled FGRPB1G when Bernd excluded cc7.0+ from distribution.

But now I get LATeah3001L00 tasks for my intel_gpu as well, with the v1.22 (FGRPopencl-intel_gpu) app. Just now, the machine I'm typing this on became completely unresponsive. I struggled to make an orderly shut-down and restart. All seems normal, but I found this in the log:

26/01/2021 17:50:26 | Einstein@Home | [cpu_sched] Restarting task LATeah3001L00_532.0_0_0.0_7392816_1 using hsgamma_FGRPB1G version 122 (FGRPopencl-intel_gpu) in slot 4
26/01/2021 17:50:39 | Einstein@Home | Computation for task LATeah3001L00_532.0_0_0.0_7392816_1 finished
26/01/2021 18:01:13 | Einstein@Home | update requested by user
26/01/2021 18:01:19 | Einstein@Home | [sched_op] handle_scheduler_reply(): got ack for task LATeah3001L00_532.0_0_0.0_7392816_1

But the std_err for task 1063477707 shows a different story:

17:39:10 (1536): called boinc_finish

That call was never completed or logged. May be a red herring: may be a clue. I'll leave you to decide.

Afterthought - remembered I have Process Lasso active to hold the intel_gpu apps at real time (sic!) priority. That may explain the unresponsive host. Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18717585573

RAC: 6393115

Quote: Has anyone noticed

26 Jan 2021 18:59:17 UTC

Message 182863

(moderation:

)

Quote:

Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?

Yes, already commented on by Gary Roberts in his post on the new task set.

https://einsteinathome.org/content/some-points-interest-grp-gpu-tasks-using-new-lateah3001l00dat-data-file

Quote:

The other notable difference is that the 'followup stage' now lasts for considerably longer. You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish. The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117559546673

RAC: 35322911

Keith Myers

26 Jan 2021 21:06:39 UTC

Message 182865 in response to message 182863

(moderation:

)

Keith Myers wrote:

Quote:
Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?

Yes, already commented on by Gary Roberts in his post on the new task set.

https://einsteinathome.org/content/some-points-interest-grp-gpu-tasks-using-new-lateah3001l00dat-data-file

Quote:
The other notable difference is that the 'followup stage' now lasts for considerably longer. You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish. The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%.

Thanks, Keith. The 20-50 secs mentioned was for relatively capable discrete GPUs. I have no idea how long the followup stage might take for an Intel GPU.

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18717585573

RAC: 6393115

Yes, I agree, in Richard's

26 Jan 2021 21:34:41 UTC

Message 182867 in response to message 182865

(moderation:

)

Yes, I agree, in Richard's case using the iGPU in his host, might in fact make the host "appear" to hang for a very long time in the finish up phase with the shared resources between the iGPU and the cpu being overcommitted.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956533095

RAC: 716042

If I catch one in the act

26 Jan 2021 22:17:33 UTC

Message 182869

(moderation:

)

If I catch one in the act (when I'm not trying to compose a complicated email about BOINC testing...), I'll try to time it.

Thanks for the replies. I'll stop and have another cup of coffee if it happens again.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250453446

RAC: 34968

We think we found the problem

28 Jan 2021 13:46:13 UTC

Message 182952

(moderation:

)

We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46774642642

RAC: 64083280

thanks for the heads up, I

28 Jan 2021 14:08:44 UTC

Message 182954

(moderation:

)

thanks for the heads up, I allowed one of my systems to try these new task/app. will report back.

_________________________________________________________________________

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46774642642

RAC: 64083280

Bernd Machenschalk wrote:We

28 Jan 2021 15:16:59 UTC

Message 182956 in response to message 182952

(moderation:

)

Bernd Machenschalk wrote:

We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.

well the app seems to work, and the tasks do process and I've even had a few validate already.

however the app doesn't seem well optimized. I only see about 85-90% GPU utilization on my RTX 20-series cards, and fairly large Utilization swings 77-99% (cyclical, repeating) on my RTX 3070. there is performance being left on the table. the old app pegged all my cards at 97-98% for the whole run, can you work on some better optimization before the app comes out of beta?

what did you need to change with the app? just out of curiosity.

_________________________________________________________________________

Gamma ray GPU tasks hanging?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner