Gamma ray GPU tasks hanging?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024314931
RAC: 1806639

Keith Myers wrote:I would

Keith Myers wrote:

I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.

So far, we do not know there is anything at all wrong with the tasks.  We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message".  That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.

Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen.  My personal bet is that the tasks do not contain an error, but that is a guess.

I wish the project people good fortune in their debugging and fixing.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33840122817
RAC: 37128014

archae86 wrote: So far, we

archae86 wrote:

So far, we do not know there is anything at all wrong with the tasks.  We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message".  That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.

Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen.  My personal bet is that the tasks do not contain an error, but that is a guess.

I wish the project people good fortune in their debugging and fixing.

I tend to agree that there isn't likely an "error" if one wants to be pedantic about verbiage. However when you look at the scope of the issue from a higher level, you can see that the only thing that changed was the task itself. Same app, same GPU, same software, just new tasks. that leads us to conclude that the root of the issue at hand is *something* in these new tasks. whether you classify that *something* as an "error" or "tweak" or "adjustment" or whatever else is pretty trivial.  

_________________________________________________________________________

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752864592
RAC: 1395921

Another, somewhat orthogonal,

Another, somewhat orthogonal, data point. Make of it what you will.

Like others, I disabled requests for 'Gamma-ray pulsar binary search #1 on GPUs (FGRPB1G)' on machines with NVidia Ampere cards, when the LATeah3001L00 tasks started showing problems. I re-enabled FGRPB1G when Bernd excluded cc7.0+ from distribution.

But now I get LATeah3001L00 tasks for my intel_gpu as well, with the v1.22 (FGRPopencl-intel_gpu) app. Just now, the machine I'm typing this on became completely unresponsive. I struggled to make an orderly shut-down and restart. All seems normal, but I found this in the log:

26/01/2021 17:50:26 | Einstein@Home | [cpu_sched] Restarting task LATeah3001L00_532.0_0_0.0_7392816_1 using hsgamma_FGRPB1G version 122 (FGRPopencl-intel_gpu) in slot 4
26/01/2021 17:50:39 | Einstein@Home | Computation for task LATeah3001L00_532.0_0_0.0_7392816_1 finished
26/01/2021 18:01:13 | Einstein@Home | update requested by user
26/01/2021 18:01:19 | Einstein@Home | [sched_op] handle_scheduler_reply(): got ack for task LATeah3001L00_532.0_0_0.0_7392816_1
 

But the std_err for task 1063477707 shows a different story: 

17:39:10 (1536): called boinc_finish

That call was never completed or logged. May be a red herring: may be a clue. I'll leave you to decide.

Afterthought - remembered I have Process Lasso active to hold the intel_gpu apps at real time (sic!) priority. That may explain the unresponsive host. Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?

 

 

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17547920399
RAC: 6425602

Quote: Has anyone noticed

Quote:
 Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?

Yes, already commented on by Gary Roberts in his post on the new task set.

https://einsteinathome.org/content/some-points-interest-grp-gpu-tasks-using-new-lateah3001l00dat-data-file

Quote:
The other notable difference is that the 'followup stage' now lasts for considerably longer.  You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish.  The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109404667966
RAC: 35475759

Keith Myers

Keith Myers wrote:

Quote:
 Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?

Yes, already commented on by Gary Roberts in his post on the new task set.

https://einsteinathome.org/content/some-points-interest-grp-gpu-tasks-using-new-lateah3001l00dat-data-file

Quote:
The other notable difference is that the 'followup stage' now lasts for considerably longer.  You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish.  The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%.

Thanks, Keith.  The 20-50 secs mentioned was for relatively capable discrete GPUs.  I have no idea how long the followup stage might take for an Intel GPU.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17547920399
RAC: 6425602

Yes, I agree, in Richard's

Yes, I agree, in Richard's case using the iGPU in his host, might in fact make the host "appear" to hang for a very long time in the finish up phase with the shared resources between the iGPU and the cpu being overcommitted.

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752864592
RAC: 1395921

If I catch one in the act

If I catch one in the act (when I'm not trying to compose a complicated email about BOINC testing...), I'll try to time it.

Thanks for the replies. I'll stop and have another cup of coffee if it happens again.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244930893
RAC: 16503

We think we found the problem

We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33840122817
RAC: 37128014

thanks for the heads up, I

thanks for the heads up, I allowed one of my systems to try these new task/app. will report back.

 

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33840122817
RAC: 37128014

Bernd Machenschalk wrote:We

Bernd Machenschalk wrote:

We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.

well the app seems to work, and the tasks do process and I've even had a few validate already.

 

however the app doesn't seem well optimized. I only see about 85-90% GPU utilization on my RTX 20-series cards, and fairly large Utilization swings 77-99% (cyclical, repeating) on my RTX 3070. there is performance being left on the table. the old app pegged all my cards at 97-98% for the whole run,  can you work on some better optimization before the app comes out of beta?

 

what did you need to change with the app? just out of curiosity.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.