I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.
So far, we do not know there is anything at all wrong with the tasks. We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message". That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.
Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen. My personal bet is that the tasks do not contain an error, but that is a guess.
I wish the project people good fortune in their debugging and fixing.
So far, we do not know there is anything at all wrong with the tasks. We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message". That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.
Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen. My personal bet is that the tasks do not contain an error, but that is a guess.
I wish the project people good fortune in their debugging and fixing.
I tend to agree that there isn't likely an "error" if one wants to be pedantic about verbiage. However when you look at the scope of the issue from a higher level, you can see that the only thing that changed was the task itself. Same app, same GPU, same software, just new tasks. that leads us to conclude that the root of the issue at hand is *something* in these new tasks. whether you classify that *something* as an "error" or "tweak" or "adjustment" or whatever else is pretty trivial.
Another, somewhat orthogonal, data point. Make of it what you will.
Like others, I disabled requests for 'Gamma-ray pulsar binary search #1 on GPUs (FGRPB1G)' on machines with NVidia Ampere cards, when the LATeah3001L00 tasks started showing problems. I re-enabled FGRPB1G when Bernd excluded cc7.0+ from distribution.
But now I get LATeah3001L00 tasks for my intel_gpu as well, with the v1.22 (FGRPopencl-intel_gpu) app. Just now, the machine I'm typing this on became completely unresponsive. I struggled to make an orderly shut-down and restart. All seems normal, but I found this in the log:
26/01/2021 17:50:26 | Einstein@Home | [cpu_sched] Restarting task LATeah3001L00_532.0_0_0.0_7392816_1 using hsgamma_FGRPB1G version 122 (FGRPopencl-intel_gpu) in slot 4
26/01/2021 17:50:39 | Einstein@Home | Computation for task LATeah3001L00_532.0_0_0.0_7392816_1 finished
26/01/2021 18:01:13 | Einstein@Home | update requested by user
26/01/2021 18:01:19 | Einstein@Home | [sched_op] handle_scheduler_reply(): got ack for task LATeah3001L00_532.0_0_0.0_7392816_1
But the std_err for task 1063477707 shows a different story:
17:39:10 (1536): called boinc_finish
That call was never completed or logged. May be a red herring: may be a clue. I'll leave you to decide.
Afterthought - remembered I have Process Lasso active to hold the intel_gpu apps at real time (sic!) priority. That may explain the unresponsive host. Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?
The other notable difference is that the 'followup stage' now lasts for considerably longer. You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish. The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%.
The other notable difference is that the 'followup stage' now lasts for considerably longer. You should have observed that the previous tasks used to pause at ~89.997% for a second or so and then immediately jump to 100% and finish. The new tasks will now pause for around 20-50 secs at ~89.997% before jumping to 100%.
Thanks, Keith. The 20-50 secs mentioned was for relatively capable discrete GPUs. I have no idea how long the followup stage might take for an Intel GPU.
Yes, I agree, in Richard's case using the iGPU in his host, might in fact make the host "appear" to hang for a very long time in the finish up phase with the shared resources between the iGPU and the cpu being overcommitted.
We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.
We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.
well the app seems to work, and the tasks do process and I've even had a few validate already.
however the app doesn't seem well optimized. I only see about 85-90% GPU utilization on my RTX 20-series cards, and fairly large Utilization swings 77-99% (cyclical, repeating) on my RTX 3070. there is performance being left on the table. the old app pegged all my cards at 97-98% for the whole run, can you work on some better optimization before the app comes out of beta?
what did you need to change with the app? just out of curiosity.
Keith Myers wrote:I would
)
So far, we do not know there is anything at all wrong with the tasks. We do know that the current application is reacting to those tasks with what, as someone who actually supervised code writing, I called "an ungraceful error message". That was an inside joke, as, of course, instead of an error message we got categorically unacceptable behavior.
Even if the tasks contain an error, the application needs to be fixed so as not to respond to such an error by either of the two failure syndromes we have seen. My personal bet is that the tasks do not contain an error, but that is a guess.
I wish the project people good fortune in their debugging and fixing.
archae86 wrote: So far, we
)
I tend to agree that there isn't likely an "error" if one wants to be pedantic about verbiage. However when you look at the scope of the issue from a higher level, you can see that the only thing that changed was the task itself. Same app, same GPU, same software, just new tasks. that leads us to conclude that the root of the issue at hand is *something* in these new tasks. whether you classify that *something* as an "error" or "tweak" or "adjustment" or whatever else is pretty trivial.
_________________________________________________________________________
Another, somewhat orthogonal,
)
Another, somewhat orthogonal, data point. Make of it what you will.
Like others, I disabled requests for 'Gamma-ray pulsar binary search #1 on GPUs (FGRPB1G)' on machines with NVidia Ampere cards, when the LATeah3001L00 tasks started showing problems. I re-enabled FGRPB1G when Bernd excluded cc7.0+ from distribution.
But now I get LATeah3001L00 tasks for my intel_gpu as well, with the v1.22 (FGRPopencl-intel_gpu) app. Just now, the machine I'm typing this on became completely unresponsive. I struggled to make an orderly shut-down and restart. All seems normal, but I found this in the log:
26/01/2021 17:50:26 | Einstein@Home | [cpu_sched] Restarting task LATeah3001L00_532.0_0_0.0_7392816_1 using hsgamma_FGRPB1G version 122 (FGRPopencl-intel_gpu) in slot 4
26/01/2021 17:50:39 | Einstein@Home | Computation for task LATeah3001L00_532.0_0_0.0_7392816_1 finished
26/01/2021 18:01:13 | Einstein@Home | update requested by user
26/01/2021 18:01:19 | Einstein@Home | [sched_op] handle_scheduler_reply(): got ack for task LATeah3001L00_532.0_0_0.0_7392816_1
But the std_err for task 1063477707 shows a different story:
That call was never completed or logged. May be a red herring: may be a clue. I'll leave you to decide.
Afterthought - remembered I have Process Lasso active to hold the intel_gpu apps at real time (sic!) priority. That may explain the unresponsive host. Has anyone noticed this task group taking much longer than usual in the end-of-run cleanup phase?
Quote: Has anyone noticed
)
Yes, already commented on by Gary Roberts in his post on the new task set.
https://einsteinathome.org/content/some-points-interest-grp-gpu-tasks-using-new-lateah3001l00dat-data-file
Keith Myers
)
Thanks, Keith. The 20-50 secs mentioned was for relatively capable discrete GPUs. I have no idea how long the followup stage might take for an Intel GPU.
Cheers,
Gary.
Yes, I agree, in Richard's
)
Yes, I agree, in Richard's case using the iGPU in his host, might in fact make the host "appear" to hang for a very long time in the finish up phase with the shared resources between the iGPU and the cpu being overcommitted.
If I catch one in the act
)
If I catch one in the act (when I'm not trying to compose a complicated email about BOINC testing...), I'll try to time it.
Thanks for the replies. I'll stop and have another cup of coffee if it happens again.
We think we found the problem
)
We think we found the problem and fixed it. There's anew app version 1.23 available for Beta test. It's been a while since we built new FGRP app binaries, and our build system is a bit rusty. There are only new versions for OSX and Linux so far, we are still working on the Windows build.
BM
thanks for the heads up, I
)
thanks for the heads up, I allowed one of my systems to try these new task/app. will report back.
_________________________________________________________________________
Bernd Machenschalk wrote:We
)
well the app seems to work, and the tasks do process and I've even had a few validate already.
however the app doesn't seem well optimized. I only see about 85-90% GPU utilization on my RTX 20-series cards, and fairly large Utilization swings 77-99% (cyclical, repeating) on my RTX 3070. there is performance being left on the table. the old app pegged all my cards at 97-98% for the whole run, can you work on some better optimization before the app comes out of beta?
what did you need to change with the app? just out of curiosity.
_________________________________________________________________________