Gamma ray GPU tasks hanging?

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46770032642
RAC: 64076291
Topic 224562

did the project make some change?

starting about 2-3hrs ago across all of my systems and all GPUs, the Gamma Ray GPU tasks are hanging. 100% GPU utilization (normal is about 97-98%) and less than half the GPU power consumption being used. seems like some data is missing with the new tasks or some parameter is incorrectly set and the system is just spinning its wheels not getting anything done.

  • stopping/starting BOINC hasnt helped
  • updating/rebooting the system hasn't helped
  • wiping out and reinstalling and updating the GPU drivers hasn't helped
  • resetting the project also has not helped

 

if this is some malformed tasks, you might not notice until you get to the newer ones in your cache. (my cache is set very small, 2 tasks per GPU, so I'm always crunching very new tasks)

 

GW GPU tasks seem unaffected, and process normally.

 

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221224931
RAC: 970224

Glancing at task lists for

Glancing at task lists for your machines, it appears that the behavior you don't like arrived pretty closely with the arrival of tasks with names starting LATeah3001L00.

I've suspended intermediate work not yet started, so all three of my systems are starting to process tasks recently received with those leading characters in the task name.

I'll report observations.  So far I don't see anything unusual, but the most progress so far is one task reported at 81% completion.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221224931
RAC: 970224

In comparing the behavior of

In comparing the behavior of the GPU Gamma-Ray Pulsar tasks with names starting with LATeah3001L00 to those of previous flavors, I do see a strong difference.  But the difference I see is not any of those that you enumerate.

On all three of my machines, with a total of four GPUs, tasks with names starting LATeah3001L00 are running to completion in about 2/3 the established usual elapsed time.  Unlike GW, for GPU GRP, elapsed times on my machines have shown essentially zero task-dependent variation in recent times, so this is an unexpected huge difference.

All my machines are Windows 10 hosts with two flavors of AMD RX 5700 GPUs. 

That these tasks are different seems clear.  Possibly the resulting behavior difference may vary with hardware or software.  I notice that all four of your machines run Linux, with various flavors of Nvidia cards.  So at the simplistic level, Windows vs. Linux and AMD vs. Nvidia seem to be live options.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46770032642
RAC: 64076291

very possible to be a Linux

very possible to be a Linux and/or Nvidia specific issue.

in my case the tasks were progressing at about 0.001% per second, which I think is just default BOINC behavior when it doesn't know what the app is doing. so tasks seemed to be "running", GPU utilization is 100% and appears like its working, but you can tell something is wrong because GPU power draw is at about 50% normal, barely warmer than idle temps, and tasks were running 2+hrs (and not finishing) for what would normally take 5-10 mins.

 

switched everything back to GW until the project fixes whatever the issue is.

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221224931
RAC: 970224

I've left my hosts running

I've left my hosts running the LATeah3001L tasks for a bit, hoping to see if they can validate.

Already two have.  Both of my successful quorum partners completed the tasks on Windows machine running two different NVidia cards.  Both show a substantial reduction in reported run time for the LATeah3001L tasks as compared to immediately preceding LATeah2049 tasks.

So at the simplistic possibilities level, Linux being a problem currently looks much more likely than Nvidia being a problem.  We have lots and lots of Linux participants, so pretty soon some evidence should come in whether this is widespread, just one distribution, or just you (and thus maybe not Linux at all).

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46770032642
RAC: 64076291

Another variable that *might*

Another variable that *might* be a factor is the GPU type used. all of my cards are Turing based cards. the one validation you got from an nvidia was a Pascal based GPU. it will be interesting to see if any Turing/Ampere cards get any successful completions.

Since these tasks are OpenCL, stuff like that *shouldn't* matter, but I'm just spitballing.

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221224931
RAC: 970224

I'm now up at four of this

I'm now up at four of this type of task validating.

Quorum partner details:

Windows 7 AMD Radeon HD 7700/R7 250X/R9 255 series (Capeverde) (1024MB)

Windows 10 GTX 1070 (4095MB)

Windows 10 GTX 1070 Ti (4095MB) 

Windows 10 GTX 1080 (4095MB) 

Still all Windows, but four is not a very large sample.

I'll unsuspend my older tasks for long enough to download some more of this stuff, then run the new stuff for a while longer.  If we get up to twenty, and they are all still Windows on the success side, I'll go looking for hung up quorum partners, but that may be hard to see for some time yet.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46770032642
RAC: 64076291

archae86 wrote: Already two

archae86 wrote:

Already two have.  Both of my successful quorum partners completed the tasks on Windows machine running two different NVidia cards.

one of those was actually the AMD card in the system you can see that the app used was the ati app and that system has a 1070ti and an AMD RX590. inspecting the stderr.txt confirms as much. doesnt look like they are processing at all on the 1070ti, at least not on Einstein.

_________________________________________________________________________

Dotious
Dotious
Joined: 5 Dec 20
Posts: 4
Credit: 94096447
RAC: 137648

I have a LATeah3001 task

I have a LATeah3001 task that’s approaching 90% after an hour and 45 minutes on a 2070 super.  It’s incrementing at .004% per second.  Usually these tasks take 10-15 minutes for me.

 

EDIT: GPU power consumption is 34% of rated reported by EVGA X1.  Usually it’s loaded up close to my max allowed (70% power) for these tasks.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46770032642
RAC: 64076291

looks like

looks like San-Fernando-Valley's Titan V, (Volta Architecture, precursor to Turing) on Windows, is having issue with these new tasks too. so it's not just me.

https://einsteinathome.org/host/12284149/tasks/6/0

looking more likely to be an issue with these tasks on certain architectures.

_________________________________________________________________________

Dotious
Dotious
Joined: 5 Dec 20
Posts: 4
Credit: 94096447
RAC: 137648

Ian&Steve C.

Ian&Steve C. wrote:

 

looking more likely to be an issue with these tasks on certain architectures.


 

Yeah, I think you’re on to something.  My 970 in the same computer as my 2070S was able to finish two LATeah3001L tasks in the normal time (20ish minutes).  Those are awaiting validation.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.