Gamma ray GPU tasks hanging?

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3682
Credit: 33846645685
RAC: 36436588

yeah, I've yet to see any

yeah, I've yet to see any Turing/Volta/Ampere card that didn't have a problem, so the issue should have been obvious on any of these cards. Though the failure mode does seem to vary, maybe due to OS differences. but the vast majority of people seem to be having the "running-but-not-running tasks-run-forever-and-never-finish" type symptoms

I can understand that it can take some time to debug and isolate the root cause though

_________________________________________________________________________

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6916511637
RAC: 19999668

Just want to repeat:  My

Just want to repeat:  My PCs/rigs oddly get BSODs (blue screen of death).

I managed to catch/see the error message on one occasion:  It said "Driver Error 116".

The PC tries to recover the GPUs, but after three times (?) of massive screen blinking the

PC does a "restart".

I don't have long endlessly running tasks.

They BSOD after a couple of seconds or minutes.

The PCs just run E @ H and just GR or GW.  No mix.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244934206
RAC: 16223

We just prevented sending

We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3682
Credit: 33846649018
RAC: 36433833

Bernd Machenschalk wrote: We

Bernd Machenschalk wrote:

We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.

what's different about these new LATeah3001L00 vs the last run of LATeah2049Lag? the previous tasks worked OK. it all started with the new tasks.

if there's some different parameter or property about this dataset, that might lead you to whatever could be causing the issue.

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244934206
RAC: 16223

The problem is probably the

The problem is probably the same (or at least related to) that we had pretty exactly two years ago: Latest data file for FGRPB1G GPU tasks.

That part of the code behaves pretty data-dependent, it's probably a barrier/synchronization problem. I vaguely remember that NVidia changed something with the Volta architecture to previous ones that had to do with synchronization among or within work groups / thread blocks. Does anyone remember better than me? I currently can't find it.

BM

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550386957
RAC: 6435685

But what was the solution for

But what was the solution for the bad tasks two years ago?  Was it simply to just stop distributing the offending tasks . . . .  or was some change in the task parameter set changed to exclude the offending element?

From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.

That leads me to believe the problem was never solved or investigated, just kicked down the road.  It looks like it has reappeared.

The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Or a perusal of the Nvidia CUDA Tuning Application Notes.

https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf

Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.

The racecheck and synccheck tools should be used on the troublesome dataset to look for violations

 

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1296

Keith Myers wrote: But what

Keith Myers wrote:

But what was the solution for the bad tasks two years ago?  Was it simply to just stop distributing the offending tasks . . . .  or was some change in the task parameter set changed to exclude the offending element?

From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.

That leads me to believe the problem was never solved or investigated, just kicked down the road.  It looks like it has reappeared.

The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Or a perusal of the Nvidia CUDA Tuning Application Notes.

https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf

Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.

The racecheck and synccheck tools should be used on the troublesome dataset to look for violations

The solution from 2 years ago was to adjust the scheduler to stop sending the problematic set of WUs to newer NVidia cards.

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=4#comment-169615

 

The actual what went wrong investigation got as far as Oliver finding out where the failure was coming from, and Bernd discovering that disabling performance optimizations in the compiler would prevent it from occuring - at the cost of slowing the tasks down (by an unspecified amount).

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=2#comment-169243

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=3#comment-169461

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3682
Credit: 33846652351
RAC: 36434154

Keith Myers wrote:But what

Keith Myers wrote:

But what was the solution for the bad tasks two years ago?  Was it simply to just stop distributing the offending tasks . . . .  or was some change in the task parameter set changed to exclude the offending element?

From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.

That leads me to believe the problem was never solved or investigated, just kicked down the road.  It looks like it has reappeared.

The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Or a perusal of the Nvidia CUDA Tuning Application Notes.

https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf

Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.

The racecheck and synccheck tools should be used on the troublesome dataset to look for violations

I assume the FGRPopenclTV-nvidia app is what came out of it before, since it has a build date of Feb 2019, no?

 

and with Danneelys comments about them removing some optimizations makes sense for why the Turing/Ampere cards seem to perform rather poorly when compared to the older pascal cards, even though the Turing Ampere cards should be much faster.

 

I hope the devs can find a fix that allows them to add the optimizations back in for faster and more efficient crunching :)

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3682
Credit: 33846652351
RAC: 36434154

and from the thread linked

and from the thread linked Bernd said this:

Bernd wrote:

We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well.

 

That would be great too! Einstein definitely needs a CUDA app to increase performance on the nvidia cards.

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550386957
RAC: 6435685

OK, so the flub in tasks two

OK, so the flub in tasks two years ago resulted in the nvidia-TV application.

Which has been working fine for two years on Volta/Turing/Ampere.

It was not a problem other than forcing lower optimizations on these cards up UNTIL these last LATeah3001L00 work units.

Has to be something in the makeup of the LATeah3001L00 WU's that trips up the nvidia-TV application now.

I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.