Gamma ray GPU tasks hanging?

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 47019532642

RAC: 65034675

yeah, I've yet to see any

25 Jan 2021 15:31:42 UTC

Message 182817 in response to message 182816

(moderation:

)

yeah, I've yet to see any Turing/Volta/Ampere card that didn't have a problem, so the issue should have been obvious on any of these cards. Though the failure mode does seem to vary, maybe due to OS differences. but the vast majority of people seem to be having the "running-but-not-running tasks-run-forever-and-never-finish" type symptoms

I can understand that it can take some time to debug and isolate the root cause though

_________________________________________________________________________

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 409

Credit: 10213993455

RAC: 22389218

Just want to repeat: My

25 Jan 2021 16:50:24 UTC

Message 182819

(moderation:

)

Just want to repeat: My PCs/rigs oddly get BSODs (blue screen of death).

I managed to catch/see the error message on one occasion: It said "Driver Error 116".

The PC tries to recover the GPUs, but after three times (?) of massive screen blinking the

PC does a "restart".

I don't have long endlessly running tasks.

They BSOD after a couple of seconds or minutes.

The PCs just run E @ H and just GR or GW. No mix.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250577837

RAC: 34794

We just prevented sending

25 Jan 2021 17:02:10 UTC

Message 182820

(moderation:

)

We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 47019532642

RAC: 65034675

Bernd Machenschalk wrote: We

25 Jan 2021 19:14:13 UTC

Message 182822 in response to message 182820

(moderation:

)

Bernd Machenschalk wrote:

We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.

what's different about these new LATeah3001L00 vs the last run of LATeah2049Lag? the previous tasks worked OK. it all started with the new tasks.

if there's some different parameter or property about this dataset, that might lead you to whatever could be causing the issue.

_________________________________________________________________________

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250577837

RAC: 34794

The problem is probably the

26 Jan 2021 6:40:11 UTC

Message 182837

(moderation:

)

The problem is probably the same (or at least related to) that we had pretty exactly two years ago: Latest data file for FGRPB1G GPU tasks.

That part of the code behaves pretty data-dependent, it's probably a barrier/synchronization problem. I vaguely remember that NVidia changed something with the Volta architecture to previous ones that had to do with synchronization among or within work groups / thread blocks. Does anyone remember better than me? I currently can't find it.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18749397359

RAC: 7074684

But what was the solution for

26 Jan 2021 8:42:06 UTC

Message 182839

(moderation:

)

But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?

From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.

That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.

The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Or a perusal of the Nvidia CUDA Tuning Application Notes.

https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf

Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.

The racecheck and synccheck tools should be used on the troublesome dataset to look for violations

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Keith Myers wrote: But what

26 Jan 2021 11:47:07 UTC

Message 182842 in response to message 182839

(moderation:

)

Keith Myers wrote:

But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?

From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.

That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.

The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Or a perusal of the Nvidia CUDA Tuning Application Notes.

https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf

Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.

The racecheck and synccheck tools should be used on the troublesome dataset to look for violations

The solution from 2 years ago was to adjust the scheduler to stop sending the problematic set of WUs to newer NVidia cards.

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=4#comment-169615

The actual what went wrong investigation got as far as Oliver finding out where the failure was coming from, and Bernd discovering that disabling performance optimizations in the compiler would prevent it from occuring - at the cost of slowing the tasks down (by an unspecified amount).

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=2#comment-169243

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=3#comment-169461

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 47019532642

RAC: 65034675

Keith Myers wrote:But what

26 Jan 2021 14:42:51 UTC

Message 182845 in response to message 182839

(moderation:

)

Keith Myers wrote:

But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?

From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.

That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.

The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4

Or a perusal of the Nvidia CUDA Tuning Application Notes.

https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf

Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.

The racecheck and synccheck tools should be used on the troublesome dataset to look for violations

I assume the FGRPopenclTV-nvidia app is what came out of it before, since it has a build date of Feb 2019, no?

and with Danneelys comments about them removing some optimizations makes sense for why the Turing/Ampere cards seem to perform rather poorly when compared to the older pascal cards, even though the Turing Ampere cards should be much faster.

I hope the devs can find a fix that allows them to add the optimizations back in for faster and more efficient crunching :)

_________________________________________________________________________

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3958

Credit: 47019532642

RAC: 65034675

and from the thread linked

26 Jan 2021 14:46:57 UTC

Message 182846

(moderation:

)

and from the thread linked Bernd said this:

Bernd wrote:

We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well.

That would be great too! Einstein definitely needs a CUDA app to increase performance on the nvidia cards.

_________________________________________________________________________

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18749397359

RAC: 7074684

OK, so the flub in tasks two

26 Jan 2021 17:19:14 UTC

Message 182855

(moderation:

)

OK, so the flub in tasks two years ago resulted in the nvidia-TV application.

Which has been working fine for two years on Volta/Turing/Ampere.

It was not a problem other than forcing lower optimizations on these cards up UNTIL these last LATeah3001L00 work units.

Has to be something in the makeup of the LATeah3001L00 WU's that trips up the nvidia-TV application now.

I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.

Gamma ray GPU tasks hanging?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner