yeah, I've yet to see any Turing/Volta/Ampere card that didn't have a problem, so the issue should have been obvious on any of these cards. Though the failure mode does seem to vary, maybe due to OS differences. but the vast majority of people seem to be having the "running-but-not-running tasks-run-forever-and-never-finish" type symptoms
I can understand that it can take some time to debug and isolate the root cause though
We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.
We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.
what's different about these new LATeah3001L00 vs the last run of LATeah2049Lag? the previous tasks worked OK. it all started with the new tasks.
if there's some different parameter or property about this dataset, that might lead you to whatever could be causing the issue.
That part of the code behaves pretty data-dependent, it's probably a barrier/synchronization problem. I vaguely remember that NVidia changed something with the Volta architecture to previous ones that had to do with synchronization among or within work groups / thread blocks. Does anyone remember better than me? I currently can't find it.
But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?
From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.
That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.
The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.
But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?
From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.
That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.
The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.
The actual what went wrong investigation got as far as Oliver finding out where the failure was coming from, and Bernd discovering that disabling performance optimizations in the compiler would prevent it from occuring - at the cost of slowing the tasks down (by an unspecified amount).
But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?
From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.
That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.
The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.
Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.
The racecheck and synccheck tools should be used on the troublesome dataset to look for violations
I assume the FGRPopenclTV-nvidia app is what came out of it before, since it has a build date of Feb 2019, no?
and with Danneelys comments about them removing some optimizations makes sense for why the Turing/Ampere cards seem to perform rather poorly when compared to the older pascal cards, even though the Turing Ampere cards should be much faster.
I hope the devs can find a fix that allows them to add the optimizations back in for faster and more efficient crunching :)
We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well.
That would be great too! Einstein definitely needs a CUDA app to increase performance on the nvidia cards.
OK, so the flub in tasks two years ago resulted in the nvidia-TV application.
Which has been working fine for two years on Volta/Turing/Ampere.
It was not a problem other than forcing lower optimizations on these cards up UNTIL these last LATeah3001L00 work units.
Has to be something in the makeup of the LATeah3001L00 WU's that trips up the nvidia-TV application now.
I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.
yeah, I've yet to see any
)
yeah, I've yet to see any Turing/Volta/Ampere card that didn't have a problem, so the issue should have been obvious on any of these cards. Though the failure mode does seem to vary, maybe due to OS differences. but the vast majority of people seem to be having the "running-but-not-running tasks-run-forever-and-never-finish" type symptoms
I can understand that it can take some time to debug and isolate the root cause though
_________________________________________________________________________
Just want to repeat: My
)
Just want to repeat: My PCs/rigs oddly get BSODs (blue screen of death).
I managed to catch/see the error message on one occasion: It said "Driver Error 116".
The PC tries to recover the GPUs, but after three times (?) of massive screen blinking the
PC does a "restart".
I don't have long endlessly running tasks.
They BSOD after a couple of seconds or minutes.
The PCs just run E @ H and just GR or GW. No mix.
We just prevented sending
)
We just prevented sending FGRP tasks to GPUs with compcap >= 7.0, i.e. to new cards. These should still be able to run GW (O2MDF) tasks, so you may set these hosts to receive new work. We'll continue to work on this, but it may take a while until that problem is fixed.
BM
Bernd Machenschalk wrote: We
)
what's different about these new LATeah3001L00 vs the last run of LATeah2049Lag? the previous tasks worked OK. it all started with the new tasks.
if there's some different parameter or property about this dataset, that might lead you to whatever could be causing the issue.
_________________________________________________________________________
The problem is probably the
)
The problem is probably the same (or at least related to) that we had pretty exactly two years ago: Latest data file for FGRPB1G GPU tasks.
That part of the code behaves pretty data-dependent, it's probably a barrier/synchronization problem. I vaguely remember that NVidia changed something with the Volta architecture to previous ones that had to do with synchronization among or within work groups / thread blocks. Does anyone remember better than me? I currently can't find it.
BM
But what was the solution for
)
But what was the solution for the bad tasks two years ago? Was it simply to just stop distributing the offending tasks . . . . or was some change in the task parameter set changed to exclude the offending element?
From what I can remember, the offending task set was just stopped and new task sets were sent out without the problem.
That leads me to believe the problem was never solved or investigated, just kicked down the road. It looks like it has reappeared.
The Anandtech Deep Dive into the changes in Volta/Turing vs Pascal probably can shed some light into the warp scheduling changes.
https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4
Or a perusal of the Nvidia CUDA Tuning Application Notes.
https://docs.nvidia.com/cuda/archive/10.1/pdf/Turing_Tuning_Guide.pdf
Read through section 1.4.1.2. Independent Thread Scheduling for issues with barrier synchronization.
The racecheck and synccheck tools should be used on the troublesome dataset to look for violations
Keith Myers wrote: But what
)
The solution from 2 years ago was to adjust the scheduler to stop sending the problematic set of WUs to newer NVidia cards.
https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=4#comment-169615
The actual what went wrong investigation got as far as Oliver finding out where the failure was coming from, and Bernd discovering that disabling performance optimizations in the compiler would prevent it from occuring - at the cost of slowing the tasks down (by an unspecified amount).
https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=2#comment-169243
https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks?page=3#comment-169461
Keith Myers wrote:But what
)
I assume the FGRPopenclTV-nvidia app is what came out of it before, since it has a build date of Feb 2019, no?
and with Danneelys comments about them removing some optimizations makes sense for why the Turing/Ampere cards seem to perform rather poorly when compared to the older pascal cards, even though the Turing Ampere cards should be much faster.
I hope the devs can find a fix that allows them to add the optimizations back in for faster and more efficient crunching :)
_________________________________________________________________________
and from the thread linked
)
and from the thread linked Bernd said this:
That would be great too! Einstein definitely needs a CUDA app to increase performance on the nvidia cards.
_________________________________________________________________________
OK, so the flub in tasks two
)
OK, so the flub in tasks two years ago resulted in the nvidia-TV application.
Which has been working fine for two years on Volta/Turing/Ampere.
It was not a problem other than forcing lower optimizations on these cards up UNTIL these last LATeah3001L00 work units.
Has to be something in the makeup of the LATeah3001L00 WU's that trips up the nvidia-TV application now.
I would think the simplest solution would be to undo whatever parameter changed between the LATeah3001L00 WU's and the previously working LATeah2049Lag WU set.