High Ratio of Invalid O3ASHF1 Tasks

tictoc
tictoc
Joined: 1 Jan 13
Posts: 44
Credit: 7245565408
RAC: 7810396
Topic 229824

I'm seeing a very high ratio of invalid GW tasks.  Roughly 30% of tasks that complete successfully are invalid.

https://einsteinathome.org/host/12883788/tasks/5/56

No obvious pattern on the invalids.  Some are straight "Validate error"s and others are "Completed, marked as invalid". 

  • Validation will fail against all the various hosts (windows/nvidia, windows/amd, linux/nvidia, linux/amd). There is a higher number of invalids against windows/nvidia hosts, but that is very likely just noise since the majority of Einstein hosts are windows/nvidia. 
  • All Hz ranges that I have ran have both valid and invalid results at 872Hz, 900Hz, 986Hz, 1022Hz, 1104Hz, and 1230Hz
  • The number of concurrent tasks doesn't seem to have any effect on the number of invalids; I've successfully ran and validated tasks at 1x, 2x, 3x, and 4x. 

I have plenty of temperature, clock, and power headroom.  About the only thing I could do on my end, is dial back the core and memory clocks. I am running the GPUs at conservative clocks that have been stable for the last few years on all other OpenCl and HIP work loads (including applications that stress the core, memory, and bandwidth much more than Einstein GW tasks).

 

Host Info:

CPU: AMD 3960x

GPU: 2x AMD Radeon VII

OS: Arch Linux

Kernel: 6.4.3

ROCm Version: 5.6.0

 

mikey
mikey
Joined: 22 Jan 05
Posts: 12702
Credit: 1839106724
RAC: 3613

tictoc wrote: I'm seeing a

tictoc wrote:

I'm seeing a very high ratio of invalid GW tasks.  Roughly 30% of tasks that complete successfully are invalid.

https://einsteinathome.org/host/12883788/tasks/5/56

No obvious pattern on the invalids.  Some are straight "Validate error"s and others are "Completed, marked as invalid". 

  • Validation will fail against all the various hosts (windows/nvidia, windows/amd, linux/nvidia, linux/amd). There is a higher number of invalids against windows/nvidia hosts, but that is very likely just noise since the majority of Einstein hosts are windows/nvidia. 
  • All Hz ranges that I have ran have both valid and invalid results at 872Hz, 900Hz, 986Hz, 1022Hz, 1104Hz, and 1230Hz
  • The number of concurrent tasks doesn't seem to have any effect on the number of invalids; I've successfully ran and validated tasks at 1x, 2x, 3x, and 4x. 

I have plenty of temperature, clock, and power headroom.  About the only thing I could do on my end, is dial back the core and memory clocks. I am running the GPUs at conservative clocks that have been stable for the last few years on all other OpenCl and HIP work loads (including applications that stress the core, memory, and bandwidth much more than Einstein GW tasks).

 

Host Info:

CPU: AMD 3960x

GPU: 2x AMD Radeon VII

OS: Arch Linux

Kernel: 6.4.3

ROCm Version: 5.6.0

If your gpu only has 4gb of ram on it the new All sky gpu tasks will often fail

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47194502642
RAC: 65397221

He mentioned his GPU, it’s

He mentioned his GPU, it’s two Radeon VIIs which have 16GB each. 
 

but I thought ROCm drivers stopped officially supporting Vega at ROCm 4.5

_________________________________________________________________________

tictoc
tictoc
Joined: 1 Jan 13
Posts: 44
Credit: 7245565408
RAC: 7810396

Ian&Steve C. wrote: He

Ian&Steve C. wrote:

He mentioned his GPU, it’s two Radeon VIIs which have 16GB each. 
 

but I thought ROCm drivers stopped officially supporting Vega at ROCm 4.5

 

The Radeon VII is Vega20, aka gfx906, and is officially supported in ROCm 5.6 .  It will also be supported and recieve new featues and performance optimizations through ROCm 5.7.  Bug fixes and security patches will continue for Vega20 through Q2 2024, which is the slated EOM timeframe. Even after that it will probably still work with ROCm, it just won't have any testing or support from AMD. 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47194502642
RAC: 65397221

The first thing I would try

The first thing I would try is run at all stock clocks with no added offsets or overclocks. See if the invalids situation improves. If so then it’s probably clocks related, and you can adjust core and mem separately to see if it’s more core or mem clock related. 
 

Stability/accuracy can definitely vary between workloads. Even if some clock settings work fine for other workloads, doesn’t mean it’s guaranteed for all workloads. I can push clocks a bit harder on Einstein than I can for GPUGRID ACEMD3 for example. Higher errors and invalids with the same settings that work fine for Einstein Gamma ray, so I just dial it back for everything because I don’t want to be changing them for each WU type. Just an example. 

_________________________________________________________________________

tictoc
tictoc
Joined: 1 Jan 13
Posts: 44
Credit: 7245565408
RAC: 7810396

Ian&Steve C. wrote: The

Ian&Steve C. wrote:

The first thing I would try is run at all stock clocks with no added offsets or overclocks. See if the invalids situation improves. If so then it’s probably clocks related, and you can adjust core and mem separately to see if it’s more core or mem clock related. 
 

Stability/accuracy can definitely vary between workloads. Even if some clock settings work fine for other workloads, doesn’t mean it’s guaranteed for all workloads. I can push clocks a bit harder on Einstein than I can for GPUGRID ACEMD3 for example. Higher errors and invalids with the same settings that work fine for Einstein Gamma ray, so I just dial it back for everything because I don’t want to be changing them for each WU type. Just an example. 

 

That's the plan.  Right now, I was running a slight memory OC, and fixed core clocks on the Radeon VIIs, due to the somewhat erratic nature of the boost clocks on Vega20.  The up and down boost clocks frequently cause issues with workloads like the GW tasks, which are not perfectly optimized, resulting in fairly large clock swings as the tasks progress.  Most workloads benefit from staic core clocks which keep the GPU from overboosting during portions of the work that are not maxing out the GPU.

 

I'll just reset everything to stock and see how it goes on a fresh host. 

tictoc
tictoc
Joined: 1 Jan 13
Posts: 44
Credit: 7245565408
RAC: 7810396

So far no errors running at

So far no errors running at stock memory clocks. https://einsteinathome.org/host/12883441/tasks/4/0

 

I also did some other testing and it looks like one of the VIIs has a bit weaker memory.  Unless something pops up with the rest of the tasks, looks like this was 100% an issue on my end.  Sorry for the noise, and any wingmen that were caught up in my invalid tasks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.