High Ratio of Invalid O3ASHF1 Tasks

tictoc

Joined: 1 Jan 13

Posts: 44

Credit: 7245565408

RAC: 7810396

20 Jul 2023 1:37:10 UTC

Topic 229824

(moderation:

)

I'm seeing a very high ratio of invalid GW tasks. Roughly 30% of tasks that complete successfully are invalid.

https://einsteinathome.org/host/12883788/tasks/5/56

No obvious pattern on the invalids. Some are straight "Validate error"s and others are "Completed, marked as invalid".

Validation will fail against all the various hosts (windows/nvidia, windows/amd, linux/nvidia, linux/amd). There is a higher number of invalids against windows/nvidia hosts, but that is very likely just noise since the majority of Einstein hosts are windows/nvidia.
All Hz ranges that I have ran have both valid and invalid results at 872Hz, 900Hz, 986Hz, 1022Hz, 1104Hz, and 1230Hz
The number of concurrent tasks doesn't seem to have any effect on the number of invalids; I've successfully ran and validated tasks at 1x, 2x, 3x, and 4x.

I have plenty of temperature, clock, and power headroom. About the only thing I could do on my end, is dial back the core and memory clocks. I am running the GPUs at conservative clocks that have been stable for the last few years on all other OpenCl and HIP work loads (including applications that stress the core, memory, and bandwidth much more than Einstein GW tasks).

Host Info:

CPU: AMD 3960x

GPU: 2x AMD Radeon VII

OS: Arch Linux

Kernel: 6.4.3

ROCm Version: 5.6.0

mikey

Joined: 22 Jan 05

Posts: 12702

Credit: 1839106724

RAC: 3613

tictoc wrote: I'm seeing a

20 Jul 2023 12:25:30 UTC

Message 214983

(moderation:

)

tictoc wrote:

I'm seeing a very high ratio of invalid GW tasks. Roughly 30% of tasks that complete successfully are invalid.

https://einsteinathome.org/host/12883788/tasks/5/56

No obvious pattern on the invalids. Some are straight "Validate error"s and others are "Completed, marked as invalid".

Validation will fail against all the various hosts (windows/nvidia, windows/amd, linux/nvidia, linux/amd). There is a higher number of invalids against windows/nvidia hosts, but that is very likely just noise since the majority of Einstein hosts are windows/nvidia.

All Hz ranges that I have ran have both valid and invalid results at 872Hz, 900Hz, 986Hz, 1022Hz, 1104Hz, and 1230Hz

The number of concurrent tasks doesn't seem to have any effect on the number of invalids; I've successfully ran and validated tasks at 1x, 2x, 3x, and 4x.

I have plenty of temperature, clock, and power headroom. About the only thing I could do on my end, is dial back the core and memory clocks. I am running the GPUs at conservative clocks that have been stable for the last few years on all other OpenCl and HIP work loads (including applications that stress the core, memory, and bandwidth much more than Einstein GW tasks).

Host Info:

CPU: AMD 3960x

GPU: 2x AMD Radeon VII

OS: Arch Linux

Kernel: 6.4.3

ROCm Version: 5.6.0

If your gpu only has 4gb of ram on it the new All sky gpu tasks will often fail

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3965

Credit: 47194502642

RAC: 65397221

He mentioned his GPU, it’s

20 Jul 2023 16:21:06 UTC

Message 214992

(moderation:

)

He mentioned his GPU, it’s two Radeon VIIs which have 16GB each.

but I thought ROCm drivers stopped officially supporting Vega at ROCm 4.5

_________________________________________________________________________

tictoc

Joined: 1 Jan 13

Posts: 44

Credit: 7245565408

RAC: 7810396

Ian&Steve C. wrote: He

20 Jul 2023 19:39:03 UTC

Message 214998 in response to message 214992

(moderation:

)

Ian&Steve C. wrote:

He mentioned his GPU, it’s two Radeon VIIs which have 16GB each.

but I thought ROCm drivers stopped officially supporting Vega at ROCm 4.5

The Radeon VII is Vega20, aka gfx906, and is officially supported in ROCm 5.6 . It will also be supported and recieve new featues and performance optimizations through ROCm 5.7. Bug fixes and security patches will continue for Vega20 through Q2 2024, which is the slated EOM timeframe. Even after that it will probably still work with ROCm, it just won't have any testing or support from AMD.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3965

Credit: 47194502642

RAC: 65397221

The first thing I would try

20 Jul 2023 20:21:13 UTC

Message 215001

(moderation:

)

The first thing I would try is run at all stock clocks with no added offsets or overclocks. See if the invalids situation improves. If so then it’s probably clocks related, and you can adjust core and mem separately to see if it’s more core or mem clock related.

Stability/accuracy can definitely vary between workloads. Even if some clock settings work fine for other workloads, doesn’t mean it’s guaranteed for all workloads. I can push clocks a bit harder on Einstein than I can for GPUGRID ACEMD3 for example. Higher errors and invalids with the same settings that work fine for Einstein Gamma ray, so I just dial it back for everything because I don’t want to be changing them for each WU type. Just an example.

_________________________________________________________________________

tictoc

Joined: 1 Jan 13

Posts: 44

Credit: 7245565408

RAC: 7810396

Ian&Steve C. wrote: The

20 Jul 2023 20:35:42 UTC

Message 215002 in response to message 215001

(moderation:

)

Ian&Steve C. wrote:

The first thing I would try is run at all stock clocks with no added offsets or overclocks. See if the invalids situation improves. If so then it’s probably clocks related, and you can adjust core and mem separately to see if it’s more core or mem clock related.

Stability/accuracy can definitely vary between workloads. Even if some clock settings work fine for other workloads, doesn’t mean it’s guaranteed for all workloads. I can push clocks a bit harder on Einstein than I can for GPUGRID ACEMD3 for example. Higher errors and invalids with the same settings that work fine for Einstein Gamma ray, so I just dial it back for everything because I don’t want to be changing them for each WU type. Just an example.

That's the plan. Right now, I was running a slight memory OC, and fixed core clocks on the Radeon VIIs, due to the somewhat erratic nature of the boost clocks on Vega20. The up and down boost clocks frequently cause issues with workloads like the GW tasks, which are not perfectly optimized, resulting in fairly large clock swings as the tasks progress. Most workloads benefit from staic core clocks which keep the GPU from overboosting during portions of the work that are not maxing out the GPU.

I'll just reset everything to stock and see how it goes on a fresh host.

tictoc

Joined: 1 Jan 13

Posts: 44

Credit: 7245565408

RAC: 7810396

So far no errors running at

22 Jul 2023 2:25:11 UTC

Message 215048

(moderation:

)

So far no errors running at stock memory clocks. https://einsteinathome.org/host/12883441/tasks/4/0

I also did some other testing and it looks like one of the VIIs has a bit weaker memory. Unless something pops up with the rest of the tasks, looks like this was 100% an issue on my end. Sorry for the noise, and any wingmen that were caught up in my invalid tasks.

High Ratio of Invalid O3ASHF1 Tasks

Forums › Problems and Bug Reports

tictoc wrote: I'm seeing a

He mentioned his GPU, it’s

Ian&Steve C. wrote: He

The first thing I would try

Ian&Steve C. wrote: The

So far no errors running at

Comment viewing options

Forums › Problems and Bug Reports