I'm seeing a very high ratio of invalid GW tasks. Roughly 30% of tasks that complete successfully are invalid.
https://einsteinathome.org/host/12883788/tasks/5/56
No obvious pattern on the invalids. Some are straight "Validate error"s and others are "Completed, marked as invalid".
I have plenty of temperature, clock, and power headroom. About the only thing I could do on my end, is dial back the core and memory clocks. I am running the GPUs at conservative clocks that have been stable for the last few years on all other OpenCl and HIP work loads (including applications that stress the core, memory, and bandwidth much more than Einstein GW tasks).
Host Info:
CPU: AMD 3960x
GPU: 2x AMD Radeon VII
OS: Arch Linux
Kernel: 6.4.3
ROCm Version: 5.6.0
Copyright © 2024 Einstein@Home. All rights reserved.
tictoc wrote: I'm seeing a
)
If your gpu only has 4gb of ram on it the new All sky gpu tasks will often fail
He mentioned his GPU, it’s
)
He mentioned his GPU, it’s two Radeon VIIs which have 16GB each.
but I thought ROCm drivers stopped officially supporting Vega at ROCm 4.5
_________________________________________________________________________
Ian&Steve C. wrote: He
)
The Radeon VII is Vega20, aka gfx906, and is officially supported in ROCm 5.6 . It will also be supported and recieve new featues and performance optimizations through ROCm 5.7. Bug fixes and security patches will continue for Vega20 through Q2 2024, which is the slated EOM timeframe. Even after that it will probably still work with ROCm, it just won't have any testing or support from AMD.
The first thing I would try
)
The first thing I would try is run at all stock clocks with no added offsets or overclocks. See if the invalids situation improves. If so then it’s probably clocks related, and you can adjust core and mem separately to see if it’s more core or mem clock related.
Stability/accuracy can definitely vary between workloads. Even if some clock settings work fine for other workloads, doesn’t mean it’s guaranteed for all workloads. I can push clocks a bit harder on Einstein than I can for GPUGRID ACEMD3 for example. Higher errors and invalids with the same settings that work fine for Einstein Gamma ray, so I just dial it back for everything because I don’t want to be changing them for each WU type. Just an example.
_________________________________________________________________________
Ian&Steve C. wrote: The
)
That's the plan. Right now, I was running a slight memory OC, and fixed core clocks on the Radeon VIIs, due to the somewhat erratic nature of the boost clocks on Vega20. The up and down boost clocks frequently cause issues with workloads like the GW tasks, which are not perfectly optimized, resulting in fairly large clock swings as the tasks progress. Most workloads benefit from staic core clocks which keep the GPU from overboosting during portions of the work that are not maxing out the GPU.
I'll just reset everything to stock and see how it goes on a fresh host.
So far no errors running at
)
So far no errors running at stock memory clocks. https://einsteinathome.org/host/12883441/tasks/4/0
I also did some other testing and it looks like one of the VIIs has a bit weaker memory. Unless something pops up with the rest of the tasks, looks like this was 100% an issue on my end. Sorry for the noise, and any wingmen that were caught up in my invalid tasks.