MeerKAT 0.05 (BRP7-opencl-ati) Results Today on three Windows/AMD machines
OVERVIEW
The nearly 300 Tasks processed on my three windows hosts on August 29, 2022 had very poor success.
Many, many error while computing and Validate Error failure--almost 50% of tasks
Successful validation only with Window AMD quorum partners, never with Nvidia
DETAILS
These results were observed from three machines which ran MeerKAT 100% for about nine hours today
The machines have a long history of running Einstein Gamma-Ray Pulsar GPU tasks with about 1% invalid rate and negligible other failures.
Three download errors--not many, but not normally seen
About 30 instant error while computing failures (about 5 seconds elapsed time)
about 10 error while computing failures after normal run time
Over 100 Validate errors after normal run time--generally these same WUs have generated Validate errors on other Windows machines, both AMD and Nvidia
Five Completed marked as invalid errors from normal run time, generally losing out to a pair of Nvidia partners
About 37 currently shown as validation inconclusive, commonly with Windows Nvidia partners
Just three have validated all with Windows AMD partners (two had Nvidia partners which failed to match)
Copyright © 2024 Einstein@Home. All rights reserved.
A very high percentage of
)
A very high percentage of Errors on my NVIDIA GPUs (3070ti and 2x 1660ti)
Nice waste of resources on it for me and I have since uncheck the app from my Project Prefs as I didn't know I had it checked to begin with lol.
I do run concurrent WUs on my GPUs so maybe MeerKAT doesn't like that? Which would be another bad mark against it as well.
EDIT: Looks like running anything but 1 WU of MeerKAT per GPU will throw a Computation Error pretty quick. Which sorta sucks as it only loads the GPU at 80% on a 3070ti so there is plenty of GPU left to run concurrent WUs.
So far as I can tell so far,
)
So far as I can tell so far, on my Windows AMD GPU setup the MeerKAT tasks run at 2X, 3X, and 4X with improved productivity and without constant error.
It is possible that there might be an error rate difference with multiplicity for my systems, but I suspect there may be an Nvidia to AMD difference in this respect.
I ran multiples (2x) just
)
I ran multiples (2x) just fine on my Linux system. But a single task still loaded the GPU (3080Ti) like 90-95%. And v0.03 maxed out the memory controller load even.
only a slight benefit from running multiples with certain versions of the app. But the project is constantly changing what app version is being sent out as they test which versions work best so you need to take that into account when comparing results or behaviors. With respect to Nvidia, the versions (v0.03, 0.05, 0.07) all differ slightly in GPU utilization, and the OpenCL apps will differ from the CUDA apps. Maybe even some Windows<->Linux differences.
the app is still in beta. If you don’t want to deal with errors or issues or otherwise aid in the testing and troubleshooting, then disable beta tasks processing.
_________________________________________________________________________
archae86 wrote: So far as I
)
Similar results on Linux AMD host. Running at 3X works well, even with FGRPBG1 tasks mixed in. But only with RX 570 cards. On my RX 5600XT GPU, MeerKAT tasks either fail to validate or stall out and are subsequently aborted.
Ideas are not fixed, nor should they be; we live in model-dependent reality.
cecht wrote:Running at 3X
)
I've avoided running mixed, but I've not seen quick failure, nor wildly disproportionate resource allocation from mixing that has happened at my boundaries.
On a slightly different point, the task duration correction factor result of running these is not hugely different on my systems from FGRPBG1, which gives hope that the crazy feast/famine behavior I've seen before when allowing mixing on my Einstein systems can be much less of a problem.
Come to think of it, I'll turn FGRPBG1 permission back on. Since it is not deadly, it is probably a broader test to run with mixing allowed.
One of my hosts has run a bit
)
One of my hosts has run a bit over 40 MeerKAT tasks on the v0.12 Windows application for AMD cards.
https://einsteinathome.org/host/12260865/tasks/0/57?page=3&sort=desc&order=Sent
In summary, this is doing vastly better than my report on the v0.05 situation.
I have not yet seen a single "Error while Computing". By now I would have expected to get several each of the fast (about 5 seconds elapsed time) and slow (abend at end of normal run time) type. Possibly these may have been fixed by the same memory adjustment Bernd applied hoping to quell the hang-ups seen on other platforms.
I have seen several validations against Nvidia card machines also running v.012. Previously there were zero of these.
It is too early to give a good guess at overall success rate, as the waters are muddied by a preponderance of WUs for which the quorum partner ran an earlier version.
But it is not too early to say that the current behavior is a huge improvement.
Is it "normal" that there's a
)
Is it "normal" that there's a big difference in runtime? The 0.12 tasks I've completed runs from 163 to 922 secs.
JohnDK wrote: Is it "normal"
)
I've seen a high preponderance of just one length, but a sub-population of ones which run in much less time--say around a quarter the dominant time.
An additional variable which contributes run-time variation is mixed running. When both GRP and BRP7 tasks are active at the same moment, the GRP task runs a little faster than normal, and the BRP7 a little slower--but it is not a huge effect, unlike some past combinations of applications.
JohnDK wrote: Is it "normal"
)
segment 4’s run a little faster than segment 3s.
also check the number of templates used in the stderr.txt output. 50,000 seems to be the standard. But earlier units had a lot less templates and ran a lot faster. So it’s possible you got some early WUs that got pushed back out for reprocessing.
_________________________________________________________________________
Ian&Steve C. wrote:also check
)
Spot checking a few of mine that ran very like others I see exactly 50000 templates mentioned near the top of stderr. I got two "shorties" on a machine for which most tasks were consuming about 1380 elapsed seconds, which used 413 and 395 elapsed seconds. Both listed 14359 templates.