Outcomes on MeerKAT 0.05

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394701687

RAC: 1977916

30 Aug 2022 3:07:12 UTC

Topic 228047

(moderation:

)

MeerKAT 0.05 (BRP7-opencl-ati) Results Today on three Windows/AMD machines

OVERVIEW

The nearly 300 Tasks processed on my three windows hosts on August 29, 2022 had very poor success.
Many, many error while computing and Validate Error failure--almost 50% of tasks
Successful validation only with Window AMD quorum partners, never with Nvidia

DETAILS

These results were observed from three machines which ran MeerKAT 100% for about nine hours today
The machines have a long history of running Einstein Gamma-Ray Pulsar GPU tasks with about 1% invalid rate and negligible other failures.

Three download errors--not many, but not normally seen
About 30 instant error while computing failures (about 5 seconds elapsed time)
about 10 error while computing failures after normal run time
Over 100 Validate errors after normal run time--generally these same WUs have generated Validate errors on other Windows machines, both AMD and Nvidia
Five Completed marked as invalid errors from normal run time, generally losing out to a pair of Nvidia partners
About 37 currently shown as validation inconclusive, commonly with Windows Nvidia partners

Just three have validated all with Windows AMD partners (two had Nvidia partners which failed to match)

bluestang

Joined: 13 Apr 15

Posts: 34

Credit: 2492970228

RAC: 0

A very high percentage of

30 Aug 2022 17:00:03 UTC

Message 200377

(moderation:

)

A very high percentage of Errors on my NVIDIA GPUs (3070ti and 2x 1660ti)

Nice waste of resources on it for me and I have since uncheck the app from my Project Prefs as I didn't know I had it checked to begin with lol.

I do run concurrent WUs on my GPUs so maybe MeerKAT doesn't like that? Which would be another bad mark against it as well.

EDIT: Looks like running anything but 1 WU of MeerKAT per GPU will throw a Computation Error pretty quick. Which sorta sucks as it only loads the GPU at 80% on a 3070ti so there is plenty of GPU left to run concurrent WUs.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394701687

RAC: 1977916

So far as I can tell so far,

30 Aug 2022 17:07:15 UTC

Message 200379 in response to message 200377

(moderation:

)

So far as I can tell so far, on my Windows AMD GPU setup the MeerKAT tasks run at 2X, 3X, and 4X with improved productivity and without constant error.

It is possible that there might be an error rate difference with multiplicity for my systems, but I suspect there may be an Nvidia to AMD difference in this respect.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4155

Credit: 50078138221

RAC: 42268183

I ran multiples (2x) just

30 Aug 2022 17:12:48 UTC

Message 200380 in response to message 200377

(moderation:

)

I ran multiples (2x) just fine on my Linux system. But a single task still loaded the GPU (3080Ti) like 90-95%. And v0.03 maxed out the memory controller load even.

only a slight benefit from running multiples with certain versions of the app. But the project is constantly changing what app version is being sent out as they test which versions work best so you need to take that into account when comparing results or behaviors. With respect to Nvidia, the versions (v0.03, 0.05, 0.07) all differ slightly in GPU utilization, and the OpenCL apps will differ from the CUDA apps. Maybe even some Windows<->Linux differences.

the app is still in beta. If you don’t want to deal with errors or issues or otherwise aid in the testing and troubleshooting, then disable beta tasks processing.

_________________________________________________________________________

cecht

Joined: 7 Mar 18

Posts: 1618

Credit: 3030520240

RAC: 1440685

archae86 wrote: So far as I

30 Aug 2022 17:29:33 UTC

Message 200381 in response to message 200379

(moderation:

)

archae86 wrote:

So far as I can tell so far, on my Windows AMD GPU setup the MeerKAT tasks run at 2X, 3X, and 4X with improved productivity and without constant error.

Similar results on Linux AMD host. Running at 3X works well, even with FGRPBG1 tasks mixed in. But only with RX 570 cards. On my RX 5600XT GPU, MeerKAT tasks either fail to validate or stall out and are subsequently aborted.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394701687

RAC: 1977916

cecht wrote:Running at 3X

30 Aug 2022 18:01:47 UTC

Message 200383 in response to message 200381

(moderation:

)

cecht wrote:

Running at 3X works well, even with FGRPBG1 tasks mixed in.

I've avoided running mixed, but I've not seen quick failure, nor wildly disproportionate resource allocation from mixing that has happened at my boundaries.

On a slightly different point, the task duration correction factor result of running these is not hugely different on my systems from FGRPBG1, which gives hope that the crazy feast/famine behavior I've seen before when allowing mixing on my Einstein systems can be much less of a problem.

Come to think of it, I'll turn FGRPBG1 permission back on. Since it is not deadly, it is probably a broader test to run with mixing allowed.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394701687

RAC: 1977916

One of my hosts has run a bit

1 Sep 2022 19:41:33 UTC

Message 200480 in response to message 200383

(moderation:

)

One of my hosts has run a bit over 40 MeerKAT tasks on the v0.12 Windows application for AMD cards.

https://einsteinathome.org/host/12260865/tasks/0/57?page=3&sort=desc&order=Sent

In summary, this is doing vastly better than my report on the v0.05 situation.

I have not yet seen a single "Error while Computing". By now I would have expected to get several each of the fast (about 5 seconds elapsed time) and slow (abend at end of normal run time) type. Possibly these may have been fixed by the same memory adjustment Bernd applied hoping to quell the hang-ups seen on other platforms.

I have seen several validations against Nvidia card machines also running v.012. Previously there were zero of these.

It is too early to give a good guess at overall success rate, as the waters are muddied by a preponderance of WUs for which the quorum partner ran an earlier version.

But it is not too early to say that the current behavior is a huge improvement.

JohnDK

Joined: 25 Jun 10

Posts: 122

Credit: 2671867322

RAC: 1507526

Is it "normal" that there's a

1 Sep 2022 20:39:57 UTC

Message 200482

(moderation:

)

Is it "normal" that there's a big difference in runtime? The 0.12 tasks I've completed runs from 163 to 922 secs.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394701687

RAC: 1977916

JohnDK wrote: Is it "normal"

1 Sep 2022 20:57:16 UTC

Message 200484 in response to message 200482

(moderation:

)

JohnDK wrote:

Is it "normal" that there's a big difference in runtime? The 0.12 tasks I've completed runs from 163 to 922 secs.

I've seen a high preponderance of just one length, but a sub-population of ones which run in much less time--say around a quarter the dominant time.

An additional variable which contributes run-time variation is mixed running. When both GRP and BRP7 tasks are active at the same moment, the GRP task runs a little faster than normal, and the BRP7 a little slower--but it is not a huge effect, unlike some past combinations of applications.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4155

Credit: 50078138221

RAC: 42268183

JohnDK wrote: Is it "normal"

1 Sep 2022 22:56:03 UTC

Message 200488 in response to message 200482

(moderation:

)

JohnDK wrote:

Is it "normal" that there's a big difference in runtime? The 0.12 tasks I've completed runs from 163 to 922 secs.

segment 4’s run a little faster than segment 3s.

also check the number of templates used in the stderr.txt output. 50,000 seems to be the standard. But earlier units had a lot less templates and ran a lot faster. So it’s possible you got some early WUs that got pushed back out for reprocessing.

_________________________________________________________________________

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7394701687

RAC: 1977916

Ian&Steve C. wrote:also check

1 Sep 2022 23:16:31 UTC

Message 200490 in response to message 200488

(moderation:

)

Ian&Steve C. wrote:

also check the number of templates used in the stderr.txt output. 50,000 seems to be the standard. But earlier units had a lot less templates and ran a lot faster. So it’s possible you got some early WUs that got pushed back out for reprocessing.

Spot checking a few of mine that ran very like others I see exactly 50000 templates mentioned near the top of stderr. I got two "shorties" on a machine for which most tasks were consuming about 1380 elapsed seconds, which used 413 and 395 elapsed seconds. Both listed 14359 templates.

Outcomes on MeerKAT 0.05

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner