State of MeerKAT result consistency across OS and GPU vendor (September 2023)

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 138
Credit: 17824931776
RAC: 6228155
Topic 230080

Just wondering what is the current state of MeerKAT development - if any. I don't have much experience with MeerKAT tasks, though I recall during the initial testing phase there was a lot of inconsistency between results from GPU types (AMD vs Nvidia) and OS platforms in terms of validation rates. And the most recent related topic I've found on this is this thread from March 2023.

I ask because I seem to be consistently getting results for this application type marked as invalid. Upon investigation, I noticed a pattern where the other two results returned for a given work unit are from Nvidia GPUs, or from Linux OS, so my little AMD GPU running on Windows loses out in validation process.

For those interested in looking at the details, the host concerned is 12911781 (hmm, looks like there's an issue with ampersands in links being unhelpfully escaped). To those who might sneer upon such a low-power device, from my perspective it's not about how powerful a device is, but making the most of the resources on hand - otherwise what would be the point of employing ARM-based devices in BOINC projects?

Because it is such a memory-limited device (and soldered memory too, so no option to upgrade as I had on previous generations of similar devices) I've switched between running FGRPB1G on Linux and MeerKAT on Windows - and it does consistently return valid results for FGRPB1G. Those are the only options for this device because ROCm OpenCL is only supported by discrete GPUs at this time, PAL OpenCL is only available on Windows, Mesa OpenCL reports a sufficient memory size such that the Einstein scheduler is willing to send FGRPB1G tasks to it (but is also three times slower than PAL OpenCL), and the Einstein scheduler refuses to send MeerKAT to Mesa-based OpenCL systems.

But all that is a bit of a digression from my main question: is there still a discrepancy expected in MeerKAT results across GPU types and OS platforms?

Soli Deo Gloria

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6763
Credit: 9724079134
RAC: 1873367

Wedge009 wrote: But all that

Wedge009 wrote:

But all that is a bit of a digression from my main question: is there still a discrepancy expected in MeerKAT results across GPU types and OS platforms?

Yes.

I frequently "lose out" against two Windows boxes on "invalid" tasks with grp#1. I run a petri optimized version of the gpr#1 application under Linux.  As long as your invalids are running in the 3-4% range it is "normal" :)  If you are getting more than 3-4% take a look at any Over Clocking you are doing on your GPU. 

I do not remember about the cross GPU  (Nvidia vs. Amd) invalid rates.  But the OS invalid rates seem to be in there.

I know we are going to get a whole lot more experience with the MeerKat (Brp7) tasks in the near future.  Apparently gpr#1 tasks are finally going to really stop being produced.  That was previously announced once before.  And then they found a bunch they wanted to re-process.

So there will be a LOT more invalid data to mull over.

Tom M

 

A Proud member of the O.F.A.  (Old Farts Association).

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 138
Credit: 17824931776
RAC: 6228155

I appreciate the response,

I appreciate the response, but I was asking specifically about MeerKAT. I'm accustomed to the invalid rates for FGRPB1 and am also well aware of the (eventual?) plan to drop FGRPB1, as well as the CUDA-specific optimisations (ugh). I don't have a large sample size yet, but I am getting majority of MeerKAT tasks marked as invalid, seemingly because of the large (relative to FGRPB1 results) discrepancy between OS and GPU types.

Nothing is overclocked (it's a low-power unit).

Soli Deo Gloria

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6763
Credit: 9724079134
RAC: 1873367

Wedge009 wrote: I appreciate

Wedge009 wrote:

I appreciate the response, but I was asking specifically about MeerKAT. I'm accustomed to the invalid rates for FGRPB1 and am also well aware of the (eventual?) plan to drop FGRPB1, as well as the CUDA-specific optimisations (ugh). I don't have a large sample size yet, but I am getting majority of MeerKAT tasks marked as invalid, seemingly because of the large (relative to FGRPB1 results) discrepancy between OS and GPU types.

Nothing is overclocked (it's a low-power unit).

 

Which system?  Looked at several that had no MeerKat invalids at all.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5885
Credit: 119092621080
RAC: 23408198

Wedge009 wrote:...I ask

Wedge009 wrote:
...I ask because I seem to be consistently getting results for this application type marked as invalid. Upon investigation, I noticed a pattern where the other two results returned for a given work unit are from Nvidia GPUs, or from Linux OS, so my little AMD GPU running on Windows loses out in validation process.

I run Linux exclusively and I don't have any knowledge or previous experience of MeerKAT validation since I haven't been running it.  I did try it out when it was first launched.  It seemed a lot slower so reverted to FGRPB1 quite quickly and didn't notice if there were issues with validation.

Fast forward to now, and I've been testing it out for when FGRPB1 finishes.  When the announcement was made, I started testing a host with an RX 460 GPU but am now running an RX 570 on it, which is twice as fast as the RX 460 but 4 times slower than when it ran FGRPB1.  That's not an issue - it is what it is - but today I decided to spend some time looking to see what validation was like.  I'm used to invalids being less than 1% for FGRPB1 but I now see something like 8% of results for validation being declared invalid.

I went through most quorums and found (almost invariably) that two Windows machines trumped my Linux machine.  A lot of these were Windows/CUDA combinations so at first I was unsure if the issue was Windows vs Linux or CUDA vs OpenCL.  However, as I looked at more examples, I saw examples of Windows/OpenCL + Windows/CUDA trumping Linux/OpenCL so I don't think the problem is CUDA vs OpenCL.  To me it seemed more likely to be Windows vs Linux thing.

Wedge009 wrote:
... I've switched between running FGRPB1G on Linux and MeerKAT on Windows - and it does consistently return valid results for FGRPB1G. Those are the only options for this device because ROCm OpenCL is only supported by discrete GPUs at this time, PAL OpenCL is only available on Windows ...

I would have thought that since you ran FGRPB1 on Linux, you could also just switch to MeerKAT on the same Linux platform??  I'm using 1st gen GCN GPUs (Southern Islands) right through to 4th gen.  Your APU shows as having 2nd gen GCN (Sea Islands) so it should just work??  You don't need to use ROCm.  I'm not using ROCm at all.  I just added BRP7 and removed FGRPB1 on that machine and it finished the remaining old tasks and then carried on with the new.

Wedge009 wrote:
But all that is a bit of a digression from my main question: is there still a discrepancy expected in MeerKAT results across GPU types and OS platforms?

On very limited testing so far, my guess is yes to OS platforms.  If this turns out to be the case and can't be cured by tweaking the validator, and since BRP7 seems to be a long term thing, maybe the best action is to create two separate searches, one for Windows and one for Linux.

They probably won't be too keen to do that so maybe they'll spend time identifying exactly what the problem is and then fix it, particularly if enough people experience the same thing and start complaining.  There should be a big switch shortly so a lot more people using BRP7 for the first time.  I'll be keen to draw the attention of the Devs to any problems with validation being reported by the users.

 

Cheers,
Gary.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 138
Credit: 17824931776
RAC: 6228155

Tom M wrote:Which system? 

Tom M wrote:
Which system?  Looked at several that had no MeerKat invalids at all.

The one cited in my original post.

Gary Roberts wrote:
I did try it out when it was first launched.  It seemed a lot slower so reverted to FGRPB1 quite quickly and didn't notice if there were issues with validation.

I think I took a similar approach, but I remember reading some discussion in the beta testing about how there was trouble with results consistency. I could be wrong (recollections are often so vague after all), but I thought this was partly the reason why the Linux application builds needed a few more revisions: to try to reduce the inconsistencies in the results relative to the Windows applications.

Gary Roberts wrote:
I'm used to invalids being less than 1% for FGRPB1 but I now see something like 8% of results for validation being declared invalid.

That's the kind of anecdote I was looking for. If it's just the results themselves that are inconsistent - and not necessarily issues with hardware - then I think that helps with my acceptance of MeerKAT.

Gary Roberts wrote:
I would have thought that since you ran FGRPB1 on Linux, you could also just switch to MeerKAT on the same Linux platform??  I'm using 1st gen GCN GPUs (Southern Islands) right through to 4th gen.  Your APU shows as having 2nd gen GCN (Sea Islands) so it should just work??  You don't need to use ROCm.  I'm not using ROCm at all.  I just added BRP7 and removed FGRPB1 on that machine and it finished the remaining old tasks and then carried on with the new.

Use of ROCm isn't necessarily my choice - it's just what's available with AMD's Linux drivers (in terms of OpenCL support). The only other option with current drivers is legacy OpenCL (which I'm only using for my Fiji hardware). AMD's Linux support seems limited to discrete GPUs in the desktop space - if I had previous success with OpenCL and Linux on AMD's mobile GPUs, I would have ditched Windows on that hardware too.

But to answer your question about MeerKAT on Linux, since I haven't had success with AMD's OpenCL support on mobile hardware, I've experimented with Mesa - but as I stated in my original post the Einstein scheduler explicitly rejects Mesa OpenCL when considering MeerKAT task distribution. I don't really know why.

Soli Deo Gloria

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5885
Credit: 119092621080
RAC: 23408198

Tom M wrote:Which system? 

Tom M wrote:
Which system?  Looked at several that had no MeerKat invalids at all.

He posted a link to it in the opening message.

Sorry, hadn't noticed that OP had already pointed this out.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5885
Credit: 119092621080
RAC: 23408198

Wedge009 wrote:The only other

Wedge009 wrote:
The only other option with current drivers is legacy OpenCL (which I'm only using for my Fiji hardware).

By 'legacy OpenCL' are you referring to the time when fglrx was the graphics driver and there was a proprietary OpenCL that came with it or are you talking about the AMDGPU-PRO proprietary driver package that started around 2016 and is still available with a legacy OpenCL (Orca) that works fine with all GCN generations of AMD GPUs?

The Linux I use (PCLinuxOS) doesn't provide a workable OpenCL.  AMD only support 3 main Linux variants (Red Hat, Ubuntu, OpenSUSE).  I'm using the Red Hat 20.40 version of AMDGPU-PRO and I extract just the bits that provide OpenCL (Orca).  I'm running a whole range of kernels (some very recent - 6.x) with the amdgpu kernel module as the graphics driver and Orca OpenCL continues to work without issue.

I started using this from the 16.60 version and it works just fine.  The latest I tried was 21.30 (I think) but that's when a lot of ROCm stuff was being introduced and things got complicated.  I've just continued with 20.40.

It took a bit of fiddling to find the quite small number of files needed to make Einstein OpenCL apps workable.  I'm fully aware of the fact that Mesa OpenCL isn't yet viable.  That's probably why E@H rejects it.  Maybe some day the Mesa Devs will fix it.  That would be nice.

 

Cheers,
Gary.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 138
Credit: 17824931776
RAC: 6228155

Gary Roberts wrote:By 'legacy

Gary Roberts wrote:
By 'legacy OpenCL' are you referring to the time when fglrx was the graphics driver and there was a proprietary OpenCL that came with it or are you talking about the AMDGPU-PRO proprietary driver package that started around 2016 and is still available with a legacy OpenCL (Orca) that works fine with all GCN generations of AMD GPUs?

The latter - I remember the fglrx days, amdgpu-pro is a big improvement over that.

I couldn't see your machines - are you using discrete GPUs or other hardware? I've tried amdgpu-pro on laptops before, at least with Ubuntu variants, and no success (blank screen, boot failure, etc).

I (believe) I do much the same - relying on amdgpu for graphics and just using amdgpu-pro for OpenCL.

I also ran into complications during that time when ROCm was introduced (because BOINC wouldn't recognise GPUs for some reason), but ROCm-based OpenCL has been working well for me for a while now.

As I already described, Mesa OpenCL is the only option when AMD OpenCL doesn't work / isn't supported. It works for FGRPB1 (albeit three times slower than AMD OpenCL, again as described above), maybe it just doesn't work for MeerKAT tasks, I don't know. The scheduler log just says it's looking for '!Mesa'.

Soli Deo Gloria

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 401939822
RAC: 363478

MeerKAT 10% invalid rate. -

MeerKAT 10% invalid rate. - Linux opencl-nvidia

Not a rigorous statistical sample, but looking at MeerKAT results for the past week I have 20 invalid and 186 valid.  Among the 20 invalid: 11 had a pair of Windows cuda55 validating; 7 were a combo of Windows cuda55 and a Windows opencl-ati; 2 were a pair of Windows opencl-ati.    Examining about 17 (most recent reported) of the validated work units, I have wingmen of three types:  windows cuda55 and windows opencl-ati and Linux opencl-nvidia.  Roughly (very!) equal distribution as 5, 7, 5 respectively.  This small sample is sufficient to show that my opencl-nvidia (Linux) DOES validate, at least sometimes, with either of the Windows type wingman.  The 10% invalid rate is higher than I would wish but I don't see any particular combination of wingman apps that always fail.  Evidently work unit data dependent - which makes it devilishly difficult to diagnose.

 

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 138
Credit: 17824931776
RAC: 6228155

Thanks for that anecdote as

Thanks for that anecdote as well. All this suggests to me I shouldn't be so concerned about the comparatively high invalid rates for MeerKAT. When I had half-a-dozen or so marked as invalid/inconclusive in a row, I was getting concerned my system had something wrong with it.

Soli Deo Gloria

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.