Outcomes on MeerKAT 0.05

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219624931

RAC: 975722

I've now got 24 hours of

2 Sep 2022 14:06:41 UTC

Message 200496 in response to message 200480

(moderation:

)

I've now got 24 hours of operation running 0.12 MeerKAT for Windows/AMD on three hosts.

In summary: the improvement is fabulous. Linux may not match Windows.

Most striking of all is that none of the hundreds of MeerKAT tasks run by any of the three systems has thrown a computation error on a single 0.12 task. So both the prompt (typically about 5 seconds elapsed time) and long (typically at the full normal elapsed time) type have gone away, or at least reduced in frequency by well over an order of magnitude.

The prompt validation rate is improving by the hour. The short 24-hour deadline on these tasks means that in recent hours my first quorum partner is more often than not also a v0.12 task, and those just validate. I spot checked recent inconclusives, and they tended to so far have seen older versions.

There may, however, be a glum shadow on the horizon--Linux.

My sole recent invalid result lost out to a pair of Linux v0.12 tasks. Also one or two inconclusives appear to have failed to match closely enough Linux v0.12 tasks.

Spot-checking over a dozen validating tasks in my recent results, I find that every single one had a Windows quorum partner. Most were CUDA55/Nvidia, some AMD. None were Linux.

As Linux users are a dominant tribe on this enterprise, if there really is a Windows/Linux result mismatch problem on v0.12 it will be serious.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46669112642

RAC: 64164874

certainly seems like

2 Sep 2022 14:21:44 UTC

Message 200498

(moderation:

)

certainly seems like windows/linux cross validation issues.

all of my invalids from yesterday lost out to a pair of Windows hosts.

all of my valids from yesterday validated against another Linux host.

will cross-post to the technical thread where Bernd is monitoring.

_________________________________________________________________________

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 373541117

RAC: 544152

BRP7 app version 0.12

2 Sep 2022 16:52:34 UTC

Message 200507

(moderation:

)

BRP7 app version 0.12 auto-downloaded at 1130 GMT 9/2; six work units have completed so far but no wing persons yet so no data on validation matches. It's an Nvidia GTX1060 for the cuda55 app. Wall-clock run times are in the 22 minute range (1340 seconds) and CPU times are in the 60 second range. "nvidia-smi" says 497 MB of GPU memory and 97% utilization. Since I'm not a Beta-test host, and the 0.12 app is listed in the "Applications" page, I presume this is the first production version release. Updated versions to follow in the coming weeks I'm sure.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219624931

RAC: 975722

Eugene Stemple wrote:so far

2 Sep 2022 17:21:11 UTC

Message 200508 in response to message 200507

(moderation:

)

Eugene Stemple wrote:

so far but no wing persons yet so no data on validation matches. It's an Nvidia GTX1060 for the cuda55 app.

since you posted I see you have picked up three validations. I find it interesting that your Linux/Cuda55 machines has validated repeatedly against Windows quorum partners, both Nvidia and AMD.

bluestang

Joined: 13 Apr 15

Posts: 34

Credit: 2492970228

RAC: 0

Well then there must be a

2 Sep 2022 19:56:25 UTC

Message 200512

(moderation:

)

Well then there must be a difference in running 3x concurrent between Windows and Linux hosts. As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.

I have since change my app_config.xml file to run 2x concurrent and we'll see what that brings when I get some more Beta tasks.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219624931

RAC: 975722

bluestang wrote:Well then

2 Sep 2022 20:57:08 UTC

Message 200513 in response to message 200512

(moderation:

)

bluestang wrote:

Well then there must be a difference in running 3x concurrent between Windows and Linux hosts. As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.

I have three Windows 10 machines, running three different AMD GPU cards--5700, 6800, and 6800 XT. On the 0.12 BRP7-opencl-ati application they have been running 2X, 3X, and 4X without throwing any runtime errors at all (I don't count the few download errors for this discussion).

So whatever the problem is that you suffer, it must be a bit narrower than all of Windows.

The older applications threw frequent errors on all my systems at all multiplicities I tried, including 1X. It is possible that it got even worse at higher multiplicity, but I did not collect the data to observe that.

bluestang

Joined: 13 Apr 15

Posts: 34

Credit: 2492970228

RAC: 0

AMD is OpenCL and NVIDIA is

2 Sep 2022 21:05:47 UTC

Message 200515

(moderation:

)

AMD is using OpenCL and NVIDIA is using cuda55...my bet is that is the issue. I say that because my lowly AMD 7870XT is doing 2x happily.

Not sure why they are not using a higher version of CUDA.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46669112642

RAC: 64164874

bluestang wrote:AMD is

2 Sep 2022 21:34:57 UTC

Message 200516 in response to message 200515

(moderation:

)

bluestang wrote:

AMD is using OpenCL and NVIDIA is using cuda55...my bet is that is the issue. I say that because my lowly AMD 7870XT is doing 2x happily.

Not sure why they are not using a higher version of CUDA.

user Boca Raton High school reported no issues running multiples. He has Windows Nvidia hosts also.

something else is wrong with your configuration somehow.

your error suggests that it’s looking for devices that don’t exist (CUDA device #2 on a single GPU system?). That might be a clue to which configuration is not correct. Maybe something with BOINC or your drivers have become corrupt.

_________________________________________________________________________

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117502117015

RAC: 35436972

bluestang wrote:... my lowly

2 Sep 2022 23:58:32 UTC

Message 200520 in response to message 200515

(moderation:

)

bluestang wrote:

... my lowly AMD 7870XT is doing 2x happily.

Is this the machine you're talking about? It shows as having a 2GB AMD 7800 series GPU.

I looked at the tasks list and the only MeerKAT tasks still showing were run with the v0.01 app over a month ago. Have you actually run concurrent tasks more recently with something closer to a production version??

Cheers,
Gary.

bluestang

Joined: 13 Apr 15

Posts: 34

Credit: 2492970228

RAC: 0

I have no issues running any

2 Sep 2022 23:58:43 UTC

Message 200521

(moderation:

)

I have no issues running any other projects and doing concurrent WUs. Was doing 3x FGRPB1G on my 3080ti, 3070ti and 3060ti just fine prior to the Meerkat tasks. Doesn't look like all are "Error While Computing", just a vast majority.

Switching to 2x helped, but still has some issues. I'm sure they'll get it all worked out with info we are supplying. I'm just trying to give as much info as possible to help in that.

Does Meerkat need/use more GPU power/resources? Only thing I can think of is that I have my GPUs running at 80 and 85% power, but like I said I don't have any issues running 2x or 3x concurrent on other GPU projects...3x on Einstein FGRPB1G 3x on PrimeGrid, 2x on Moo.

Outcomes on MeerKAT 0.05

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner