Outcomes on MeerKAT 0.05

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060884931
RAC: 1165531

I've now got 24 hours of

I've now got 24 hours of operation running 0.12 MeerKAT for Windows/AMD on three hosts.

In summary: the improvement is fabulous.  Linux may not match Windows.

Most striking of all is that none of the hundreds of MeerKAT tasks run by any of the three systems has thrown a computation error on a single 0.12 task.  So both the prompt (typically about 5 seconds elapsed time) and long (typically at the full normal elapsed time) type have gone away, or at least reduced in frequency by well over an order of magnitude.

The prompt validation rate is improving by the hour.  The short 24-hour deadline on these tasks means that in recent hours my first quorum partner is more often than not also a v0.12 task, and those just validate.  I spot checked recent inconclusives, and they tended to so far have seen older versions.

There may, however, be a glum shadow on the horizon--Linux.

My sole recent invalid result lost out to a pair of Linux v0.12 tasks.  Also one or two inconclusives appear to have failed to match closely enough Linux v0.12 tasks.  

Spot-checking over a dozen validating tasks in my recent results, I find that every single one had a Windows quorum partner.  Most were CUDA55/Nvidia, some AMD.  None were Linux.

As Linux users are a dominant tribe on this enterprise, if there really is a Windows/Linux result mismatch problem on v0.12 it will be serious.

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3716
Credit: 34692343075
RAC: 26708433

certainly seems like

certainly seems like windows/linux cross validation issues.

all of my invalids from yesterday lost out to a pair of Windows hosts.

all of my valids from yesterday validated against another Linux host.

 

will cross-post to the technical thread where Bernd is monitoring.

_________________________________________________________________________

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 58
Credit: 271932225
RAC: 303678

BRP7 app version 0.12

BRP7 app version 0.12 auto-downloaded at 1130 GMT 9/2; six work units have completed so far but no wing persons yet so no data on validation matches.  It's an Nvidia GTX1060 for the cuda55 app.  Wall-clock run times are in the 22 minute range (1340 seconds) and CPU times are in the 60 second range.  "nvidia-smi" says 497 MB of GPU memory and 97% utilization.  Since I'm not a Beta-test host, and the 0.12 app is listed in the "Applications" page, I presume this is the first production version release.  Updated versions to follow in the coming weeks I'm sure.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060884931
RAC: 1165531

Eugene Stemple wrote:so far

Eugene Stemple wrote:
so far but no wing persons yet so no data on validation matches.  It's an Nvidia GTX1060 for the cuda55 app.

since you posted I see you have picked up three validations.  I find it interesting that your Linux/Cuda55 machines has validated repeatedly against Windows quorum partners, both Nvidia and AMD.

bluestang
bluestang
Joined: 13 Apr 15
Posts: 34
Credit: 2492970228
RAC: 1314

Well then there must be a

Well then there must be a difference in running 3x concurrent between Windows and Linux hosts.  As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.

I have since change my app_config.xml file to run 2x concurrent and we'll see what that brings when I get some more Beta tasks.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060884931
RAC: 1165531

bluestang wrote:Well then

bluestang wrote:
Well then there must be a difference in running 3x concurrent between Windows and Linux hosts.  As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.

I have three Windows 10 machines, running three different AMD GPU cards--5700, 6800, and 6800 XT.  On the 0.12 BRP7-opencl-ati application they have been running 2X, 3X, and 4X without throwing any runtime errors at all (I don't count the few download errors for this discussion).

So whatever the problem is that you suffer, it must be a bit narrower than all of Windows.

The older applications threw frequent errors on all my systems at all multiplicities I tried, including 1X.  It is possible that it got even worse at higher multiplicity, but I did not collect the data to observe that.

bluestang
bluestang
Joined: 13 Apr 15
Posts: 34
Credit: 2492970228
RAC: 1314

AMD is OpenCL and NVIDIA is

AMD is using OpenCL and NVIDIA is using cuda55...my bet is that is the issue.  I say that because my lowly AMD 7870XT is doing 2x happily.

Not sure why they are not using a higher version of CUDA.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3716
Credit: 34692343075
RAC: 26708433

bluestang wrote:AMD is

bluestang wrote:

AMD is using OpenCL and NVIDIA is using cuda55...my bet is that is the issue.  I say that because my lowly AMD 7870XT is doing 2x happily.

Not sure why they are not using a higher version of CUDA.

user Boca Raton High school reported no issues running multiples. He has Windows Nvidia hosts also. 
 

something else is wrong with your configuration somehow. 
 

your error suggests that it’s looking for devices that don’t exist (CUDA device #2 on a single GPU system?). That might be a clue to which configuration is not correct. Maybe something with BOINC or your drivers have become corrupt. 

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110039253980
RAC: 22387097

bluestang wrote:... my lowly

bluestang wrote:
... my lowly AMD 7870XT is doing 2x happily.

Is this the machine you're talking about?  It shows as having a 2GB AMD 7800 series GPU.

I looked at the tasks list and the only MeerKAT tasks still showing were run with the v0.01 app over a month ago.  Have you actually run concurrent tasks more recently with something closer to a production version??

Cheers,
Gary.

bluestang
bluestang
Joined: 13 Apr 15
Posts: 34
Credit: 2492970228
RAC: 1314

I have no issues running any

I have no issues running any other projects and doing concurrent WUs.  Was doing 3x FGRPB1G on my 3080ti, 3070ti and 3060ti just fine prior to the Meerkat tasks.  Doesn't look like all are "Error While Computing", just a vast majority.

Switching to 2x helped, but still has some issues.  I'm sure they'll get it all worked out with info we are supplying.  I'm just trying to give as much info as possible to help in that.

Does Meerkat need/use more GPU power/resources?  Only thing I can think of is that I have my GPUs running at 80 and 85% power, but like I said I don't have any issues running 2x or 3x concurrent on other GPU projects...3x on  Einstein FGRPB1G 3x on PrimeGrid, 2x on Moo.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.