I've now got 24 hours of operation running 0.12 MeerKAT for Windows/AMD on three hosts.
In summary: the improvement is fabulous. Linux may not match Windows.
Most striking of all is that none of the hundreds of MeerKAT tasks run by any of the three systems has thrown a computation error on a single 0.12 task. So both the prompt (typically about 5 seconds elapsed time) and long (typically at the full normal elapsed time) type have gone away, or at least reduced in frequency by well over an order of magnitude.
The prompt validation rate is improving by the hour. The short 24-hour deadline on these tasks means that in recent hours my first quorum partner is more often than not also a v0.12 task, and those just validate. I spot checked recent inconclusives, and they tended to so far have seen older versions.
There may, however, be a glum shadow on the horizon--Linux.
My sole recent invalid result lost out to a pair of Linux v0.12 tasks. Also one or two inconclusives appear to have failed to match closely enough Linux v0.12 tasks.
Spot-checking over a dozen validating tasks in my recent results, I find that every single one had a Windows quorum partner. Most were CUDA55/Nvidia, some AMD. None were Linux.
As Linux users are a dominant tribe on this enterprise, if there really is a Windows/Linux result mismatch problem on v0.12 it will be serious.
BRP7 app version 0.12 auto-downloaded at 1130 GMT 9/2; six work units have completed so far but no wing persons yet so no data on validation matches. It's an Nvidia GTX1060 for the cuda55 app. Wall-clock run times are in the 22 minute range (1340 seconds) and CPU times are in the 60 second range. "nvidia-smi" says 497 MB of GPU memory and 97% utilization. Since I'm not a Beta-test host, and the 0.12 app is listed in the "Applications" page, I presume this is the first production version release. Updated versions to follow in the coming weeks I'm sure.
so far but no wing persons yet so no data on validation matches. It's an Nvidia GTX1060 for the cuda55 app.
since you posted I see you have picked up three validations. I find it interesting that your Linux/Cuda55 machines has validated repeatedly against Windows quorum partners, both Nvidia and AMD.
Well then there must be a difference in running 3x concurrent between Windows and Linux hosts. As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.
I have since change my app_config.xml file to run 2x concurrent and we'll see what that brings when I get some more Beta tasks.
Well then there must be a difference in running 3x concurrent between Windows and Linux hosts. As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.
I have three Windows 10 machines, running three different AMD GPU cards--5700, 6800, and 6800 XT. On the 0.12 BRP7-opencl-ati application they have been running 2X, 3X, and 4X without throwing any runtime errors at all (I don't count the few download errors for this discussion).
So whatever the problem is that you suffer, it must be a bit narrower than all of Windows.
The older applications threw frequent errors on all my systems at all multiplicities I tried, including 1X. It is possible that it got even worse at higher multiplicity, but I did not collect the data to observe that.
AMD is using OpenCL and NVIDIA is using cuda55...my bet is that is the issue. I say that because my lowly AMD 7870XT is doing 2x happily.
Not sure why they are not using a higher version of CUDA.
user Boca Raton High school reported no issues running multiples. He has Windows Nvidia hosts also.
something else is wrong with your configuration somehow.
your error suggests that it’s looking for devices that don’t exist (CUDA device #2 on a single GPU system?). That might be a clue to which configuration is not correct. Maybe something with BOINC or your drivers have become corrupt.
Is this the machine you're talking about? It shows as having a 2GB AMD 7800 series GPU.
I looked at the tasks list and the only MeerKAT tasks still showing were run with the v0.01 app over a month ago. Have you actually run concurrent tasks more recently with something closer to a production version??
I have no issues running any other projects and doing concurrent WUs. Was doing 3x FGRPB1G on my 3080ti, 3070ti and 3060ti just fine prior to the Meerkat tasks. Doesn't look like all are "Error While Computing", just a vast majority.
Switching to 2x helped, but still has some issues. I'm sure they'll get it all worked out with info we are supplying. I'm just trying to give as much info as possible to help in that.
Does Meerkat need/use more GPU power/resources? Only thing I can think of is that I have my GPUs running at 80 and 85% power, but like I said I don't have any issues running 2x or 3x concurrent on other GPU projects...3x on Einstein FGRPB1G 3x on PrimeGrid, 2x on Moo.
I've now got 24 hours of
)
I've now got 24 hours of operation running 0.12 MeerKAT for Windows/AMD on three hosts.
In summary: the improvement is fabulous. Linux may not match Windows.
Most striking of all is that none of the hundreds of MeerKAT tasks run by any of the three systems has thrown a computation error on a single 0.12 task. So both the prompt (typically about 5 seconds elapsed time) and long (typically at the full normal elapsed time) type have gone away, or at least reduced in frequency by well over an order of magnitude.
The prompt validation rate is improving by the hour. The short 24-hour deadline on these tasks means that in recent hours my first quorum partner is more often than not also a v0.12 task, and those just validate. I spot checked recent inconclusives, and they tended to so far have seen older versions.
There may, however, be a glum shadow on the horizon--Linux.
My sole recent invalid result lost out to a pair of Linux v0.12 tasks. Also one or two inconclusives appear to have failed to match closely enough Linux v0.12 tasks.
Spot-checking over a dozen validating tasks in my recent results, I find that every single one had a Windows quorum partner. Most were CUDA55/Nvidia, some AMD. None were Linux.
As Linux users are a dominant tribe on this enterprise, if there really is a Windows/Linux result mismatch problem on v0.12 it will be serious.
certainly seems like
)
certainly seems like windows/linux cross validation issues.
all of my invalids from yesterday lost out to a pair of Windows hosts.
all of my valids from yesterday validated against another Linux host.
will cross-post to the technical thread where Bernd is monitoring.
_________________________________________________________________________
BRP7 app version 0.12
)
BRP7 app version 0.12 auto-downloaded at 1130 GMT 9/2; six work units have completed so far but no wing persons yet so no data on validation matches. It's an Nvidia GTX1060 for the cuda55 app. Wall-clock run times are in the 22 minute range (1340 seconds) and CPU times are in the 60 second range. "nvidia-smi" says 497 MB of GPU memory and 97% utilization. Since I'm not a Beta-test host, and the 0.12 app is listed in the "Applications" page, I presume this is the first production version release. Updated versions to follow in the coming weeks I'm sure.
Eugene Stemple wrote:so far
)
since you posted I see you have picked up three validations. I find it interesting that your Linux/Cuda55 machines has validated repeatedly against Windows quorum partners, both Nvidia and AMD.
Well then there must be a
)
Well then there must be a difference in running 3x concurrent between Windows and Linux hosts. As running 3x concurrent on my Windows 10 hosts (3080ti, 3070ti, 3060ti) throws almost all Error.
I have since change my app_config.xml file to run 2x concurrent and we'll see what that brings when I get some more Beta tasks.
bluestang wrote:Well then
)
I have three Windows 10 machines, running three different AMD GPU cards--5700, 6800, and 6800 XT. On the 0.12 BRP7-opencl-ati application they have been running 2X, 3X, and 4X without throwing any runtime errors at all (I don't count the few download errors for this discussion).
So whatever the problem is that you suffer, it must be a bit narrower than all of Windows.
The older applications threw frequent errors on all my systems at all multiplicities I tried, including 1X. It is possible that it got even worse at higher multiplicity, but I did not collect the data to observe that.
AMD is OpenCL and NVIDIA is
)
AMD is using OpenCL and NVIDIA is using cuda55...my bet is that is the issue. I say that because my lowly AMD 7870XT is doing 2x happily.
Not sure why they are not using a higher version of CUDA.
bluestang wrote:AMD is
)
user Boca Raton High school reported no issues running multiples. He has Windows Nvidia hosts also.
something else is wrong with your configuration somehow.
your error suggests that it’s looking for devices that don’t exist (CUDA device #2 on a single GPU system?). That might be a clue to which configuration is not correct. Maybe something with BOINC or your drivers have become corrupt.
_________________________________________________________________________
bluestang wrote:... my lowly
)
Is this the machine you're talking about? It shows as having a 2GB AMD 7800 series GPU.
I looked at the tasks list and the only MeerKAT tasks still showing were run with the v0.01 app over a month ago. Have you actually run concurrent tasks more recently with something closer to a production version??
Cheers,
Gary.
I have no issues running any
)
I have no issues running any other projects and doing concurrent WUs. Was doing 3x FGRPB1G on my 3080ti, 3070ti and 3060ti just fine prior to the Meerkat tasks. Doesn't look like all are "Error While Computing", just a vast majority.
Switching to 2x helped, but still has some issues. I'm sure they'll get it all worked out with info we are supplying. I'm just trying to give as much info as possible to help in that.
Does Meerkat need/use more GPU power/resources? Only thing I can think of is that I have my GPUs running at 80 and 85% power, but like I said I don't have any issues running 2x or 3x concurrent on other GPU projects...3x on Einstein FGRPB1G 3x on PrimeGrid, 2x on Moo.