I have 2 machines here running Einstein through Bonic. One iMac with a NVIDIA 780M 4GB on latest Catalina 10.15.1, and one older MacPro with a ATI RX590, on Mojave 10.14.6.
At some point over the last few weeks GPU tasks have been failing on both machines. Anyone have any insight into what's up? Two GPU architectures, two generations of OS, both failing.
I have changed nothing on either machine, just took some time off from processing and when I came back it no longer seems to work. I'm running stock OS, and stock OS drivers (the ones built into OS X) for these cards, and they used to do GPU processing fine in the past, so not sure what's changed.
Thanks in advance.
Example ATI: https://einsteinathome.org/task/903101231
Example NVIDIA: https://einsteinathome.org/task/901897165
Copyright © 2024 Einstein@Home. All rights reserved.
CIA wrote:I have changed
)
While the machines were having a break, did you happen to upgrade the OS to their current versions? I know nothing about this but I've seen some comments about more recent versions of MacOS not supporting OpenCL any more.
The ATI task you linked to shows "TIME_LIMIT_EXCEEDED". In other words, the task didn't crash but wasn't progressing at the expected rate. Maybe whatever Apple uses in place of OpenCL can still run OpenCL apps but rather slowly. On the same tasks list there was a completed task showing as 'pending' which didn't take quite as long, so avoided the time limit. If that task actually validates, it would imply that MacOS can still handle the running of tasks that use OpenCL, but perhaps very inefficiently.
I also looked at the nvidia task link. All the compute errors I saw on that machine were for the new GW tasks and each gave the same "TIME_LIMIT_EXCEEDED" so it seems like the same story.
That machine had a bunch of earlier FGRPB1G tasks that were not started by the deadline. There was one that must have been in progress at the time of the deadline that subsequently completed and validated after the deadline. Its crunch time was a lot longer than you would expect for FGRPB1G tasks so even that different OpenCL based search is not being run efficiently by whatever Apple now uses to run OpenCL.
If you haven't upgraded/updated the OS recently, then I really don't know why tasks are now taking so long so as to exceed the time limit. Particularly for FGRPB1G, if tasks used to run fine, and absolutely nothing has changed, then they still should. The one FGRPB1G task I saw that validated took 12,186 secs. Is that the same time you saw previously? I would hope you used to see times a lot less than that.
Cheers,
Gary.
The initial issue (based on a
)
The initial issue (based on a single result) on my older 10.14.6 MacPro with the ATI card seemed to have been a one off fluke. All subsequent GPU work from that machine has been processing fine.
As for the NVIDIA based machine.. Unsure what's going on there. It's on the latest MacOS, Catalina (10.15.1) vs the MacPro which is limited to the previous OS, Mojave (10.14.6). Catalina has been out long enough that I suppose if others were having an issue they would have reported it here by now.
You mention time might be an issue, this is a old (by computing standards) machine from 2013 running a mobile part, the 780M, which isn't speedy anymore.
Could it just be a matter of it being so out of date that running GPU tasks on it takes too long? I feel like even if takes forever, the results should still be OK as long as you submit them before the WU deadline...
It takes about 3 hours to process a single unit on the iMac machine using Gravitational Wave search O2 Multi-Directional GPU v2.03 () x86_64-apple-darwin. Which from reading seems to be OpenCL based. The iMac with a NVIDIA GPU and newest OS seem to run version 2.03 of the app, while the MacPro with the ATI GPUs and one generation older OS runs version 2.02 without issue.
I don't know if the App versions are relevant to the OS, or to the fact that one is AMD and one is NVIDIA but I do notice that 2.03 seems to be the most recent version also on windows and Linux.
The most recent failure I had just now, looking at the unit in more detail shows I'm not the only one (or OS for that matter) having issues.
https://einsteinathome.org/workunit/429078559
CIA wrote:The initial issue
)
That TIME_LIMIT_EXCEEDED (TLE) error wasn't a fluke and it could easily happen again. The time that tripped the limit was 11,114 secs. You have a validated task that got to 10,784 secs which is pretty close to the time limit. Fortunately, more recent times seem to be a bit lower than that.
Your machine has a Xeon CPU with plenty of cores running at 3.47 GHz - so no shortage of CPU support for the RX 590 Graphics Compute Engine. When looking at your results and guessing that you aren't running multiple concurrent tasks, the run times seem much longer than you would expect. I formed that view because I have an RX 570 that runs 3 concurrent 'Vela' tasks and takes around 50 mins for all 3 - ie. ~17 mins per task. I don't own an RX 590 but I imagine one of those should be faster than an RX 570, and certainly not slower.
I haven't tested mine with different multiplicities on the current tasks but if I did run tasks singly, they would probably take less than 30 mins each. It seems likely the large time difference for very similar hardware, points to some sort of software issue - most likely the current driver that handles OpenCL.
Yes, and there have been comments about Apple using something called "Metal" instead of OpenCL - I don't really know anything about that.
However, I have tried out an old retired machine with a GTX 750Ti GPU recently. The CPU was a Celeron dual core G540 running at 2.5GHz. I would expect your iMac with the 780M to do much better than what mine could. Mine ran the same type of tasks directed at the Vela pulsar. It was able to complete results in well less than an hour, despite the poor CPU support. Here is a page of results it achieved before I shut it down again. I tried it for a short period just to see how an nvidia GPU would go since all my working hosts have AMD GPUs.
My machine is even older and more out of date, yet it produces results much faster. You can draw your own conclusions :-).
Of course the results should be fine. It's just a shame they take so long. It would be nice if that could be improved. Maybe questioning Apple might get you some answers.
As far as crunching goes, there is no difference between v2.02 and v2.03. The reason for the 2.03 version was announced here.
Yes, but two of the three errors were TLEs. We are probably going to see quite a few of these because Bernd has confirmed that the crunch times for GPU tasks directed at the Vela pulsar are much longer than expected from CPU results - perhaps more than double. So any under performing setups may well test a limit that is much smaller than it really should be. Hopefully, this will change when the directed search moves on to a different source - eg. CasA. The time limit is not supposed to catch slow crunching. It's a safety mechanism to terminate a situation where progress is glacial to non-existent, rather than just on the slow side.
Cheers,
Gary.