Improvements in the code of the clients

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3813
Credit: 37951415160
RAC: 58909344

Wedge009 wrote: I'd prefer

Wedge009 wrote:

I'd prefer not to until I have confidence that I'll have any different results from the past - it's my primary machine, the one I use for day-to-day work and play. My past experience with it is the same as what I just described for Vega10 - BOINC simply doesn't like to co-operate with ROCr/ROCm-based drivers, I've only had (BOINC) success with PAL-based OpenCL for Vega GPUs.

I had a lot of help on this in the past from mountkidd, in fact they were the one to let me know of the incompatibility between certain amdgpu-pro versions and Ubuntu kernels in the first place.

Depending how long ago it was that you had issues and settled on PAL-based drivers, I believe that ROCm has matured quite a bit since then.

 

Wedge009 wrote:

I'm aware of the PCIe atomics requirements of ROCm but I understood that's only relevant for older GPU generation - they dropped that requirement for Vega-generation GPUs, as you mentioned. I have several similar 990FX-based motherboards and the same results in all of them. No risers, I only use motherboard slots. And while the block diagrams indicate PCIe lanes through the 990FX chipset vs the CPU for Threadripper, I've observed no difference in OpenCL behaviour between the two, at least in the past.

Yeah that's what I thought too, but what *should* be true, isn't always true with AMD drivers since it seems like they don't have every possible hardware combination to test. and based on their documentation it sounds like they've only tested on the hardware listed on the approved list, and we see other examples of inconsistent feature support, like what we're seeing now with Windows drivers (which should have support, but obviously doesn't) and older PAL drivers (which should have support, but obviously doesn't).

I only had this thought because your failure mode/behavior is quite different than that of others who are having failures. Others are seeing straight oclfft errors after a few seconds. you see just the card running indefinitely with no progress and no load. so something different here. either some idiosyncrasy with that system's software package (some conflicting software or driver) or possibly the platform. it doesn't make sense to me that my Polaris card works (older than your Vega), and your Vega card doesn't with very similar OS/drivers. Unless the issues are specific to Vega, which I find unlikely with ROCm since they explicitly claim full support with Vega.

 

Wedge009 wrote:

Encouraging to know it's possible (haven't seen many other Vega successes till now) but I'm generally not at the level of self-compiling drivers (I self-compile some of my favourite applications though). The performance appears to be substantially better than the amdgpu-pro drivers too, if it is indeed ROCm that accounts for the substantially shorter run times.

I've reported my issues on both AMD Community and ROCm GitHub, but if self-compiling the driver produces good results I wonder if there's an issue with the driver packaging for Ubuntu somewhere.

The user you mentioned appears to be anonymous. Would he/she be willing to provide some input on how they got it working with BOINC so well?

User is tictoc (they've posted publicly in the past that this system belonged to them, that's how I knew).

As for their performance difference, it might be due to drivers, but I'm going to guess that it's mostly due to however they are operating that system. possibly with watercooling and overclocking, but that's just a guess. temps and overclocks and particularly undervolting or power limiting can have big effects on performance and power efficiency. I could run my cards harder, but prefer to dial it back a bit for more power efficiency, but everyone's situation and goals are different. some might not care about power use and just want the most performance possible.

based on our conversations, this user seems well versed in Linux (likely much more than me) as they are running Arch and custom compiling most things. I'm good at observing behavior, trending, research, googling, and putting the pieces of the puzzle together to find and fix issues. but less versed in how everything is working in the background.

 

I recommended to try on your threadripper/VII system due to the platform similarities to a known good system, and to try to narrow down if your previous issues were somehow related to the old platform or not. if you experience the same exact issues (no GPU load, 0% progress) it could come down to some software conflict on your system (some conflicting software package or driver).

FYI you can load up ROCm on your 5.4 kernel. you do not need to update the kernel just for ROCm. do the dkms install and if you decide to update the kernel later, ROCm install should be preserved. I did this on my RX570 system. initially I had it locked with the 5.4 kernel. installed ROCm, verified it was working, then installed the 5.11 kernel, and re-verified that it still worked.

PS: you'll probably have to do the same hack/edit of the ICD file to get RadeonVII detected in BOINC with the ROCm like you did before. you're right that this could be an issue with the Ubuntu package for this that's causing this issue. But for some reason, I never needed to do this with my RX570, it only seems to affect Vega and newer. Tom needed to do the same on his RX5700. I also haven't seen others posting about this issue so maybe it's not as much an issue with other popular distributions like Linux Mint.

 

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16321964425
RAC: 6599315

Ian&Steve C.

Ian&Steve C. wrote:

Depending how long ago it was that you had issues and settled on PAL-based drivers, I believe that ROCm has matured quite a bit since then.

I retest every new amdgpu-pro release, so the last time would have been when 21.30 was released. The last test was 6th August. Unless there's something additional that ROCm offers over ROCr, which I haven't seen any evidence of yet.

 

Ian&Steve C. wrote:

Others are seeing straight oclfft errors after a few seconds. you see just the card running indefinitely with no progress and no load.

As I mentioned previously, I observed precisely the immediate oclfft errors you described with ROCr-based OpenCL in amdgpu-pro between 20.45 and 21.10 inclusive. It's only more recently, with amdgpu-pro 21.20, 21.30, and ROCm 4.3.0 that I have the stalled progress.

 

Ian&Steve C. wrote:

As for their performance difference, it might be due to drivers, but I'm going to guess that it's mostly due to however they are operating that system. possibly with watercooling and overclocking, but that's just a guess.

That's the kind of information I'm curious about. I observe most power users run multiple tasks at a time, I stopped doing that a long time ago because I do other things on the GPU besides BOINC too. So I found it interesting that they had such low run-times, suggesting a single concurrent task like mine.

 

Ian&Steve C. wrote:

FYI you can load up ROCm on your 5.4 kernel. you do not need to update the kernel just for ROCm.

That's good to know - I usually go to the latest kernel when retrying ROCr-based OpenCL because if/when it does work that's what I'll be using anyway, plus the docs say only later kernels are officially supported. The sole reason for my remaining on kernel 5.4 is because the older PAL-based amdgpu-pro doesn't install on more recent kernels.

When next amdgpu-pro is released I'll give both it and ROCm a try.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3813
Credit: 37951415160
RAC: 58909344

FYI, ROCm and rocr are not

FYI, ROCm and rocr are not the same thing. Rocr is only the runtime and not the full ROCm package. 
 

for example, rocr does not support OpenCL 2.0 on GPUs older than Vega, whereas ROCm does. 
 

and for the normal tasks (which only need OpenCL 1.2) you would be fine with ROCm or rocr in recent versions (and ROCm and rocr even works for 5.8 kernels too). I would highly suggest you test on the threadripper system first as it’s a known working platform, just to rule out variables. I have a strong suspicion that if it’s still not working on your FX system, it’s platform related and not necessarily related to Vega or Linux kernels. My Polaris card (older than Vega) works on ROCm with 5.11 kernels. Both for normal tasks and these new tasks. I’m not doing anything special. 
 

as for the system I previously linked running 1x. It’s clear to me that this was a test instance of BOINC (new hostid) that was spun up only for the purposes of this test and that’s why it was only running 1x. Look on the leaderboard for an Anonymous host with Linux Arch w/5.14 kernel and 4x Radeon VII and you’ll see his normal production hosts, which look to be running 4x at the moment based on the 550-600s run times recently. 

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16321964425
RAC: 6599315

Yes, I'm aware they're not

Yes, I'm aware they're not the same but they seem to be related somehow. Certainly the fact they exhibit same behaviour for me so far doesn't instil me with much confidence.

I already sent a message but have yet to receive a reply.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3813
Credit: 37951415160
RAC: 58909344

FYI tictoc wrote: Just

FYI

tictoc wrote:

Just a follow up on the beta app.  I set one of my Radeon VIIs to crunch only the beta app.  Running 4x tasks concurrently.  Roughly 20 seconds slower per task, when running 4x tasks, than the 1.18 app. https://einsteinathome.org/host/12882622/tasks/0/0

 

Arch Linux | Kernel 5.14.0-1 mainline | ROCM 4.3.1 | Radeon VII @ 1950core/1000mem

Pending - 196

Valid - 168

Invalid - 1

Error - 0

 

I also went ahead and started testing a somewhat silly driver and GPU combination.  This OpenCl driver is from the AMD closed source driver before they started using ROCm for OpenCl in the closed driver.  Only the OpenCl bits are installed from amdgpu-pro. I didn't start a new client on this machine, so it is mostly the 1.18 tasks with the most recently downloaded tasks being the beta tasks. https://einsteinathome.org/host/12874264/tasks/2/0?sort=desc&order=Sent

Arch Linux | Kernel 5.13.13 | OpenCl from amdgpu-pro 20.30_1109583 | Vega 64 power capped at 115W

Pending - 14

Valid - 8

Invalid - 0 (there are 4 invalids, but they were from a few days ago and not in this round of testing)

Error - 0

 

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16321964425
RAC: 6599315

I had a brief power outage

I had a brief power outage forcing a reboot of everything so I took that as an opportunity to give the ROCm packages (still at 4.3.0 for Ubuntu) a try on the Threadripper system. The results were different... but not in a good way. In fact it was quite a disaster.

Followed the same steps as before - removed amdgpu-pro, install rocm-dkms, hack /etc/OpenCL/vendors/amdocl64_40200.icd to get BOINC to find libamdocl64.so properly. However, with ROCm installed running KDE Plasma slows to a crawl - nearly unusable - and any attempt to run GPU processing in BOINC results in an apparent complete system freeze (won't even respond to pings from other PCs in my network). This happened with both Ubuntu kernels 5.4 and 5.11, including several reboots and retries.

It's worth noting that I install amdgpu-pro with the --headless option so the open-source amdgpu drivers are still used for regular graphics acceleration, amdgpu-pro is only used for the OpenCL implementation. I assumed ROCm was similarly only taking care of the compute side of things as per its name - Radeon Open Compute - but maybe it's doing something weird on the display side too? Even when BOINC wasn't running, Plasma was not responding well, though strangely the log-in screen and TTYs seemed to be just fine.

At any rate, removing ROCm restored my system's functionality and yet again I'm back on the old PAL-based OpenCL from amdgpu-pro 20.40. I just can't seem to have any luck with ROCm/ROCr to date... and the down-time on my primary workstation was why I wasn't keen on doing any testing urgently.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3813
Credit: 37951415160
RAC: 58909344

Wedge009 wrote: KDE Plasma

Wedge009 wrote:

KDE Plasma slows to a crawl

 

In observation of "what's different" between your system and others who aren't having these problems. Perhaps KDE doesnt play nice with the ROCm drivers? Just guessing. I really have no explanation why your system in particular seems to have issues with newer ROCm based drivers, but given that several others aren't experience the same issues, I think it's possible that there's some conflict in your software package that's causing the issues, rather than some issue with the driver package itself. A test on a fresh install of normal/vanilla Ubuntu with no special configurations might be useful (got a spare HDD/SSD? so you don't have to disturb your production environment) but I wouldn't blame you if you're not interested in doing that.

 

Are you running Kubuntu to be running KDE? or did you install KDE over the normal Ubuntu?

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16321964425
RAC: 6599315

To clarify, the machine I

To clarify, the machine I first attempted - the FX-8350 + Vega 64 - also runs KDE Plasma. In all cases I use Kubuntu rather than KDE over Ubuntu, although theoretically there shouldn't be much difference between the two.

Yeah, it's always difficult trying to debug issues on primary workstations. But I sometimes do fresh installation to try to work out any issues like this - still haven't had any success. I'd like to know who are the 'several others' are who 'aren't experienc[ing] the same issues', perhaps they could help. I still have no response from tictoc...

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3813
Credit: 37951425160
RAC: 58910227

Wedge009 wrote:I'd like to

Wedge009 wrote:

I'd like to know who are the 'several others' are who 'aren't experienc[ing] the same issues',

from this thread:

I've run the ROCm drivers on Ubuntu with my RX 570 (older than Vega)

tictoc as ive mentioned, ran ROCm drivers on Arch linux with his Radeon VIIs, he also mentioned to me that he ran the new tasks fine with the older pal-based amdgpu-pro package as well, only installing the opencl components. on his RX Vega cards. https://einsteinathome.org/host/12874264/tasks/2/0?sort=desc&order=Sent

DF1DX ran ROCm drivers on Linux Mint 19.3 on his Radeon VII

solling2 ran ROCm drivers on Linux Mint 20.2 on his Ellesmere (guessing RX 570/580, older than Vega)

 

likely more that haven't reported in this thread. hard to find Linux/Vega combinations for people who have tried the beta tasks if they don't post their systems. 

 

edit:

looks like gBaker has successfully run the new tasks on his Threadripper system on Ubuntu also. not sure which drivers he's using, but based on the format of the listed coprocs, maybe he's even using the older drivers like you are as well.

https://einsteinathome.org/host/12796547/tasks/4/0?sort=desc&order=Sent

 

so yeah, several people with various configurations not experiencing the same issues.

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16321964425
RAC: 6599315

I was aware of the general

I was aware of the general Polaris successes, just not so much with Vega besides tictoc that you mentioned earlier. I'll see if I can get any help from DF1DX and/or gBaker since I had no response from tictoc.

Soli Deo Gloria

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.