Here we are again. I must have been though this a few times before, myself, I even see my comments in previous threads about, but I still don't quite remember how to fix it. I'll keep looking but wanted to post this here right away:
My RX 6800XT was working fine, then I pulled it out and replaced it with a 7900XTX and how Boinc only detects the Clover/Mesa OpenCL implementation. Clinfo shows three platforms at the top, included the HSA/ROCm one I think it should use, but *doesn't* show more detailed info below, which is what I expected. And clpeak crashess. BOINC tries to use the Mesa OpenCL 1.1, but all jobs finish immediately with an error.
I saw the thread about Ubuntu, but that turned out to be a problem that I'm sure is not my case, anyway.
It doesn't make sense that driver was working for the 6800XT but doesn't for the 7900XTX, so I'm guessing it's something else. I don't like what clinfo and clpeak are doing.
Does anyone know of problems with AMDGPU + OpenCL for the new 7900 series?
Copyright © 2024 Einstein@Home. All rights reserved.
For ubuntu I'd reinstall the
)
For ubuntu I'd reinstall the amdgpu driver. For fedora ?
/usr/bin$ sudo ./amdgpu-uninstall
followed by a clean install of your correct version for your flavour of fedora.
This is for the drivers on the amdgpu firmware support site.
And assuming your fedora installation scheme isn't so far away from ubuntu / debian
where the /usr/bin path is correct.
Maybe you've been there and done that.
Clover/Mesa OpenCL seems to
)
Clover/Mesa OpenCL seems to be getting in the way and most likely won't work with the 7900xtx. Check the contents of /etc/OpenCL/vendors/. Files ending in .icd are candidates for opencl and you may be able to guess the source from the filename. You can cat each file to see which lib is being pointed to and it should be clear which file you probably should be using. amdgpu file is usually named amdocl64.icd and I don't know/remember if rocr opencl is named the same. You can prevent a .icd file from being used by renaming .icd to something else. After renaming, reboot to reset the driver.
If you need to reinstall the amdgpu opencl driver, download, install the first d/l part, then sudo amdgpu-install -y --usecase=opencl --opencl=rocr [--accept-eula]. The eula shouldn't be needed for rocr, but you never know...
Thanks all, but by AMDGPU I
)
Thanks all, but by AMDGPU I mean NOT AMDGPU-PRO, so that means no (extra) drivers and I didn't use amdgpu-install. I can't. But, it's not necessary and I'm sure of that. I've been around this block many times.
But, good thinking; something like that is the right approach. I think I should try to remove and reinstall AMD packages. I'll try that and get back to you.
It's true that the MESA OpenCL path doesn't work, and that is what BOINC is trying to use. BUT! That's not why it doesn't work. It doesn't work because the ROCm driver is NOT working, even though it's installed and worked with my old card.
I like what you said about /etc/OpenCL/vendors. But, it's not that I need to remove MESA OCL 1.1 from the list of usable ones so that it does try that one. It's that I need to figure out why the OpenCL provider I want to use isn't on the list, isn't detected, or isn't working. clinfo shows this:
That is the correct driver and it's detected. But, after that, in the details that follow, it is NOT shown. And, as I say, clpeak fails. So, I think there is a problem with the driver, I just don't know what
Okay, so, I was able to get
)
Okay, so, I was able to get it working by installing some ROCm packages from AMD. This is troublesome because I don't really know which ones I need to have or should have installed.
It also means I might be wrong about not needing other packages. I would need to do more testing, and I plan to, but this is very disturbing. I really thought I had it figured out and had lots of evidence to back that up.
However, the system was horribly unstable, crashing frequently. So, I'm not sure I have everything correct. I went back to my 6800XT and it seems better, more stable, but it has only been an hour, so, I'm not confident in that assessment either.
AMD supports RHEL 7/8/9 -
)
AMD supports RHEL 7/8/9 - where Fedora 37 fit in? One of the RHEL bundles would provide the most direct install for opencl but it doesn't seem that this is an option for you. A search for Fedora/amdgpu/opencl brings up a number of possible solutions that I expect you are aware of. You only need opencl from amdgpu, all other graphics support can come via mesa.
What file(s) does your /etc/Opencl/vendors/ directory contain and what lib files are listed? Since you are able to easily flip back to your 6800XT and you do get results that pass validation for both the 6800&7900 cards it seems that AMD rocr opencl is in fact installed. Is the firmware file for the 7900 current? You could check /var/log/kern.log when the 7900 is plugged in to see if there is some startup issue...
Thanks for the
)
Thanks for the help!
Fedora is not supported. But, it doesn't much matter because AMD has promised the support I need to be incorporated into the OSS stack. It's just a question of how much has been put out there now, and how much has been integrated into the distribution. ROCm, at least some parts of it, are in Fedora proper. For 5700XT & 6800XT, I had good evidence that these extra pkgs from AMD were not required. But, I confess, my validation could have been better. I know for sure that it worked once without any extra packages, but I have since added some back trying to get HSA working for things beyond OpenCL.
I uninstalled MESA OpenCL, so now the only thing in OpenCL/vendors is the amdocl64 file, but it's not owned by anyone, which is upsetting. It points to a library that is part of an AMD RPM, but that was an RPM I replaced, and the official Fedora RPM of the same name provides the same lib. So, I had that installed before.
I completely agree with your troubleshooting approach, but it's not something that is missing. ROCm opencl stack was present, it just wasn't working correctly for that card, but did work correctly for the previous gen card. If it is, in fact, a software issue, then it's the wrong software or conflicting software, not missing software.
Now I don't know much about the firmware, and it sounds a bit more suspicious. This is a part I haven't looked into, before. ...Okay, I think I found some firmware information about the card in the journal--no more SYSLOG ;-)--so if we know what to look for I think we can. Nearby, I see "SMU driver if version not matched" then "SMU is initialized successfully!"
What's the name of that pkg? ... Yeah, okay amd-gpu-firmware. Looks current. last updated 30 days ago.
Hmm, what am I looking for with the firmware, just errors? No errors regarding firmware, and no errors, generally, during boot that I can see. Think I need to check FW versions or something?
I still have some messing I can think of do with the pkgs, so I'll try to make some time to play with that.
In the meantime, this is helping! Any other ideas?
Paul wrote: I still have
)
Have you tried loading up a copy of Ubuntu in a VM box and see if it works there? It might give you some ideas for libraries you aren't using right now.
Clover/Mesa OpenCL doesn't
)
Clover/Mesa OpenCL doesn't work for anything except card name detection :) really useless
if you have troubles with ROCm, try an alternative https://github.com/pocl/pocl it works well with AMD cards (including Einstein@Home apps)
and there's also a brand new Rusticl/Mesa https://docs.mesa3d.org/rusticl , but I haven't had a chance to test it yet.
ahorek's team
)
Yes, correct. I'm long aware of this. I even did some research and posted on the Fedora community about it. AMD documentation explains it pretty clearly, it's just buried really deep. But, they state explicitly that they are no longer supporting CLover/Mesa OpenCL -- gosh, that must have been 6 years ago or more -- and that, instead, they are supporting OSS OCL via ROCm. That is their official position. Which is how I know this is *supposed* to work, it's just a matter of how much and how well and how that code gets out to end users.
Oh, now that's very interesting. POCL was a problem in the past. It was just like CLover in that BOINC would detect it and not be able to use it. I guess I could try it again.
Rusticl sounds very interesting to. Not sure how that could help BOINC, but, good to know. Keep meaning to teach myself Rust. It's on the TODO list.
mikey wrote: Have you tried
)
I've been thinking about this over the day. I thought of a couple of problems with this idea, but I also thought of solutions for all of them. The question is, how do I get what you are saying I could get out of it. So, trying Ubuntu means I can use the amdgpu-install. But, it installs a bunch of packages, silently, I think, so then I need to figure out how to get the apt log, which I assume I can do, but I forget how. Then I need to look up the 'provides' list, and filter that for libs. I mean, I think that would help, a little. I guess I could compare that to the same list on my system. Seems like a lot of work, but I don't see anything wrong with that approach, in theory.
I could also be useful just to stress test it that way. At this point, I assume it's not a bad card on delivery, but I also cannot be sure it's not. I suppose this would be one way to test. Certainly a better test bed than, say, Windows, for me.
I think what I'm going to do is run down a couple other ideas I have, first. But, I might return to this one. Thanks.
I also have the "opportunity" to talk to the manufacturer's support. I do feel like they own me a little help, considering the price I payed.
So, is anyone using the 7900XT/XTX on Linux, now, for OpenCL crunching? I would love to connect with someone who actually is doing that in this project. I didn't see any on the top 50 machines list, which is where I expected to find them.