ROCm compute error

agates
agates
Joined: 15 Jan 13
Posts: 4
Credit: 11787020
RAC: 0
Topic 228156

Hi all, been a while.

 

I've got ROCm 5.2.3 running on Fedora 36* and have successfully used it for various OpenCL tasks.

However, with E@H (and other boinc projects I've tried so far) I'm getting a compute error.  Upon some minor investigation I found some logs that indicated opencl compilation errors.

Also a lot of "out of memory" errors but I'm not sure how accurate they are considering I'm only using 16/64 GB system memory.  Maybe it's misreporting available GPU memory?  It's a 6800 XT with 16 GB total VRAM.

Happy to help test and provide any logs, just let me know what you need.


* COPR repo for reference https://copr.fedorainfracloud.org/coprs/mystro256/rocm-opencl/

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5002
Credit: 18872415752
RAC: 6156161

What does the BOINC Event Log

What does the BOINC Event Log startup show for your card discovery?  Does it show any OpenCL detection?

ROCm does not work on most consumer cards for OpenCL.  It is meant for use for only the workstation or compute cards like the Instinct cards.

ROCM needs PCIE 3.0 and PCIE atomics at the motherboard level.

What does clinfo in the Terminal show?

What does rocminfo show in the Terminal?

You should have defined opencl in the installation script to get OpenCL components installed.

 

agates
agates
Joined: 15 Jan 13
Posts: 4
Credit: 11787020
RAC: 0

Yes, BOINC shows OpenCL

Yes, BOINC shows OpenCL detection:

OpenCL: AMD/ATI GPU 0: AMD Radeon RX 6800 XT (driver version 3452.0 (HSA1.1,LC), device version OpenCL 2.0, 16368MB, 16368MB available, 23731 GFLOPS peak)

Here is the clinfo/rocminfo output:
https://gist.github.com/agates/c1571975192af8c690a6be29dd013755

Like I said in my first post, I know OpenCL works for my card.  I have used it with other programs.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118134043839
RAC: 23188940

agates wrote:I've got ROCm

agates wrote:

I've got ROCm 5.2.3 running on Fedora 36* and have successfully used it for various OpenCL tasks.

However, with E@H (and other boinc projects I've tried so far) I'm getting a compute error.

Unfortunately, most people contributing here using 'high end' GPUs tend to be using nvidia.  The ones using AMD on Linux tend to use a 'supported' distro like Ubuntu.  I'm using Linux (PCLinuxOS) exclusively on AMD GPUs but I acquired all of mine years ago, with the most recent being Polaris series (RX 460 - RX 580) between 2017/2020.  I'm not immediately aware of any regulars here who are using GPUs like yours on an unsupported distro.  With Red Hat/CentOS being supported, I would have thought that this would have quickly flowed through to Fedora??

I see you've tried both FGRPB1G and MeerKAT tasks and the error message from FGRPB1G is the most informative:-

OpenCL compiling FAILED! : -11 . Error message: fatal error: malformed or corrupted AST file: 'could not find file '/usr/lib64/clang/14.0.0/include/opencl-c-base.h' referenced by AST file '/tmp/comgr-a18e5e/include/opencl2.0-c.pch''

In case you're not aware of it, if you click on the TaskID link for any returned failed task that shows on the website, you get to see all the output that was returned for that task.  I imagine the above referenced missing file is something that should have been installed when you installed the ROCm stuff for your distro.  I really don't know anything about Fedora but from some recent research on OpenCL support for Navi GPUs, I've seen lots of comments around the difficulties being experienced by Fedora users.  I'm pretty sure your problem is nothing to do with lack of memory.

For OpenCL support on all my AMD GPUs, I've always been able to extract the necessary sub-set of packages from the Red Hat version of what used to be called AMDGPU-PRO.  I started in early 2017 with the 16.60 version and have continued right up to the 20.40 version in 2020.  After that, AMDGPU-PRO became Radeon Software for Linux and a whole lot of ROCm stuff was added.  None of that was relevant to me so I have not tried to use anything more recent than the 20.40 version - until now :-).

As luck would have it, an RX 5700 XT was very recently donated to me so I've just started researching what might be needed to provide a working OpenCL install for PCLOS.  Some time ago when AMD stopped supplying the AMDGPU-PRO packages, I downloaded the 21.30 final version.  In my searches for what bits of that full package might be needed, I've come across this thread which lists 11 RPMs as the ones needed.  The thread refers to 21.20 but I'm guessing that not too much will have changed for the 21.30 package that I have.  When I get some time, I'll start experimenting :-).

Sorry I can't offer any real clues at the moment but if anything turns up that might be of use I'll pass it on in this thread.  Hopefully you'll work something out and report back here :-).

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5002
Credit: 18872415752
RAC: 6156161

Good information from Gary

Good information from Gary and I will hand off to him for your further assistance. I only really know Nvidia but have been exposed to many posts from AMD card users and their difficulties with AMD Linux installations.

Fedora is an unknown land to me.  Only really knowledgeable with Ubuntu and it closest derivation Mint.

 

agates
agates
Joined: 15 Jan 13
Posts: 4
Credit: 11787020
RAC: 0

Gary, That error

Gary,

 

That error message helps a lot actually.

Looks it's referencing the wrong version of clang -- I have the following available:

/usr/lib64/clang/14.0.5/include/opencl-c-base.h

But the 14.0.0 directory doesn't exist at all.

Downgrading to clang-libs-14.0.0 resolves the issue, GPU tasks are running now :).

Question is, where is the incorrect version being referenced?  I'd prefer not to have to keep the package downgraded.  But happy to have gotten it working.

 

Alecks

 

EDIT:

The following is also available (directory "14" is a symlink to the specific version, 14.0.0, 14.0.5, etc).  Maybe it could be referenced instead?


/usr/lib64/clang/14/include/opencl-c-base.h

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5002
Credit: 18872415752
RAC: 6156161

I'd just symlink the upgraded

I'd just symlink the upgraded library back to the one that is being shown as needed.

The incorrect version is being referenced from the initial ROCm installation.

Not unheard of for the running distro of ROCm being slow in keeping scripts updated.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118134043839
RAC: 23188940

agates wrote:Looks it's

agates wrote:

Looks it's referencing the wrong version of clang -- I have the following available:

/usr/lib64/clang/14.0.5/include/opencl-c-base.h

But the 14.0.0 directory doesn't exist at all.

The problem (as Keith points out - thanks Keith) is probably due to a symlink not being properly upgraded when you went from version 14.0.0 to 14.0.5.  This needs to be reported to whoever maintains the Fedora ROCm stuff so that the bug in the install procedure can be fixed.

So I'm clear on exactly what is going on, can you please run the following command in a root terminal session and report the full output (command and output - in code tags) in your next reply, thanks. I'm just trying to be sure about exactly what is symlinked to what :-).

ls -ld /usr/lib64/clang/14*

If you wish, it should be easy to go back to having 14.0.5 installed but to manually create a symlink that would allow OpenCL to work until the bug is properly identified and fixed.  Of course, it's entirely possible that going back to 14.0.5 and fixing this one include file might expose other things that have been overlooked in the 14.0.0 -> 14.0.5 transition :-).

Cheers,
Gary.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 128
Credit: 17540005797
RAC: 6861611

I notice you're running BOINC

I notice you're running BOINC 7.20.2. From BOINC 7.18.x onwards I have not been able to get GPU tasks processing with ROCm-based OpenCL (legacy OpenCL is fine). It seems to be an issue with BOINC somehow - or maybe with how newer BOINCs expect something that ROCm isn't providing - as I have it running fine under BOINC 7.16.17.

I've known about this issue for months but only recently posted a report on the BOINC forums. No answer as yet - I suspect the issue is too obscure for anyone to make a comment on it.

https://boinc.berkeley.edu/forum_thread.php?id=14786

Then again, since you found a solution to your issue with clang, maybe your issue is different - we are on separate distributions after all.

Soli Deo Gloria

SkyFall
SkyFall
Joined: 7 Apr 15
Posts: 1
Credit: 4214161
RAC: 0

Thank you all for your

Thank you all for your contribution and your helpful information!


Just a quick heads-up on this issue with Fedora 37 and Boinc 7.20.2:

After installing rocm-opencl (and uninstalling mesa-libOpenCL) one of my tasks was failing with:

fatal error: malformed or corrupted AST file: 'could not find file '/usr/lib64/clang/15.0.0/include/opencl-c-base.h' referenced by AST file '/tmp/comgr-f806b2/include/opencl1.2-c.pch''

 

So, I executed the following in a terminal:

$ cd /usr/lib64/clang
$ sudo ln -s 15 15.0.0

After that, ls -all shows:

$ ls -all

drwxr-xr-x. 1 root root 28 26. Dec 10:38 .
dr-xr-xr-x. 1 root root 114786 26. Dec 10:21 ..
lrwxrwxrwx. 1 root root 6 6. Dec 14:22 15 -> 15.0.6
lrwxrwxrwx. 1 root root 2 26. Dec 10:38 15.0.0 -> 15

 

 

And GPU tasks are now crunching happily. :-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.