computation errors on Ubuntu17 + RX460 + OpenCL Mesa

Juha
Juha
Joined: 27 Nov 14
Posts: 49
Credit: 4964434
RAC: 0

libOpenCL.so is Installable

libOpenCL.so is Installable Client Driver (ICD) loader. It's job is to discover and load the real OpenCL drivers at runtime. So ldd is not going to work with it.

On Linux systems the real drivers are referenced in files in /etc/OpenCL/vendors. Once you have the real driver you can then try ldd at it.

Or if you have a running GPU task "lsof -p pid" or "pmap pid" tells what libraries it has loaded.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117724748958
RAC: 34983431

Thank you very much for

Thank you very much for responding.  As you are no doubt fully aware, I'm way out of my depth with this stuff :-).

Juha_6 wrote:
On Linux systems the real drivers are referenced in files in /etc/OpenCL/vendors. Once you have the real driver you can then try ldd at it.

The vendors directory contains amdocl64.icd, the contents of which point to libamdocl64.so (with no path prefix) which I presume is the file of that name that resides in /opt/amdgpu-pro/lib64.  Running ldd on that file produces a list of libs that look pretty standard but none with a name suggesting OpenCL libs.

Juha_6 wrote:
Or if you have a running GPU task "lsof -p pid" ....

This looks more promising.  It gives (along with a lot of other stuff seemingly unrelated to the 'real' libs) the following:-

        /opt/amdgpu-pro/lib64/libamdocl12cl64.so
        /opt/amdgpu-pro/lib64/libdrm_amdgpu.so.1.0.0
        /opt/amdgpu-pro/lib64/libdrm.so.2.4.0
        /opt/amdgpu-pro/lib64/libamdocl64.so

So here is what confuses me.  Running ldd on the hsgamma* science app suggests that /usr/lib64/libOpenCL.so is the ICD loader it will use.  The vendors file points to a different file, libamdocl64.so.  So how does that all work?  Are you really saying that because I set LD_LIBRARY_PATH to point at /opt/amdgpu-pro/lib64 before launching boinc, the science app will ignore the other ICD loader (/usr/lib64/libOpenCL.so) and use what the vendor file points to (libamdocl64.so) with the LD_LIBRARY_PATH prefix?  I had originally considered editing the .icd file to include the full path to libamdocl64.so but everything just worked with the above path that allowed boinc to detect the GPU, so I didn't bother investigating any further.

I guess the alternative question is this.  If I had installed the above 4 files in /usr/lib64/ (rather than in /opt/amdgpu-pro/lib64/) and edited the .icd file in vendors/ to point to /usr/lib64/libOpenCL.so, would it have just worked that way?  It's just a theoretical question as I would be quite reluctant to do anything like that - I wouldn't want to pollute the standard library paths.

All the stuff I'm using came from the 16.60 version.  I did look at trying to work out how to use the latest version but it seemed a lot more complicated with a lot of stuff changed.  Do you know of anybody publishing howto type information for distros outside the RH/Ubu group?  I guess I should revisit that once I get a few more pieces of the puzzle sorted out :-).

 

Cheers,
Gary.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785730885
RAC: 250924

Juha_6 wrote:libOpenCL.so is

Juha_6 wrote:

libOpenCL.so is Installable Client Driver (ICD) loader. It's job is to discover and load the real OpenCL drivers at runtime. So ldd is not going to work with it.

On Linux systems the real drivers are referenced in files in /etc/OpenCL/vendors. Once you have the real driver you can then try ldd at it.

Or if you have a running GPU task "lsof -p pid" or "pmap pid" tells what libraries it has loaded.

Thanks, Juha.  That is what I thought ICD was and how it worked.  Which is why I don't understand why Gary's method didn't work for me.  Maybe I just didn't get all the .so files I needed.  Or, ICD wouldn't map the right ones.

Do you know how to predict or help ICD choose?  I mean, I already have two OpenCL vendors installed now.  Can I be sure installing AMD-PRO OpenCL will help?  When it crashes, the output looked identical.  Would that suggest ICD is picking the same wrong driver?  This is the parts of ICD I don't understand.

Juha
Juha
Joined: 27 Nov 14
Posts: 49
Credit: 4964434
RAC: 0

Gary Roberts wrote:So here is

Gary Roberts wrote:
So here is what confuses me.  Running ldd on the hsgamma* science app suggests that /usr/lib64/libOpenCL.so is the ICD loader it will use.  The vendors file points to a different file, libamdocl64.so.  So how does that all work?  Are you really saying that because I set LD_LIBRARY_PATH to point at /opt/amdgpu-pro/lib64 before launching boinc, the science app will ignore the other ICD loader (/usr/lib64/libOpenCL.so) and use what the vendor file points to (libamdocl64.so) with the LD_LIBRARY_PATH prefix?

An OpenCL app is linked so that it depends on libOpenCL.so.1. System sees that and loads the library before starting the app. The library file that is used is the first that is found in LD_LIBRARY_PATH or standard system library directories (ld.so man page doesn't actually say what the order is but I think that order makes more sense).

The app then queries libOpenCL.so for supported platforms and devices. To answer the query libOpenCL.so checks the files in /etc/OpenCL/vendors, loads every library mentioned in the files and queries those libraries for supported devices. Once libOpenCL.so has compiled the list it gives it back to the app which then chooses which device to use using whatever criteria it wants. In our case the science app picks the device BOINC client told it to use.

In your case, since the .icd file doesn't include path and libamdocl64.so is not in standard library directory you need to use LD_LIBRARY_PATH so that the system finds it (and the other libraries it depends on).

Gary wrote:
I guess the alternative question is this.  If I had installed the above 4 files in /usr/lib64/ (rather than in /opt/amdgpu-pro/lib64/) and edited the .icd file in vendors/ to point to /usr/lib64/libOpenCL.so, would it have just worked that way?

In that case libOpenCL.so (1st copy) would check the .icd file, load libOpenCL.so and ask it for supported devices, libOpenCL.so (2nd copy) would check the .icd file, load libOpenCL.so and ask it for supported devices, libOpenCL.so (3rd copy) ... etc. If it actually ran at all, it would run for a while until it consumed all memory on the system and then crashed.

Gary wrote:
Do you know of anybody publishing howto type information for distros outside the RH/Ubu group?

Nope, sorry.

 

Paul wrote:

Thanks, Juha.  That is what I thought ICD was and how it worked.  Which is why I don't understand why Gary's method didn't work for me.  Maybe I just didn't get all the .so files I needed.  Or, ICD wouldn't map the right ones.

Do you know how to predict or help ICD choose?  I mean, I already have two OpenCL vendors installed now.  Can I be sure installing AMD-PRO OpenCL will help?  When it crashes, the output looked identical.  Would that suggest ICD is picking the same wrong driver?  This is the parts of ICD I don't understand.

I think first you need to check the .icd files you have. You probably still have Mesa .icd file there. I don't really know what happens if you have both Mesa OpenCL and AMDGPU-PRO installed. Perhaps "nothing good" would be a good guess.

You could remove Mesa .icd file or even uninstall Mesa OpenCL packages. Just make sure you keep ICD loader and kernel driver. Otherwise you need to figure out how to install them from the AMD package.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785730885
RAC: 250924

Ouch, that was a painful

Ouch, that was a painful experience.  With the all of the AMDGPU OpenGL packages installed, X11 will not start.

To reiterate, though, I don't want AMDGPU-PRO.  I want AMDGPU and any OpenCL that will work with it, exactly like it worked for me for three whole years.

Perhaps some other day I can try just the ICD stuff.  Or, I can bite the bullet and try the full -PRO suite.  But, I'm pretty sure that is broken with my X11, now.  This is why I switched to OSS in the first place.  Linus and Fedora started pushing kernels more quickly and X11 started changing fast.  By the time AMD updates their drivers, I have the next X11 and kernel, which are not compatible.

Juha
Juha
Joined: 27 Nov 14
Posts: 49
Credit: 4964434
RAC: 0

Well, the idea, if I

Well, the idea, if I understood it right, was to keep everything else Mesa expect OpenCL. As far as I can tell, in Fedora Mesa OpenCL is in mesa-libOpenCL package.

I can't find AMDGPU-PRO 16.60 that Gary used from AMD's site, only the latest 17.50. Looking at the Ubuntu package, the OpenCL driver for "legacy" (pre Vega10) cards is in opencl-amdgpu-pro-icd_17.50-511655_amd64.deb. You don't really need to install it, just extract it and copy the .so files to some lib directory and .icd files to /etc/OpenCL/vendors. If you use some non-standard lib directory you need to use LD_LIBRARY_PATH or update lib cache (see ldconfig man page).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117724748958
RAC: 34983431

I decided (around the time I

I decided (around the time I wrote my previous message) to go find the actual machine on which I first got a Polaris GPU to crunch under PCLinuxOS, over a year ago, with the 16.60 version of OpenCL libs from AMDGPU-PRO.  I had tried several things on several different machines and whilst I thought I'd remember the details at the time, it has got a bit foggy since.  I did find the correct machine and it was the RH version (actually CentOS) that I got to work - not Ubuntu.  So I made the decision to work with a bit more commitment on the RH version of 17.50 to see if I could get it to work as well.

Sorry it has taken a while to make progress.  Sometimes life gets in the way of the more important stuff! Laughing

To cut a long story short, I now have a fully working 17.50 setup.  So far, I have two machines, one with an RX 460 and one with an RX 560, both using the 17.50 OpenCL libs.  Both are running fine now but it ended up being quite a struggle to find a missing link.

When I started, the plan was to use brute force and extract as many RPMs as necessary to find current versions of all files that I had used from 16.60.  That turned out to be reasonably simple.  Without thinking too much about possible implications, I made a new path /opt/amdgpu-new/ alongside /opt/amdgpu-pro/ so I could easily go back if it didn't work.  I edited my BOINC startup script to set LD_LIBRARY_PATH=/opt/amdgpu-new/. 

On the first machine I tried, I just stopped and re-launched BOINC with the new LD_LIBRARY_PATH.  I found that crunching resumed without complaint.  Tasks were completed and validated and the crunch times seemed to be about the same or, perhaps, just a smidgen better.  Certainly not enough to get even vaguely excited about.  When everything had gone OK for quite a number of tasks, I picked the PID for a running task and ran lsof -p <PID>.  Here's exactly what I got:-


[gary@localhost ~]$ lsof -p 19018
COMMAND     PID USER  FD TYPE  DEVICE SIZE/OFF       NODE  NAME
hsgamma_F 19018 gary cwd  DIR     8,3     4096     131004  /home/gary/BOINC/slots/0
hsgamma_F 19018 gary rtd  DIR     8,1     4096          2  /
hsgamma_F 19018 gary txt  REG     8,3 10216922     129857  /home/gary/BOINC/projects/einstein.phys.uwm.edu/hsgamma...
hsgamma_F 19018 gary mem  CHR 226,128                1919  /dev/dri/renderD128
hsgamma_F 19018 gary mem  REG     8,1    92384     296254  /lib64/libgcc_s-7.2.1.so.1
hsgamma_F 19018 gary mem  REG     8,1 39490520     298177  /opt/amdgpu-new/lib64/libamdocl12cl64.so
hsgamma_F 19018 gary mem  REG     8,1   190248     296226  /opt/amdgpu-pro/lib64/libdrm_amdgpu.so.1
hsgamma_F 19018 gary mem  REG     8,1    90904     298173  /opt/amdgpu-new/lib64/libdrm.so.2.4.0
hsgamma_F 19018 gary mem  REG     8,1    31648     296365  /lib64/librt-2.26.so
hsgamma_F 19018 gary mem  REG     8,1  1849304     296333  /lib64/libc-2.26.so
hsgamma_F 19018 gary mem  REG     8,1  1354592     296341  /lib64/libm-2.26.so
hsgamma_F 19018 gary mem  REG     8,1    14560     296339  /lib64/libdl-2.26.so
hsgamma_F 19018 gary mem  REG     8,1   106112     296361  /lib64/libpthread-2.26.so
hsgamma_F 19018 gary mem  REG     8,1 67947096     298167  /opt/amdgpu-new/lib64/libamdocl64.so
hsgamma_F 19018 gary mem  REG     8,1   153408     296326  /lib64/ld-2.26.so
hsgamma_F 19018 gary mem  REG     8,3     1024     131020  /home/gary/BOINC/slots/0/boinc_EinsteinHS_0
hsgamma_F 19018 gary mem  REG     8,3     8192     131013  /home/gary/BOINC/slots/0/boinc_mmap_file
hsgamma_F 19018 gary  0u  CHR     1,3      0t0       1035  /dev/null
hsgamma_F 19018 gary  1u  CHR     1,3      0t0       1035  /dev/null
hsgamma_F 19018 gary  2w  REG     8,3    29010     131016  /home/gary/BOINC/slots/0/stderr.txt
hsgamma_F 19018 gary 3wW  REG     8,3        0     131018  /home/gary/BOINC/slots/0/boinc_lockfile
hsgamma_F 19018 gary  4w  REG     8,3        0     130199  /home/gary/BOINC/lockfile
hsgamma_F 19018 gary  5w  REG     8,3    24381     130196  /home/gary/BOINC/time_stats_log
hsgamma_F 19018 gary  6u  CHR 226,128      0t0       1919  /dev/dri/renderD128
hsgamma_F 19018 gary  7r  REG     0,4        0 4026531991  /proc/interrupts
hsgamma_F 19018 gary  8u  CHR 226,128      0t0       1919  /dev/dri/renderD128
[gary@localhost ~]$

I didn't notice it initially, but have a look at the red bit.  Of the four libs referenced, 3 have the path given by LD_LIBRARY_PATH (/opt/amdgpu-new/) but this 4th one is using the previous -pro path where the 16.60 version was still installed. The only thing I could think of really is that this lib is used because of a hard coded path in some other file from the 17.50 mix so perhaps it wasn't very smart to use /opt/amdgpu-new/ as the install path.

To resolve this, I decided to convert a second machine to using 17.50.  Instead of changing LD_LIBRARY_PATH, I moved the old libs to /opt/amdgpu-old/ to take them right out of the loop and created a new /opt/amdgpu-pro/ and installed all the 17.50 files there.  When I restarted BOINC on this 2nd machine with LD_LIBRARY_PATH still pointing to /opt/amdgpu-pro, I got the infamous "GPU missing ..." response.  So I stopped BOINC, deleted the libdrm_amdgpu.so.1.0.0 (which the libdrm_amdgpu.so.1 symbolic link points to) and replaced it with the version from 16.60.  I just wanted to confirm that using the old version of libdrm_amdgpu was what was allowing BOINC to be satisfied.  BOINC was indeed able to restart with no complaint.  So, the conclusion seems to be that there is a routine in the old version of the lib that BOINC uses that must be missing in the new version.  It was just a fortuitous set of circumstances that had the old version of the lib available in a place where it could be found that alerted me to this.

I decided to get a list of all symbols in both versions of that lib to see if there was anything in the old not present in the new one.  The command I used was 'nm -D --defined-only <path/to/shared_lib>.  The new has quite a few more symbols than the old, but there is just one that was in the old but not in the new.  It was 'amdgpu_asic_id_table'.  That looked pretty promising so I decided to do some googling.

I came across this particular link which talked about moving the amdgpu asic id table to a different file.  I figured an asic id table would be a big list of IDs for all supported GPU models and their common names - something that BOINC might be relying on to identify particular GPUs that were suitable for crunching.  So I went looking through other RPMs to see where it now was.  I found it in ids-amdgpu-1.0.0-511655.el7.noarch.rpm.  The file path in the RPM was /opt/amdgpu/share/libdrm/amdgpu.ids and it's a large text file with data in 3 columns, device-id, revision-id, product-name.

Being a noob at this sort of thing, I stuck this file under /opt/amdgpu-pro/ without paying attention to the proper path in the RPM.  I got rid of the 16.60 version of libdrm_amdgpu.so.1.0.0 that I'd been using and installed the 17.50 version in its place.  Of course, it didn't work so after a bit more thinking I realised I should be following the precise path from the RPM.  So I created /opt/amdgpu/ and installed the amdgpu.ids file under there rather than under /opt/amdgpu-pro/.  Now BOINC doesn't complain and lsof -p <PID> shows the same set of 4 libs as the above but all with the correct /opt/amdgpu-pro/ path.

Hopefully this is the end of any problems for me being able to use the 17.50 OPenCL stuff - not that I'm motivated to change a whole bunch of machines since there doesn't seem to be much improvement.  I guess I should try an RX 580 and see if that goes any faster.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.