FGRPopencl1K-ati POLARIS10 AMDGPU LLVM3.9.1 Mesa 17 = Crash

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

Thanks again, Gary. Yes, I'm

Thanks again, Gary.

Yes, I'm certain that is exactly the problem.  You have AMD-APP platform, and I do not.  That's exactly what I need to fix.

This is very frustrating.  I don't understand what is so different between our systems, nor do I really understand what is happening on either system.

On the other hand, and I'm sorry I didn't notice this before, but "AMD-APP" makes me suspicious. There used to be AMD software for computing in /opt/AMD-APP.  It didn't come with the drivers, but the driver docs said you needed to get it and pointed users to a separate download site on AMD.com.  It was separate API stuff...oh, I can't remember...wait, it was STEAM or something; like CUDA, but for AMD.  You don't have anything like that installed, do you?

Here is my /opt directory tree:

/opt/amdgpu-pro/bin:
total 780
-rwxr-xr-x. 1 root root 798464 Aug 10 11:00 clinfo*

/opt/amdgpu-pro/lib64:
total 99980
-rwxr-xr-x. 1 root root 39131760 Aug 10 11:00 libamdocl12cl64.so*
-rwxr-xr-x. 1 root root 62971352 Aug 10 11:00 libamdocl64.so*
lrwxrwxrwx. 1 root root       22 Aug 10 09:24 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.0.0*
-rwxr-xr-x. 1 root root    66936 Aug 10 09:24 libdrm_amdgpu.so.1.0.0*
lrwxrwxrwx. 1 root root       22 Aug 10 09:24 libdrm_radeon.so.1 -> libdrm_radeon.so.1.0.1*
-rwxr-xr-x. 1 root root    67536 Aug 10 09:24 libdrm_radeon.so.1.0.1*
lrwxrwxrwx. 1 root root       15 Aug 10 09:24 libdrm.so.2 -> libdrm.so.2.4.0*
-rwxr-xr-x. 1 root root    81040 Aug 10 09:24 libdrm.so.2.4.0*
lrwxrwxrwx. 1 root root       15 Aug 10 09:24 libkms.so.1 -> libkms.so.1.0.0*
-rwxr-xr-x. 1 root root    18736 Aug 10 09:24 libkms.so.1.0.0*
lrwxrwxrwx. 1 root root       14 Aug 10 11:00 libOpenCL.so -> libOpenCL.so.1*
-rwxr-xr-x. 1 root root    27336 Aug 10 11:00 libOpenCL.so.1*

/opt/amdgpu-pro/share:
total 4
drwxr-xr-x. 3 root root 4096 Sep 23 15:57 doc/

/opt/amdgpu-pro/share/doc:
total 4
drwxr-xr-x. 2 root root 4096 Sep 23 15:57 libdrm-amdgpu-pro-2.4.70/

/opt/amdgpu-pro/share/doc/libdrm-amdgpu-pro-2.4.70:
total 4
-rw-r--r--. 1 root root 1627 Jul 24 01:42 README

One difference I see is that you do not have libOpenCL provided by AMD, but, I do.  Since, this file came with libamdocl64 and libamdocl12cl64, I'm not sure what to make of it.  Perhaps this is the key, but I would have expected more problems with your situation than mine.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

After moving libOpenCL out of

After moving libOpenCL out of the way, but it didn't help.

$ LD_LIBRARY_PATH=/opt/amdgpu-pro/lib64 /opt/amdgpu-pro/bin/clinfo
/opt/amdgpu-pro/share/libdrm/amdgpu.ids: No such file or directory
Segmentation fault (core dumped)

I don't know what is /opt/amdgpu-pro/share/libdrm/amdgpu.ids, but neither of us have it, I don't think.  But, that error doesn't occur with the AMD version of libOpenCL.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

Yikes.  I rebooted and GDM

Yikes.  I rebooted and GDM wouldn't start.  Dang, this is worse than expected.  I removed everything I installed and then GDM would start.

That doesn't even make sense.  GDM wouldn't have been looking at any of the libraries, at least not directly.  The only component installed globally would be the ICD stuff.

So, I'm still missing a piece.

Gordon Haverland
Gordon Haverland
Joined: 28 Oct 16
Posts: 20
Credit: 428489605
RAC: 0

I don't know if relevant, but

I don't know if relevant, but somewhere around the 4.8/4.9 kernel, vsyscall changed from emulate to none for most distributions of Linux.  If you look in /var/log/messages (or similar) and see a bunch of errors about vsyscall, this could be the source.  You can change the linux kernel boot line, to say vsyscall=emulate in the options, and this seems to help.  I gather the reason this option was changed to none, is that there are security concerns.  So hopefully at some point, BOINC quits requiring this to be set to emulate.

 

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

Wow, thanks Gordon.  I

Wow, thanks Gordon.  I noticed your posts elsewhere, but didn't think that was related.  After your post here, I checked, and I do see vsyscall at the very bottom of my backtrace.  Hmm.  I'll give it a try on my next reboot.

So, I'm trying to understand this issue.  Does this mean that E@H GPU app uses vsyscall, directly?  There isn't any problem CPU apps, only GPU apps.  So, it must not be BOINC, right?  All GPU apps on my system, fail, though I don't have logs from any other app but E@H to check for this particular indicator.  It may not be the same problem.  Or, this may only be one problem and a new one will pop up after I fix it.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

E@H account service was down,

E@H account service was down, yesterday, so I haven't be enable GPU computing to test.  However, I tried MilkyWay@H and got only failures.  Also, I'm getting SETI@H errors, too, still.  So, I don't think vsyscall this has anything to do with my root problem.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

I have confirmed that

I have confirmed that vsyscall=emulate made no improvement to E@H.  I'm running kernel 4.13, now.  It's a very interesting suggestion, however, since my crashes end with:

7ffe849b9000-7ffe849bc000 r--p 00000000 00:00 0                          [vvar]

7ffe849bc000-7ffe849be000 r-xp 00000000 00:00 0                          [vdso]

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

but, changing to 'emulate' had no effect.

Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565
Paul
Paul
Joined: 3 May 07
Posts: 123
Credit: 1785298453
RAC: 310565

OMG, it's working.  I noticed

OMG, it's working.  I noticed some updates to Mesa, and just decided to try Einsten@Home.  Yay!

I now have DRM 3.27.0 and LLVM 7.0.0, driver 18.2.6 and OpenCL 1.1, Mesa 18.2.6.

Fingers crossed, this keeps working for a long time.  It's been down for 2.5 years (well, less than, but I'm not sure how much less.)

Anybody else see this?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.