Yes, I'm certain that is exactly the problem. You have AMD-APP platform, and I do not. That's exactly what I need to fix.
This is very frustrating. I don't understand what is so different between our systems, nor do I really understand what is happening on either system.
On the other hand, and I'm sorry I didn't notice this before, but "AMD-APP" makes me suspicious. There used to be AMD software for computing in /opt/AMD-APP. It didn't come with the drivers, but the driver docs said you needed to get it and pointed users to a separate download site on AMD.com. It was separate API stuff...oh, I can't remember...wait, it was STEAM or something; like CUDA, but for AMD. You don't have anything like that installed, do you?
Here is my /opt directory tree:
/opt/amdgpu-pro/bin:
total 780
-rwxr-xr-x. 1 root root 798464 Aug 10 11:00 clinfo*
/opt/amdgpu-pro/lib64:
total 99980
-rwxr-xr-x. 1 root root 39131760 Aug 10 11:00 libamdocl12cl64.so*
-rwxr-xr-x. 1 root root 62971352 Aug 10 11:00 libamdocl64.so*
lrwxrwxrwx. 1 root root 22 Aug 10 09:24 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.0.0*
-rwxr-xr-x. 1 root root 66936 Aug 10 09:24 libdrm_amdgpu.so.1.0.0*
lrwxrwxrwx. 1 root root 22 Aug 10 09:24 libdrm_radeon.so.1 -> libdrm_radeon.so.1.0.1*
-rwxr-xr-x. 1 root root 67536 Aug 10 09:24 libdrm_radeon.so.1.0.1*
lrwxrwxrwx. 1 root root 15 Aug 10 09:24 libdrm.so.2 -> libdrm.so.2.4.0*
-rwxr-xr-x. 1 root root 81040 Aug 10 09:24 libdrm.so.2.4.0*
lrwxrwxrwx. 1 root root 15 Aug 10 09:24 libkms.so.1 -> libkms.so.1.0.0*
-rwxr-xr-x. 1 root root 18736 Aug 10 09:24 libkms.so.1.0.0*
lrwxrwxrwx. 1 root root 14 Aug 10 11:00 libOpenCL.so -> libOpenCL.so.1*
-rwxr-xr-x. 1 root root 27336 Aug 10 11:00 libOpenCL.so.1*
One difference I see is that you do not have libOpenCL provided by AMD, but, I do. Since, this file came with libamdocl64 and libamdocl12cl64, I'm not sure what to make of it. Perhaps this is the key, but I would have expected more problems with your situation than mine.
After moving libOpenCL out of the way, but it didn't help.
$ LD_LIBRARY_PATH=/opt/amdgpu-pro/lib64 /opt/amdgpu-pro/bin/clinfo
/opt/amdgpu-pro/share/libdrm/amdgpu.ids: No such file or directory
Segmentation fault (core dumped)
I don't know what is /opt/amdgpu-pro/share/libdrm/amdgpu.ids, but neither of us have it, I don't think. But, that error doesn't occur with the AMD version of libOpenCL.
Yikes. I rebooted and GDM wouldn't start. Dang, this is worse than expected. I removed everything I installed and then GDM would start.
That doesn't even make sense. GDM wouldn't have been looking at any of the libraries, at least not directly. The only component installed globally would be the ICD stuff.
I don't know if relevant, but somewhere around the 4.8/4.9 kernel, vsyscall changed from emulate to none for most distributions of Linux. If you look in /var/log/messages (or similar) and see a bunch of errors about vsyscall, this could be the source. You can change the linux kernel boot line, to say vsyscall=emulate in the options, and this seems to help. I gather the reason this option was changed to none, is that there are security concerns. So hopefully at some point, BOINC quits requiring this to be set to emulate.
Wow, thanks Gordon. I noticed your posts elsewhere, but didn't think that was related. After your post here, I checked, and I do see vsyscall at the very bottom of my backtrace. Hmm. I'll give it a try on my next reboot.
So, I'm trying to understand this issue. Does this mean that E@H GPU app uses vsyscall, directly? There isn't any problem CPU apps, only GPU apps. So, it must not be BOINC, right? All GPU apps on my system, fail, though I don't have logs from any other app but E@H to check for this particular indicator. It may not be the same problem. Or, this may only be one problem and a new one will pop up after I fix it.
E@H account service was down, yesterday, so I haven't be enable GPU computing to test. However, I tried MilkyWay@H and got only failures. Also, I'm getting SETI@H errors, too, still. So, I don't think vsyscall this has anything to do with my root problem.
I have confirmed that vsyscall=emulate made no improvement to E@H. I'm running kernel 4.13, now. It's a very interesting suggestion, however, since my crashes end with:
Thanks again, Gary. Yes, I'm
)
Thanks again, Gary.
Yes, I'm certain that is exactly the problem. You have AMD-APP platform, and I do not. That's exactly what I need to fix.
This is very frustrating. I don't understand what is so different between our systems, nor do I really understand what is happening on either system.
On the other hand, and I'm sorry I didn't notice this before, but "AMD-APP" makes me suspicious. There used to be AMD software for computing in /opt/AMD-APP. It didn't come with the drivers, but the driver docs said you needed to get it and pointed users to a separate download site on AMD.com. It was separate API stuff...oh, I can't remember...wait, it was STEAM or something; like CUDA, but for AMD. You don't have anything like that installed, do you?
Here is my /opt directory tree:
/opt/amdgpu-pro/bin:
total 780
-rwxr-xr-x. 1 root root 798464 Aug 10 11:00 clinfo*
/opt/amdgpu-pro/lib64:
total 99980
-rwxr-xr-x. 1 root root 39131760 Aug 10 11:00 libamdocl12cl64.so*
-rwxr-xr-x. 1 root root 62971352 Aug 10 11:00 libamdocl64.so*
lrwxrwxrwx. 1 root root 22 Aug 10 09:24 libdrm_amdgpu.so.1 -> libdrm_amdgpu.so.1.0.0*
-rwxr-xr-x. 1 root root 66936 Aug 10 09:24 libdrm_amdgpu.so.1.0.0*
lrwxrwxrwx. 1 root root 22 Aug 10 09:24 libdrm_radeon.so.1 -> libdrm_radeon.so.1.0.1*
-rwxr-xr-x. 1 root root 67536 Aug 10 09:24 libdrm_radeon.so.1.0.1*
lrwxrwxrwx. 1 root root 15 Aug 10 09:24 libdrm.so.2 -> libdrm.so.2.4.0*
-rwxr-xr-x. 1 root root 81040 Aug 10 09:24 libdrm.so.2.4.0*
lrwxrwxrwx. 1 root root 15 Aug 10 09:24 libkms.so.1 -> libkms.so.1.0.0*
-rwxr-xr-x. 1 root root 18736 Aug 10 09:24 libkms.so.1.0.0*
lrwxrwxrwx. 1 root root 14 Aug 10 11:00 libOpenCL.so -> libOpenCL.so.1*
-rwxr-xr-x. 1 root root 27336 Aug 10 11:00 libOpenCL.so.1*
/opt/amdgpu-pro/share:
total 4
drwxr-xr-x. 3 root root 4096 Sep 23 15:57 doc/
/opt/amdgpu-pro/share/doc:
total 4
drwxr-xr-x. 2 root root 4096 Sep 23 15:57 libdrm-amdgpu-pro-2.4.70/
/opt/amdgpu-pro/share/doc/libdrm-amdgpu-pro-2.4.70:
total 4
-rw-r--r--. 1 root root 1627 Jul 24 01:42 README
One difference I see is that you do not have libOpenCL provided by AMD, but, I do. Since, this file came with libamdocl64 and libamdocl12cl64, I'm not sure what to make of it. Perhaps this is the key, but I would have expected more problems with your situation than mine.
After moving libOpenCL out of
)
After moving libOpenCL out of the way, but it didn't help.
$ LD_LIBRARY_PATH=/opt/amdgpu-pro/lib64 /opt/amdgpu-pro/bin/clinfo
/opt/amdgpu-pro/share/libdrm/amdgpu.ids: No such file or directory
Segmentation fault (core dumped)
I don't know what is /opt/amdgpu-pro/share/libdrm/amdgpu.ids, but neither of us have it, I don't think. But, that error doesn't occur with the AMD version of libOpenCL.
Yikes. I rebooted and GDM
)
Yikes. I rebooted and GDM wouldn't start. Dang, this is worse than expected. I removed everything I installed and then GDM would start.
That doesn't even make sense. GDM wouldn't have been looking at any of the libraries, at least not directly. The only component installed globally would be the ICD stuff.
So, I'm still missing a piece.
I don't know if relevant, but
)
I don't know if relevant, but somewhere around the 4.8/4.9 kernel, vsyscall changed from emulate to none for most distributions of Linux. If you look in /var/log/messages (or similar) and see a bunch of errors about vsyscall, this could be the source. You can change the linux kernel boot line, to say vsyscall=emulate in the options, and this seems to help. I gather the reason this option was changed to none, is that there are security concerns. So hopefully at some point, BOINC quits requiring this to be set to emulate.
Wow, thanks Gordon. I
)
Wow, thanks Gordon. I noticed your posts elsewhere, but didn't think that was related. After your post here, I checked, and I do see vsyscall at the very bottom of my backtrace. Hmm. I'll give it a try on my next reboot.
So, I'm trying to understand this issue. Does this mean that E@H GPU app uses vsyscall, directly? There isn't any problem CPU apps, only GPU apps. So, it must not be BOINC, right? All GPU apps on my system, fail, though I don't have logs from any other app but E@H to check for this particular indicator. It may not be the same problem. Or, this may only be one problem and a new one will pop up after I fix it.
E@H account service was down,
)
E@H account service was down, yesterday, so I haven't be enable GPU computing to test. However, I tried MilkyWay@H and got only failures. Also, I'm getting SETI@H errors, too, still. So, I don't think vsyscall this has anything to do with my root problem.
I have confirmed that
)
I have confirmed that vsyscall=emulate made no improvement to E@H. I'm running kernel 4.13, now. It's a very interesting suggestion, however, since my crashes end with:
7ffe849b9000-7ffe849bc000 r--p 00000000 00:00 0 [vvar]
7ffe849bc000-7ffe849be000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
but, changing to 'emulate' had no effect.
Problems persist 6 months
)
Problems persist 6 months later after more upgrades.
Same problem reported here https://einsteinathome.org/content/computation-errors-ubuntu17-rx460-opencl-mesa
See these bugs.
https://bugs.freedesktop.org/show_bug.cgi?id=104182
https://bugs.freedesktop.org/show_bug.cgi?id=104681
OMG, it's working. I noticed
)
OMG, it's working. I noticed some updates to Mesa, and just decided to try Einsten@Home. Yay!
I now have DRM 3.27.0 and LLVM 7.0.0, driver 18.2.6 and OpenCL 1.1, Mesa 18.2.6.
Fingers crossed, this keeps working for a long time. It's been down for 2.5 years (well, less than, but I'm not sure how much less.)
Anybody else see this?