All Einstein@Home jobs fail immediately on starting

UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 69899100
RAC: 17559
Topic 198616

E@H reports computation error on every one of its tasks as soon as it starts.

I've tried resetting the project. I've tried removing the project altogether, then reinstating it after checking that no einst* files exist. I've tried setting no GPUs on the project's preferences page. What else can I try?

This is a new Armari box with i7-5820K running at 4GHz, 32GB RAM, 256GB NVMe SSD and AMD/ATI Radeon R9 380X graphics (Tonga chipset) with sys-firmware/amdgpu-ucode-20160616. I'm running Gentoo Linux with sci-misc/boinc-7.6.31-r3 and app-emulation/virtualbox-4.3.32.

--
Rgds
Peter.

UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 69899100
RAC: 17559

All Einstein@Home jobs fail immediately on starting

Of course, I thought of something to try just after I hit Post: I set 1 in cc_config.xml and restarting BOINC.

The status of E@H jobs is now shown as "GPU missing". I'm waiting for these jobs to be started, to see what they do.

--
Rgds
Peter.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118624819
RAC: 109682

It seems your openCL driver

It seems your openCL driver is not installed correctly or is not working with our app.

from stderr.txt of one of the failed tasks:

Quote:
../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP6_1.53_x86_64-pc-linux-gnu__BRP6-opencl-ati: /usr/lib64/libOpenCL.so.1: no version information available (required by ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP6_1.53_x86_64-pc-linux-gnu__BRP6-opencl-ati)
UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 69899100
RAC: 17559

RE: It seems your openCL

Quote:
It seems your openCL driver is not installed correctly or is not working with our app.

Yes, that's how it seems to me too. As to OpenCL not being installed correctly, all I did is to set USE="opencl" in /etc/portage/make.conf and the libraries required were pulled in and compiled. No errors were reported as far as I can remember.

Quote:
from stderr.txt of one of the failed tasks:
Quote:
../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP6_1.53_x86_64-pc-linux-gnu__BRP6-opencl-ati: /usr/lib64/libOpenCL.so.1: no version information available (required by ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP6_1.53_x86_64-pc-linux-gnu__BRP6-opencl-ati)

I don't know where you found that, but it doesn't exist here:

$ find ~/boinc -type f -exec grep 'libOpenCL.so.1' {} +
Binary file ./projects/einstein.phys.uwm.edu/einsteinbinary_BRP6_1.53_x86_64-pc-linux-gnu__BRP6-opencl-ati matches
$

That's the only similar reference I can find.

Is there a way for me to debug what's going on here? Can I run an strace or something? I couldn't find anything helpful via Google.

--
Rgds
Peter

--
Rgds
Peter.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118624819
RAC: 109682

I'm not that fluent with

I'm not that fluent with Gentoo. On Debian I installed the "ocl-icd-libopencl1" package which provides me with a functional library. I don't know which version information is missing from your libOpenCL.so.1. Maybe you can compare yours with mine. Here is some output:

$ ldd /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
        linux-vdso.so.1 (0x00007ffefb6ec000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f7f7683c000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7f76498000)
        /lib64/ld-linux-x86-64.so.2 (0x000055ebaff6c000)

$ readelf -a -W /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 | head -20
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x42c0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          41544 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         30
  Section header string table index: 29

$ readelf -d /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0

Dynamic section at offset 0x9db0 contains 29 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
0x000000000000000e (SONAME) Library soname: [libOpenCL.so.1]
0x000000000000000c (INIT) 0x40c0
0x000000000000000d (FINI) 0x6a18
0x0000000000000019 (INIT_ARRAY) 0x209588
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x209590
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x228
0x0000000000000005 (STRTAB) 0x1698
0x0000000000000006 (SYMTAB) 0x6d8
0x000000000000000a (STRSZ) 3094 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x20a000
0x0000000000000002 (PLTRELSZ) 672 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x3e20
0x0000000000000007 (RELA) 0x2548
0x0000000000000008 (RELASZ) 6360 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffffc (VERDEF) 0x2400
0x000000006ffffffd (VERDEFNUM) 6
0x000000006ffffffe (VERNEED) 0x24c8
0x000000006fffffff (VERNEEDNUM) 3
0x000000006ffffff0 (VERSYM) 0x22ae
0x000000006ffffff9 (RELACOUNT) 259
0x0000000000000000 (NULL) 0x0

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118624819
RAC: 109682

I found something that seems

I found something that seems related: http://stackoverflow.com/questions/137773/what-does-the-no-version-information-available-error-from-linux-dynamic-linker. The message "no version information available" means the libOpenCL.so you have installed is older than the one we used to build the application with. This is not an error but a warning so it might be related or not.

For completeness: the message: "[16:32:00][10115][ERROR] Couldn't build OpenCL program (error: -11)!" means CL_BUILD_PROGRAM_FAILURE if there is a failure to build the program executable. This error will be returned if clBuildProgram does not return until the build has completed. and is the return value from clBuildProgram(). I don't know why that fails.

UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 69899100
RAC: 17559

Many thanks for your help,

Many thanks for your help, Christian.

Quote:
Here is some output:
$ ldd /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
        linux-vdso.so.1 (0x00007ffefb6ec000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f7f7683c000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7f76498000)
        /lib64/ld-linux-x86-64.so.2 (0x000055ebaff6c000)


Mine is so different already that I can't see how to reconcile it with yours (it's many years since I could claim to be a coder). Here is what I get:

# ldd /usr/lib64/OpenCL/vendors/mesa/libOpenCL.so.1
        linux-vdso.so.1 (0x00007fff0d168000)
        libexpat.so.1 => /usr/lib64/libexpat.so.1 (0x00007fc01361c000)
        libdrm.so.2 => /usr/lib64/libdrm.so.2 (0x00007fc01340d000)
        libelf.so.1 => /usr/lib64/libelf.so.1 (0x00007fc0131f5000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fc012ff1000)
        libLLVM-3.6.so => /usr/lib64/libLLVM-3.6.so (0x00007fc0117c1000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc0115a4000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.3/libstdc++.so.6 (0x00007fc01124c000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fc010f56000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fc010bb9000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.3/libgcc_s.so.1 (0x00007fc0109a2000)
        /lib64/ld-linux-x86-64.so.2 (0x0000562fa918d000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fc01078c000)
        libffi.so.6 => /usr/lib64/libffi.so.6 (0x00007fc010582000)
        libncurses.so.5 => /lib64/libncurses.so.5 (0x00007fc01032b000)


Quote:
$ readelf -a -W /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 | head -20
--->8


This is mine:

# readelf -a -W /usr/lib64/OpenCL/vendors/mesa/libOpenCL.so.1 | head -20
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - GNU
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x126000
  Start of program headers:          64 (bytes into file)
  Start of section headers:          15731144 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         7
  Size of section headers:           64 (bytes)
  Number of section headers:         26
  Section header string table index: 25


Finally:

# readelf -d /usr/lib64/OpenCL/vendors/mesa/libOpenCL.so.1.0.0
Dynamic section at offset 0xef9bb8 contains 34 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libexpat.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libelf.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libLLVM-3.6.so]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x000000000000000e (SONAME)             Library soname: [libOpenCL.so.1]
 0x000000000000000c (INIT)               0x121f00
 0x000000000000000d (FINI)               0xc3d4dc
 0x0000000000000019 (INIT_ARRAY)         0x105e470
 0x000000000000001b (INIT_ARRAYSZ)       128 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x105e4f0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x1c8
 0x0000000000000005 (STRTAB)             0x7788
 0x0000000000000006 (SYMTAB)             0x660
 0x000000000000000a (STRSZ)              51441 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x10fa000
 0x0000000000000002 (PLTRELSZ)           24888 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x11bdc8
 0x0000000000000007 (RELA)               0x14b78
 0x0000000000000008 (RELASZ)             1077840 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x149e8
 0x000000006fffffff (VERNEEDNUM)         7
 0x000000006ffffff0 (VERSYM)             0x1407a
 0x000000006ffffff9 (RELACOUNT)          43519
 0x0000000000000000 (NULL)               0x0


Can you see anything useful there?

As to the stackoverflow example, it seems to be saying the same as you did earlier: that the library version I have is older than yours. I don't know what I can do about that, other than waiting until Gentoo catches up.

--
Rgds
Peter.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118624819
RAC: 109682

I also found this Gentoo

I also found this Gentoo OpenCL Wikipage. Which states that the mesa implementation you are using works with almost all AMD cards (it seems you have one of thse it does not work with). On the page there are other implementations and also the AMD one which you should try next I think. If you don't need the mesa implementation you should uninstall it again, otherwise you need the ICD package too.

Juha
Juha
Joined: 27 Nov 14
Posts: 49
Credit: 4962246
RAC: 23

RE: For completeness: the

Quote:
For completeness: the message: "[16:32:00][10115][ERROR] Couldn't build OpenCL program (error: -11)!" means CL_BUILD_PROGRAM_FAILURE if there is a failure to build the program executable. This error will be returned if clBuildProgram does not return until the build has completed. and is the return value from clBuildProgram(). I don't know why that fails.

He likely needs a newer libclc the same as Fedora and Ubuntu 16.04 users. Gentoo has a 0.2.0_pre20160209 version of libclc in testing which might be new enough. The modf function that was missing was checked in in January 2016.

UnionJack
UnionJack
Joined: 9 Feb 05
Posts: 15
Credit: 69899100
RAC: 17559

Thanks for the pointer to the

Thanks for the pointer to the new libclc. I've upgraded to it, which involved upgrading a few other packages too (llvm, clang, ...), and now I see this in the event log:

Wed 25 May 2016 23:27:55 BST | | OpenCL: AMD/ATI GPU 0: AMD TONGA (DRM 3.1.0, LLVM 3.7.1) (driver version 11.0.6, device version OpenCL 1.1 MESA 11.0.6, 1024MB, 1024MB available, 50 GFLOPS peak)

I haven't seen this before; it seems to augur well.

I have 12 E@H jobs running at the moment and communication with the project is backed off, presumably until they've finished, about 7 - 9 hours estimated.

See you in the morning...

--
Rgds
Peter.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118624819
RAC: 109682

RE: RE: For completeness:

Quote:
Quote:
For completeness: the message: "[16:32:00][10115][ERROR] Couldn't build OpenCL program (error: -11)!" means CL_BUILD_PROGRAM_FAILURE if there is a failure to build the program executable. This error will be returned if clBuildProgram does not return until the build has completed. and is the return value from clBuildProgram(). I don't know why that fails.

He likely needs a newer libclc the same as Fedora and Ubuntu 16.04 users. Gentoo has a 0.2.0_pre20160209 version of libclc in testing which might be new enough. The modf function that was missing was checked in in January 2016.

So this was a compiler issue of LLVM? I can't find the libclc library in the dependencies but the gentoo libOpenCL.so has libLLVM as a dependency. Was this the clue to solving this?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.