GPU crunching on Linux - Fun with AMD fglrx drivers

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482408
RAC: 8689

RE: RE: I rebooted the

Quote:
Quote:
I rebooted the node. Started boincmgr from the command line (not using the Ubuntu boinc package) All CPU jobs that were in progress started to run BUT NOT the GPU jobs - they were indicating no usable GPU.

There have been many reports over the years about a timing problem during BOINC startup under Linux. If the BOINC client is started "too soon" after reboot (a deliberately vague term), and Xserver isn't fully initialsed, then the GPU drivers won't have finished initialising either, and BOINC - which relies on functioning drivers for GPU detection - won't be able to see the GPUs.


I had read about this possibility over at GPU Grid. Their solution was to modify the /etc/init.d/boinc-client script to include a delay during startup. In my case that solution is not applicable because I am using the Berkeley installer which does not produce an init.d script. Since BOINC in my situation is started from the command line I would think that all the necessary Xserver requirements would be complete.

Quote:

Could you try repeating that sequence, with either a longer pause before starting BOINC, or alternatively trying a *BOINC* closedown/restart if you see 'no usable GPU'?


I have repeated the same sequence but without a new kernel upgrade.

Here are the steps:
1. ps -ae | grep boinc
both boinc client and manager running

2. gracefully exited the boinc manager.
only boinc client continued to run

3. reboot

4. login
ps -ae | grep boinc
nothing

5. cd to $HOME/BOINC
./run_manager&

6. ps -ae | grep boinc
both the client and the manager are running
both CPU and GPU tasks are crunching.

What follows is a "guess" on my part. The difference between this restart and yesterdays is that yesterday's effort downloaded and installed a new Linux kernel. I believe that when installing the AMD driver that the kernel is some how modified and that is why everything works after reinstalling the driver and performing a reboot. I believe that in my use of a PPA to manage NVIDIA drivers on Linux there is a hook in that process that uses DKMS to rebuild kernel device drivers. It is transparent in that it occurs after the install of the driver. You then reboot the machine and you are good to go. No issues with missing GPUs. By reinstalling the driver yesterday on the AMD node I believe the new kernel had been modified by the driver installation process and everthing was happy.

Quote:

Apropos, on your system, does launching BOINC Manager trigger the startup of the client, or does the client start by itself from a daemon launch script, much earlier in the boot process?

yes starting the BOINC manager does start the boinc-client.

I came across this yesterday so I just tried this start up procedure.

cd your_directory_path/BOINC
./run_client --daemon <----start boinc-client as a daemon
./run_manager

ps -ae | grep boinc
shows one client and one manager running.

Does not improve anything.

The problem with the Berkeley install is that you cannot gracefully restart/bounce the client with a "sudo service boinc-client restart". This sometimes fixes issues with "missing GPUs" utilizing NVIDIA. Yes I have seen "missing GPUs" on NVIDIA after a reboot and without a new kernel and this can usually be fixed by bouncing the client. That would seem to indicate a timing issue during boot. I could install Ubuntu's distro version of BOINC after these current jobs finish and see how that works out. Earlier in this thread I could not get the distro's BOINC to crunch GPUs WUs but maybe I was confused with all of this other clutter. Might be worth a try - after the current work is completed.

I keep saying it can't be this hard, but ....

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752801655
RAC: 1436958

I've seen users report an

I've seen users report an error message after installing new NVidia drivers - probably not using the same ppa route that you are using - along the lines of "kernel version number doesn't match NVidia driver component version number - re-compile your kernel" (or words to that effect - not an exact quote). Maybe AMD require the same, but don't tell you about it?

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482408
RAC: 8689

Did some poking around on the

Did some poking around on the AMD node in the area of the driver logs and came across this file: fglrx-install.log. It gets generated during the driver installation routine and contains the following:

  • Supported adapter detected. Check if system has the tools required for installation.
    Uninstalling any previously installed drivers.
    [Message] Kernel Module : Trying to install a precompiled kernel module.
    [Message] Kernel Module :

Precompiled kernel module version mismatched.
[Message] Kernel Module : Found kernel module build environment, generating kernel module now.
AMD kernel module generator version 2.1
doing Makefile based build for kernel 2.6.x and higher
rm -rf *.c *.h *.o *.ko *.a .??* *.symvers
make -C /lib/modules/3.11.0-18-generic/build SUBDIRS=/lib/modules/fglrx/build_mod/2.6.x modules
make[1]: Entering directory `/usr/src/linux-headers-3.11.0-18-generic'
CC [M] /lib/modules/fglrx/build_mod/2.6.x/firegl_public.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_acpi.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_agp.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_debug.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_ioctl.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_io.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_pci.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_str.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_iommu.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl.o
CC [M] /lib/modules/fglrx/build_mod/2.6.x/kcl_wait.o
LD [M] /lib/modules/fglrx/build_mod/2.6.x/fglrx.o
Building modules, stage 2.
MODPOST 1 modules
CC /lib/modules/fglrx/build_mod/2.6.x/fglrx.mod.o
LD [M] /lib/modules/fglrx/build_mod/2.6.x/fglrx.ko
make[1]: Leaving directory `/usr/src/linux-headers-3.11.0-18-generic'
build succeeded with return value 0
duplicating results into driver repository...
done.
You must change your working directory to /lib/modules/fglrx
and then call ./make_install.sh in order to install the built module.
- recreating module dependency list
- trying a sample load of the kernel modules
done.
[Reboot] Kernel Module : update-initramfs

The comment about running "make_install.sh" is handled for you. the comments "recreating module" and "trying a sample load" come from that "make_install.sh" script.

This seems to imply that when installing this driver (and probably others) there is some magic being done to the kernel environment that is required for a successful driver install, and that if you download/install a new kernel using your linux package manager you will have to reinstall the AMD driver to update the new kernel and reboot. If you don't then you can probably expect issues with GPUs not being recognized.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109395960037
RAC: 35787516

RE: This seems to imply

Quote:
This seems to imply that when installing this driver (and probably others) there is some magic being done to the kernel environment that is required for a successful driver install, and that if you download/install a new kernel using your linux package manager you will have to reinstall the AMD driver to update the new kernel and reboot. If you don't then you can probably expect issues with GPUs not being recognized.


Sorry for no recent responses but I've had some long overdue home duties to attend to :-).

I'm not a programmer so once you start to get down to the code level, I just tend to sit back and watch. My expectation is that a distro that has had time to mature ought to be able to seamlessly handle installation of different kernels and the automatic rebuilding and installation of kernel modules as required, including AMD GPU driver modules. If that is not happening automatically, I would regard that as a problem to be attended to by the distro maintainers. Changing your kernel should not require any manual intervention.

As already indicated, the only distro I have significant exposure to is PCLinuxOS. The spin I use is KDE-MiniMe 32 bit. It's quite lightweight (for KDE) and it comes standard with a non-PAE kernel. That's fine for the older 2GB machines but I now have an increasing number of more modern hosts with (mainly) 8GB RAM and as part of the post-installation setup, I run synaptic to install all upgrades, a new PAE kernel, the WxWidgets libs for BOINC Manager and the AMD or NVIDIA drivers (including CUDA or OpenCL) as appropriate. All of that is done in one hit and I've never had an issue with the proper building and installation of kernel modules for the new kernel.

The Grub menu defaults to the new kernel and the old kernel is also listed so that I could boot it if I ever wanted to. During first boot, there is output on the screen indicating that several kernel modules are being built and installed on the new kernel. The process takes several minutes and I've never seen it deliver an outcome where the new kernel fails to boot with the necessary proprietary video drivers available. Sometimes, xorg.conf must not be setup correctly because X fails to start but you get a text login which can be used to get a root shell and a simple utility can be run to correct that. The utility is run with --auto as the switch and after that there is no problem with X starting correctly. This doesn't normally happen but it has lately and I'm putting this down to the fact that the system on the live USB is quite old and there are close to 300 packages that get upgraded. There will be a new .iso image available any day now and once I upgrade the live USB I don't expect this oddity to recur.

My original reason for starting this thread can be put down to lack of knowledge - the knowledge that BOINC seems to need certain libs to be in /usr/lib/ rather than /usr/lib/fglrx-current/. All I have to do is find out who is maintaining the fglrx stuff on PCLinuxOS and ask them if they can include the creation of the links in the package or if there is some different solution. We shouldn't need to struggle with any of this. I don't really mind having done so - it has taught me a lot.

Cheers,
Gary.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482408
RAC: 8689

RE: RE: This seems to

Quote:
Quote:
This seems to imply that when installing this driver (and probably others) there is some magic being done to the kernel environment that is required for a successful driver install, and that if you download/install a new kernel using your linux package manager you will have to reinstall the AMD driver to update the new kernel and reboot. If you don't then you can probably expect issues with GPUs not being recognized.

Sorry for no recent responses but I've had some long overdue home duties to attend to :-).

I'm not a programmer so once you start to get down to the code level, I just tend to sit back and watch. My expectation is that a distro that has had time to mature ought to be able to seamlessly handle installation of different kernels and the automatic rebuilding and installation of kernel modules as required, including AMD GPU driver modules. If that is not happening automatically, I would regard that as a problem to be attended to by the distro maintainers. Changing your kernel should not require any manual intervention.

I am not sure where the responsibility for handling this should reside. I tend to lean more towards the hardware vendor and not the distro. I found this thread that address this very problem over on SUSE. When you look at his solution (specifically for AMD) then I really have no idea why AMD/NVIDIA can't/don't produce a similar script. I have not tried his solution so I am assuming that it works, but it looks reasonable. His script executes on boot up and runs the requisite "makes". If you have many mods though this could be a long boot.

I also cam across this for NVIDIA:

1. uname -r

gives 3.8.0-37-generic

2. sudo apt-get install linux-headers-3.8.0-37-generic

3. sudo dpkg-reconfigure nvidia-current <-- if this (nvidia-current) is your
current driver package

4. reboot

This too could be automated in the above script. Its important in this example to note the "nvidia-current" is an actual driver name. This could change based upon the driver you have loaded, i.e., "sudo dpkg-reconfigure nvidia-304".

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482408
RAC: 8689

I believe that DKMS is the

I believe that DKMS is the solution to rebuilding the kernel modules when a new kernel is installed.
This is a fairly informative pdf about DKMS, especially the section on DKMS autoinstaller. This section will discuss a *.conf file. I believe the key here is that the hardware vendor must supply such a file with the driver package.

Here is such a conf file for virtual box located in
/var/lib/dkms/virtualbox/4.1.12/build/dkms.conf:

PACKAGE_NAME="virtualbox"
PACKAGE_VERSION="4.1.12"
CLEAN="rm -f *.*o"
BUILT_MODULE_NAME[0]="vboxdrv"
BUILT_MODULE_LOCATION[0]="vboxdrv"
DEST_MODULE_LOCATION[0]="/updates"
BUILT_MODULE_NAME[1]="vboxnetadp"
BUILT_MODULE_LOCATION[1]="vboxnetadp"
DEST_MODULE_LOCATION[1]="/updates"
BUILT_MODULE_NAME[2]="vboxnetflt"
BUILT_MODULE_LOCATION[2]="vboxnetflt"
DEST_MODULE_LOCATION[2]="/updates"
BUILT_MODULE_NAME[3]="vboxpci"
BUILT_MODULE_LOCATION[3]="vboxpci"
DEST_MODULE_LOCATION[3]="/updates"
AUTOINSTALL="yes"

and here is one for NVIDIA:
/usr/src/nvidia-331-331.38/dkms.conf

PACKAGE_NAME="nvidia-331"
PACKAGE_VERSION="331.38"
CLEAN="make clean"
BUILT_MODULE_NAME[0]="nvidia"
DEST_MODULE_NAME[0]="nvidia_331"
MAKE[0]="make module KERNDIR=/lib/modules/$kernelver IGNORE_XEN_PRESENCE=1 IGNORE_CC_MISMATCH=1 SYSSRC=$kernel_source_dir LD=/usr/bin/ld.bfd"
DEST_MODULE_LOCATION[0]="/kernel/drivers/char/drm"
AUTOINSTALL="yes"
#PATCH[0]="allow_sublevel_greater_than_5.patch"
#PATCH_MATCH[0]="^3.[8-9]"
#PATCH[1]="buildfix_kernel_3.10.patch"
#PATCH_MATCH[1]="^3.[10-11]"
#PATCH[0]="buildfix_kernel_3.11.patch"

NOTE the "AUTOINSTALL" param must be "yes".

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109395960037
RAC: 35787516

RE: I am not sure where the

Quote:
I am not sure where the responsibility for handling this should reside. I tend to lean more towards the hardware vendor and not the distro.


Hardware vendors may produce Linux drivers but not the framework that will allow these to be turned into modules that can be dynamically linked to different kernels in different distros. That's very much the responsibility of different distro maintainers. I don't really know anything about it but I think that many distros (including Ubuntu) use DKMS (dynamic kernel module support) to achieve this. PCLinuxOS certainly does and it just works - every time. Just do a search for "DKMS Ubuntu" and see what you get. My (extremely basic) understanding is that you need the DKMS framework installed and then you install DKMS packages of drivers of interest into that framework. From then on things get sorted automatically.

Quote:
I found this thread that address this very problem over on SUSE.


That thread dates back to 2010. The thread starter (please_try_again) is quite a brilliant writer of shell scripts. A couple of years later he wrote one called 'atiupgrade' that allowed you to upgrade the catalyst driver immediately as soon as AMD released a new one. It was even possible to upgrade to beta versions. I used that script when I first installed SUSE to get a working AMD crunching setup using a HD7770 card. The script worked a treat. Soon after that I found a community repo that was maintaining up-to-date drivers and so I ceased using the script.

If you look through the old thread you linked to, there is a reply by ken_yap which says, "Not trying to put down your contribution (thanks) but isn't this what DKMS is supposed to address?" please_try_again's response to this is quite informative. He didn't know about DKMS which was obviously available in SUSE even then.

Quote:
When you look at his solution (specifically for AMD) then I really have no idea why AMD/NVIDIA can't/don't produce a similar script.


It would be almost impossible for the hardware vendors to maintain something that handled the quirks of every different distro out there. Far easier for distro maintainers to customise DKMS to suit their particular distro. That's my take on it anyway :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109395960037
RAC: 35787516

RE: I believe that DKMS is

Quote:
I believe that DKMS is the solution to rebuilding the kernel modules when a new kernel is installed.
This is a fairly informative pdf about DKMS, especially the section on DKMS autoinstaller. This section will discuss a *.conf file. I believe the key here is that the hardware vendor must supply such a file with the driver package.


I hadn't seen this message from you whilst I was composing mine. I only saw it after I posted.

Essentially we are in complete agreement about the role of DKMS. However I don't think the hardware vendor provides the .conf file. Each distro would tweak various variables to suit their own purposes. I imagine there would be templates that come with DKMS itself that get filled in with the required values by each distro maintainer.

Cheers,
Gary.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482408
RAC: 8689

And here we are. You know

And here we are. You know after all this I am still not 100% sure of the steps necessary to install an AMD GPU and get it working. :>) But I know a lot more then when I started out.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454482408
RAC: 8689

RE: When you look at his

Quote:
When you look at his solution (specifically for AMD) then I really have no idea why AMD/NVIDIA can't/don't produce a similar script.


Quote:
It would be almost impossible for the hardware vendors to maintain something that handled the quirks of every different distro out there. Far easier for distro maintainers to customise DKMS to suit their particular distro. That's my take on it anyway :-).

Gary,

Just for kicks I downloaded the latest drivers from both NVIDIA and AMD. I extracted the contents of both of these files doing the following:

For NVIDIA:

chmod +x NVIDIA-Linux-x86_64-331.49.run
./NVIDIA-Linux-x86_64-331.49.run --extract-only

This generated a directory called: NVIDIA-Linux-x86_64-331.49

In NVIDIA-Linux-x86_64-331.49/kernel is a file called dkms.conf. its contents look like:

PACKAGE_NAME="nvidia"
PACKAGE_VERSION="331.49"
BUILT_MODULE_NAME[0]="$PACKAGE_NAME"
DEST_MODULE_LOCATION[0]="/kernel/drivers/video"
MAKE[0]="make module KERNEL_UNAME=${kernelver}"
CLEAN="make clean"
AUTOINSTALL="yes"

For AMD:

./amd-catalyst-13.12-linux-x86.x86_64.run --extract
this created a directory called: fglrx-install.s5PMB2
within this directory there are a series of dkms files for various linux distros, for example:

Debian Fedora Mageia Mandriva RedFlag RedHat Slackware SuSE Ubuntu

within each of these subdirectories are dkms.conf files for the various flavors of each distribution based upon each distributions type of release. For example for Ubuntu, i.e., precise, hardy, quantal, etc.

the dkms file for Unbuntu Precise which is what I am using is called:
fglrx-install.s5PMB2/packages/Ubuntu/dists/precise/dkms.conf.in

its contents looks like:
PACKAGE_NAME="fglrx"
PACKAGE_VERSION="#CVERSION#"
CLEAN="rm -f *.*o"
BUILT_MODULE_NAME[0]="fglrx"
MAKE[0]="cd ${dkms_tree}/fglrx/#CVERSION#/build; sh make.sh --nohints --uname_r=$kernelver --norootcheck"
DEST_MODULE_LOCATION[0]="/kernel/drivers/char/drm"
AUTOINSTALL="yes"
PATCH[0]="rt_preempt_28.patch"
PATCH_MATCH[0]="^2.6.28\-[0-9]*\-rt$"
PATCH[1]="rt_preempt_31.patch"
PATCH_MATCH[1]="^2.6.31\-[0-9]*\-rt$"

I seems to me that it is the hardware vendors responsibility to provide the dkms hooks for driver installation based upon what I am seeing here. Also in the world of Linux you must have the complete dkms package installed. This is why my install of the 13.12 driver package failed to install in the 2nd post of this thread. I did not have dkms installed becuase I did not know I needed it. I do now.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.