Improvements in the code of the clients

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3812
Credit: 37820221827
RAC: 56905018

also keep in mind the

also keep in mind the hardware limitations when using ROCm drivers with older hardware, the CPU you have can play a role as well. Pay particular attention to PCIe generation and PCI atomics support, and/or if you are attaching a GPU through the CPU lanes or using chipset/PLX lanes.

 

Quote:

Hardware and Software Support

ROCm is focused on using AMD GPUs to accelerate computational tasks such as machine learning, engineering workloads, and scientific computing. In order to focus our development efforts on these domains of interest, ROCm supports a targeted set of hardware configurations which are detailed further in this section.

Note: The AMD ROCm™ open software platform is a compute stack for headless system deployments. GUI-based software applications are currently not supported.

 

Supported GPUs

Because the ROCm Platform has a focus on particular computational domains, we offer official support for a selection of AMD GPUs that are designed to offer good performance and price in these domains.

Note: The integrated GPUs of Ryzen are not officially supported targets for ROCm.

ROCm officially supports AMD GPUs that use following chips:

  • GFX9 GPUs



    • "Vega 10" chips, such as on the AMD Radeon RX Vega 64 and Radeon Instinct MI25




    • "Vega 7nm" chips, such as on the Radeon Instinct MI50, Radeon Instinct MI60 or AMD Radeon VII, Radeon Pro VII






  • CDNA GPUs


    • MI100 chips such as on the AMD Instinct™ MI100


ROCm is a collection of software ranging from drivers and runtimes to libraries and developer tools. Some of this software may work with more GPUs than the "officially supported" list above, though AMD does not make any official claims of support for these devices on the ROCm software platform.

The following list of GPUs are enabled in the ROCm software, though full support is not guaranteed:

  • GFX8 GPUs
    • "Polaris 11" chips, such as on the AMD Radeon RX 570 and Radeon Pro WX 4100
    • "Polaris 12" chips, such as on the AMD Radeon RX 550 and Radeon RX 540
  • GFX7 GPUs
    • "Hawaii" chips, such as the AMD Radeon R9 390X and FirePro W9100

As described in the next section, GFX8 GPUs require PCI Express 3.0 (PCIe 3.0) with support for PCIe atomics. This requires both CPU and motherboard support. GFX9 GPUs require PCIe 3.0 with support for PCIe atomics by default, but they can operate in most cases without this capability.

The integrated GPUs in AMD APUs are not officially supported targets for ROCm. As described below, "Carrizo", "Bristol Ridge", and "Raven Ridge" APUs are enabled in our upstream drivers and the ROCm OpenCL runtime. However, they are not enabled in the HIP runtime, and may not work due to motherboard or OEM hardware limitations. As such, they are not yet officially supported targets for ROCm.

For a more detailed list of hardware support, please see the following documentation.

 

Supported CPUs

As described above, GFX8 GPUs require PCIe 3.0 with PCIe atomics in order to run ROCm. In particular, the CPU and every active PCIe point between the CPU and GPU require support for PCIe 3.0 and PCIe atomics. The CPU root must indicate PCIe AtomicOp Completion capabilities and any intermediate switch must indicate PCIe AtomicOp Routing capabilities.

Current CPUs which support PCIe Gen3 + PCIe Atomics are:

  • AMD Ryzen CPUs
  • The CPUs in AMD Ryzen APUs
  • AMD Ryzen Threadripper CPUs
  • AMD EPYC CPUs
  • Intel Xeon E7 v3 or newer CPUs
  • Intel Xeon E5 v3 or newer CPUs
  • Intel Xeon E3 v3 or newer CPUs
  • Intel Core i7 v4, Core i5 v4, Core i3 v4 or newer CPUs (i.e. Haswell family or newer)
  • Some Ivy Bridge-E systems

Beginning with ROCm 1.8, GFX9 GPUs (such as Vega 10) no longer require PCIe atomics. We have similarly opened up more options for number of PCIe lanes. GFX9 GPUs can now be run on CPUs without PCIe atomics and on older PCIe generations, such as PCIe 2.0. This is not supported on GPUs below GFX9, e.g. GFX8 cards in the Fiji and Polaris families.

If you are using any PCIe switches in your system, please note that PCIe Atomics are only supported on some switches, such as Broadcom PLX. When you install your GPUs, make sure you install them in a PCIe 3.1.0 x16, x8, x4, or x1 slot attached either directly to the CPU's Root I/O controller or via a PCIe switch directly attached to the CPU's Root I/O controller.

In our experience, many issues stem from trying to use consumer motherboards which provide physical x16 connectors that are electrically connected as e.g. PCIe 2.0 x4, PCIe slots connected via the Southbridge PCIe I/O controller, or PCIe slots connected through a PCIe switch that does not support PCIe atomics.

If you attempt to run ROCm on a system without proper PCIe atomic support, you may see an error in the kernel log (dmesg):

kfd: skipped device 1002:7300, PCI rejects atomics

Experimental support for our Hawaii (GFX7) GPUs (Radeon R9 290, R9 390, FirePro W9100, S9150, S9170) does not require or take advantage of PCIe Atomics. However, we still recommend that you use a CPU from the list provided above for compatibility purposes.

 

Not supported or limited support under ROCm

 

Limited support

  • ROCm 4.x should support PCIe 2.0 enabled CPUs such as the AMD Opteron, Phenom, Phenom II, Athlon, Athlon X2, Athlon II and older Intel Xeon and Intel Core Architecture and Pentium CPUs. However, we have done very limited testing on these configurations, since our test farm has been catering to CPUs listed above. This is where we need community support. If you find problems on such setups, please report these issues.

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16309697000
RAC: 6635314

It's been a while since I

It's been a while since I gave ROCm a try, but I don't recall having much success with it (at least with BOINC). I imagine it's come a long way since then. The problem I have with amdgpu-pro now is that anything beyond version 20.40 (current release is 21.30) only has ROCr OpenCL implementation for Vega and newer GPUs and I can't get BOINC GPU tasks to work with that (https://community.amd.com/t5/opencl/amdgpu-pro-20-45-rocr-vs-pal-opencl-breaks-boinc-gpu-processing/m-p/484012)

amdgpu-pro is fine for anything older than Vega (--opencl=legacy installation option). How straightforward is it to get ROCm set-up these days? Compared with the old fglrx driver I really like amdgpu-pro for its (generally) just-install-and-it-works results.

If ROCm is the way to go for Vega and newer GPUs, at least I can get out of being stuck on an old kernel.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3812
Credit: 37820221827
RAC: 56905018

Wedge009 wrote: It's been a

Wedge009 wrote:

It's been a while since I gave ROCm a try, but I don't recall having much success with it (at least with BOINC). I imagine it's come a long way since then. The problem I have with amdgpu-pro now is that anything beyond version 20.40 (current release is 21.30) only has ROCr OpenCL implementation for Vega and newer GPUs and I can't get BOINC GPU tasks to work with that (https://community.amd.com/t5/opencl/amdgpu-pro-20-45-rocr-vs-pal-opencl-breaks-boinc-gpu-processing/m-p/484012)

amdgpu-pro is fine for anything older than Vega (--opencl=legacy installation option). How straightforward is it to get ROCm set-up these days? Compared with the old fglrx driver I really like amdgpu-pro for its (generally) just-install-and-it-works results.

If ROCm is the way to go for Vega and newer GPUs, at least I can get out of being stuck on an old kernel.

here are install instructions: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

here's an abridged Ubuntu version that I compiled for another user:

Condensed instructions:

Setting Permissions for Groups

sudo usermod -a -G video $LOGNAME

sudo usermod -a -G render $LOGNAME

 

Ubuntu

Note: AMD ROCm only supports Long Term Support (LTS) versions of Ubuntu. Versions other than LTS may work with ROCm, however, they are not officially supported.

 

1. Run the following code to ensure that your system is up to date:

sudo apt update

sudo apt dist-upgrade

sudo apt install libnuma-dev

sudo reboot

 

2. Add the ROCm apt repository.

sudo apt install wget gnupg2

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

 

3. Install the ROCm meta-package. Update the appropriate repository list and install the rocm-dkms meta-package:

sudo apt update

sudo apt install rocm-dkms && sudo reboot

 

4. After restarting the system, run the following commands to verify that the ROCm installation is successful. If you see your GPUs listed by both commands, the installation is considered successful.

/opt/rocm/bin/rocminfo

/opt/rocm/opencl/bin/clinfo

 

Uninstalling ROCm Packages from Ubuntu

sudo apt autoremove rocm-opencl rocm-dkms rocm-dev rocm-utils && sudo reboot

 

pretty straightforward.

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16309697000
RAC: 6635314

Doesn't look like the process

Doesn't look like the process has changed since I last tried. I tried again just now with ROCm 4.3 and Vega10 on Ubuntu kernel 5.11.0-27 but it looks like I still can't get BOINC to recognise ROCm.

Both boinc and user accounts are members of the video and render groups.

clinfo doesn't show the GPU but rocminfo does. The only relevant coproc_debug message I see from BOINC is 'ATI: libcalrt.so: cannot open shared object file', which I think is from the detection of the old CAL libraries rather than OpenCL.

The docs say both clinfo and rocminfo should show the GPU in its output for installation to be considered successful, so I suppose I'm not successful? Feels a bit finicky like how things were back with fglrx...

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3812
Credit: 37820221827
RAC: 56905018

Tom had this same issue i

Tom had this same issue i think. So I understand the issue. are you saying that BOINC does not recognize the GPU? and you completely removed AMDGPU-Pro before installing ROCm?

 

try this:

check the /etc/OpenCL/vendors/ directory.

you should find a similar .icd file there as before but this time named "amdocl64_40200.icd"(*). and is it the only file in this directory? if you're on a fresh install and have only tried the ROCm install, I imagine it's the only file there right now. if you have any other files in this directory, please post what they are and their contents (if applicable)

next check your /opt/rocm/opencl/lib/ directory and verify that the libamdocl64.so file is in there. if not please let me know.

open the amdocl64_40200.icd(*) file with nano to edit:

sudo nano /etc/OpenCL/vendors/amdocl64_40200.icd

 

contents is likely just "libamdocl64.so"

 

change the contents to "/opt/rocm/opencl/lib/libamdocl64.so" (without the quotes)

 

[Ctrl]+[x] to exit, you will be prompted to save, enter [y], and hit [Enter] to verify filename (don't change it) and it will save and close.

 

Then reboot and retry.

 

 

 

(*)note, the suffix of the libamdocl64.icd file (in the above instructions it’s “40200” referring to ROCm v4.2 which is what I have) might be different if ROCm has been updated in the repository. Be sure to check the file names that exist and make the necessary modifications to the given instructions. 

 

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16309697000
RAC: 6635314

Odd. Following

Odd. Following https://github.com/RadeonOpenCompute/ROCm/issues/1430#issuecomment-809956690, I tried setting the LD_LIBRARY_PATH variable and can now have clinfo also report the GPU. Unfortunately BOINC still can't see it, however.

Edit: Looks like we replied at about the same time. Confirming yes, of course I completely removed amdgpu-pro before installing rocm-dkms. Reading the rest of your suggestions now.

Soli Deo Gloria

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16309697000
RAC: 6635314

Ian&Steve C. wrote: check

Ian&Steve C. wrote:

check the /etc/OpenCL/vendors/ directory.

amdocl64_40300.icd

Ian&Steve C. wrote:

next check your /opt/rocm/opencl/lib/ directory and verify that the libamdocl64.so file is in there. if not please let me know.

Present and globally readable.

Ian&Steve C. wrote:

open the amdocl64_40200.icd(*) file with nano to edit:

sudo nano /etc/OpenCL/vendors/amdocl64_40200.icd

Interesting hack. Makes me wonder why the default installation doesn't have this already, unless it's something that works as-is on other distributions. At any rate clinfo reads the GPU and BOINC can see it as well.

Unfortunately my experience seems to be the same as for the ROCr-based OpenCL implementation within amdgpu-pro: BOINC thinks the task is processing but it's actually stalled. GPU usage is practically zero (seen through radeontop) and GPU temperature is consistent with idle operation. At this point it looks like I'll have to revert to PAL-based OpenCL from amdgpu-pro 20.40.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3812
Credit: 37820221827
RAC: 56905018

Wedge009 wrote:Interesting

Wedge009 wrote:

Interesting hack. Makes me wonder why the default installation doesn't have this already, unless it's something that works as-is on other distributions. At any rate clinfo reads the GPU and BOINC can see it as well.

I think it's an issue with BOINC's detection maybe, and you can actually achieve the same result with a symbolic link, but this way is more direct. Tom had this same problem (I didn't with my RX570, and found the solution while googling for him) on his 5700 Navi card with Ubuntu and our all-in-one BOINC instance. but interestingly enough he doesn't have a problem if he uses the Ubuntu repository version of BOINC. it was only the standalone version that seemed to have this problem, so i think it's something to do with BOINCs detection that might be changed in later versions?, but that's just a hunch right now. it certainly could vary with different OS's too I think.

 

Wedge009 wrote:
Unfortunately my experience seems to be the same as for the ROCr-based OpenCL implementation within amdgpu-pro: BOINC thinks the task is processing but it's actually stalled. GPU usage is practically zero (seen through radeontop) and GPU temperature is consistent with idle operation. At this point it looks like I'll have to revert to PAL-based OpenCL from amdgpu-pro 20.40.

driver support seems not great for the Radeon VII and Vega cards unfortunately. At least we tried. doesn't sound like you're missing out on anything though. I've only seen speedups on Navi and newer. No Vegas available for me to test unfortunately. my Polaris card runs marginally slower with the new app. but that was the case with my pre-release code as well. Tom saw ~20% improvement on his 5700 with pre-release code though. not sure how his 5700 is doing with the new app yet.

edit: Tom's 5700s seem to be handling the new app OK. <600s @ 2x tasks

https://einsteinathome.org/host/12896211/tasks/2/0?sort=desc&order=Sent

_________________________________________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 16309697000
RAC: 6635314

Ian&Steve C. wrote: ...but

Ian&Steve C. wrote:

...but interestingly enough he doesn't have a problem if he uses the Ubuntu repository version of BOINC.

A symbolic link is probably more elegant, but whatever works. It may be the case that something changed in more recent versions rather than specifically being an issue on repository vs stand-alone execution - I'm using costa's PPA that's maintained fairly frequently.

 

Ian&Steve C. wrote:

driver support seems not great for the Radeon VII and Vega cards unfortunately.

It does look that way for Linux (even with PAL-based OpenCL tasks consistently run around 20 seconds slower on Linux vs Windows for Vega20).

I was concerned about your observed slow-down on Polaris. I don't have anything RDNA-based other than the discrete GPU in a laptop.

My main hope was that ROCm would allow me to move off kernel 5.4. Until and unless BOINC can run on ROCm/ROCr with Vega, I'll be stuck on that kernel...

Anyway, appreciate your assistance.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3812
Credit: 37820221827
RAC: 56905018

hey Wedge, can you try again

hey Wedge, can you try again but on your dual [2] Radeon VII, Threadripper system?

 

and can you tell me more about the Vega 10 system you tried, what motherboard and what slot on the board is the Vega plugged into (are any risers being used)? I see the CPU being used is a FX-8350, which only supports PCIe 2.0 and from what I can tell, all lanes flow through the chipset and not direct to the CPU like newer architectures.

the ROCm drivers have some pretty complicated interactions with the PCIe bus and requires particular CPU support. according to the docs, they say that for GFX9 (your vega card falls here) PCIe 2.0 is supported, and no longer needs PCIe atomics to work, but I wonder if the specific combination of hardware you're using ends up actually being unsupported for some other fringe reason on that old platform, possibly related to how the PCIe lanes connect to the CPU.

I wonder if you'll have better success with the threadripper system.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.