Troubleshooting Multiple gpu setups that use Riser cards

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6,278
Credit: 8,991,127,612
RAC: 11,834,052

Ian&Steve C. wrote: Tom M

Ian&Steve C. wrote:

Tom M wrote:

I had an EPYC running 4 rx 5700's reliably.  But I also had a "spare" rx 5700.

I am now down to 0 gpus that process through flat ribbon cables.  I had 3 rx 5700's that were working.  But they now are all stalling/not using cpu cycles after X time.

Will start by adding one gpu next.

Tom M

 

did you make any software changes between 4x GPUs working and now nothing working?

 

try 1 GPU on 1 riser.

if it still gives issues. change the riser (only) to another one.

if it gives issues with all risers, try plugging the single GPU directly into the motherboard.

if it still gives issues, try a different slot.

if it still gives issues, try a different GPU, and repeat.

if it still gives issues, try a different PSU or different power connections.

changing one thing at a time can help you isolate where the problem is.

I have replicated the 3 gpu's on risers processing VERY slowly (about 1.25 hours per task) on an rx 5700 plugged directly into the motherboard.

It looks like my GPU driver setup has gotten scrambled.  Some of the messages that are being displayed when I tried to uninstall the drivers complain about the ROCR id codes/stuff being incorrect.

I am going to review the original install how to and see if I can uninstall everything.

Otherwise it looks like backup Boinc and install Ubuntu 20 again.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,106,769,309
RAC: 63,644,677

Tom M wrote: I have

Tom M wrote:

I have replicated the 3 gpu's on risers processing VERY slowly (about 1.25 hours per task) on an rx 5700 plugged directly into the motherboard.

 

does the card plugged directly to the motherboard also process very slowly? 

_________________________________________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6,278
Credit: 8,991,127,612
RAC: 11,834,052

Yes

Yes

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6,278
Credit: 8,991,127,612
RAC: 11,834,052

Ian&Steve C. wrote: Tom M

Ian&Steve C. wrote:

Tom M wrote:

I have replicated the 3 gpu's on risers processing VERY slowly (about 1.25 hours per task) on an rx 5700 plugged directly into the motherboard.

does the card plugged directly to the motherboard also process very slowly? 

Yes.

I went through the re-install of Ubuntu 20 and the AIO (unpacked over the top of the BOINC directory to save BOIINC id).

I have run my notes:

To get AIO to recognize Rx 5700's you need this:

Must install Amd gpu drivers without legacy

amdgpu-pro-uninstall
./amdgpu-pro-install -y --opencl=rocr --headless

sudo ln -s /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl64.so /etc/OpenCL/vendors/libamdocl64.so

sudo apt-get install ocl-icd-libopencl1

reboot.


Confirmed via the command line that both "files" seem to be present where the file paths are pointing via CD and DIR.

But it still isn't recognizing the gpus. What mistake did I make?

Fri 27 Aug 2021 07:54:21 PM CDT |  | Starting BOINC client version 7.16.5 for x86_64-pc-linux-gnu
Fri 27 Aug 2021 07:54:21 PM CDT |  | log flags: file_xfer, sched_ops, task, sched_op_debug
Fri 27 Aug 2021 07:54:21 PM CDT |  | Libraries: libcurl/7.68.0 GnuTLS/3.6.13 zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Fri 27 Aug 2021 07:54:21 PM CDT |  | Data directory: /home/tom/Desktop/BOINC
Fri 27 Aug 2021 07:54:25 PM CDT |  | No usable GPUs found
Fri 27 Aug 2021 07:54:25 PM CDT |  | libc: Ubuntu GLIBC 2.31-0ubuntu9.2 version 2.31
Fri 27 Aug 2021 07:54:25 PM CDT |  | Host name: EPYC-Moonshot
Fri 27 Aug 2021 07:54:25 PM CDT |  | Processor: 48 AuthenticAMD AMD EPYC 7401P 24-Core Processor [Family 23 Model 1 Stepping 2]
Fri 27 Aug 2021 07:54:25 PM CDT |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
Fri 27 Aug 2021 07:54:25 PM CDT |  | OS: Linux Ubuntu: Ubuntu 20.04.3 LTS [5.11.0-27-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]
Fri 27 Aug 2021 07:54:25 PM CDT |  | Memory: 62.79 GB physical, 2.00 GB virtual
Fri 27 Aug 2021 07:54:25 PM CDT |  | Disk: 915.40 GB total, 856.50 GB free
Fri 27 Aug 2021 07:54:25 PM CDT |  | Local time is UTC -5 hours
Fri 27 Aug 2021 07:54:25 PM CDT | Einstein@Home | Found app_config.xml

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,106,769,309
RAC: 63,644,677

Not sure.  try removing

Not sure. What does clinfo report? 
 

try removing the AMDGPU-pro drivers and do the ROCm install instead. 

_________________________________________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6,278
Credit: 8,991,127,612
RAC: 11,834,052

Ian&Steve C. wrote: Not

Ian&Steve C. wrote:

Not sure. What does clinfo report? 
 

try removing the AMDGPU-pro drivers and do the ROCm install instead. 

On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed.  Removed the symblic link.

rebooting here and there.

And still didn't work.

Clinfo:

tom@EPYC-Moonshot:~$ clinfo
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3261.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform
tom@EPYC-Moonshot:~$
tom@EPYC-Moonshot:~$ sudo lshw -C display
[sudo] password for tom:
  *-display                 
       description: VGA compatible controller
       product: ASPEED Graphics Family
       vendor: ASPEED Technology, Inc.
       physical id: 0
       bus info: pci@0000:02:00.0
       version: 41
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi vga_controller bus_master cap_list rom
       configuration: driver=ast latency=0
       resources: irq:28 memory:ee000000-eeffffff memory:ef000000-ef01ffff ioport:1000(size=128) memory:c0000-dffff
  *-display
       description: VGA compatible controller
       product: Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:23:00.0
       logical name: /dev/fb0
       version: c4
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=amdgpu latency=0 mode=1360x768 visual=truecolor xres=1360 yres=768
       resources: iomemory:1140-113f iomemory:1160-115f irq:69 memory:11400000000-115ffffffff memory:11600000000-116001fffff ioport:2000(size=256) memory:edd00000-edd7ffff memory:edd80000-edd9ffff
tom@EPYC-Moonshot:~$

==========

I have lost my notes for the ROCm install.  And didn't see the command line for that when I looked around in the Redux forum.  Could you either point me to the right message or post it again?

Thank you.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,106,769,309
RAC: 63,644,677

Tom M wrote: Ian&Steve C.

Tom M wrote:

Ian&Steve C. wrote:

Not sure. What does clinfo report? 
 

try removing the AMDGPU-pro drivers and do the ROCm install instead. 

On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed.  Removed the symblic link.

rebooting here and there.

And still didn't work.

Did you re-apply the symlink after re-installing the different drivers? 
 

standby for instructions 

_________________________________________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6,278
Credit: 8,991,127,612
RAC: 11,834,052

Ian&Steve C. wrote: Tom M

Ian&Steve C. wrote:

Tom M wrote:

Ian&Steve C. wrote:

Not sure. What does clinfo report? 
 

try removing the AMDGPU-pro drivers and do the ROCm install instead. 

On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed.  Removed the symblic link.

rebooting here and there.

And still didn't work.

Did you re-apply the symlink after re-installing the different drivers? 
 

standby for instructions 

Yes.

Think I just found instructions.

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6,278
Credit: 8,991,127,612
RAC: 11,834,052

Tom M wrote: Think I just

Tom M wrote:

Think I just found instructions.

Setting Permissions for Groups

This section provides steps to add any current user to a video group to access GPU resources.

  1. Issue the following command to check the groups in your system:

groups

  1. Add yourself to the video group using the following instruction:

sudo usermod -a -G video $LOGNAME

For all ROCm supported operating systems, continue to use video group. By default, you can add any future users to the video and render groups.

Note: render group is required only for Ubuntu v20.04.

  1. To add future users to the video and render groups, run the following command:

echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf

echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf

echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf




Supported Operating Systems

Ubuntu

Note: AMD ROCm only supports Long Term Support (LTS) versions of Ubuntu. Versions other than LTS may work with ROCm, however, they are not officially supported.

Installing a ROCm Package from a Debian Repository

To install from a Debian Repository:

  1. Run the following code to ensure that your system is up to date:

sudo apt update

sudo apt dist-upgrade

sudo apt install libnuma-dev

sudo reboot



  1. Add the ROCm apt repository.

For Debian-based systems like Ubuntu, configure the Debian ROCm repository as follows:

sudo apt install wget gnupg2

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list



Note: For ROCm v4.1 and lower, use ‘xenial main’ as shown below

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/4.1/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list



Note: For developer systems or Docker containers (where it could be beneficial to use a fixed ROCm version), select a versioned repository from:

https://repo.radeon.com/rocm/apt/

The gpg key may change; ensure it is updated when installing a new release. If the key signature verification fails while updating, re-add the key from the ROCm apt repository.

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

The current rocm.gpg.key is not available in a standard key ring distribution, but has the following sha1sum hash:

777947b2579611bf4d377687b5013c69642c5762 rocm.gpg.key

  1. Install the ROCm meta-package. Update the appropriate repository list and install the rocm-dkms meta-package:

sudo apt update

sudo apt install rocm-dkms && sudo reboot



  1. Restart the system.

  2. After restarting the system, run the following commands to verify that the ROCm installation is successful. If you see your GPUs listed by both commands, the installation is considered successful.

/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,914
Credit: 44,106,769,309
RAC: 63,644,677

copy/paste from page 6 of the

copy/paste from page 6 of the redux thread: 

the install is fairly easy and you do not even have to pre-download any package, it pulls it from the rocm repository.

follow the instructions here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

Condensed instructions:

Setting Permissions for Groups

sudo usermod -a -G video $LOGNAME

sudo usermod -a -G render $LOGNAME

 

Ubuntu

Note: AMD ROCm only supports Long Term Support (LTS) versions of Ubuntu. Versions other than LTS may work with ROCm, however, they are not officially supported.

 

1. Run the following code to ensure that your system is up to date:

sudo apt update

sudo apt dist-upgrade

sudo apt install libnuma-dev

sudo reboot

 

2. Add the ROCm apt repository.

sudo apt install wget gnupg2

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

 

3. Install the ROCm meta-package. Update the appropriate repository list and install the rocm-dkms meta-package:

sudo apt update

sudo apt install rocm-dkms && sudo reboot

 

4. After restarting the system, run the following commands to verify that the ROCm installation is successful. If you see your GPUs listed by both commands, the installation is considered successful.

/opt/rocm/bin/rocminfo

/opt/rocm/opencl/bin/clinfo

 

Uninstalling ROCm Packages from Ubuntu

sudo apt autoremove rocm-opencl rocm-dkms rocm-dev rocm-utils && sudo reboot


 

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.