I had an EPYC running 4 rx 5700's reliably. But I also had a "spare" rx 5700.
I am now down to 0 gpus that process through flat ribbon cables. I had 3 rx 5700's that were working. But they now are all stalling/not using cpu cycles after X time.
Will start by adding one gpu next.
Tom M
did you make any software changes between 4x GPUs working and now nothing working?
try 1 GPU on 1 riser.
if it still gives issues. change the riser (only) to another one.
if it gives issues with all risers, try plugging the single GPU directly into the motherboard.
if it still gives issues, try a different slot.
if it still gives issues, try a different GPU, and repeat.
if it still gives issues, try a different PSU or different power connections.
changing one thing at a time can help you isolate where the problem is.
I have replicated the 3 gpu's on risers processing VERY slowly (about 1.25 hours per task) on an rx 5700 plugged directly into the motherboard.
It looks like my GPU driver setup has gotten scrambled. Some of the messages that are being displayed when I tried to uninstall the drivers complain about the ROCR id codes/stuff being incorrect.
I am going to review the original install how to and see if I can uninstall everything.
Otherwise it looks like backup Boinc and install Ubuntu 20 again.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
try removing the AMDGPU-pro drivers and do the ROCm install instead.
On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed. Removed the symblic link.
rebooting here and there.
And still didn't work.
Clinfo:
tom@EPYC-Moonshot:~$ clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.0 AMD-APP (3261.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Extensions function suffix AMD
Platform Name AMD Accelerated Parallel Processing
Number of devices 0
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform
tom@EPYC-Moonshot:~$
tom@EPYC-Moonshot:~$ sudo lshw -C display
[sudo] password for tom:
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:02:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller bus_master cap_list rom
configuration: driver=ast latency=0
resources: irq:28 memory:ee000000-eeffffff memory:ef000000-ef01ffff ioport:1000(size=128) memory:c0000-dffff
*-display
description: VGA compatible controller
product: Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:23:00.0
logical name: /dev/fb0
version: c4
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=amdgpu latency=0 mode=1360x768 visual=truecolor xres=1360 yres=768
resources: iomemory:1140-113f iomemory:1160-115f irq:69 memory:11400000000-115ffffffff memory:11600000000-116001fffff ioport:2000(size=256) memory:edd00000-edd7ffff memory:edd80000-edd9ffff
tom@EPYC-Moonshot:~$
==========
I have lost my notes for the ROCm install. And didn't see the command line for that when I looked around in the Redux forum. Could you either point me to the right message or post it again?
Thank you.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
try removing the AMDGPU-pro drivers and do the ROCm install instead.
On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed. Removed the symblic link.
rebooting here and there.
And still didn't work.
Did you re-apply the symlink after re-installing the different drivers?
try removing the AMDGPU-pro drivers and do the ROCm install instead.
On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed. Removed the symblic link.
rebooting here and there.
And still didn't work.
Did you re-apply the symlink after re-installing the different drivers?
standby for instructions
Yes.
Think I just found instructions.
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
This section provides steps to add any current user to a video group to access GPU resources.
Issue the following command to check the groups in your system:
groups
Add yourself to the video group using the following instruction:
sudo usermod -a -G video $LOGNAME
For all ROCm supported operating systems, continue to use video group. By default, you can add any future users to the video and render groups.
Note: render group is required only for Ubuntu v20.04.
To add future users to the video and render groups, run the following command:
echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf
Supported Operating Systems
Ubuntu
Note: AMD ROCm only supports Long Term Support (LTS) versions of Ubuntu. Versions other than LTS may work with ROCm, however, they are not officially supported.
Installing a ROCm Package from a Debian Repository
To install from a Debian Repository:
Run the following code to ensure that your system is up to date:
sudo apt update
sudo apt dist-upgrade
sudo apt install libnuma-dev
sudo reboot
Add the ROCm apt repository.
For Debian-based systems like Ubuntu, configure the Debian ROCm repository as follows:
The gpg key may change; ensure it is updated when installing a new release. If the key signature verification fails while updating, re-add the key from the ROCm apt repository.
Install the ROCm meta-package. Update the appropriate repository list and install the rocm-dkms meta-package:
sudo apt update
sudo apt install rocm-dkms && sudo reboot
Restart the system.
After restarting the system, run the following commands to verify that the ROCm installation is successful. If you see your GPUs listed by both commands, the installation is considered successful.
Note: AMD ROCm only supports Long Term Support (LTS) versions of Ubuntu. Versions other than LTS may work with ROCm, however, they are not officially supported.
1. Run the following code to ensure that your system is up to date:
3. Install the ROCm meta-package. Update the appropriate repository list and install the rocm-dkms meta-package:
sudo apt update
sudo apt install rocm-dkms && sudo reboot
4. After restarting the system, run the following commands to verify that the ROCm installation is successful. If you see your GPUs listed by both commands, the installation is considered successful.
Ian&Steve C. wrote: Tom M
)
I have replicated the 3 gpu's on risers processing VERY slowly (about 1.25 hours per task) on an rx 5700 plugged directly into the motherboard.
It looks like my GPU driver setup has gotten scrambled. Some of the messages that are being displayed when I tried to uninstall the drivers complain about the ROCR id codes/stuff being incorrect.
I am going to review the original install how to and see if I can uninstall everything.
Otherwise it looks like backup Boinc and install Ubuntu 20 again.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
Tom M wrote: I have
)
does the card plugged directly to the motherboard also process very slowly?
_________________________________________________________________________
Yes
)
Yes
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
Ian&Steve C. wrote: Tom M
)
Yes.
I went through the re-install of Ubuntu 20 and the AIO (unpacked over the top of the BOINC directory to save BOIINC id).
I have run my notes:
To get AIO to recognize Rx 5700's you need this:
Must install Amd gpu drivers without legacy
amdgpu-pro-uninstall
./amdgpu-pro-install -y --opencl=rocr --headless
sudo ln -s /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl64.so /etc/OpenCL/vendors/libamdocl64.so
sudo apt-get install ocl-icd-libopencl1
reboot.
Confirmed via the command line that both "files" seem to be present where the file paths are pointing via CD and DIR.
But it still isn't recognizing the gpus. What mistake did I make?
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
Not sure. try removing
)
Not sure. What does clinfo report?
try removing the AMDGPU-pro drivers and do the ROCm install instead.
_________________________________________________________________________
Ian&Steve C. wrote: Not
)
On the off chance that the brand new version of the gpu drivers was the issue I uninstalled them, used the version that I probably previously installed. Removed the symblic link.
rebooting here and there.
And still didn't work.
Clinfo:
==========
I have lost my notes for the ROCm install. And didn't see the command line for that when I looked around in the Redux forum. Could you either point me to the right message or post it again?
Thank you.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
Tom M wrote: Ian&Steve C.
)
Did you re-apply the symlink after re-installing the different drivers?
standby for instructions
_________________________________________________________________________
Ian&Steve C. wrote: Tom M
)
Yes.
Think I just found instructions.
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
Tom M wrote: Think I just
)
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
copy/paste from page 6 of the
)
copy/paste from page 6 of the redux thread:
the install is fairly easy and you do not even have to pre-download any package, it pulls it from the rocm repository.
follow the instructions here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html
Condensed instructions:
Setting Permissions for Groups
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
Ubuntu
Note: AMD ROCm only supports Long Term Support (LTS) versions of Ubuntu. Versions other than LTS may work with ROCm, however, they are not officially supported.
1. Run the following code to ensure that your system is up to date:
sudo apt update
sudo apt dist-upgrade
sudo apt install libnuma-dev
sudo reboot
2. Add the ROCm apt repository.
sudo apt install wget gnupg2
wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
3. Install the ROCm meta-package. Update the appropriate repository list and install the rocm-dkms meta-package:
sudo apt update
sudo apt install rocm-dkms && sudo reboot
4. After restarting the system, run the following commands to verify that the ROCm installation is successful. If you see your GPUs listed by both commands, the installation is considered successful.
/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo
Uninstalling ROCm Packages from Ubuntu
sudo apt autoremove rocm-opencl rocm-dkms rocm-dev rocm-utils && sudo reboot
_________________________________________________________________________