AMDGPU driver issues with recent Linux upgrades

cecht
cecht
Joined: 7 Mar 18
Posts: 640
Credit: 630,340,829
RAC: 974,721
Topic 220536

I'm running Ubuntu 18.04.3 LTS and just upgraded to the latest distro, which upgraded the Linux kernel from 5.0.0-37 to 5.3.0-26. That also upgraded AMDGPU drivers from 19.3-934563 to 19.5-967956. Following the reboot, while running GW tasks on my RX 570s (4x on each of 2 GPUs), I noticed the following problems:
  - tasks are taking about twice as long to complete;
  - the 4 threads of my CPU were running consistently near 100%, when previously they would be ~60%;
  - the 'top' terminal command shows one boinc process using ~90% CPU with the other 7 at ~30% each, whereas previously all 8 tasks/processes used the same amount of CPU resources (when no task is running the 99% completion phase that is);
  - and the 'top' terminal command shows several sdma and comp_1 processes each using ~15% to 30% of CPU resources, which they never did previously.
(If anyone uses the amdgpu-utils program to control GPU run parameters, the upgrades also broke s-clock power masking.)

I ran a search on 'gpu sdma' and found this tidbit posted a couple of weeks ago:
https://linuxreviews.org/Mesa_20_Will_Have_SDMA_Disabled_On_AMD_RX-Series_GPUs

While the article is about an upcoming Mesa 20 release, it ends with this note: "mesa 19.3.2, released January 9th, 2019, includes the "disable SDMA on gfx8 to fix corruption on RX 580" patch." AMDGPU uses Mesa drivers.

Although the article doesn't mention OpenCL functions, given the altered sdma and CPU utilization and extended GW crunch times I'm seeing, I suspect that the changes to the recent Linux/Mesa/AMDGPU drivers have hobbled these AMD cards. I'm not sure whether to try to roll back the upgrades or wait it out for new drivers to fix the problems.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

mikey
mikey
Joined: 22 Jan 05
Posts: 5,655
Credit: 540,823,983
RAC: 128,412

cecht wrote:I'm running

cecht wrote:

I'm running Ubuntu 18.04.3 LTS and just upgraded to the latest distro, which upgraded the Linux kernel from 5.0.0-37 to 5.3.0-26. That also upgraded AMDGPU drivers from 19.3-934563 to 19.5-967956. Following the reboot, while running GW tasks on my RX 570s (4x on each of 2 GPUs), I noticed the following problems:
  - tasks are taking about twice as long to complete;
  - the 4 threads of my CPU were running consistently near 100%, when previously they would be ~60%;
  - the 'top' terminal command shows one boinc process using ~90% CPU with the other 7 at ~30% each, whereas previously all 8 tasks/processes used the same amount of CPU resources (when no task is running the 99% completion phase that is);
  - and the 'top' terminal command shows several sdma and comp_1 processes each using ~15% to 30% of CPU resources, which they never did previously.
(If anyone uses the amdgpu-utils program to control GPU run parameters, the upgrades also broke s-clock power masking.)

I ran a search on 'gpu sdma' and found this tidbit posted a couple of weeks ago:
https://linuxreviews.org/Mesa_20_Will_Have_SDMA_Disabled_On_AMD_RX-Series_GPUs

While the article is about an upcoming Mesa 20 release, it ends with this note: "mesa 19.3.2, released January 9th, 2019, includes the "disable SDMA on gfx8 to fix corruption on RX 580" patch." AMDGPU uses Mesa drivers.

Although the article doesn't mention OpenCL functions, given the altered sdma and CPU utilization and extended GW crunch times I'm seeing, I suspect that the changes to the recent Linux/Mesa/AMDGPU drivers have hobbled these AMD cards. I'm not sure whether to try to roll back the upgrades or wait it out for new drivers to fix the problems.

If this is a Boinc only machine and you have a spare drive I would load the older version on it and set it up just to crunch until they come out with upgades to fix the problem, leaving the existing drive alone to swap back to once they come out.

cecht
cecht
Joined: 7 Mar 18
Posts: 640
Credit: 630,340,829
RAC: 974,721

mikey wrote:If this is a

mikey wrote:
If this is a Boinc only machine and you have a spare drive I would load the older version on it and set it up just to crunch until they come out with upgrades to fix the problem, leaving the existing drive alone to swap back to once they come out.

That's a good thought, thanks Mikey. I've realized, however, that crunch times of the good ol' gamma ray binary pulsar tasks are not throttled by the AMDGPU upgrade, so I've taken the lazy man's approach and temporarily switched to running only GRP tasks. I'm guessing that the reason my GPUs appear to be only slightly affected while running these tasks is because of the FGRP app's low CPU overhead. This makes sense if the cause of the upgrade "problem" is that RX series GPUs running the GW app are heavily reliant on SDMA (system Direct Memory Access), which was disabled in the most recent AMDGPU/Mesa drivers. My limited understanding of DMA/SDMA is based on https://en.wikipedia.org/wiki/Direct_memory_access.

In short, beware the upgrade if you are running 2.02 (GW-opencl-ati) work on Linux.  I wonder whether this affects the Windows app in the same way?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

mikey
mikey
Joined: 22 Jan 05
Posts: 5,655
Credit: 540,823,983
RAC: 128,412

cecht wrote:mikey wrote:If

cecht wrote:
mikey wrote:
If this is a Boinc only machine and you have a spare drive I would load the older version on it and set it up just to crunch until they come out with upgrades to fix the problem, leaving the existing drive alone to swap back to once they come out.

That's a good thought, thanks Mikey. I've realized, however, that crunch times of the good ol' gamma ray binary pulsar tasks are not throttled by the AMDGPU upgrade, so I've taken the lazy man's approach and temporarily switched to running only GRP tasks. I'm guessing that the reason my GPUs appear to be only slightly affected while running these tasks is because of the FGRP app's low CPU overhead. This makes sense if the cause of the upgrade "problem" is that RX series GPUs running the GW app are heavily reliant on SDMA (system Direct Memory Access), which was disabled in the most recent AMDGPU/Mesa drivers. My limited understanding of DMA/SDMA is based on https://en.wikipedia.org/wiki/Direct_memory_access.

In short, beware the upgrade if you are running 2.02 (GW-opencl-ati) work on Linux.  I wonder whether this affects the Windows app in the same way?

I run the GRP tasks on my own 5870 and have for awhile now on my Win10 machine but am not willing to try the other kinds of tasks as I have had issues with them before. This works and lets me keep posting here  while splitting time with MilkyWay, the rest of my gpu's are running Collatz right now as people are trying to pass me and I need to build up a cushion.

cecht
cecht
Joined: 7 Mar 18
Posts: 640
Credit: 630,340,829
RAC: 974,721

To conclude, I confess that

To conclude, I confess that my assumptions were wrong about the upgrade causing problems. While I  thought my problems were an SDMA issue embedded in a Mesa driver upgrade, I finally got around to installing glxinfo and discovered that system is currently running Mesa 19.2.8, not 19.3.2 as I had thought, so SDMA has nothing to do with it. The changes in how GW GPU tasks affected system resources was (is) something else entirely.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.