Linux kernel 5.10 + AMDGPU + Radeon 20.45 = Frequent Gnome Crashes

Paul
Paul
Joined: 3 May 07
Posts: 84
Credit: 448,394,059
RAC: 1,019,779
Topic 225033

I have noticed a severe problem with Linux kernel 5.10 in combination with my AMD GPU using AMDGPU OSS video driver plus the Radeon v20.45 compute drivers (only).  Does anyone else see this?

On kernel 5.9, things are fine, but if I boot kernel  5.10, my Gnome Shell crashes frequently, say 20-30 minutes after boot when with E@H GPU tasks running.  Without E@H GPU tasks, the system seems okay; it's certainly more stable.

My GPU is 5700XT.

The last time I checked, about six months ago, 5700XT (aka Navi 10) was still not supported by fully OSS compute stack.  Does anyone know if that has changed?

When I get some time, I plan to remove Radeon driver v20.45, test OSS, and then reinstall Radeon driver.  I'm pretty confident that will not change anything, though.  I suspect v20.45 just isn't compatible with 5.10.

Blackleg
Blackleg
Joined: 9 Dec 17
Posts: 1
Credit: 77,914,112
RAC: 236,004

I use driver 20.40 with

I use driver 20.40 with kernel 5.4 (Ubuntu 20.04) in my computer with the same GPU.

I tried to update to use driver 20.45 and I see the gnome crashes. After various tests with that driver I reinstalled the 20.40 and It's stable with that.

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 431
Credit: 205,560,364
RAC: 5,307

Paul wrote: I suspect v20.45

Paul wrote:
I suspect v20.45 just isn't compatible with 5.10.

Try the new 20.50 driver.  It fixes a compatibility problem with kernel 5.8.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 56
Credit: 4,386,134,865
RAC: 10,615,276

Is that amdgpu-pro 20.45 /

Is that amdgpu-pro 20.45 / 20.50? Would you be using the RHEL/CentOS packages?

Since amdgpu-pro started using ROCr-based OpenCL from 20.45 onwards, I haven't had any success with BOINC GPU processing, I've had to revert to 20.40. But that's using the Ubuntu packages - still, I find it interesting that you seem to be successful with a Navi-based GPU.

https://einsteinathome.org/content/troubleshooting-ubuntu-20-and-fresh-install-amd-drivers (most recent posts)

Soli Deo Gloria

Paul
Paul
Joined: 3 May 07
Posts: 84
Credit: 448,394,059
RAC: 1,019,779

Just following up on this. 

Just following up on this.  Yes, this is a weird, but that seems pretty familiar too, and that's frustrating.

Since I reported the problem, this situation has improved, and may not be the same issue, but the combination of kernel 5.10+ and amdgpu compute is not solved.  What I see how is that memory usage grows steadily until all RAM is consumed and the system crashes.  This is particularly confusing because we have OOM now and other protections, so it's not clear how this OOM condition is handled so poorly.

My best guess is unchanged, though: incompatible non-PRO AMD drivers in OSS kernel vs. AMD compute stack.

I'm using the compute components from 21.10 now, and that hasn't resolved the issue, either.  But, the last time I checked the system took 48 hours to exhaust 32GB of memory.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,462
Credit: 58,826,115,320
RAC: 55,754,328

Paul wrote:....  What I see

Paul wrote:
....  What I see how is that memory usage grows steadily until all RAM is consumed and the system crashes.

If you take a look at this message you will find some comments about an issue that looks very similar.  Two messages later in that same thread, I gave more details.

For me, this happens with any 5.10.x or 5.11.x kernel that I've tried so far.  I haven't seen it with 5.4.x, 5.7.x, 5.8.x or 5.9.x kernels.  I haven't yet tried a 5.12.x kernel.  My personal preference is to use the latest LTS kernel - in this case the latest in the 5.4 series.  I'm up to 5.4.115 and will be downloading an even later one in that series shortly.  I do try to test a member or two of each kernel series in case the LTS series has a problem.  I usually wait until there have been 5 to 10 releases - eg. for 5.12.x, I'll try from around 5.12.10 or later to see if there are any issues with the series.  Usually there isn't and I was quite surprised when this issue turned up in 5.10.x.

My guess is that some sort of bug has been introduced in the 5.10 (and probably later) series which hopefully will get sorted fairly soon :-).

Cheers,
Gary.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 941
Credit: 1,601,407,545
RAC: 3,887,834

I have been getting "Gnome

I have been getting "Gnome shell" error messages lately even in kernel 5.4

Since there is no evidence at the GUI level of a problem I just toggled the "don't show me this error anymore".

Tom M

Over the hill?  What hill?  I don't REMEMBER any hill...

 

 

 

 

Paul
Paul
Joined: 3 May 07
Posts: 84
Credit: 448,394,059
RAC: 1,019,779

System is working okay

System is working okay again.  Kernel 5.12 + AMDGPU OSS Drivers 21.10.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.