Troubleshooting Ubuntu 20 and a fresh install of Amd drivers

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 134
Credit: 6,051,133,843
RAC: 4,810,505

AMD 20.45 driver for Ubuntu

AMD 20.45 driver for Ubuntu 20.4 LTS works for kernels 5.4.0-54 & 58 only.  It will not work with kernel 5.4.0-56 or any of the newer 5.8.x.  Earlier versions of the 20.x drivers are no-go with 5.4 kernels.

Previously AMD 20.30 (for 18.04 HWE) installed cleanly over kernels 4.15.0-115 & 117.  Recent kernels 4.15.0-129 & 130) in Ubuntu 18.4 LTS now have an install build issue with both AMD drivers 20.30 & 20.40, making a re-install of 20.30 impossible.  Solution is to wait for a 20.45 version for 18.04 HWE or there is a low-level code tweak that offers a solution as well, but it's a bit nasty and implementing it has risks of course.

The AMD Community has a great thread on the issues.  A description of the low-level tweak was posted in this thread by @tim-savage.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,575
Credit: 82,283,616,896
RAC: 67,349,462

mountkidd wrote:The AMD

mountkidd wrote:
The AMD Community has a great thread on the issues.  A description of the low-level tweak was posted in this thread by @tim-savage.

Thank you very much for posting this.  Finally understanding why Ubuntu users are having such issues was quite a revelation.  I was starting to believe that perhaps the AMD devs were really screwing up when they couldn't get it right for their main supported distro.

Here is the true picture as stated by @tim-savage:-

Quote:
The pci_platform_rom definition has been removed from linux/pci.h and the implementation from drivers/pci/rom.c. The method is still available in the mainline kernel.

The commit comment states "pci_platform_rom() now has no remaining callers, so remove it." this is the cause of the compile failures.

My interpretation is that a Ubuntu developer decided to remove a particular function (since, supposedly, nothing would need it) when building a specific customised kernel for Ubuntu.  This function was still in the "mainline kernel".  It would seem that the developer didn't realise the function was still needed to build the AMD GPU kernel module.  The solution from @tim-savage was to hack the module source code as well to remove the inconsistency that had been created.  His comment was:-

Quote:

After patching, the AMD GPU driver compiled and loaded successfully with Ubuntu Kernel 4.5.0-48.

The solution is based on the changes made in the Ubuntu kernel, it requires modifying the AMD GPU source.

Whilst I fully understand the 'blame AMD' sentiment, it seems rather obvious that this was entirely out of AMD's control and a classic example of a distro 'shooting itself in the foot.'  It's also a timely reminder about how complicated and fraught with danger, the downstream modifications to the mainline kernel can be.  I'm quite grateful that the distro I use doesn't hack the kernel but simply builds them with a standardised config, exactly as released by the kernel developers.  Since the full source code is also readily available, if people want non-standard options or hacked functions, they are perfectly able to apply their own mods to the source (and the config) and rebuild the kernel to their own particular requirements without affecting the whole community.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,575
Credit: 82,283,616,896
RAC: 67,349,462

Tom M wrote:Suceeded in

Tom M wrote:

Suceeded in getting Amd drivers installed and E@H processing on two Rx 580 gpus.

For my next magic trick I will see if I can swap out all my Nvidia gpus for the Rx 570's that just showed up.

Tom, congratulations on getting your RX 580s working.

Please realise that your problems are down to actions taken by a Ubuntu developer (as pointed out by Mountkidd) and not the particular "opencl=pal,legacy" option you were using.  Also ignore the suggestion to use "opencl=rocr" because Polaris GPUs like the RX 570 and 580 variants don't usr pal or rocr - they use the 'orca' variant which is 'legacy'.

It doesn't hurt to have multiple variants installed and it doesn't affect you at all since the correct implementation of opencl will be used as long as you have the 'legacy' option specified, irrespective of any other option you choose as well.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,575
Credit: 82,283,616,896
RAC: 67,349,462

robl wrote:I have always

robl wrote:
I have always believed that AMD produced good products.  And still do,  But  if I were to read through this thread I would avoid AMD on Linux.

It's wise not to believe everything you read on the internet ;-) :-).

robl wrote:
...  I should have spent the money on Win10 earlier in the cycle.  It had never been this hard on earlier versions of Ubuntu.

No, No not Windows!!  You just needed to abandon Ubuntu earlier ;-) :-).

Seriously though, AMD do produce good hardware that performs really well at Einstein but the development model that Ubuntu uses can really upset the ordinary users if the Ubuntu devs are a bit sloppy with releasing stuff that hasn't been properly tested - just as this case seems to suggest.  It seems like every 6 months (April and October) there is drama/trauma for some section of the ordinary users.  Having to do a fresh install or some sort of upgrade every 6 months would drive me up the wall.

My experience has been that an 'install once and update forever' mechanism (a rolling release) works best for me when I can choose what to install, what to update, and when to do it.  Of course, I don't update every day or every week or on any specific interval.  I do it when I reckon it would be beneficial - and when I have time to plan it properly.

I keep a full copy of the PCLOS repository on a local USB hard drive.  I started this in 2013 and update it from a main repo mirror on about a 2-week to 1-month cycle.  About every 3-6 months, I create a dated clone of the repo at a time when I believe there is good stability.  There are a dedicated group of testers who post their results on the PCLOS forums very regularly so it's pretty easy to see problems and solutions when they occur.  When issues are noted, they usually get fixed in less than 24 hours.  So even if there was some longer term drama with a current update, I could easily rewind back to the previous stable clone copy of the repo.

Funnily enough, in all these years, I've never needed to do that.  I always stay at least a week or two behind the latest development and check the forums for lack of issues before creating a clone.  Every clone I've created (currently about 16) has had no issues when used to update a batch of machines.

My most recent clone copy was 31st Dec 2020 - an easy date to remember :-).  Since then, the only issues reported in the PCLOS forums have been to do with 2021 updates so I figured that clone copy should be good.  In the last week or so, I've updated 30 disparate machines using that copy.  I've installed the latest 5.9.16 kernel.  I've also installed my own build of BOINC 7.16.11 and the OpenCL libs as extracted from the Red Hat 20.40 amdgpu-pro package.  PCLOS doesn't package BOINC (they don't package alpha quality software) and they don't package the OpenCL libs.  I do both those things myself and it's become quite routine.

There was no problem with the 30 machine update and all have resumed crunching without issue.  Obviously, that repo clone is another 'keeper' :-).

Cheers,
Gary.

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 134
Credit: 6,051,133,843
RAC: 4,810,505

Great summary

Great summary Gary! 

Giving Ubuntu credit for all this might be displaced.  There are 3 major pieces to this puzzle - the Linux Foundation, with full responsibility for the kernel, the system packagers, like Canonical with their Ubuntu families, Red Hat, etc and lastly the third party groups like AMD, Nvidia and other hardware vendors.  I'd like to think these groups communicate, but sometimes... 

The pci_platform_rom definition and call is kernel level stuff, so it's not likely the removal came from Canonical or AMD. AMD otoh claim to have processes in place to monitor and detect such kernel changes, but this one slipped through.  In fairness to AMD, the problem was determined in early Dec 2020 and AMD created driver package 20.45 for 20.04 LTS systems.  They have yet to backport the driver changes to 18.04 LTS.  Anything earlier than 18.04 has passed its EOL date and those users must face the inevitable.

The patch to amdgpu_bios.c by @tim-savage compiles cleanly and I was able to get a successful build with it and install kernel 4.15.0-132 on one of my systems.  What remains however is getting the patched code into the amdgpu 20.30/20.40 driver so I can get the GUI back - it's not a simple process!

 

robl
robl
Joined: 2 Jan 13
Posts: 1,706
Credit: 1,423,691,146
RAC: 2,739

out of curiosity does anyone

out of curiosity does anyone have a feel for when this will be corrected?  Of all parties concerned I would think that AMD would have the greater interest to arrive at a solution/resolution.  

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 906
Credit: 1,784,874,152
RAC: 6,951,556

Gary Roberts wrote: No, No

Gary Roberts wrote:

No, No not Windows!!  You just needed to abandon Ubuntu earlier ;-) :-).

Hi Gary,

There is no problem with you mentioning Windows, as in "No, No, not Windows!!", because I do know that Linux is the better operating system for BOINC.  But do you think that Ubuntu, and the offshoots like Mint, are really that bad?  I realize that your comment was a bit (slightly) tainted with the ;-) :-) emojis, but really?

George

A proud member of the O.F.A. (Old Farts Association)

John Persichilli
John Persichilli
Joined: 23 Jan 12
Posts: 51
Credit: 43,898,736
RAC: 60,676

Okay guys and gals here is a

Okay guys and gals here is a new (for me anyway) issue.

I have installed the latest versions for BOINC and selected Einstein @ Home as the one and only project on three different Ubuntu 20.10 notebook computers and one Windows 10 Home edition. Two of the three Ubuntu computers run the Einstein@Home screensaver just fine. The third and oldest notebook, which is an ACER Aspire 7730Z runs the screensaver but not completely. The starsphere loads but not all the toggled functions don't. The biggest one of these is the Constellations display. When I select Shift and C nothing happens, that is there is no Constellation lines display on to the starsphere. I can toggle the Shift G and the grid lines display on to the starsphere. I can also toggle off and on some of the others like stars and pulsars.

 

The only think I can figure is that the graphics card itself or the drivers are too old to handle the X11 display, But that doesn't seem logical as I can toggle other of the features.

This computer is running an Intel Dual Core Pentium processor T3400, 2 GB DDR2 RAM, up to 732 MB Mobile Intel Graphics and a 160 GB HDD. I have set 80 GB for Ubuntu 20.10 OS.

Does anyone have an answer?

Regards,

John

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 2,432
Credit: 6,881,575,571
RAC: 24,108,205

My first thought is not

My first thought is not enough RAM for the troublesome computer to run the hardest and most memory intensive part of the screensaver simulation.

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 2,435
Credit: 4,119,880,677
RAC: 9,294,953

John Persichilli wrote:Okay

John Persichilli wrote:

Okay guys and gals here is a new (for me anyway) issue.

===edit==

The third and oldest notebook, which is an ACER Aspire 7730Z runs the screensaver but not completely. But that doesn't seem logical as I can toggle other of the features.

===edit===

This computer is running an Intel Dual Core Pentium processor T3400, 2 GB DDR2 RAM, up to 732 MB Mobile Intel Graphics and a 160 GB HDD. I have set 80 GB for Ubuntu 20.10 OS.

Since the screen saver is driven by a cpu task.  I would propose you reduce your processing to a single cpu task.  And see if the screen saver then loads/runs properly.

It is probable that you can only run 1 cpu task and (maybe) a gpu task on the laptop.  My own testing has almost always shown that processing a task on the older Intel internal graphics inside the cpu slowed the total production of the system down.  There was one exception (reported by someone else) but only testing will show which category your laptop is in.

If a ram upgrade is available/cheap/easy! (sometimes the ram chips are not very accessible, check before you buy more ram) for the laptop I would bump it to 4GB or higher.

Tom M

Live long and Prosper.

A proud member of the O.F.A. (Old Farts Association)

It ain't the heat it's the humility. - Yogi Berra

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.