A very long standing issue I work on sporadically makes some progress ...

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109037910439
RAC: 33871185
Topic 219274

Some (perhaps many) regulars are probably sick of me talking about AMD GPUs.  If you're one of those, best to quit now while you've still got your sanity :-).  If you keep on reading, that's your problem :-).

What follows is a summary of a personal journey I've been on for a while now.  That journey was to find a mechanism to allow older AMD GPUs such as HD 7850s, 7950s, R7 370s, etc, that I run, to transition from the deprecated proprietary fglrx driver to the open source amdgpu driver on my Linux distro of choice - PCLinuxOS.

No doubt, a lot of the details I'm intending to write down will be of little interest to others.  I find I always do a better job of writing notes to myself if I start with the objective of explaining the full details of the saga to someone else.  I'm not forcing anyone to read but I am hoping that the process of documenting the journey will be helpful to me at some point in the future.  Maybe even, there might be someone who does read through this, that then comes up with a useful suggestion about how to resolve the problem.  I just might get lucky :-).

I bought my first AMD GPU for Einstein crunching (a HD 7770) around late 2012 or early 2013 and was initially disappointed.  It didn't perform all that well at first.  At that time, I didn't understand that it seems to take a while for AMD to get their drivers right particularly for Linux.  Then, one day, along came updated drivers and performance really improved.  The card still runs today and its current RAC is around 140K.  Not too bad for a 7yr old 1GB card.

That experience caused me to research and ultimately to invest in HD 7850s.  One obvious performance 'plus' was the fact they had a 256bit PCIe bus width compared to the 128bit width of the 7770.  The purchase 'trigger' was the discounted price of old stock after the launch of the 'new' R5/R7/R9 2xx series.  In that series was an R9 270 which was just a re-badged (and perhaps slightly improved) HD 7850.  That event taught me the crazy things that can happen when the old and new have virtually the same performance.  People assume the new will be better and the old gets discounted, often heavily, to clear it.

The 7770, 7850, R9 270 and an even later re-badging (R7 370) all belong to the 1st generation of the then new architecture called Graphics Core Next (GCN) which had replaced the previous architecture known as TeraScale.  GPUs in that new 1st gen were collectively referred to as "Southern Islands" (SI).  Over time, the GCN architecture was improved and the 2nd gen cards were "Sea Islands" (CIK) and the 3rd gen, "Volcanic Islands" (VI), etc.  I mention this just to draw attention to how convoluted the GCN architecture had become over time.  Things like the Polaris GPUs - RX 4xx and 5xx are GCN 4th gen and Vega and Radeon VII are GCN 5th gen.

While all this was going on, AMD had started the process of open sourcing their graphics drivers for Linux.  At the time I bought the 7850s, the driver was closed source and called 'fglrx'.  Somewhat later, AMD announced they were going to deprecate fglrx and start the transition to an open source driver called amdgpu which was to be part of the Linux kernel.  For those needing the OpenCL functionality which had been part of fglrx, there was a binary package called AMDGPU-PRO which would include that as well as proprietary versions of the driver.

AMD made 3 versions of this binary package - for Ubuntu, for Red Hat and for OpenSUSE.  Other linux distros were not catered for directly.  It was up to the package managers for individual distros to handle that, if they wanted to.  Just like nVidia which had a community developed open source driver called 'nouveau' there was an AMD equivalent called 'radeon'.  Both nouveau and radeon don't have the capability to support the use of OpenCL.  If you needed OpenCL on supported AMD hardware, you needed the new amdgpu kernel driver and you needed to install the OpenCL libs on top of that.

If you used one of the supported distros, it was a pretty trivial exercise.  You had the choice of installing straight from AMD by downloading the package and running the install script or you could wait for the supported distro to include it in their own available sets of packages and use the distro's own software installation/package management system.  For the many other unsupported distros, the wait could be much longer since somebody would need to do a lot more work to make something that would work for that particular distro. 

In my opinion, open sourcing AMD drivers is a win-win-win (eventually) for all parties - AMD, the Linux kernel community, and all users (both Windows and Linux).  AMD's developer team contribute knowledge and code to the Linux kernel.  Kernel developers, other Linux developers and the package management teams in different distros take this code and improve it further.  AMD gets the benefit of more critical eyes probing their code contributions with the end result of a better and more timely product than what AMD could have achieved by being entirely 'in-house'.  The end user (eventually) gets the benefit of this.  It's taking quite a while to get there but, for me as an end user, I believe it will be (ultimately) worth the wait.

Initially, the amdgpu open source driver didn't support cards belonging 1st & 2nd gen GCN - SI and CIK.  The plan was to gradually incorporate these over time.  It's still an ongoing work.  The fglrx driver was deprecated around mid-2016.  They produced a final cleaned-up version of it that did have some performance improvements.  I grabbed a copy and upgraded all my SI equipped hosts and they've sat at that upgrade point ever since.

In early 2017, I started playing with the amdgpu driver and 4th gen cards and worked out a procedure to extract just the OpenCL bits from the Red Hat AMDGPU-PRO package of the day (16.60) to get crunching to work on Polaris cards using my distro of choice.  My distro of choice doesn't package BOINC (they regard it as alpha quality software based on early bad experiences with it) and they have not provided packaged versions of OpenCL once the binary version that came with fglrx was deprecated.  In a way both of those things were ultimately to my benefit since it forced me to learn how to build BOINC for myself and it forced me to work out how to extract stuff from packages designed for other distros.

Since that time, about 4 or 5 times per year, AMD release an updated package for those three supported distros.  Quite often there are changes in components and locations so it's a bit of an adventure to work out exactly what is needed to have working OpenCL with each subsequent version.  I got through to 18.30 about August last year but then had a problem getting the December package (18.50) to work.  There was obviously a change I wasn't understanding.

With each new release of the -PRO package, I would fire up a system with an SI GPU to see if things would work. The basic graphics has worked OK for quite a while with some extra kernel parameters at boot time but the best I got with OpenCL was for BOINC to detect the GPU and download tasks which would then proceed to fail after about 15-20 seconds of trying to run.  So I'd just restore the old 2016 fglrx system and put it back to work that way.

I did some further work on this a while ago now after seeing the information in this thread, in particular this message which documents some needed environment variables.  I was still using the OpenCL libs from 18.30 and it still wouldn't work for me, even when setting those variables.  If you look near the end of that thread, you'll see that I asked the OP if his results were validating.  I suspect (as I describe later) that he was using the 18.50 libs, which do allow crunching to work, but not to validate reliably.  I've kept an eye on that thread but the OP didn't respond further and his RAC has essentially gone to zero.  I suspect he couldn't get his completed work to validate reliably.

More recently, I worked out how to use 18.50 and 19.10 for OpenCL libs and that works fine for Polaris GPUs.  These two new versions are now incorporated into my personal installation script.  That single script can automatically handle install/uninstall/re-install any particular version from the original 16.60 I first started with up to the 19.10 version announced earlier this year.  From testing the various versions, there doesn't seem to be any significant improvement in crunching rate on Polaris GPUs with the more recent versions compared to what was available from 16.60.  The vast majority of my Polaris hosts using the amdgpu driver are running the 18.30 version of the OpenCL libs.  There are just a couple testing 18.50 and 19.10 and with no benefit showing up, no impetus to upgrade the others.

So I repeated the tests on the SI series and by incorporating the environment variables, was delighted to find that crunching now was working on both a HD 7850 and an R7 370 (both Pitcairn GPUs).  The results were completing around 5-10% slower than for fglrx but it was really nice to see no compute errors.  After a number of tasks had been returned, there were even one or two validations.  I downloaded a few more tasks and was about to set x2 when there was an inconclusive and then another one.  I had around 30 tasks on board and then a validate error popped up.  At that point I set NNT and decided to see what would happen to all the others that had been uploaded and were waiting for a quorum to form.   The end result was very few more validations with around half of the tasks becoming validate errors and the majority of (if not all) inconclusives being deemed invalid.  I had started with a 7850 but I tried a few on an R7 370 with probably an even worse result.  That was a couple of months ago and I've tried since then with a later version kernel with no improvement in outcome.

Very recently, my distro of choice has released kernels (with amdgpu kernel modules) for the 5.2.x kernel series. There are also the latest versions of lots of other important stuff so I figured it was time to try again, but rather cautiously.  This is the tasks list of a test installation I'm using.  It's a fully updated latest version install on a second disk with its own separate hostID in a machine that normally runs the old 2016 fglrx version.  It has an R7 370 GPU.  It's a very quick and painless changeover operation.  I shut down the 2016 version cruncher, reboot from the other disk, update the system from my local copy of the full repo, install the version of the OpenCL libs I want to use (if that has changed) and run clinfo to make sure the GPU's OpenCL properties are recognised.  After that checks out, I launch BOINC.

There are currently 10 tasks showing of which 6 (dated 08 July) were the leftovers from a previous test with an earlier version kernel.  I had hoped to use those 6 for the 5.2.x kernel test but unfortunately I wasn't able to get that kernel before the tasks expired.  I downloaded 4 new tasks yesterday and the first 3 crunched without error as before.  I've kept a single task while waiting for validation results.  As you can see, two validate errors and 1 invalid.  So the machine has rebooted to the old 2016 setup and has resumed the normal trouble free operation with its old work cache.

So, I haven't reached the end of the road yet.  I intend to keep trying from time to time.  I was starting to become hazy about some of the earlier details so I decided to write it all down.  If you've got to this point, thanks for making the effort to read it all through to the end.  If you have any suggestions, feel free to add a comment.  Thank you.

Cheers,
Gary.

Rolf
Rolf
Joined: 7 Aug 17
Posts: 27
Credit: 135377187
RAC: 0

Hi Gary, I was going to

Hi Gary,

I was going to suggest you try the Rocm opencl component, but when checking further I see that they specifically say that Pitcairn chips are known not to work with Rocm, "because the basic drivers required for ROCm, such as amdkfd do not include support for them". There is an overview of all the GCN generations and which ones work and don't work, perhaps it can give some more insight. There seems to have been a fundamental change in the Sea Islands generation so only Hawaii chips work. (And I thought Hawaii was a volcanic island but in this context it seems to be a sea island.)

https://rocm.github.io/hardware.html

Rolf

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109037910439
RAC: 33871185

Hi Rolf, Thank you very much

Hi Rolf,

Thank you very much for your response. It is appreciated.

I did take a very brief look at ROCm quite some time ago and when I saw statements like

"GPUs all require a host CPU and platform with PCIe 3.0 with support for PCIe atomics."

I knew it wasn't for me.  The biggest fraction of my Pitcairn GPU based hosts had 2013/14 HD 7850s which I bought to put in existing computers and some new builds that certainly weren't PCIe 3.0 capable.

Periodically, I look at this chart from the X.org wiki which (right near the bottom) still to this day lists the compute (OpenCL) status as WIP.  Over time I've gradually seen more stuff turn green so I guess it will eventually get there :-).

Whilst I'd be really nice to bring everything up-to-date, I have gotten quite comfortable with leaving around half the fleet back at the late 2016 update point.  The final cleaned up release of proprietary fglrx/OpenCL has been very stable and did give a very nice performance improvement when I installed it.  Even though the hardware is older, those machines on average have longer uptimes and fewer GPU quirks than the more modern stuff.  I guess it's a bit of a personal challenge to see if that older hardware can be made to get valid results under the new driver regime.

Cheers,
Gary.

QuantumHelos
QuantumHelos
Joined: 5 Nov 17
Posts: 190
Credit: 64239858
RAC: 0

Dear Customer, Your Service

Dear Customer,

Your Service Request has been received and will be processed shortly. Depending on the nature of your inquiry, further automated messages with additional instructions might follow.

Service Request: {ticketno:[8200889346]}

We thank you for your patience.

Best regards,

AMD Global Customer Care

______________________________________________________________________________________________

ICSS Service Request
From: me
Sent: 07/26/2019 01:15:39
To: TECH.SUPPORT 
Subject: PCLinuxOS Driver & ROCm Support 270XC : A VERY LONG STANDING ISSUE I WORK ON SPORADICALLY MAKES SOME PROGRESS

QuantumHelos
QuantumHelos
Joined: 5 Nov 17
Posts: 190
Credit: 64239858
RAC: 0

Gary Roberts wrote:Hi

Gary Roberts wrote:

Hi Rolf,

Thank you very much for your response. It is appreciated.

I did take a very brief look at ROCm quite some time ago and when I saw statements like

"GPUs all require a host CPU and platform with PCIe 3.0 with support for PCIe atomics."

I knew it wasn't for me.  The biggest fraction of my Pitcairn GPU based hosts had 2013/14 HD 7850s which I bought to put in existing computers and some new builds that certainly weren't PCIe 3.0 capable.

Periodically, I look at this chart from the X.org wiki which (right near the bottom) still to this day lists the compute (OpenCL) status as WIP.  Over time I've gradually seen more stuff turn green so I guess it will eventually get there :-).

Whilst I'd be really nice to bring everything up-to-date, I have gotten quite comfortable with leaving around half the fleet back at the late 2016 update point.  The final cleaned up release of proprietary fglrx/OpenCL has been very stable and did give a very nice performance improvement when I installed it.  Even though the hardware is older, those machines on average have longer uptimes and fewer GPU quirks than the more modern stuff.  I guess it's a bit of a personal challenge to see if that older hardware can be made to get valid results under the new driver regime.

 

PCI Atomics no longer required.

 

****

ROCm Driver for Debian type Linux distributions like Ubuntu

The thing with AMD drivers is that you need to uninstall the previous driver completely first before installing the new one.. ROCm sounds promisingly likely to improve with the laboratories promising to improve ROCm with cray & does not require uninstalling ...

https://www.phoronix.com/scan.php?page=news_item&px=Radeon-ROCm-2.3-Released

https://rocm.github.io/blog.html

https://github.com/RadeonOpenCompute/ROCm

ROCm & Vulkan Drivers

run this after downloading file (google drive): https://is.gd/Install_gpl_ROCm_amd_drv_sh

sudo chmod 774 Install-gpl-ROCm-amd-drivers.sh
sudo ./Install-gpl-ROCm-amd-drivers.sh

QE

Rolf
Rolf
Joined: 7 Aug 17
Posts: 27
Credit: 135377187
RAC: 0

As Quantum says, PCIE3 is not

As Quantum says, PCIE3 is not a good reason to avoid Rocm, if you browse around the site you find more info on PCIE2 support. But they are cryptic when describing what's supported and not. Sometimes they mean "not supported for Rocm-specific features (HCC, HSA, HIP et al)", sometimes they mean "does not work at all". Either way, it won't cost much to try it. The only caveat being that the Rocm team says Pitcairn doesn't work because of AMD basic drivers and I suspect it means that amdgpu-pro could not work either. Hawaii should work though.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109037910439
RAC: 33871185

Rolf wrote:As Quantum says,

Rolf wrote:
As Quantum says, PCIE3 is not a good reason to avoid Rocm

I think it's an extremely good reason and I don't think QH has much of a clue about the subject.  I don't claim to have much of a clue either but at least I do study the official details that do get published about these matters.

If you look at the links he offered, the first is to a news announcement at Phoronix about the 2.3 version that I remember reading months ago.  The current release is actually 2.6.

The 2nd link is basically marketing material (and copyright 2016 at that) from AMD trying to sell people on the idea of using ROCm by selling the virtues of various built-in libraries, eg. "250+ optimized math functions ...".  In other words, if you're building new apps for high end computing clusters containing powerful and modern GPU hardware maybe the ROCm platform might just suit you.

However, I'm just a volunteer and have no interest in coding some high performance GPU app.  I wouldn't have the faintest clue of how to get started.  I just want to run the E@H app.  And if you think about that, the app E@H uses must be able to run on consumer grade GPUs of all sorts of capabilities, from very high end to quite low end.  So basically it needs to accommodate down to some 'lowest common denominator' so as not to exclude potential volunteers, wherever possible.  If E@H were to build their app on the ROCm platform, how many ordinary volunteers would have the hardware that could run it?

The 3rd link is actually to the 2.6 version on github.  I remember seeing the news announcement about this on Phoronix a couple of weeks ago.  Here are a couple of quotes that very clearly point out why ROCm isn't going to work for me.  If you scroll down to the "Hardware Support" sub-heading you will see that official support is for GFX8 and GFX9 GPUs only - ie. Fiji, Polaris, Vega.  Then they say, "The following list of GPUs are enabled in the ROCm software, though full support is not guaranteed:  GFX7 - Hawaii".

Here is the most telling quote of all.

"Beginning with ROCm 1.8, GFX9 GPUs (such as Vega 10) no longer require PCIe atomics. We have similarly opened up more options for number of PCIe lanes. GFX9 GPUs can now be run on CPUs without PCIe atomics and on older PCIe generations, such as PCIe 2.0. This is not supported on GPUs below GFX9, e.g. GFX8 cards in the Fiji and Polaris families."

So, it's very clear that without PCIe 3.0, not even Polaris GPUs are supported, let alone something ancient like Southern Islands.

A lot of my older boards are not even PCIe 2.  I have lots of PCIe 1.x, all working quite happily with proprietary fglrx/proprietary OpenCL.  I've followed the various ROCm announcements from way back in the 1.x days and each time I see that the README descriptions of supported CPUs, GPUs and PCIe versions do a pretty good job of advising me that It's not suitable for my stuff.

Cheers,
Gary.

QuantumHelos
QuantumHelos
Joined: 5 Nov 17
Posts: 190
Credit: 64239858
RAC: 0

Gary Roberts wrote:Rolf

Gary Roberts wrote:
Rolf wrote:
As Quantum says, PCIE3 is not a good reason to avoid Rocm

I think it's an extremely good reason and I don't think QH has much of a clue about the subject.  I don't claim to have much of a clue either

A lot of my older boards are not even PCIe 2.  I have lots of PCIe 1.x, all working quite happily with proprietary fglrx/proprietary OpenCL.  I've followed the various ROCm announcements from way back in the 1.x days and each time I see that the README descriptions of supported CPUs, GPUs and PCIe versions do a pretty good job of advising me that It's not suitable for my stuff.

 

So what are you saying ? you would like to be the only one writing to amd support ? , My personal posts to amd support include frame data videos and 2 page detailed reports on Elite dangerous and hardware impute..

The situation may change and as they say "Our drivers are written with the best intentions." driver development is still underway including the 19.30 driver which has support for the A3 and R7 200 driver sets,

However you do have to uninstall the old drivers and install the new in safe mode preferably! the situation could be better..

The lorence laboratory still has older hardware to support & not bin & so do many other labs, So support for 200X Hardware is an evolving support compatibility between ROCm & the official AMD drivers & or the GPL driver chains.

Know nothing ? pardon us for we are only showing support.

QE

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

Great information, Gary.

Great information, Gary. Thanks for sharing.

Clear skies,
Matt
Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109037910439
RAC: 33871185

You're welcome! Yesterday, I

You're welcome!

Yesterday, I 'discovered' one of the bunch of hosts that I thought had SI GPUs, actually has a CIK GPU.  I didn't think I had any 2nd gen GCN so I was quite delighted to find that this particular GPU was a Bonaire rather than being the Cape Verde that I thought it was. It's definitely CIK rather then SI :-).

I'd been spending a bit of time adding functionality to a script called 'post_install' that I use to convert a fresh install (or re-install) into something that's ready to crunch.  There were two 'manual' parts of the operation that now have code to do them automatically by answering a couple of questions.

So this morning, I decided to add a second disk to the Bonaire host, set NNT for the time being on the 2016 OS version that was already crunching, load a fresh install onto the 2nd disk and use it as a testbed for the updated post_install procedure.  It all seems to have worked rather well and I have a fully working amdgpu based system that is crunching without error (so far) with times that seem comparable to what I was getting with fglrx.

Here is the current tasks list for anyone interested.  As I write this, there is 1 task validated, 1 inconclusive, and 1 waiting for validation.  The inconclusive is not a welcome sight - it's quite like what I saw (and mentioned previously) when I first got a Pitcairn GPU using amdgpu to return completed tasks without computation error.

I've set NNT for now and intend to wait to see how the current lot goes.  I'm pleased there are no validate errors yet but if I get more inconclusives, it will be a bit of a worry.  At the moment I'm using OpenCL libs from the 18.50 version of AMDGPU-PRO.  I'm using a 5.0.9 kernel version with the amdgpu kernel module that comes with that kernel.  If the current setup leads to significant invalids, my intention is to keep testing with different versions of OpenCL libs and with different kernels/amdgpu modules.  I have a number of different clone points for my local repo copy so I can test with different kernels from the 4.17.x period right through to the current 5.2.x series.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5840
Credit: 109037910439
RAC: 33871185

I took a look at the quorum

I took a look at the quorum for the first inconclusive I saw and maybe it's gonna be alright :-).

My task is the _2 resend.  The first two tasks (_0 and _1) were done on nvidia GPUs, one on Windows and one on Linux.  Then along came mine and none of us agreed :-).  However the next resend (_3) is being done on an AMD GPU on Linux.  Maybe (finally) we'll be able to agree after all! :-).  Maybe it's just that validation is strict and maybe mine might be basically OK and not complete rubbish :-).

UPDATE:  There are now 2 valid, 4 waiting and still just 1 inconclusive.  Looking better all the time :-).

I've downloaded a few more tasks to have more "somewhat aged" to shorten the wait for validation.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.