A very long standing issue I work on sporadically makes some progress ...

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109411214493
RAC: 34940403

The count is now up to 2

The count is now up to 2 valid, 9 pending and 1 inconclusive.  As there are no validate errors (which plagued me with the SI test) I've downloaded some more tasks and switched from running x1 to x2.  The x1 times are averaging around 1600 secs.  The GPU is an R7 260X.

There was a single task crunched 45% x1 and the balance on x2.  Its time was 2295s.  Its partner (crunched entirely at x2) took 2984s - or 1492 per task when two are crunched simultaneously.  So that (if sustained) represents a nice little improvement.

This GPU when using proprietary fglrx/OpenCL gave a per task time (at x2) of 1535s so the new amdgpu based result is nearly a 3% improvement - as long as the results validate and the time is maintained.  It's certainly looking at least a bit promising :-).

In downloading extra tasks, I was lucky enough to snag a group of 5 resends.  They are the last 5 tasks in the current list and I've promoted all of them to crunch immediately in the hope that they will be presented for validation as soon as they finish.  They'll all be at x2 and the first will finish about now.

UPDATE:  The first three resends have completed and validated.  Unfortunately, the inconclusive result has now become invalid so the 4th result chose one of the nvidia results to agree with.

I've downloaded more tasks so the machine can run unattended through the night and I'll review the 'score' in the morning.  Currently it stands at 5 valid, 9 pending and the single invalid.

Cheers,
Gary.

QuantumHelos
QuantumHelos
Joined: 5 Nov 17
Posts: 190
Credit: 64239858
RAC: 0

congratulations garry

congratulations garry

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109411214493
RAC: 34940403

Gary Roberts wrote:I've

Gary Roberts wrote:
I've downloaded more tasks so the machine can run unattended through the night and I'll review the 'score' in the morning.  Currently it stands at 5 valid, 9 pending and the single invalid.

No problems through the night.  There are now 21 valid, 22 Pending and still the single invalid.  The average crunch time for the 17 crunched at x2 and validated was 2966s - a per task time of 1483s.  So, this batch of results is even slightly faster than the very first one to finish at 2x.  That first one which took 2984s is still in the pending list so not part of the average for the 17 valids.

I'll let this current install run for a bit longer just to make sure everything continues without problems and at that point I'll reboot to the other disk to finish off the remaining tasks there under the old fglrx based system.  Once that's done, I'll remove the old disk and this R7 260X based system will join the ranks of the amdgpu based fleet.

In the meantime, I'll fire up the SI (Pitcairn) testbed with the R7 370 GPU and try a few more experiments with it.  The current CIK exercise has given me a few clues as to what to try next :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109411214493
RAC: 34940403

It's been nearly 5 months

It's been nearly 5 months since I last did any work on trying to get GCN 1st Gen GPUs (Southern Islands series) to work using the amdgpu driver rather than the old deprecated fglrx proprietary driver.  The brief summary is that whilst I had complete success with the 2nd Gen (R7 260X - Sea Islands series) example, (as mentioned in my previous post) I couldn't get any of my 1st Gen Pitcairns to do likewise - so I gave up (temporarily) :-).

With the advent of what is likely to be a long hot summer, well over a month ago I made the decision to shut down more than 50 machines (many of which contained Pitcairn series GPUs) basically to protect my investment in newer machines with lots of RX 570 GPUs.  I've bought quite a few more of these GPUs since they are now selling locally for as low as $AU165 which is equivalent to around $US116.

The effect of the shutdown on temperature and stability has been extremely pleasing.  I use forced ventilation (not aircon) to manage temperature in the machine room.  In previous summers, I used to run a script to pause crunching on all machines if the room temperature got to ~37C.  This year, with more machines running than ever before, the room temperature was regularly heading towards 40C, even before summer had officially started, so I really had to take this drastic shutdown action :-).

As a result, the room temperature now peaks at around 34C and I haven't needed to pause crunching on the machines that remain.  They seem to be running very reliably.  The last time there was any issue of note was on Dec 24 - according to the logs.  I remember seeing the office lights give a very brief flicker - one of these annoying power glitches that seem to happen randomly (but happily, fairly infrequently) for no apparent reason.  My monitoring script found 5 machines that were not operating normally.  I rebooted those five and all was well.  There were no other issues that had gone undetected.  Gotta luv the monitoring script! :-).

The last time a GPU actually crashed was Dec 19 when a 2008 vintage Q6600 machine had the RX 570 stop crunching overnight.  So that's just 1 actual problem in the last 2 weeks.  Normally I would expect to see 1 or 2 problems per day at this time of year.  These are obviously heat related since winter time has a much reduced rate of issues as well.

To keep myself busy, I've been refurbishing shutdown machines - and restarting the odd one or two with new RX 570 GPUs :-).  With a good clean and dust-off, and with fan re-lubes and capacitor replacements where needed, they scrub-up quite well :-).  I also decided it was time to take one of the refurbs still with a Pitcairn series GPU (HD7850) and have a further attempt at getting it to produce valid results with the very latest Linux kernel/amdgpu kernel module/OpenCL libs installed.

I had learned an important lesson back in 2011/2012 with a pair of GPUs, an AMD HD7770 and an nVidia GTX650 that I had purchased for testing what particular model I should buy in bulk.  The AMD was slightly cheaper, but on paper, should have performed a little better.  I was able to get the 650 working immediately but it took a lot longer (mainly due to my ineptitude and lack of knowledge) to get a working 7770.  The 650 outperformed the 7770 by about 10% so I bought a batch of 650s.

I kept the 7770 running and gained more understanding and experience with installing newer drivers and OpenCL libs for it.  Maybe 12 months later, the drivers then available suddenly started giving significantly better performance and the 7770 ended up outperforming the 650s by quite a margin.   The lesson was, "Don't expect AMD hardware to have mature and efficient Linux drivers with new hardware.  Be prepared to hold off until the FOSS community have time to tweak things."  I've only just recently shut down that 7770.  It still performs very well under fglrx.  It's GCN 1st Gen so it could still be used.  Its fan was completely shot but I had zip-tied a decent server fan to it and it was running quite cool :-).  However, the host it was in now has an RX 570 so I'm letting it retire gracefully :-).  After all, it has worked hard for nearly 8 years! :-).

About a month ago, AMD had quietly released the 19.30 version of Radeon Software for Linux - what used to be called the AMDGPU-PRO package.  I have a script that extracts the necessary bits out of the Red Hat version of this package since my distro of choice is not supported.  It's always a bit of an adventure to work out what might have changed with subsequent versions and then to edit my own script to make sure it handles every different version correctly.

So, yesterday, having refurbed this machine which I originally built back in 2008, I added a 128GB Sata SSD instead of the 2003 20GB IDE drive it had been running and fired it up.  I installed the very latest PCLOS ISO.  I updated all components to the latest available.  The kernel/amdgpu module is version 5.4.6.  I installed the 19.30 version OpenCL libs, extracted from the latest AMDGPU-PRO package.  I have built my own version of BOINC based on 7.16.3, so I took the whole BOINC tree from the 7.2.42 version that had been on the old 20GB hard drive and just replaced the old 7.2.42 program files with the 7.16.3 versions.

The new BOINC fired up without complaint, commented on the version change from 7.2.42 to 7.15.0 (what 7.16.3 is called when 7.16.x hasn't been officially released) detected the Pitcairn series GPU and its OpenCL capabilities and proceeded to run benchmarks.  I had set the same series of environment variables as mentioned in the opening post in this thread so when I unset NNT and allowed a small work fetch, I expected, from previous behaviour, that crunching would start without issue, which it did.  The big unknown was whether or not the results would validate.

A couple of tasks completed and became 'pending'.  It was getting late so I downloaded enough work to last the night and went home.  This morning, the machine was still running OK and there are plenty of pendings, quite a few valids as well and not a single invalid or validate error.   The crunch time (for 1x) is a little slower than what it would have been under the old fglrx driver.  I've changed to 2x where I'll really know more clearly if there is any change in performance.  It looks like my confidence that sooner or later, the combination of the small number of AMD employees and the larger number of FOSS community volunteers would eventually fix the problem for Linux users, has finally paid off!

As I write this, it's quite a few hours later and there are 160 tasks in progress, 26 pending, 25 valid and 0 for any 'error' categories.  The crunch time singly is largely between 1300 and 1310secs and for x2 running, it is between 2410 and 2420secs which equates to ~1210s per task.  There have already been a couple of validations for tasks crunched x2.   Everything looks fine!

It's early days, but it looks like the current performance is just a small amount less than what I was getting under the old fglrx.   As Murphy's Law would have it, now that I could upgrade all the GCN 1st Gen bearing hosts to current, I can't because they really need to stay shut down until the autumn coolness returns :-).

What I will be able to do is continue refurbing the remainder of the shut-down fleet.  The current success is less than a day old.  I just hope that the next Murphy's Law doesn't now kick in - you know, that one that says, "The very instant you declare a successful outcome to a long standing and seemingly intractable problem, it will all go to sh*t in the most severe fashion possible". :-).

Cheers,
Gary.

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

You've been very busy!

You've been very busy! :)

I've done some hardware upgrades as well. My XW4600 and DL360 have been replaced with a Z420 and a DL380. The server is a single CPU unit, 8 cores vs 2 6 core units. I'll need to upgrade that if I want to install a newer GPU. I'll need a different riser board as well. For now, I'm enjoying the lower energy bill. :)

I had some trouble getting my AMD card to play nice in the Z420. I saved the driver files, alowing me to uninstall the add on Radeon software and install the drivers alone. That helped, but the machine still would run the GPU tasks much longer than they should. I finally set the GPU task to .5 and everything seems to be okay. It didn't like the .33 nor the .25 setting. Now my NVIDIA Ti 1660 was fine with .25, but that card was relegated to my flight simulator. At some point, I'll put a better card in the Z420, after I clear it with the finance committee. :) (Wife and I want to buy a new home so we are watching our pennies.)

Clear skies,
Matt
cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2446026247
RAC: 1487798

Gary and Matt, Thanks for the

Gary and Matt, Thanks for the informative updates and congrats on your successes.  A good start to the new year!

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.