Former GPU mining rig now running BOINC

cecht
cecht
Joined: 7 Mar 18
Posts: 1,492
Credit: 2,750,644,484
RAC: 2,035,829

JStateson wrote:There maybe

JStateson wrote:
There maybe better tools other than fancontrol. ...

There is. Check out amdgpu-utils on GitHub, https://github.com/Ricks-Lab/amdgpu-utils. I know the graphical controller (amdgpu-pac utility) can display four cards, and should be able to fit in five or six. Read the USER_GUIDE through first. I use the utilities for fan control, P-state masking, and card monitoring. With it, you can make changes to fan speeds on the fly, but changes to P-states need to be made with the card(s) not under load (i.e., suspend BOINC).

JStateson wrote:
When I put an RX580 in with the five RX560 the system would frequently shutdown.  It ran hot and I am pretty sure that caused the problem....

My two RX 570's have run in the low 80s for several weeks with no problems, so I wonder whether the temps you saw with the RX 580 are the sole cause of shutdowns. The only time I've experienced a temperature related system shutdown is when I screwed up a setting and a card got above 90 C.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3,039,625,389
RAC: 508,266

All of those temps and fan

All of those temps and fan speeds are too low.  Looks like nothing is happening.  what does your manager show?

 The following picture is of two systems:  Two rx570 + One rx580 under win10 and five rx560 ubuntu.  With full load, your rx580x should show temps in mid 50 to mid 60 with fans at %100. Temps for the Linux are not available but they match what I posted earlier using "sensors"

[EDIT]  I added a screenshot from the windows system showing GPU utilization.  Possibly that amd utility at GitHub can show something similar.  The pic confirm full load: 71,100,99 with temps 55-65 and fan speedup high.

text;powerup

 

boinctasks usage

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3,039,625,389
RAC: 508,266

cecht wrote:My two RX 570's

cecht wrote:
My two RX 570's have run in the low 80s for several weeks with no problems, so I wonder whether the temps you saw with the RX 580 are the sole cause of shutdowns. The only time I've experienced a temperature related system shutdown is when I screwed up a setting and a card got above 90 C.

I don't like going above 79.  Be that as it may, supposedly the GPUs throttle down if too hot so should never have a shutdown.  Could be mining bios mods change the throttle threshold? =- Just a guess.

I am not running any CPU bound tasks on the rx560 system and the CPU is always cool.  If I don't spin the rx560 fans at %100 the system dies in about 20 minutes.  

 

I will look into that GitHub program as fancontrol is terrible though I did manage to get it work.  

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

I think I found the culprit -

I think I found the culprit - with the USB stick I seem to have done something wrong with the installation of upstream's (AMD's) drivers, likely not having set the --opencl option. That USB stick was slow, though, even pristine Ubuntu with the new SATA SSD was beating it by 13 minutes.

Three cards are now about as fast as these were with the M.2 NVMe SSD, here the temperatures for now five active cards:

coretemp-isa-0000 Adapter: ISA adapter Package id 0:  +27.0°C  (high = +84.0°C, crit = +100.0°C) Core 0:        +24.0°C  (high = +84.0°C, crit = +100.0°C) Core 1:        +25.0°C  (high = +84.0°C, crit = +100.0°C)

amdgpu-pci-0500
Adapter: PCI adapter
vddgfx:       +1.12 V
fan1:        1007 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +51.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      114.00 W  (cap = 160.00 W)

amdgpu-pci-0300
Adapter: PCI adapter
vddgfx:       +1.10 V
fan1:         953 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +45.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       93.03 W  (cap = 160.00 W)

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +119.0°C)
temp2:        +29.8°C  (crit = +119.0°C)

amdgpu-pci-0600
Adapter: PCI adapter
vddgfx:       +1.07 V
fan1:        1033 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +43.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       93.21 W  (cap = 160.00 W)

amdgpu-pci-0400
Adapter: PCI adapter
vddgfx:       +1.10 V
fan1:         988 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +53.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      100.04 W  (cap = 160.00 W)

amdgpu-pci-0200
Adapter: PCI adapter
vddgfx:       +1.10 V
fan1:        1005 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +51.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       97.00 W  (cap = 160.00 W)

CPU utilisation is lower over here at around 17%.

I'll have a look at compiling (and packaging) the AMD utils over the weekend. These look really nice. Many thanks for the pointer.

Wall-wattage is at 781W, btw.  Uh, and fans are getting slower:

coretemp-isa-0000 Adapter: ISA adapter Package id 0:  +26.0°C  (high = +84.0°C, crit = +100.0°C) Core 0:        +23.0°C  (high = +84.0°C, crit = +100.0°C) Core 1:        +25.0°C  (high = +84.0°C, crit = +100.0°C)

amdgpu-pci-0500
Adapter: PCI adapter
vddgfx:       +1.13 V
fan1:         853 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +75.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      112.00 W  (cap = 160.00 W)

amdgpu-pci-0300
Adapter: PCI adapter
vddgfx:       +1.11 V
fan1:         786 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +58.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      104.02 W  (cap = 160.00 W)

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +119.0°C)
temp2:        +29.8°C  (crit = +119.0°C)

amdgpu-pci-0600
Adapter: PCI adapter
vddgfx:       +1.09 V
fan1:         805 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +69.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      113.10 W  (cap = 160.00 W)

amdgpu-pci-0400
Adapter: PCI adapter
vddgfx:       +1.11 V
fan1:         794 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +74.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      109.03 W  (cap = 160.00 W)

amdgpu-pci-0200
Adapter: PCI adapter
vddgfx:       +1.11 V
fan1:         800 RPM  (min =    0 RPM, max = 3200 RPM)
temp1:        +69.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:      117.24 W  (cap = 160.00 W)

The slomo fans over here may be a symptom of the mining mode. I'll leave that running for a while and then check out that sixth card in the 16x slot again. And then also attempt it all in the regular mode.

I just cross-checked with a random task of JStateson (https://einsteinathome.org/de/task/866208691) and from what I see, my overall compute times are similar (https://einsteinathome.org/de/host/12784717/tasks/4/0). My cpu time seems lower, which may be because of not having virtualisation on that processor. 

Thank you all.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

So, I got six cards mostly

So, I got six cards mostly working now. The implicit hint came from the Gigabyte BIOS update site (https://www.gigabyte.com/Motherboard/GA-H110-D3A-rev-10/support#support-dl-bios) that suggests for the F21 version (which this board already runs) to optimize resource allocation when 7 VGA are detected. So, not trusting that, I deactivated the internal GPU and the system booted just fine.

All 6 cards are detected by BOINC. And, just like before when there where reboots when I ran 5 of 5 with E@H, now there are spontaneous system reboots at about 50% done when I run 6 of 6 with E@H. The rig is now running SETI and I activate 5 E@H WUs when I pass by :)

I'll let this run now for a while. The next thing will be to substitute the one USB 1x PCI extender with the 16x cable and get E@H to run more stable.

Also, many thanks for the pointer to the amdgpu-utils. It works over here, even got a Debian/Ubuntu package for it. Will wait for the next version (2.5.2) expected any moment before sending this to Debian. And I shall find a Ubuntu PPA for it, too.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

The USB-free connect to the

The USB-free connect to the 16x PCIe slot make a difference:

$ amdgpu-ls | egrep '^(Card|Link|PCI)' Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) Card Number: 5 Card Path: /sys/class/drm/card5/device/ PCIe ID: 07:00.0 Link Speed: 5 GT/s Link Width: 1 Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) Card Number: 3 Card Path: /sys/class/drm/card3/device/ PCIe ID: 05:00.0 Link Speed: 5 GT/s Link Width: 1 Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) Card Number: 1 Card Path: /sys/class/drm/card1/device/ PCIe ID: 03:00.0 Link Speed: 5 GT/s Link Width: 1 Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) Card Number: 4 Card Path: /sys/class/drm/card4/device/ PCIe ID: 06:00.0 Link Speed: 5 GT/s Link Width: 1 Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) Card Number: 2 Card Path: /sys/class/drm/card2/device/ PCIe ID: 04:00.0 Link Speed: 5 GT/s Link Width: 1 Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) Card Number: 0 Card Path: /sys/class/drm/card0/device/ PCIe ID: 01:00.0 Link Speed: 8 GT/s Link Width: 16

Both the link speed and the link width improve (last card, #0). It does not help stability, though. I still see system resets a few minutes into all slots running Einstein.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

I had two Einstein WUs

I had two Einstein WUs compete, of which one was on the 16x card and the other on the 1x one. The 16x had a 0.2% lead after 1/3rd of the task done. Which is of no practical relevance, I think. To see if this worsens I then added 2 more E@H tasks and two SETI tasks to it. So at very close to 3/3rd this went from the expected 0.4% difference to almost a full 1% difference. Or to 12:16 instead of 12:06 runtime. This is much better than I thought. This is in part because it was only two other E@H tasks sharing the PCIe bus - with 5 on the PCIe 1x slots (which would now crash) I had 12:30 to 12:45.

So, it is good to have that 6th slot active. There is - aside from the stability - no immediate drawback performance-wise. The substitution of the USB riser apparently does not make a difference that would be of practical relevance.

Update: 1 SETI plus 5 E@H WUs had the 16x slot at 12:36 and the 2nd best 11s and the 5th (last) at 14s later (12:50).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 115,002,793,316
RAC: 31,118,271

steffen_moeller wrote:The

steffen_moeller wrote:

The USB-free connect to the 16x PCIe slot make a difference:

        $ amdgpu-ls | egrep '^(Card|Link|PCI)'

        Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Card Number: 5
        Card Path: /sys/class/drm/card5/device/
        PCIe ID: 07:00.0
        Link Speed: 5 GT/s
        Link Width: 1

        Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Card Number: 3
        Card Path: /sys/class/drm/card3/device/
        PCIe ID: 05:00.0
        Link Speed: 5 GT/s
        Link Width: 1

        Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Card Number: 1
        Card Path: /sys/class/drm/card1/device/
        PCIe ID: 03:00.0
        Link Speed: 5 GT/s
        Link Width: 1

        Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Card Number: 4
        Card Path: /sys/class/drm/card4/device/
        PCIe ID: 06:00.0
        Link Speed: 5 GT/s
        Link Width: 1

        Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Card Number: 2
        Card Path: /sys/class/drm/card2/device/
        PCIe ID: 04:00.0
        Link Speed: 5 GT/s
        Link Width: 1

        Card Model:  Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
        Card Number: 0
        Card Path: /sys/class/drm/card0/device/
        PCIe ID: 01:00.0
        Link Speed: 8 GT/s
        Link Width: 16

Both the link speed and the link width improve (last card, #0). It does not help stability, though. I still see system resets a few minutes into all slots running Einstein.

For the benefit of those following along, I've re-formatted the posted output of the original CLI command (given in the first line) to make it rather easier to digest the information.  I haven't come across the 'amdgpu-ls' command previously, but it's obviously designed to list a swag of information about any devices supported by the amdgpu driver.

This is a beautiful example of the power of the Linux shell (bash) and traditional Unix utilities like 'grep' (egrep = extended grep).  Take that swag of output and pipe it through 'egrep' whose job it is to select only those lines which match a particular pattern.  The '^'  means that the alternative patterns of 'Card' or 'Link' or 'PCI' must occur at the start of any lines being matched.

Cheers,
Gary.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

Gary Roberts

Gary Roberts wrote:

Fantastic fix of my illegible screendump.

Ooops, thank you, Gary.

amdgpu-ls (https://github.com/Ricks-Lab/amdgpu-utils/blob/master/docs/USER_GUIDE.md#using-amdgpu-ls) is part of the amdgpu-utils that Cecht suggested earlier in this thread (https://einsteinathome.org/de/goto/comment/172164). The link points to the documentation that gives an example with all the other lines it offers.

I am not exactly sure about how it does it, but apparently this can be used to match the device number to a physical slot and this way helps to identify a graphics card - there is a BOINC issue "Fix duplicate GPU problem" at https://github.com/BOINC/boinc/issues/3200.

I don't exactly know what to try next. Most plausible seems to look at the temperature, just like JStateson suggested. I think I just get a big external fan as a start. With some aluminum foil I want to insolate the 16x riser cable a bit more - less for the electrostatics but for where it physically touches the hot neighboring GPU. And then also try the 1x USB one again. But that is all only to find out why this fails, I would just be as happy to perpetually run SETI on one GPU and have it all stable. Sounds weird, but it works. I issued a BOINC issue about it: https://github.com/BOINC/boinc/issues/3229 to support the specification of such a minimal number of tasks of a project to be run.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 115,002,793,316
RAC: 31,118,271

steffen_moeller wrote:...

steffen_moeller wrote:
... part of the amdgpu-utils that Cecht suggested earlier in this thread ...

I'd figured that was the case and it's certainly on my list of things to investigate when I get some time :-).

That young cecht is a real go-getter in terms of mastering new stuff and finding the hidden gems!  We need to chain him down and make sure he never leaves :-).

At the moment I'm trying to complete a major re-write of a script I use to do post-installation configuration of the entire crunching fleet.  I've had certain parts of that automated previously and other parts that I do manually.  As I get older, it becomes increasingly hard to remember every bit of manual tweaking.  Not only that, I'm increasingly making mistakes in root commands so I'm a danger to myself :-).

So I have this grand plan to make this script that is smart enough to do all the major parts of the job on its own and to warn me if I'm trying to do something out of order - or otherwise insane :-).  So far it's going quite well.  A module to handle the automatic installation and recording of packages required by BOINC and to save that information on external media on a per host basis as it changes over time is being tested right now and seems to be working well.  This means that recovering from a disk failure, or deciding to transfer a BOINC installation from one host to another, or even just the mundane stuff after a new host is built, can all be done pretty much automatically by answering some simple questions when asked.  I reckon then I might be able to compensate for forgetfulness by just following instructions :-).

steffen_moeller wrote:
... I would just be as happy to perpetually run SETI on one GPU and have it all stable. Sounds weird, but it works. I issued a BOINC issue about it: https://github.com/BOINC/boinc/issues/3229 to support the specification of such a minimal number of tasks of a project to be run.

Couldn't you install an app_config.xml in the Seti project dir with a <max_concurrent> of just 1 and a different file in the EAH project dir with a <max_concurrent> of whatever maximum number you wanted there?  You might have to juggle resource shares to make sure BOINC didn't stop getting EAH work because it thought Seti was behind in its share.  If Seti was out of work, the <max_concurrent> for Einstein should leave a GPU idle.  Or am I not properly understanding what you want to achieve?

As I only run a single project, I have no real experience with <max_concurrent>.  However I believe it does what the label says :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.