Former GPU mining rig now running BOINC

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110027872793
RAC: 22466669

steffen_moeller wrote:...The

steffen_moeller wrote:
...The only PCIe 3.0 slot ("x16") is not connected ...

That explains something that puzzled me when I looked through the crunch times you were getting.  As I commented, they were all pretty much the same around the 720 - 740 secs mark.  I had been expecting to see some a bit faster for a card in the x16 slot and the others a bit slower (maybe a lot slower) due to x1 restrictions.

As I also commented, I've no experience running GPUs on risers from x1 slots. I had always imagined there would be a significant penalty for doing that, so was never tempted to try.  Perhaps there isn't - I don't know :-).

Before you buy more risers how difficult would it be to put a card in the x16 slot with the other cards staying on the x1 slots to see if the bandwidth taken by the x16 card has any effect on the remaining ones on risers?  Maybe your optimal result will come from all 6 GPUs on x1 risers - particularly if the chipset support is limited to 6 PCIe lanes?  I don't know enough about this stuff to make a useful comment.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110027872793
RAC: 22466669

Holmis wrote: Gary Roberts

Holmis wrote:

Gary Roberts wrote:
Holmis has pointed you in the correct direction but...

Thank you for giving some more detail, I didn't have time to write a more detailed message.

No need to thank me - I just hope you don't think I was interfering :-).

You mentioned clearly that the file was 'tuned' to your requirements.  Some times other people can take an example like that and, without understanding what will happen, try to apply it to an inappropriate situation.

I'm pretty sure the OP didn't really need the extra detail so it was more for the benefit of other readers with less understanding of the mechanism.

Cheers,
Gary.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

The extra USB risers and the

The extra USB risers and the PCI 16x extension have now all arrived, but I have not plugged anything extra in, yet.

My concern is, that while the machine with the 5 cards in runs very stable with SETI, when running 100% Einstein- the machine reboots spontaneously after a few minutes - no warning, no entry in the syslog about it - I have the syslog displayed in a remote terminal. Mixing Einstein with other projects, like PrimeGrid or SETI, is stable, which is why I had not noticed that behaviour any earlier, I presume, with a long-running PrimeGrid job still with it. The internet cries "PSU" (power supply unit) over this kind of behaviour. I am a bit reluctant to chime in since the system still misses as GPU (5 instead of 6), it is run in the less power demanding (?) mining mode, and sensors says the cards all to take <120W with SETI, PrimeGrid with 155, and something between 135W and 145W with Einstein. This totals to 600W for SETI and to <900W for Einstein/PrimeGrid, and the PSU (https://www.enermaxeu.com/de/products/power-supplies/premium/platimax-d-f/platimax-d-f-models/750w-1200w/750w-1200w/) is nominally prepared for 1200. That should be fine, right?

Now, we know Einstein to be more demanding on the PCIe bus than other projects. Having read so much about electrical interference with those USB risers, I had some fun this morning, desigining 2 1/2 inch wide, letter-long aluminum foils wrapped in paper to separate the cards at the PCI slots - no effect at all. But it was fun. Any chance it just happens because Saturn is in opposition?

What I have done now is to set the checkpointing from 600 seconds to 2400, i.e. past the time any of the tasks need to complete. So there should be less of a disturbance cased by the M.2 SSD that competes on the PCIe bus. This would also in part explain the increased chances of a reboot when the 5th workunit is processed in parallel, when any two of these were started at the same time, this would double the I/O. Just some gut feeling that this may make a difference, and then again, with 32GB of RAM, there should be enough opportunity to schedule I/O at leisure time.

For obvious reasons I cannot continue watching manually over the tasks to avoid too many Einsteins to run. So I also set the "time between tasks" down to 10 minutes and later also at 3 minutes, hoping this woud mix tasks between projects a bit more. But this does not seem to be the case. So, I'll finish the SETI tasks that I have now downloaded and then fall back to the lonely long-running (1+ days) PrimeGrid task in the backgound.

The 6th card I'll add tomorrow with a LINKUP 45cm long 16x riser - that riser cable set me back 1/10th of what an additional rig costs on eBay, but we want to learn about the effect of the 16x vs 1x in that multigpu setup, right? My hunch is that any difference is not because of a wider bus (nobody trusting PCI lanes any more, right?) but because of lower latencies since we also substitute that USB bridge. So, when sending many smaller packages and waiting for an OK that these have arrived, this should make a difference. But if so, then this should also work on the 1x sockets, right? It just costs 10 times more. I'll let this sink in over night and possibly try that first before adding another card: Substitute one of the current USB-based 1x risers with that direct 16x connect.

Comments on what to do to identify the cause of the very reproducible spontaneous reboots are most welcome.

 

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

About time for an update. A

About time for an update. A not so fortunate one.

The 5 PCIe 1x slots physically don't allow to have any longer cards put in. I would need to cut that terminating plastic.

The 16x port apparently interferes with the storage attached to the M.2. When I have the 6th card plugged in - either with the 1x or the 16x connection, then the machine does not boot, I don't get into the BIOS, just after some long time as say 4minutes I eventually read on the screen that no boot medium was found.

The miners use a USB stick to boot. I'll finish the currently downloaded E@H tasks and then check if that problem goes away with a decent enough USB live distro. If so, then maybe next week brings a SATA drive and a summary of what I have learned.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

So, here what I learned

So, here what I learned running Linux from USB 3.0 stick with that same rig. I first tried Knoppix but that failed to come up with a usable desktop and I could not change to the tty.

I then tried Ubuntu "Try out" from a USB 3.0, flashed on a mac with etcher (not unetbootin). It installed boinc-client-opencl and it immediately worked. Slowly, though. WUs needed 2500s, i.e. about three times longer. GPU fans kept restarting evey 30 seconds as if the cards were repeatedly resetted. Total CPU time was about the same, i.e. the CPU was barely used. I was not unhappy to have something basic working right out of the box plus installing the boinc-client-opencl package.

Presuming outdated drivers with the default Ubuntu install, I then updated to the amd-gpu-pro install offered from AMD.com for the 18.04 of Ubuntu as I have for the installation on the M.2 SSD. A reboot reminded me that this live CD was not persistent. I then installed to a second USB stick, rebooted from that with the latest drivers, but it was just as slow (https://einsteinathome.org/de/task/869709503). I created a RAM disk on /var/lib/boinc-client/slots but that did not help at all. Is this because the M.2 SSD is still installed? Rebooted into the M.2 SSD with the USB stick still in which returned old compute times of 730 seconds.

SATA SSD is next. Suggestions are welcome.

 

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1564139881
RAC: 36826

Just a question about your

Just a question about your Bios settings:  do you run it with mining mode enablred or disabled, max speed link gen2 or auto, above 4G enabled or disabled, onboard graphics or not?

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

solling2 schrieb:Just a

solling2 wrote:
Just a question about your Bios settings:

This is all with Ubuntu 18.04 and the original GPU BIOS flashed back. If there are special tools for Windows to fiddle with the GPUs then I have not used them.

solling2 wrote:
do you run it with mining mode enablred or disabled, max speed link gen2 or auto, above 4G enabled or disabled, onboard graphics or not?

The GPUs run in their mining mode. I'll check max speed link and "above 4G" tomorrow when the SATA drive arrives.

Yes, the CPU's onboard graphics are activated.

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1564139881
RAC: 36826

I wrote my question

I wrote my question mistakable, sorry for that. It refers to the mainboard because I was curious why it shows spontaneous rebbots with five cars but not with one card. Original card Bios seems just fine.

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 173
Credit: 2951918256
RAC: 1157489

Your problem may be cooling. 

Your problem may be cooling.  I started using open frame mining rigs for BOINC and ran into cooling problem.  Also had a problem with CPU similar to yours.
 
CPU:
 My rig came with a two core celeron.  It worked fine for 5 RX560 running milkyway.  When I tried running SETI and Einstein I ran into problems.  When milkyway was offline, only 2 of the RX560 ran Einstein and cpu usage was %100. When milkyway came back on line I might have two Einstein and two Milkway running (4 out of 5 RX560) and 18.04 was so sluggish I had to physically pull the plug as I could not even do a controlled shutdown and the work units running were taking way to long to complete.
I replaced the G1840 Celeron with an Xeon E3-1230 for $87 USD.  This actually cost me more then the original T85 motherboard & Celeron & shipping combined.  A bios upgrade was required and since the Xeon did not include intel graphics I had to first update the bios with the Celeron then replace it.  This did allowed me to crunch SETI and Einstein with a fully responsive 18.04 desktop.  Used socket 1150 chips are much more expensive than 1366 chips and your socket 1151 may even cost more for an upgrade.  I could not find any 1366 mining rig motherboards.
 
Temps:
When I put an RX580 in with the five RX560 the system would frequently shutdown.  It ran hot and I am pretty sure that caused the problem.  Your system re-booting may be a different problem but it could be heat related.
what does the "sensor" command show for temps?
The following is a before and after comparision of temperatures.  The "before" was after a cold boot and boinc not running.  The "after" is a few minutes after starting Einstein.  Note that the temps went up by about 35c and these were all two fan RX560 which run cool compared to RX580s.
 
 
</div>
<div>jstateson@rx560:~$ sensors > after.txt
jstateson@rx560:~$ diff before.txt after.txt
3,6c3,6
< vddgfx:       +0.80 V
< fan1:        2828 RPM  (min =    0 RPM, max = 5500 RPM)
< temp1:        +41.0°C  (crit = +94.0°C, hyst = -273.1°C)
< power1:        6.21 W  (cap =  48.00 W)
---
> vddgfx:       +1.15 V
> fan1:        2775 RPM  (min =    0 RPM, max = 5500 RPM)
> temp1:        +72.0°C  (crit = +94.0°C, hyst = -273.1°C)
> power1:       36.14 W  (cap =  48.00 W)
10,13c10,13
< vddgfx:       +0.80 V
< fan1:        3037 RPM  (min =    0 RPM, max = 5500 RPM)
< temp1:        +36.0°C  (crit = +94.0°C, hyst = -273.1°C)
< power1:        7.08 W  (cap =  48.00 W)
---
> vddgfx:       +1.15 V
> fan1:        3243 RPM  (min =    0 RPM, max = 5500 RPM)
> temp1:        +70.0°C  (crit = +94.0°C, hyst = -273.1°C)
> power1:       47.22 W  (cap =  48.00 W)
18,19c18,19
< fan1:        1190 RPM  (min =    0 RPM, max = 5500 RPM)
< temp1:        +38.0°C  (crit = +94.0°C, hyst = -273.1°C)
---
> fan1:        1174 RPM  (min =    0 RPM, max = 5500 RPM)
> temp1:        +43.0°C  (crit = +94.0°C, hyst = -273.1°C)
30,33c30,33
< vddgfx:       +0.80 V
< fan1:        3091 RPM  (min =    0 RPM, max = 5500 RPM)
< temp1:        +37.0°C  (crit = +94.0°C, hyst = -273.1°C)
< power1:        8.22 W  (cap =  48.00 W)
---
> vddgfx:       +0.85 V
> fan1:        3003 RPM  (min =    0 RPM, max = 5500 RPM)
> temp1:        +73.0°C  (crit = +94.0°C, hyst = -273.1°C)
> power1:       49.04 W  (cap =  48.00 W)
38,40c38,40
< fan1:        2983 RPM  (min =    0 RPM, max = 5500 RPM)
< temp1:        +36.0°C  (crit = +94.0°C, hyst = -273.1°C)
< power1:        7.23 W  (cap =  48.00 W)</div>
<div>
 
Temps I listed above are actually too high and even the RX560 can overheat and stop. I used "fancontrol" and "pwmconfig" to get maximum RPM as shown below.  There maybe better tools other than fancontrol. Maybe there is a propritary AMD fan control like then nVidia one discussed here http://www.gpugrid.org/forum_thread.php?id=4962
root@rx560:/home/jstateson# sensors
amdgpu-pci-0600
Adapter: PCI adapter
vddgfx:       +1.15 V
fan1:        6282 RPM  (min =    0 RPM, max = 5500 RPM)
temp1:        +65.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       47.13 W  (cap =  48.00 W)</div>
<div>amdgpu-pci-0400
Adapter: PCI adapter
vddgfx:       +1.15 V
fan1:        6573 RPM  (min =    0 RPM, max = 5500 RPM)
temp1:        +59.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       47.09 W  (cap =  48.00 W)</div>
<div>amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       +1.10 V
fan1:        3206 RPM  (min =    0 RPM, max = 5500 RPM)
temp1:        +55.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:        0.00 W  (cap =  48.00 W)</div>
<div>acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +97.0°C)
temp2:        +29.8°C  (crit = +97.0°C)
temp3:       -273.2°C  (crit = +110.0°C)</div>
<div>amdgpu-pci-0500
Adapter: PCI adapter
vddgfx:       +0.85 V
fan1:        6597 RPM  (min =    0 RPM, max = 5500 RPM)
temp1:        +47.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:        8.22 W  (cap =  48.00 W)</div>
<div>amdgpu-pci-0300
Adapter: PCI adapter
vddgfx:       +1.15 V
fan1:        6728 RPM  (min =    0 RPM, max = 5500 RPM)
temp1:        +59.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       48.11 W  (cap =  48.00 W)</div>
<div>coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +46.0°C  (high = +86.0°C, crit = +92.0°C)
Core 0:        +46.0°C  (high = +86.0°C, crit = +92.0°C)
Core 1:        +45.0°C  (high = +86.0°C, crit = +92.0°C)
Core 2:        +38.0°C  (high = +86.0°C, crit = +92.0°C)
Core 3:        +40.0°C  (high = +86.0°C, crit = +92.0°C)
steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1773655132
RAC: 0

The SATA drive has arrived

The SATA drive has arrived and I am running a fresh Ubuntu that detects all 5 cards but shows the very same symptoms as the USB drive - slow as in 40-45min per WU instead of 12 - the very first are still computing. It is not the temperature, I am sure, since it all worked reasonably well with with the M.2 NVMe card, except for the 6th card competing with that drive.

Here are the temps with (app_config.xml pending) three of the five GPUs active:

```

#coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +26.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0:        +24.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:        +24.0°C  (high = +84.0°C, crit = +100.0°C)

amdgpu-pci-0500
Adapter: PCI adapter
vddgfx:       +0.75 V  
fan1:         790 RPM
temp1:        +34.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       35.09 W  (cap = 160.00 W)

amdgpu-pci-0300
Adapter: PCI adapter
vddgfx:       +1.10 V  
fan1:         803 RPM
temp1:        +48.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       82.17 W  (cap = 160.00 W)

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +119.0°C)
temp2:        +29.8°C  (crit = +119.0°C)

amdgpu-pci-0600
Adapter: PCI adapter
vddgfx:       +0.75 V  
fan1:         792 RPM
temp1:        +28.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       36.22 W  (cap = 160.00 W)

amdgpu-pci-0400
Adapter: PCI adapter
vddgfx:       +1.10 V  
fan1:         808 RPM
temp1:        +57.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       89.21 W  (cap = 160.00 W)

amdgpu-pci-0200
Adapter: PCI adapter
vddgfx:       +1.10 V  
fan1:         821 RPM
temp1:        +56.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       99.15 W  (cap = 160.00 W)
```

Once these 3 WUs are done, I'll install the AMD drivers.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.