Multi GPU rig errors

Pokey
Pokey
Joined: 7 Jan 16
Posts: 14
Credit: 6998346019
RAC: 2229
Topic 228326

I have been trying to build multi-GPU rigs for use with DC.  I am not a miner, and it has been quite the learning experience. I have one rig (all 2080ti’s) up and running successfully.  The other one one includes a mixture of 3080’s and is presenting its own set of problems.  When I get to five GPU’s, it may run for a while but eventually spontaneously reboots or does not get work altogether because of errors.

My question:  Aside from managing the additional power load, (two 1200watt Corsair PSUs) to what extent does the CPU core count and speed affect the rigs abilities and/or the MB for that matter?  The 2080ti rig runs smoothly with an i5-8400 CPU on an ASUS Z390-A MB, but something is holding me back on the 3080+ rig with the same CPU & MB.  The rig is shut down right now but is not hidden.

PS: I am running one task per GPU, including MeerKat work, and typically power limit all at 80%.

Thanks,

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46754602642
RAC: 64134862

Most of the recent errors on

Most of the recent errors on the 3080 system look to be because you’re missing some input file(s). Maybe a problem downloading them or the file has been corrupted.

I would try resetting the project to erase all the files and Re download them. 
 

 

_________________________________________________________________________

Pokey
Pokey
Joined: 7 Jan 16
Posts: 14
Credit: 6998346019
RAC: 2229

OK, thanks, Done, and no work

OK, thanks, Done, and no work available.  I think the errors have me task limited.

I guess I'll have to wait it out.

 

 

 

Pokey
Pokey
Joined: 7 Jan 16
Posts: 14
Credit: 6998346019
RAC: 2229

I got impatient and removed

I got impatient and removed Boinc using Mint's software manager.  Then reinstalled Boinc.  Einstein restarted right away.

But while I was drafting this note the rig spontaneously rebooted.

It has done this before, but I assumed it was coincidental. 

Also, there are "computational error" notes in the task list of Boinc manager.

If I am not looking at a hardware problem, I think I need to manually delete all Boinc files, not just the ones Software Manager removes.  There are still files in there that I installed with Petri's optimized files.

Any thoughts or suggestions??

  

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3061
Credit: 4965227686
RAC: 1416688

Pokey wrote: I got impatient

Pokey wrote:

I got impatient and removed Boinc using Mint's software manager.  Then reinstalled Boinc.  Einstein restarted right away.

But while I was drafting this note the rig spontaneously rebooted.

It has done this before, but I assumed it was coincidental. 

Also, there are "computational error" notes in the task list of Boinc manager.

If I am not looking at a hardware problem, I think I need to manually delete all Boinc files, not just the ones Software Manager removes.  There are still files in there that I installed with Petri's optimized files.

Any thoughts or suggestions??

This seems suspiciously like a PSU problem.  I know that you have many hosts, which computer are you doing this on that is having the "spontaneously reboot" problem? And what is your PSU make, 80+ rating and wattage?  Also, what are the GPU's being used?  All the same?

George

Proud member of the Old Farts Association

Pokey
Pokey
Joined: 7 Jan 16
Posts: 14
Credit: 6998346019
RAC: 2229

The PSU's are two Corsair

The PSU's are two Corsair HX1200, 80 plus Platinum, both recently purchased.

Your question prompted me to check whether they are single rail or multi rail and they were set to multi.  I switched them to single rail and re-started because that is the same as my "good" rig.  But I am still hearing a lot of speeding up and slowing down of fans.

Computer ID: 12884324

The GPU's are a mix of 3080, 3080 ti and one 3090 ti.

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3061
Credit: 4965227686
RAC: 1416688

Pokey wrote:The PSU's are

Pokey wrote:

The PSU's are two Corsair HX1200, 80 plus Platinum, both recently purchased.

Your question prompted me to check whether they are single rail or multi rail and they were set to multi.  I switched them to single rail and re-started because that is the same as my "good" rig.  But I am still hearing a lot of speeding up and slowing down of fans.

Computer ID: 12884324

The GPU's are a mix of 3080, 3080 ti and one 3090 ti.

Just so I am not misunderstanding you, the 2 Corsair HX1200 PSUs you described are BOTH plugged into the 5 GPUs for computer ID: 12884324?

The speeding up and slowing down of fans... is that from the GPUs or the PSUs?  Or even the fans used for cooling your PC?  If you can't tell by listening, can you tell by looking at each of them?  Not knowing your setup, whether you are running it in a case or on a mining rig, it is impossible to tell.

The reason I asked about which PSU(s) and what and how many GPUs you're using is because one of the PSU's maybe shutting down and rebooting without warning due to an over-voltage caused by one or more of the GPUs spiking higher than the over-voltage protection setting in the PSU.  Or over-current protection caused by having a too high of a current draw for the PSU to handle.

Not knowing EXACTLY what GPUs you have (brand & model #) I can only speculate on the potential wattage.  My speculation is from EVGA's listings for 3 x RTX 3080 FTW3 @ 350W ea, 1 x RTX 3080 Ti FTW3 @ 350W, and 1 x RTX 3090 Ti FTW3 @ 450W.  And this does NOT take into account whether they are overclocked.

If you have the three 3080's plugged into one 1200W PSU, that's 350W x 3 = 1050W.

If you have the single 3080 Ti and 3090 Ti plugged into the other 1200W PSU, that's 350W + 450W = 800W.

Plus the wattage of your CPU (usually 125W-150W for your Intel, and much higher if you are also running CPU tasks), motherboard & HDD/SSD (easily 150W-200W), and whatever cooling your using for the CPU and GPUs and case (# fans??, water cooling??, etc. + 25W??).

1050W + 800W + 150W + 200W + (speculation) 25W = 2225W total.  Your 2 x 1200W PSU's are 2400W, or 93% usage.

That means your PSU's are running nearly 100% all the time your running BOINC.  For a PSU to "survive" in our environment ( 75% - 80% usage @ 100% of the time ), it - or they - must be CAPABLE of meeting the high current loads listed for short duration's, but steadily for 75% - 80% max of the PSU rating.  And this isn't even accounting for whether you might have a defective PSU.

Much of this conjecture is based on speculation because of the lack of specifics on your setup.  But I think it is fairly close anyway.  Also, you just may have a different problem, such as a bad PSU-to-PCIe cable, or a weak motherboard, or something else.  Who knows?

I'm placing my bet on an overloaded PSU.

George

Proud member of the Old Farts Association

Pokey
Pokey
Joined: 7 Jan 16
Posts: 14
Credit: 6998346019
RAC: 2229

You actually are pretty close

You actually are pretty close in your suppositions.  The primary PSU is close to limit at 100% (9.5 amps).  I did load balance using a couple of Kill-a-Watt meters and have the readings somewhere.  But clearly I'm pushing it thinking 1200w PSUs would be sufficient.  These cards range from 320 w to 450w.  Plus all the components that go into a rig like this makes it hard to trouble shoot. 

All this has given me a couple more ideas to try.

Thanks again for the insights and thoughts.

mikey
mikey
Joined: 22 Jan 05
Posts: 12681
Credit: 1839084849
RAC: 3883

Pokey wrote:The PSU's are

Pokey wrote:

The PSU's are two Corsair HX1200, 80 plus Platinum, both recently purchased.

Your question prompted me to check whether they are single rail or multi rail and they were set to multi.  I switched them to single rail and re-started because that is the same as my "good" rig.  But I am still hearing a lot of speeding up and slowing down of fans.

Computer ID: 12884324

The GPU's are a mix of 3080, 3080 ti and one 3090 ti. 

To follow up with George's post:

What are the temps of your gpu's and cpu's ? Do you use something in LM to track that or are you just winging it? One program I use in LM is called 'Hardware Sensors Indicator' and you can configure it to track drives, cpu's and gpu's. I set mine to run at startup and then on the first startup clicked 'configure' to monitor the temps I wanted, it now starts up and stays down by the clock and I can click on it to see the temps if I wish.

mikey
mikey
Joined: 22 Jan 05
Posts: 12681
Credit: 1839084849
RAC: 3883

Pokey wrote:You actually

Pokey wrote:

You actually are pretty close in your suppositions.  The primary PSU is close to limit at 100% (9.5 amps).  I did load balance using a couple of Kill-a-Watt meters and have the readings somewhere.  But clearly I'm pushing it thinking 1200w PSUs would be sufficient.  These cards range from 320 w to 450w.  Plus all the components that go into a rig like this makes it hard to trouble shoot. 

All this has given me a couple more ideas to try.

Thanks again for the insights and thoughts. 

A teammate used to have 2nd psu's sitting next to his cases to run the bigger and faster gpu's as they came out, it would give your 1200's a break if you split a gpu off to a 3rd psu and it wouldn't need to be big fancy 12300 platinum either to get you thru until you can get another big fancy one. Amazon has 750watt and 850watt Gold ones for the $75US range and the 750watt Bronze ones for under $50US on sale sometimes.

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3061
Credit: 4965227686
RAC: 1416688

Pokey wrote:You actually

Pokey wrote:

You actually are pretty close in your suppositions.  The primary PSU is close to limit at 100% (9.5 amps).  I did load balance using a couple of Kill-a-Watt meters and have the readings somewhere.  But clearly I'm pushing it thinking 1200w PSUs would be sufficient.  These cards range from 320 w to 450w.  Plus all the components that go into a rig like this makes it hard to trouble shoot. 

All this has given me a couple more ideas to try.

Thanks again for the insights and thoughts.

I don't know if you are a YouTube watcher or not, but there are some YT videos worth watching if you're so inclined.  Many of the tech channels don't necessarily give reasons for abnormal behavior, like power spikes.  And many have said that the common ratings of a PSU should be cut to 50%-60% for the computer's to run 24/7 @ 100%.  But with the newer stuff that's come out in recent years, that has been up'ed to 75%-80%.

Like Mikey said, you could add a 3rd PSU for a few bucks, or up the antie and get two 1600W PSUs with at least a Platinum rating.  Some of the lessor rating PSUs (Gold or lower) may have lessor spec'ed components in their build.

FWIW, I personally won't go any lower than a 1000W, 80+ Platinum PSU from a MAJOR supplier for those reasons.

Here are a few links that I think you can enjoy reading and getting some good information from.

This list is current as of 06/11/2022, and if you mouse click on the GOLD listings you will find a pop-up-window of the pros and cons of that particular PSU.

https://cultists.network/140/psu-tier-list/

This speadsheet has several good and useful pages about PSU Cable Compatibility for the likes of Corsair, EVGA, and Seasonic, Tom's Hardware and TechPowerUp reviews of PSUs, and more.

https://docs.google.com/spreadsheets/d/1eL0893Ramlwk6E3s3uSvH1_juom7SMG5SCNzP2Uov8w/edit#gid=1719706335&range=A19:B19

This "infobits" about GPU is essentially focused on Windows usage, but you may be able to transpose some of the info to Linux.

https://computerinfobits.com/why-does-my-gpu-utilization-spike/

You can, and should, read up and carefully read about the specific PSUs that you have and what they are capable of, or are NOT capable of...  such as providing protection against GPU power spikes.

When is a 300w card not a 300w card?  When its power draw spikes to almost 900w and trips your PSU!

Apparently, you cannot measure this phenomenon with software sensor based tools like GPU-Z, or HWINFO 64, but it has been measured by the following website:

AMD Radeon RX6900XT显卡功耗分析 – FCPOWERUP极电魔方

The Chinese website (above) is nearly impossible to read...  but not entirely.  Reading through the gobilygook, they do use our base 10 numbering system, and I think you may be able to make at least some headway into it.

"So, What? I have a 1200w power-supply," you might be saying. "It's Platinum Rated!"

These power spikes can drop the voltage load to levels deemed outside the safe zone of your PSU.

Lastly, if you hadn't seen these particular videos before, you should watch them, or watch them again even if you have seen them.  From Gamer's Nexus:

https://www.youtube.com/watch?v=wnRyyCsuHFQ

And this one by Jayztwocents shows a similar GPU response but by playing video games:

https://www.youtube.com/watch?v=6A0sLVgJ7qU&t=960s

George

Proud member of the Old Farts Association

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.