Multi GPU rig errors
I have been trying to build multi-GPU rigs for use with DC. I am not a miner, and it has been quite the learning experience. I have one rig (all 2080ti’s) up and running successfully. The other one one includes a mixture of 3080’s and is presenting its own set of problems. When I get to five GPU’s, it may run for a while but eventually spontaneously reboots or does not get work altogether because of errors.
My question: Aside from managing the additional power load, (two 1200watt Corsair PSUs) to what extent does the CPU core count and speed affect the rigs abilities and/or the MB for that matter? The 2080ti rig runs smoothly with an i5-8400 CPU on an ASUS Z390-A MB, but something is holding me back on the 3080+ rig with the same CPU & MB. The rig is shut down right now but is not hidden.
PS: I am running one task per GPU, including MeerKat work, and typically power limit all at 80%.
Thanks,
Most of the recent errors on
)
Most of the recent errors on the 3080 system look to be because you’re missing some input file(s). Maybe a problem downloading them or the file has been corrupted.
I would try resetting the project to erase all the files and Re download them.
_________________________________________________________________________
OK, thanks, Done, and no work
)
OK, thanks, Done, and no work available. I think the errors have me task limited.
I guess I'll have to wait it out.
I got impatient and removed
)
I got impatient and removed Boinc using Mint's software manager. Then reinstalled Boinc. Einstein restarted right away.
But while I was drafting this note the rig spontaneously rebooted.
It has done this before, but I assumed it was coincidental.
Also, there are "computational error" notes in the task list of Boinc manager.
If I am not looking at a hardware problem, I think I need to manually delete all Boinc files, not just the ones Software Manager removes. There are still files in there that I installed with Petri's optimized files.
Any thoughts or suggestions??
Pokey wrote: I got impatient
)
This seems suspiciously like a PSU problem. I know that you have many hosts, which computer are you doing this on that is having the "spontaneously reboot" problem? And what is your PSU make, 80+ rating and wattage? Also, what are the GPU's being used? All the same?
Proud member of the Old Farts Association
The PSU's are two Corsair
)
The PSU's are two Corsair HX1200, 80 plus Platinum, both recently purchased.
Your question prompted me to check whether they are single rail or multi rail and they were set to multi. I switched them to single rail and re-started because that is the same as my "good" rig. But I am still hearing a lot of speeding up and slowing down of fans.
Computer ID: 12884324
The GPU's are a mix of 3080, 3080 ti and one 3090 ti.
Pokey wrote:The PSU's are
)
Just so I am not misunderstanding you, the 2 Corsair HX1200 PSUs you described are BOTH plugged into the 5 GPUs for computer ID: 12884324?
The speeding up and slowing down of fans... is that from the GPUs or the PSUs? Or even the fans used for cooling your PC? If you can't tell by listening, can you tell by looking at each of them? Not knowing your setup, whether you are running it in a case or on a mining rig, it is impossible to tell.
The reason I asked about which PSU(s) and what and how many GPUs you're using is because one of the PSU's maybe shutting down and rebooting without warning due to an over-voltage caused by one or more of the GPUs spiking higher than the over-voltage protection setting in the PSU. Or over-current protection caused by having a too high of a current draw for the PSU to handle.
Not knowing EXACTLY what GPUs you have (brand & model #) I can only speculate on the potential wattage. My speculation is from EVGA's listings for 3 x RTX 3080 FTW3 @ 350W ea, 1 x RTX 3080 Ti FTW3 @ 350W, and 1 x RTX 3090 Ti FTW3 @ 450W. And this does NOT take into account whether they are overclocked.
If you have the three 3080's plugged into one 1200W PSU, that's 350W x 3 = 1050W.
If you have the single 3080 Ti and 3090 Ti plugged into the other 1200W PSU, that's 350W + 450W = 800W.
Plus the wattage of your CPU (usually 125W-150W for your Intel, and much higher if you are also running CPU tasks), motherboard & HDD/SSD (easily 150W-200W), and whatever cooling your using for the CPU and GPUs and case (# fans??, water cooling??, etc. + 25W??).
1050W + 800W + 150W + 200W + (speculation) 25W = 2225W total. Your 2 x 1200W PSU's are 2400W, or 93% usage.
That means your PSU's are running nearly 100% all the time your running BOINC. For a PSU to "survive" in our environment ( 75% - 80% usage @ 100% of the time ), it - or they - must be CAPABLE of meeting the high current loads listed for short duration's, but steadily for 75% - 80% max of the PSU rating. And this isn't even accounting for whether you might have a defective PSU.
Much of this conjecture is based on speculation because of the lack of specifics on your setup. But I think it is fairly close anyway. Also, you just may have a different problem, such as a bad PSU-to-PCIe cable, or a weak motherboard, or something else. Who knows?
I'm placing my bet on an overloaded PSU.
Proud member of the Old Farts Association
You actually are pretty close
)
You actually are pretty close in your suppositions. The primary PSU is close to limit at 100% (9.5 amps). I did load balance using a couple of Kill-a-Watt meters and have the readings somewhere. But clearly I'm pushing it thinking 1200w PSUs would be sufficient. These cards range from 320 w to 450w. Plus all the components that go into a rig like this makes it hard to trouble shoot.
All this has given me a couple more ideas to try.
Thanks again for the insights and thoughts.
Pokey wrote:The PSU's are
)
To follow up with George's post:
What are the temps of your gpu's and cpu's ? Do you use something in LM to track that or are you just winging it? One program I use in LM is called 'Hardware Sensors Indicator' and you can configure it to track drives, cpu's and gpu's. I set mine to run at startup and then on the first startup clicked 'configure' to monitor the temps I wanted, it now starts up and stays down by the clock and I can click on it to see the temps if I wish.
Pokey wrote:You actually
)
A teammate used to have 2nd psu's sitting next to his cases to run the bigger and faster gpu's as they came out, it would give your 1200's a break if you split a gpu off to a 3rd psu and it wouldn't need to be big fancy 12300 platinum either to get you thru until you can get another big fancy one. Amazon has 750watt and 850watt Gold ones for the $75US range and the 750watt Bronze ones for under $50US on sale sometimes.
Pokey wrote:You actually
)
I don't know if you are a YouTube watcher or not, but there are some YT videos worth watching if you're so inclined. Many of the tech channels don't necessarily give reasons for abnormal behavior, like power spikes. And many have said that the common ratings of a PSU should be cut to 50%-60% for the computer's to run 24/7 @ 100%. But with the newer stuff that's come out in recent years, that has been up'ed to 75%-80%.
Like Mikey said, you could add a 3rd PSU for a few bucks, or up the antie and get two 1600W PSUs with at least a Platinum rating. Some of the lessor rating PSUs (Gold or lower) may have lessor spec'ed components in their build.
FWIW, I personally won't go any lower than a 1000W, 80+ Platinum PSU from a MAJOR supplier for those reasons.
Here are a few links that I think you can enjoy reading and getting some good information from.
This list is current as of 06/11/2022, and if you mouse click on the GOLD listings you will find a pop-up-window of the pros and cons of that particular PSU.
https://cultists.network/140/psu-tier-list/
This speadsheet has several good and useful pages about PSU Cable Compatibility for the likes of Corsair, EVGA, and Seasonic, Tom's Hardware and TechPowerUp reviews of PSUs, and more.
https://docs.google.com/spreadsheets/d/1eL0893Ramlwk6E3s3uSvH1_juom7SMG5SCNzP2Uov8w/edit#gid=1719706335&range=A19:B19
This "infobits" about GPU is essentially focused on Windows usage, but you may be able to transpose some of the info to Linux.
https://computerinfobits.com/why-does-my-gpu-utilization-spike/
You can, and should, read up and carefully read about the specific PSUs that you have and what they are capable of, or are NOT capable of... such as providing protection against GPU power spikes.
When is a 300w card not a 300w card? When its power draw spikes to almost 900w and trips your PSU!
Apparently, you cannot measure this phenomenon with software sensor based tools like GPU-Z, or HWINFO 64, but it has been measured by the following website:
AMD Radeon RX6900XT显卡功耗分析 – FCPOWERUP极电魔方
The Chinese website (above) is nearly impossible to read... but not entirely. Reading through the gobilygook, they do use our base 10 numbering system, and I think you may be able to make at least some headway into it.
"So, What? I have a 1200w power-supply," you might be saying. "It's Platinum Rated!"
These power spikes can drop the voltage load to levels deemed outside the safe zone of your PSU.
Lastly, if you hadn't seen these particular videos before, you should watch them, or watch them again even if you have seen them. From Gamer's Nexus:
https://www.youtube.com/watch?v=wnRyyCsuHFQ
And this one by Jayztwocents shows a similar GPU response but by playing video games:
https://www.youtube.com/watch?v=6A0sLVgJ7qU&t=960s
Proud member of the Old Farts Association