Troubleshooting Multiple gpu setups that use Riser cards

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,123
Credit: 1,966,998,476
RAC: 4,844,425
Topic 225342

The reason I am starting this thread is I don't want the content to get buried in the Generic Multiple GPU thread.

if you see a system with more than 2 cards displayed there is an excellent chance that they are using "external" gpu solutions.  I have another 3 long slot MB on order.  I will be testing it with 3 gpus directly on the MB.  My last attempt with an X470 failed on a chip set hardware bug.

The cheaper solutions that I am interested in are variations on "Mining System" hardware to allow motherboards to run all the slots they have with discrete gpus for each slot.

There are multiple variations on this kind of multiple gpu setup.

  1.  Motherboard with upto 19-20 slots.
  2.  Using an expansion card to use on 1 short slot to drive upto 4 gpus.  Or a long slot expansion card to drive upto 8 gpus.
    1. I want to note are there are specific limits to the # of gpus a motherboard will boot with even if you use expansion cards and enable something called "Above 4G".
  3. Riser Card combo's come in two general classes.
    1. Usb to Pcie, to USB 3.0 cable, to powered Riser card base.
    2. Pass through Ribbon Cables that make an entire long slot bandwidth available to the Gpu.
    3. Frequently 3a is sufficient bandwidth for many Boinc GPU projects.
    4. Some BOINC GPU projects require full bandwidth connectivity to process at any reasonable speed on your GPU.

 

 

 

Over the hill?  What hill?  I don't REMEMBER any hill...
A Proud member of the O.F.A. (I've forgotten what that stands for.... ;)

 

 

 

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,123
Credit: 1,966,998,476
RAC: 4,844,425

The specific system I am

The specific system I am currently struggling with is this one.

The system has many interesting features. Probably the most important are that it inherited the gpus from this system.  But has an entirely different MB/Cpu/OS.

I have been tinkering with 3 different MB models that feature the 300 series Intel chip set that supports the 8th/9th generation Intel cpus.  Specifically the i9-9900 series which is 8c/16t at various performance levels on an LGA 1151 socket.

My B360-F Pro (18 slots) seems to have been lost by FedEx during an Rma return.

My H310-F Pro (13 slots) has a hardware fault that won't let me us Rx 5700 gpus on it.  I am waiting for an RMA# on it.

I had two B360-A Pro (6 slots) motherboards.  The first was used and had a damaged long slot.  The second was brand new but turned out it was bought in Australia and some how made its way here.  But it is new and works like new.

So I have a B360-A Pro MB with an i9-9900 cpu, 2 sticks of 3200 8GB ram running at 2677?, and all the rest of the "presumably" working gpu/riser card setups.

I am currently running 4 out of 5 available Rx 5700 gpus under Ubuntu.

I have at least 2 bad riser card setups out of the previous Windows 10 5 working gpus setup.

It required that I unplug all the gpus and try one at a time till I found the non-working one(s).  Trying to do it with multiple cards plugged in turned out to be too complicated for me.

This was made easier because I had my video out put coming via the iGPU of the CPU.  Some server class MB's have "on board" video hardware even if their cpus don't have iGPU's.

I know that Ian&Steve as well as Keith Myers have been active in this kind of multiple cpu with riser cards on BOINC.

We have multiple 3 to 10 gpu systems in the top 50.  4-7 or 8 gpus are the most common.  I am pretty sure they are all using some kind of mining rig with externally mounted gpu cards.  But that is speculation.  Anyone who has managed to fit 4 or more gpus inside a case should probably speak up.  I would be interested and I think others would be too.

Tom M

 

 

Over the hill?  What hill?  I don't REMEMBER any hill...
A Proud member of the O.F.A. (I've forgotten what that stands for.... ;)

 

 

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 890
Credit: 5,709,134,468
RAC: 32,283,497

I prefer to use the PCIe 3.0

I prefer to use the PCIe 3.0 (shielded) ribbon cables.

Firstly because I contribute with a heavy focus on GPUGRID, which is one of the projects that greatly benefits from PCIe bandwidth (to a point. 3.0x1 will cause slowdowns, but 3.0x8 will not. the limit for GPUG seems to be around PCIe 3.0x4 depending on how fast the GPU is)

Secondly, because after I started using them, they've proven to be the most reliable and easier to manage. Less cable clutter. Less things to plug in, have loose connections, or be flaky, or fail.

and those expander cards are really geared towards mining, where PCIe speed/bandwidth makes no difference. you can run mining on PCIe 1.0 x1. most of the expander cards I've seen (unless they are proper, expensive, PLX-based units) can't handle more than PCIe 2.0. And the USB cable risers almost always need a USB cable upgrade to reliably run at PCIe 3.0 (and sometimes can still be flaky even with good cables).

Also a good idea to look up the block diagram for whatever board you're trying to use. figure out which PCIe lanes are coming direct from the CPU and which are being run through the chipset. then make sure you understand the link between the chipset and the CPU as there can be conflicts there if you try to do too much through the chipset as it has a bottleneck (for example Intel's common CPU<->chipset link on consumer boards is PCIe 3.0 x4, shared for ALL chipset devices). This consideration is largely why I moved away from consumer platform to the AMD EPYC board which provides 128 PCIe CPU based lanes and no chipset nonsense.

 

yeah, the PCIe 3.0 capable ribbons are more expensive, but you get what you pay for.

 

only two of my systems are using the risers though. the single GPU systems aren't hard to imagine being inside a case. but my 7x RTX2080 system actually has all GPUs plugged directly into the motherboard. they all were converted to watercooling with 1-slot waterblocks. the radiator is mounted externally though.

8x RTX 2070 on mining frame

7x RTX 2080Ti on mining frame

7x RTX 2080 direct to motherboard in case

_____________________________________________

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 59
Credit: 1,140,229,967
RAC: 13

Nice photos Ian&Steve C.

Nice photos Ian&Steve C. Looking into that platform to show photos is another lesson. Cool.

I tend to put more space between gpu's by custom building my rig frames to have more room. I think it could be noted to use VGA power cables to power the PCI-e riser cards, not SATA plugs. The risers are designed and built for the same power as a board slot, 75 watts. No way a SATA much less two or three plugged into our risers is sufficient power. I put one mining frame in the middle with one board on the bottom bunk with 3 risers and put one riser connected to another computer with one in slot 1 and one card on the side of upper bunk and a third computer on the other side with 2 or 3 gpus having the extra ones on the upper bunk also. All three computers share one monitor and one keyboard & mouse. I can scroll through the monitor and see what every rig is up to by never turning off the monitor in each computer and doing that when needed actually at the monitor.

I find type of tasks is very important to how many risers to use per mobo. Some tasks don't care how many gpu's we have all on one rig. A task like gammaray search on GPU seems to have diminishing marginal returns. I limit my gpu's to 3 on a 6 slot mobo. When I just moved over here from milkyway@home, I had to spread out the gpu's for gammaray on gpu. I trust a board with ample bus may be different.

My computers are there to see in my computer area. Only one has all the gpu's in the computer case. It is so nice to work on and keep gpu's cool using a mining rig. Three monitors for 9 computers with 23 gpu's.

A bad riser can prevent all tasks from running. Risers do not handle improper power or VGA connections well. Got a drawer full that don't work.

I keep one computer for a workbench. If I use an old-old firepro for VGA I can put a nonfunctioning gpu that otherwise would crash the computer on bootup, on a riser or lower in the mobo, and can reflash the gpu bios among other things.

What else may be interesting...other than I sold all my GTX cards and purchased all AMD cards for a fraction the value. Now the AMD cards are worth something too. Sold 5 amd cards and paid off 15 others.

OH! Something I did, I never want to happen to anyone else! If you touch more than one GPU card with metal to, of all things, the heatsinks, the gpu's will stop and may short to ruin forever chips on the GPU board. DO NOT TOUCH MORE THAN ONE HEAT TUBE OF ONE GPU WITH A METAL OBJECT!!! Touch across 4 GPU's on one mining rig and all four may need soldering new parts.

Work runs fine on Bosons reacted into Fermions,

Sunny regards,

earthbilly

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,123
Credit: 1,966,998,476
RAC: 4,844,425

This Windows 10 (and probably

This Windows 10 (and probably other Windows versions) crosses over on both my Intel and AMD Ryzen systems.

It is a BSOD (Blue Screen of Death) error.

"Driver Overran Stack Buffer".

When installing GPU and Motherboard drivers on my Intel box I got it until I dropped almost all my (6) GPUs offline.  I can install GPU/motherboard drivers reliably with 1 GPU plugged in :)

When processing if I get it, assuming I have current drivers, no malware, etc.  One of the troubleshooting articles mentioned, "Cpu running too fast".

For my Intel system that means disabling the "Turbo" in the bios.

For my AMD Ryzen system(s) that means disabling at least the Precision Boost Option and may require you to disable the "CPU Boost" too.

Keith also reminded me that for Ryzen CPUs, the PBO can make them unstable under either Windows or Linux when processing BOINC projects.

Tom M

Over the hill?  What hill?  I don't REMEMBER any hill...
A Proud member of the O.F.A. (I've forgotten what that stands for.... ;)

 

 

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,320
Credit: 2,428,576,290
RAC: 7,590,565

I disable all auto "turbo" or

I disable all auto "turbo" or "boost" functions on both Intel and AMD platforms.

I am not a gamer.  I am a cruncher.

I set a good fixed frequency for all-cores at a manageable voltage and temps.

And just let the cpu run. No muss, no fuss.

 

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 59
Credit: 1,140,229,967
RAC: 13

I can get your blue screen of

I can get your blue screen of death if there is only one GPU with a questionable BIOS installed, usually when I get a 'mining bios gpu' used from ebay. One such MSI rx570 4GB armor OC I have never found a good BIOS to correct the problem other than not using that GPU!  Nothing to do with O/C, drivers, CPU's, etc. Just one bad GPU causing it. All this with windows 10.

I'm so stubborn trying to solve it over and over with same results over and over. So the GPU finally ended up in a drawer to solve the blue screen problem and my anger. It seems to be when the gpu or memory clock gets stuck on full or on 0. Leave it long enough and TBSofD shows up. Sometimes a few minutes and sometimes before we fully boot from restart. Some solve the hangup on their own soon enough it keeps going.

It would be sooooo nice if the manufacturers were a little more giving of stock signed BIOS.

Sometimes this can be caused by a bad riser too. Not always. Rule out risers and look to which gpu was used with a beta mining bios. Easy to find with gpu-z. Also, those gpu's tend not to run with newer AMD drivers anyway. The last AMD driver the patcher app works on is PRO 20.q2. After that when running the patcher it says 'file size to large'. All solved if we could get signed BIOS from the Company.

Might be your issue Tom.

 

Work runs fine on Bosons reacted into Fermions,

Sunny regards,

earthbilly

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,123
Credit: 1,966,998,476
RAC: 4,844,425

earthbilly wrote:Might be

earthbilly wrote:

Might be your issue Tom.

Thank you. I have one known Rx 5700 that I bought with the 5700x bio flash upgrade  When it gave me trouble (immediately) I back flashed it to the stock bios.

So far I have managed to extend the times between crashing by turning off the Turbo-boost on my i9-9900 and am now trying it with the slower stock ram speeds.

It was crashing every 4 hours or so.  After I turned off the Turbo-boost it ran maybe 12 hours before it crashed.  I am in the midst of the slower ram test.

I have managed to resurrect two problem children Rx 580's by back flashing them with their stock bios. They refused to run more than two threads are were crash-prone.  They are now happily munching along at 3 threads per GPU on another system.

I have also sometimes been able to find an older Windows driver that was stable.

Tom M

 

Over the hill?  What hill?  I don't REMEMBER any hill...
A Proud member of the O.F.A. (I've forgotten what that stands for.... ;)

 

 

 

 

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 59
Credit: 1,140,229,967
RAC: 13

Tom, you probably already do

Tom, you probably already do this. I let MSI Afterburner run the monitoring screen with just a very few parameters. GPU usage and temp, sometimes power. CPU main temp and usage. I run that screen all the time stretched across the entire monitor. I can correlate gpu hanging up on gpu clock or memory clock, using only usage graph, directly with Blue screen death.

I also tried trimming the clocks to stop it. Seems to slow down the problem but never fully gets rid of it till I can find a proper gpu BIOS.

When an AMD rx 570 4gb was under a hundred bucks we could buy a couple, looking for a good bios. For now those days are over. I now have a rule. When buying a used card, the card must have dual bios or bust the sale. The large GPU firmware archive that will go unnamed, those bios on file are not signed stock bios and work only sometimes only so well.

I may loose 10'th place for a while. I selected the engineering task to do and they downloaded over 4,000 tasks to my 13 computers. By the time I clean up all of them, I may be out of sight!

 

Work runs fine on Bosons reacted into Fermions,

Sunny regards,

earthbilly

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,123
Credit: 1,966,998,476
RAC: 4,844,425

a BSOD (Win10) error message:

a BSOD (Win10) error message: "Driver Overran Stack Buffer"

This error message has shown up on both my Ryzen systems (MSI X570-A Pro) and my Intel system (MSI B360-A Pro).

It also will show up during a driver install if "too many" GPUs are plugged in.

The only fix that seems to work consistently is to reduce the # of GPUs below some threshold.  Probably 5 or certainly 4 which is where my top performer (Intel) is now sitting.

Tom M

 

Over the hill?  What hill?  I don't REMEMBER any hill...
A Proud member of the O.F.A. (I've forgotten what that stands for.... ;)

 

 

 

 

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 59
Credit: 1,140,229,967
RAC: 13

Tom M wrote: It also will

Tom M wrote:

It also will show up during a driver install if "too many" GPUs are plugged in.

The only fix that seems to work consistently is to reduce the # of GPUs below some threshold.  Probably 5 or certainly 4 which is where my top performer (Intel) is now sitting.

For our BOINC crunching I also have found the same issues. I thought it may be due to use of H-81 and Z-97 system boards but it seems that it's more widespread. I read with interest about PCI-e lanes and pairing with chipsets and pairing with CPU. Interesting.

I've also been thinking of other 'good practice' habits that seem to help when running multiple gpu configuration. I strongly recommend checking to be sure your power is properly grounded. I use 4 wire 120VAC on all computers. Also often overlooked is when using power supplies at such high amperage continuous duty, the AC power cable should never be smaller than 14AWG. Running 700-1400W constant, the voltage drop using 18AWG can drop our output by 5-10% due to voltage drop between the wall and the PSU. A while back when I had several 5 card rigs I immediately saw several % output increase just changing every power cord to 14AWG. ( The smaller the gauge the larger the wire size).

4 wire 120AC means 1-positive wire, 2-neutral wire, 3-electronics ground or the third plug on the power cable, and 4-a separate green wire running direct to ground rod and to the screw that holds the PSU into the computer frame and to the mining frame where the gpu cards are bolted down. If the green wire ran to a ground fault circuit breaker it would be called a (PE).

PLEASE be careful running your multi gpu rigs so to reduce chance of electrical fire in your home from overloaded circuits. Spread out each rig to a different power supply circuit breaker or have dedicated circuit breakers only for 2-3 rigs each, with 10AWG wire supplying the room where crunching happens.

Work runs fine on Bosons reacted into Fermions,

Sunny regards,

earthbilly

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.