Troubleshooting Multiple gpu setups that use Riser cards

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 916
Credit: 6,750,307,859
RAC: 21,799,652

is that with the system

is that with the system running and crunching all 4 GPUs? or at idle?

if under load, it all looks good. no issues there.

_____________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,150
Credit: 2,147,356,996
RAC: 4,552,349

Ian&Steve C. wrote: is that

Ian&Steve C. wrote:

is that with the system running and crunching all 4 GPUs? or at idle?

if under load, it all looks good. no issues there.

under load.

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,150
Credit: 2,147,356,996
RAC: 4,552,349

  My latest attempt to run

 

My latest attempt to run 5 gpus has blown up in my face badly enough that I am back down to 3 gpus.

I did manage to get Libsleep started/running on the EPYC system.

It is beginning to look like the only EPYC way to get back to 5 gpus would be another new MB.

:(

Assuming I can't get back to 4 gpus I will start considering alternate MB/Cpus.

Tom M

 

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 916
Credit: 6,750,307,859
RAC: 21,799,652

it's possible your board is

it's possible your board is just defective. bought used/ebay right?

how'd the CPU pins look? any bent? do you recall? PCIe devices all connect right to the CPU and will hence have pin connections from the PCIe slots into the CPU socket.

 

consumer level AMD AM4 platforms will not perform better for this. and you end up getting big eyes for more cores and try to stuff high core count CPUs into mining boards that dont have the VRMs to handle it and it spirals out of control again.

 

The EPYC platform is your best bet. I think your best course of action will be to totally remove the epyc board from the chassis/rack and give it a good and thorough cleaning, and look for signs of damage or bent pins in the CPU socket. if it's still acting up, RMA it to Asrock.

 

the only other platform I would recommend for >4 GPUs would be an Intel HEDT or Intel server platform. avoid boards with PCIe "expanders" or using lanes that run through chipsets.

 

another possibility could be AMD driver issues. the only AMD GPU system with more than 4 GPUs are Gaurav's systems, and maybe he's done something special to support it. how many nvidia GPUs do you have? maybe you could see if 5 nvidia GPUs will run on that epyc board without issues? especially now that there's no AMD advantage with the inception of the new 1.28 nvidia app.

 

 

_____________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,353
Credit: 2,664,818,915
RAC: 6,055,528

It's OK to use an Intel HEDT

It's OK to use an Intel HEDT mobo that uses PLX chips though, right?  None of the PCIE lanes go through the chipset.  The PLX's are fed directly out of the CPU.

Like this block diagram for my X99-E-10G_WS mobo.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 916
Credit: 6,750,307,859
RAC: 21,799,652

Keith Myers wrote: It's OK

Keith Myers wrote:

It's OK to use an Intel HEDT mobo that uses PLX chips though, right?  None of the PCIE lanes go through the chipset.  The PLX's are fed directly out of the CPU.

Like this block diagram for my X99-E-10G_WS mobo.

 

I had a decent experience with my P9X79-E WS board that also uses similar PLX chips. It ran 7x RTX2080's fairly reliably, with occasional hangups. whether the slight instability was caused by the board design, or the fact that I spilled some water on that board (and cleaned it up), remains an unknown. maybe a little of both? 

 

but again, you and I have this experience with Nvidia GPUs and drivers. I'm not sure how robust the AMD drivers are to support >4 GPUs. and it's hard to pin down if his issues are AMD driver related, physical hardware related, environment related, or some other unknown issue that's impossible to know without being able to be hands on with his system. 

_____________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,150
Credit: 2,147,356,996
RAC: 4,552,349

Thank you for reminding about

Thank you for reminding about a possible mb rma.

I remember having a lot of trouble with windows driver failures. And one of my symptoms is complete GPU missing.

I will confirm the possible RMA. 

Start working on a complete mb disassemble/dismount/clean.

I already own enough hardware to experiment with getting up to all 5 gpus online while Epyc is completely down.

And I just got a new to me Ryzen 3700x in the door yesterday :)

A replacement Epyc motherboard is cheaper than another gtx 1080.

If I were to sell all my rx 5700's, rx 580's I might be able to fund two higher end rtx 3000 series gpus :) 

I have been collecting #'s from the top 50 list and your 3080 ti really smokes...

First the clean up...

Tom M

 

 

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,353
Credit: 2,664,818,915
RAC: 6,055,528

Yes, I would like to see his

Yes, I would like to see his Epyc board populated with nothing but Nvidia cards,  even if only temporarily.

I suspect that the consumer level AMD drivers just don't work well with more than 1 or 2 cards on any system.

That would prove one way or the other whether it was hardware failure or simply bad drivers.

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,150
Credit: 2,147,356,996
RAC: 4,552,349

Keith Myers wrote: Yes, I

Keith Myers wrote:

Yes, I would like to see his Epyc board populated with nothing but Nvidia cards,  even if only temporarily.

I suspect that the consumer level AMD drivers just don't work well with more than 1 or 2 cards on any system.

That would prove one way or the other whether it was hardware failure or simply bad drivers.

Since my problems seem to be linked to specific PCIe slots I can see how that might be a very useful test.

Does it boot with an Rx 5700 in that slot (No) or a gtx 1660 Super in that slot? :)

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,150
Credit: 2,147,356,996
RAC: 4,552,349

Tom M wrote:Keith Myers

Tom M wrote:

Keith Myers wrote:

Yes, I would like to see his Epyc board populated with nothing but Nvidia cards,  even if only temporarily.

I suspect that the consumer level AMD drivers just don't work well with more than 1 or 2 cards on any system.

That would prove one way or the other whether it was hardware failure or simply bad drivers.

Since my problems seem to be linked to specific PCIe slots I can see how that might be a very useful test.

Does it boot with an Rx 5700 in that slot (No) or a gtx 1660 Super in that slot? :)

Tom M

In preperation for putting a Nvidia gpu in the pcie slot nearest the cpu I unplugged all the other gpus.

Put a known good working Rx 5700 in 3rd slot (counting from the furthest from the cpu) and confirmed it worked there.

Then moved that same gpu to (#7?) pcie slot closest to the cpu.

Where it just came up and started processing.  I am going to let it run for a while here.  I have had Rx 5700 et al start processing and then stall after a while.

I guess this likely means that testing with a NVIDIA card isn't going to do much good because I don't have enough of them to replicate my Rx 5700 setup.

Tom M

===edit===

Its got me to wondering if the "atomic" stuff of the ROMx driver doesn't work very well with 8x slots?  I am going to keep adding 16x slot based gpus and see if I can get back up to 4.  Previously I was using the 3 furthest slots including an 8x slot.

===edit===

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.