Troubleshooting Multiple gpu setups that use Riser cards

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,153
Credit: 2,167,266,886
RAC: 4,623,542

Ian&Steve C. wrote: can you

Ian&Steve C. wrote:

can you clarify the run configuration now?

how many tasks at a time? 1x? 2x?

Running one task at a time with 3 gpus.

I will be setting up another test bench system to confirm that the two gpus and flat ribbon cables are all good.

I am currently convinced I have dirty pcie slots that I have not been able to get clean. But maybe not.

I may have to replace the Epyc motherboard. Or terminate the 7 GPU experiment.

I have a spare x570 motherboard and an 3700x coming in so it might be possible to get a 5 GPU rig up without spending more money. 

Reducing GPU count and installing systems in a enclosed case may turn out to be my only choice for reliable systems.

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 922
Credit: 6,785,876,970
RAC: 17,322,527

for the slots, do you mean

for the slots, do you mean physically dirty (dust/dirt)? or electrically dirty (noise/ripple)?

what are you using for a PSU on the EPYC/5700 system?

_____________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,153
Credit: 2,167,266,886
RAC: 4,623,542

Ian&Steve C. wrote:for the

Ian&Steve C. wrote:

for the slots, do you mean physically dirty (dust/dirt)? or electrically dirty (noise/ripple)?

what are you using for a PSU on the EPYC/5700 system?

Presumably physically dirty.  EVGA 1600 PSU.

I had an exterminator in for Bed Bugs.  He sprayed a lot of diatomaceous earth around along with heating up my house to over 120F.  

I failed to cover/block slots on the EPYC motherboard except for the installed ribbon cables.  And probably screwed up things more so I am down to 3 slots that are definitely working. :)

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 922
Credit: 6,785,876,970
RAC: 17,322,527

that PSU should be more than

get a few cans of compressed air and really clear out any dirt/dust. from the whole board, not just in the slots.

 

that PSU should be more than enough theoretically. can you log in to the IPMI interface and check the 12V voltage under load? just to verify that the voltage isn't sagging. even a good PSU can go bad over time.

when the system boots, you should see an IP address listed on the BIOS splash screen. using another computer on the same network, put that IP address into a browser window. you should hit a login page (might have to add some security exceptions to get it to show in like firefox/chrome). default user:password credentials are admin:admin (you should change these at a later date just for good opsec). the main dashboard has a sensor monitoring section that will display many different voltages and other system telemetry. check the 12V/5V/3.3V values.  12V and 3.3V are particularly important for a GPU since it directly uses both of those voltages.

_____________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,153
Credit: 2,167,266,886
RAC: 4,623,542

Ian&Steve C. wrote:get a

Ian&Steve C. wrote:

get a few cans of compressed air and really clear out any dirt/dust. from the whole board, not just in the slots.

 

that PSU should be more than enough theoretically. can you log in to the IPMI interface and check the 12V voltage under load? just to verify that the voltage isn't sagging. even a good PSU can go bad over time.

when the system boots, you should see an IP address listed on the BIOS splash screen. using another computer on the same network, put that IP address into a browser window. you should hit a login page (might have to add some security exceptions to get it to show in like firefox/chrome). default user:password credentials are admin:admin (you should change these at a later date just for good opsec). the main dashboard has a sensor monitoring section that will display many different voltages and other system telemetry. check the 12V/5V/3.3V values.  12V and 3.3V are particularly important for a GPU since it directly uses both of those voltages.

I have a high powered plug-in air blower.  It hasn't worked.  Yet.

Just tried running my CPU bios speed setting to "power" which drives the CPU up to around 2.7-8 Mhz.  And the gpu's all promptly went "missing" :(  Set it back to "auto" and they came up and are running.  They used to work at 2.7 MHz.  :(

GPU's are now set to 2X.

Going to go get my workbench setup and test some gpus/cables.

Then try cleaning MB and slots again ((and again) and again)).....

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 922
Credit: 6,785,876,970
RAC: 17,322,527

might be a good idea to check

might be a good idea to check the PSU voltages too as I've outlined above.

_____________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,153
Credit: 2,167,266,886
RAC: 4,623,542

Tom M wrote: Going to go get

Tom M wrote:

Going to go get my workbench setup and test some gpus/cables.

Then try cleaning MB and slots again ((and again) and again)).....

Some good news.

1) Both Rx 5700 gpus run and process gamma ray #1 tasks on the test bench.

2) So do two flat ribbon cables.

Some mixed to good news.

Pcie Slot #7 closest to the cpu causes the gpus to go "missing" still.

BUT Slot #6 (x8) just came up and the 4th gpu started processing.

And its too hot inside my house so I am going to enable the suspend on time until 8pm this evening.

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,153
Credit: 2,167,266,886
RAC: 4,623,542

Ian&Steve C. wrote: might be

Ian&Steve C. wrote:

might be a good idea to check the PSU voltages too as I've outlined above.

I will chase down the eBay message from the previous owner so I can copy his password.  It long and hairy.... and I can't even sorta remember it.

I have been meaning to get that PW changed.

Sounds like this is the best time.

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 922
Credit: 6,785,876,970
RAC: 17,322,527

FYI, now that they've fixed

FYI, now that they've fixed the plan class name issue with the beta tasks, your cache is now filled with 1.28 beta tasks. since processing is FIFO, it looks like you've got 6 tasks left to process before hitting those new 1.28 tasks.

some users have reported that they get errors with these new tasks (i suspect driver issues), so keep an eye out and watch the system closely when it gets to the 1.28 tasks so you can stop it if you see issues with the 1.28 app and avoid getting the 24hr blacklist from too many errors.

 

I was able to process the 1.28 app on my RX570 (no speed difference) just fine though with my ROCm drivers. if you're on ROCm drivers too you might be OK. just be sure to watch and not enable processing and walk away for a long time, just in case.

_____________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 1,153
Credit: 2,167,266,886
RAC: 4,623,542

Ian&Steve C. wrote: get a

Ian&Steve C. wrote:

get a few cans of compressed air and really clear out any dirt/dust. from the whole board, not just in the slots.

 

that PSU should be more than enough theoretically. can you log in to the IPMI interface and check the 12V voltage under load? just to verify that the voltage isn't sagging. even a good PSU can go bad over time.

when the system boots, you should see an IP address listed on the BIOS splash screen. using another computer on the same network, put that IP address into a browser window. you should hit a login page (might have to add some security exceptions to get it to show in like firefox/chrome). default user:password credentials are admin:admin (you should change these at a later date just for good opsec). the main dashboard has a sensor monitoring section that will display many different voltages and other system telemetry. check the 12V/5V/3.3V values.  12V and 3.3V are particularly important for a GPU since it directly uses both of those voltages.

I am in the IPMI interface: 12V 11.90 V, 3V 3.18 V, 3VSB 3.32V, 5V 4.98 V

Tom M

As a self-interested person, I aspire to be a Humane.
In detail, I am a BIG Picture person.

 

 

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.