Generic Multiple GPU discusssion

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673936154
RAC: 1784122

My i9-9900 system is

My i9-9900 system is hiccuping.  Looks like I scrambled my copy of the OS.  

Will be backing up my BOINC folder and re-installing the OS once I run out of tasks to process (NNT).

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673936154
RAC: 1784122

My AMD 2700x/4 Rx 5700 also

My AMD 2700x/4 Rx 5700 also had a hiccup a couple of days ago.  One of the video cards reported an error to Windows so it was disabled.  Rebooting didn't fix it.

So I moved that GPU to a test rig where it happily calculated for many hours.

Since my single Rx 5700 experiment results have been conflated by the change in the data, I reinstalled that GPU back on the AMD 3950x box.

So all the outstanding questions -> answers require "more patience".

And as I have noted before "I want more PATIENCE.  And I want it RIGHT NOW!".

Tom M ;)

 

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

de_fou
de_fou
Joined: 11 Jun 20
Posts: 2
Credit: 441483103
RAC: 0

Hi, I'm a new neighbor. I got

Hi, I'm a new neighbor. I got a rig with a Threadripper 1920X and 3 × AMD Vega 56, running GW only, 2 tasks per GPU. Each card is limited to 150W. No OC nor UV. I was wondering however if it's usual to do so here with E@H tasks?

Cheers,

Dan

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4700
Credit: 17544740462
RAC: 6408335

GW tasks normally max out at

GW tasks normally max out at the most 2X per card.  They demand a lot more cpu support than Gamma Ray tasks for example. So don't be stingy in allocating at least one cpu core per task.

They also can use a fair amount of card RAM, up to 4GB per task in the past though that has dropped quite a bit lately for current tasks.

Your Vega 56 card should do fine with GW work if given enough cpu support.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109392010087
RAC: 35879695

Hi Dan, welcome to

Hi Dan, welcome to Einstein@Home!

de_fou wrote:
... I got a rig with a Threadripper 1920X and 3 × AMD Vega 56, running GW only, 2 tasks per GPU.

I had a look at your tasks list on the website and you're not running GW tasks at all - which is probably just as well since we are transiting to a new GW search and there seem to be some teething issues.  All the tasks for that machine are gamma-ray pulsar (GRP) tasks and not gravitational wave (GW).  You'd be best advised to stick with GRP until things settle down.

de_fou wrote:
Each card is limited to 150W. No OC nor UV. I was wondering however if it's usual to do so here with E@H tasks?

Some people tweak their systems in this way and it can result in lower power bills for not much reduction in output.  However the risk is that you may cut it too fine and start getting task failures or invalid results.  To avoid this, it seems like a fair bit of careful testing is required.

In your case I note that you currently have ~5300 valid results, 15 invalid and 8 error.  A small number of invalid results is quite normal and often its due to small differences between different systems which can cause enough discrepancy to exceed the tight validation criteria.  A third 'decider' is sent out and it's the luck of the draw as to which two 'agree' the closest.  It's usual to see about 0.5% - 1.5% of results get declared as invalid because of this.  So your current 0.28% invalid rate is very good.

What is even more remarkable (but something of a red flag) is that 5 of those 15 failed by a different mechanism, resulting in a 'validate error' status (and not just the normal 'invalid').  This status means that the result returned was complete rubbish - so much so that it was immediately recognised as such and wasn't even presented for validation.

Validate errors can be a sign that the card in question is operating outside of it's comfort zone - so perhaps a sign that one card may have been tweaked every so slightly beyond its limits.  If you want to spend the time, you could study the std_error output returned to the project for each validate error to see if you could identify which GPU was involved.  I don't have time for this sort of stuff and my systems (fortunately) don't seem to get many validate errors.  I don't know offhand if you can detect which GPU of a multi-GPU system was involved in each case, anyway.

The 8 errors mean that computation failed in some way.  Many of yours were very early but there might be clues as to what went wrong in the std_error output - click a taskID link on the website and scroll down to see what that output says.

One of the errors was caused by you aborting the task after a very long time.  These 'stuck tasks' do happen (I see maybe a couple of cases per week over a very large number of GPUs) and they are caused by a GPU crash of some sort.  Stopping and restarting BOINC doesn't fix it - the GPU needs a full reset.  I have a script that detects this when it happens.  The only remedy I've found is a full cold restart and the task will then pick up from the last saved checkpoint and complete normally.  There's no need to abort as whatever caused the GPU to lock up doesn't seem to recur on that same task in my experience.

Cheers,
Gary.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673936154
RAC: 1784122

My "Moonshot" is now running

My "Moonshot" is now running three Rx 5700's and seems to have peaked at around 3,400,000~ RAC.  I expect it would have been a little lower using the previous data sets.

My "gpu-server" has been running 2 gpus without any errors.  So this morning I bumped it back up to 3 gpus while I continue the external riser card hardware testing.  My goal is to get back to all 6 Rx 570/580 gpus running without errors.

Then I will look at adding 3 more XFX Rx 570 cards.  I can manage 9 without having to re-think how to package them on my mining rack.

I am running GW on the gpu-server and the RAC is still climbing.  I am presuming that the latest fixes on the GW processing have been successful (discussed in another thread).

Tom M

 

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

de_fou
de_fou
Joined: 11 Jun 20
Posts: 2
Credit: 441483103
RAC: 0

Thanks for your welcome and

Thanks for your welcome and for the messages! In fact some of the few invalid errors came out because of a faulty PCIe extension cable. I'm using the ASRock X399M Taichi, which is a small factor (mATX) MB with 3 full PCIe x16 ports (and the Threadripper can handle up to 64 pcie lanes). The Vega cards are double-slot sized and therefore, I need to use an extension cable to fit it the 3. The first cable was a cheap manufactured one, and it gave me lots of headaches. Put it into the trash bin, got a new one from a recognized brand, and now it's working good. It's really rare, but seldom times I have a stuck task that I have to manually abort. Otherwise, unless somebody has a proven OC/UV parameters that might share (perhaps in another thread), I prefer to stick with the actual values.

And yes, at the app_config.xml file, for the GRP tasks I assigned 1 CPU, and for the GW tasks, 1.1 CPUs.

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673936154
RAC: 1784122

de_fou wrote: And yes, at

de_fou wrote:

And yes, at the app_config.xml file, for the GRP tasks I assigned 1 CPU, and for the GW tasks, 1.1 CPUs.

The default 0.7 cpu seems to be driving my Rx 570/580's fine for 1 gpu thread per.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

tictoc
tictoc
Joined: 1 Jan 13
Posts: 33
Credit: 6021590140
RAC: 5031942

Tom M wrote: Andrew Petkin

Tom M wrote:

Andrew Petkin wrote:

I set up two 4xGpu systems in this way. I need undervolting to meet the power supply limit. For my multi GPU systems, this gives an economy of 300W

So are there Linux tools with that level of granularity?

Tom M

 

If you are running AMD GPUs then yes, all the tools are in the kernel.  If you want to go beyond basic over/underclocking, over/undervolting, fan tuning, and power limit adjustments, then you can load and run custom Power Play Tables.

 

4x Radeon VIIs (only running all 4 about 16-18 hours per day): https://einsteinathome.org/host/12597202

2x Vega 64: https://einsteinathome.org/host/12871694

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673936154
RAC: 1784122

Thank you TicToc.

Thank you TicToc.

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.