Dual AMD computes on first card only

koschi
koschi
Joined: 17 Mar 05
Posts: 86
Credit: 1713736941
RAC: 407643
Topic 218834

Hi everyone,

I have quite a strange problem on my main system. Einstein computes all WUs on only one of the AMD cards.

First its specs:

Ubuntu 18.04.2 (kernel 4.15, AMDGPU-PRO 19.10, BOINC 7.14.2) & Ubuntu 19.04 (kernel 5.0.0, OpenCL from AMDGPU-PRO 18.50, BOINC 7.14.2)

AMD R7 (undervolted, 120W power draw with CPU units)

BeQuiet Straight Power 11 650W (Gold/93%)

1st 16x PCIe Radeon RX580 (monitor attached, 82W doing 2x FGRP1B)

2nd 16x PCIe Radeon Vega 56 (180W PowerLimit)

Seems like enough power, wouldn't expect any issues on that end.

 

The RX580 was my main card during last months, the 2nd PCIe slot hosted a GTX1060 until this week. Both cards were crunching along in Einstein in parallel (& doing 2 WUs per card) until I took the GTX out.

I put the Vega 56 in, removed the Nvidia drivers and hoped I would now run 2 x 2 (0.5ngpu) FGRP1B ATI tasks on these two Radeon cards.

However, WUs are only processed on the VEGA.

Both cards are recognised by BOINC: 

Wed 08 May 2019 20:20:27 CEST | | OpenCL: AMD/ATI GPU 0: Radeon RX Vega (driver version 2841.4 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2841.4), 8176MB, 8176MB available, 11397 GFLOPS peak)Wed 08 May 2019 20:20:27 CEST | | OpenCL: AMD/ATI GPU 1: Radeon RX 580 Series (driver version 2841.4, device version OpenCL 1.2 AMD-APP (2841.4), 7295MB, 7295MB available, 5161 GFLOPS peak)

 <use_all_gpus>1</use_all_gpus> is set and acknowledged by BOINC:Wed 08 May 2019 20:20:28 CEST | | Config: use all coprocessors

Regardless how many WUs I run in parallel (tested 1 and 2), they all end up on the Vega. The RX580 shows no load / increased temperature.

With ngpus 1.0 the BOINC client sends one WU to each GPU, in the manager this is shown in the status column as (device 0) & (device 1). The FGRP1G app is correctly called by BOINC, once with --device 0 and once with --device 1:

root 28013 11934 14 23:13 pts/2 00:01:03 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile LATeah1049X.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 180.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1049X_0188_2669947.dat --debug 1 --debugCommandLineMangling --device 1

root 28592 11934 57 23:20 pts/2 00:00:05 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile LATeah1049X.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 180.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1049X_0188_2793903.dat --debug 1 --debugCommandLineMangling --device 0

 However, lmsensors, amdgpu-utils and the WU runtime indicate that both WUs are being run on the Vega, while the RX580 remains idle.

Quite a strange problem. I'm not sure at what level this is screwed up. Most likely not BOINC, it was sending WUs to devices 0 and 1, as shown by the manager and the FGRPB1G processes themselves. Is it the Einstein executable that ignores the device parameter (and runs everything on device 0) or somewhere in OpenCL, scheduling these tasks to the more powerful card?

I have reproduced the problem on two independent installation of Ubuntu at different release levels.
I'm completely out of ideas...

Anyone any idea? Any insight on the FGRP1B executable themselves, can we somehow trace why/how they decide on where they compute?

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7299528357
RAC: 2201674

koschi wrote:Any insight on

koschi wrote:
Any insight on the FGRP1B executable themselves, can we somehow trace why/how they decide on where they compute?

They don't.  The BOINC installation on your system does.

koschi
koschi
Joined: 17 Mar 05
Posts: 86
Credit: 1713736941
RAC: 407643

Well BOINC already assigns

Well BOINC already assigns each WU to a different device and starts both FGRP1B processes with different --device parameters. So that information is given to the Einstein executable. I don't see what BOINC itself could do any better here?

 

I tried excluding the Vega with <ignore_ati_dev>1</ignore_ati_dev> (also trying 0).

I swapped both cards in their slots, the Vega is now the main card. Nothing helps. BOINC starts the executable correctly, but they don't respect the --device flag.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7299528357
RAC: 2201674

koschi wrote:Well BOINC

koschi wrote:

Well BOINC already assigns each WU to a different device and starts both FGRP1B processes with different --device parameters. So that information is given to the Einstein executable. I don't see what BOINC itself could do any better here?

 

I tried excluding the Vega with <ignore_ati_dev>1</ignore_ati_dev> (also trying 0).

I swapped both cards in their slots, the Vega is now the main card. Nothing helps. BOINC starts the executable correctly, but they don't respect the --device flag.

You may wish to review information at:

https://boinc.berkeley.edu/trac/wiki/AppCoprocessor

In particular, the specific device flag you are harping on about is long deprecated.  I think the current standard method relies on <gpu_device_num> in the init_data.xml file for a particular task run.  But I'm non-expert on this matter, and currently only run single-GPU machines.

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 0

I wonder and assume you do

I wonder and assume you do not have a monitor connected to each card... If you connect your display to the current non working card and reboot will tasks then run on it and ignore the other?

 

 

mmonnin
mmonnin
Joined: 29 May 16
Posts: 292
Credit: 3444726540
RAC: 163899

Are you still on the same

Are you still on the same driver as when it was the NV/AMD setup? The same driver supports both VEGA and RX cards?

koschi
koschi
Joined: 17 Mar 05
Posts: 86
Credit: 1713736941
RAC: 407643

@Gavin, the Vega didn't have

@Gavin, the Vega didn't have a monitor connected and was crunching along fine. The not-computing RX580 was my primary card with the monitor attached.

 

I ran 18.50 in the mixed AMD/Nvidia setup. 18.50 also nicely powered the VEGA, as does 19.10.

https://www.amd.com/en/support/kb/release-notes/rn-rad-lin-18-50-unified

These drivers support cards from GCN 2 (Radeon 200 series) up to latest Radeon VII.

When it comes to OpenCL though, Polaris/RX580 requires the legacy implementation to be installed, while the VEGA requires the PAL implementation. I had both installed, the cards were recognized by clinfo and BOINC.

 

Not knowing what the exact problem is, I gave AMD ROCm (RadeonOpenCompute) another shot on the Ubuntu 19.04 installation (kernel 5.0.0, ROCm 2.45, Mesa 19.2). Being maybe a few percent slower than the AMDGPU-PRO OpenCL implementations, it is able to run Einstein FGRP1B on both cards. That completely makes up for not having the Polaris active at all.

Awesome!

 

========================ROCm System Management Interface======================== ================================================================================ GPU  Temp   AvgPwr   SCLK     MCLK     Fan     Perf  PwrCap  SCLK OD  MCLK OD  GPU%   0    71.0c  132.0W   1474Mhz  920Mhz   42.75%  auto  130.0W  N/A      15%      97%    1    52.0c  81.239W  1120Mhz  2000Mhz  18.82%  auto  122.0W  0%       -2%      98%    ================================================================================ ==============================End of ROCm SMI Log ==============================

 

So it seems the official drivers can't reliably run OpenCL code on two cards that require different OpenCL implementations (RX580=>legacy, Vega=>PAL). Somewhere in there it gets messy, all started tasks are then scheduled onto the Vega.

 

@archae86

thanks, will check the init_data.xml on my main install

 

 

koschi
koschi
Joined: 17 Mar 05
Posts: 86
Credit: 1713736941
RAC: 407643

This is on my main install

This is on my main install that I didn't fix yet:

root@frickelbude:/var/lib/bunker2/slots# grep gpu_device_num */init_data.xml 0/init_data.xml:<gpu_device_num>1</gpu_device_num> 1/init_data.xml:<gpu_device_num>0</gpu_device_num> root@frickelbude:/var/lib/bunker2/slots#

 

Seems about right, each init_data.xml specifies a different target GPU.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.