Radeon Vega

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

koschi wrote:So Einstein

7 May 2019 22:19:58 UTC

Message 171202 in response to message 171200

(moderation:

)

koschi wrote:

So Einstein became the GPU Project of this years Pentathlon, Vega 56 ordered and arriving tomorrow, exciting!

I hope it's going to serve you as well as mine's served me!

koschi

Joined: 17 Mar 05

Posts: 87

Credit: 1719058593

RAC: 211941

It does! 2 WUs in 9:36min, 1

8 May 2019 21:35:57 UTC

Message 171213

(moderation:

)

It does!

2 WUs in 9:36min, 1 in 5:35min.

However, it runs next to an RX580, but steals all the work from it (not VEGAs fault though).

I run Ubuntu 18.04 with AMDGPU-PRO 19.10 (legacy and pal OpenCL installed), BOINC client 7.14.2 & 7.15.0.

Both cards are recognised by BOINC:

Wed 08 May 2019 20:20:27 CEST | | OpenCL: AMD/ATI GPU 0: Radeon RX Vega (driver version 2841.4 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2841.4), 8176MB, 8176MB available, 11397 GFLOPS peak)Wed 08 May 2019 20:20:27 CEST | | OpenCL: AMD/ATI GPU 1: Radeon RX 580 Series (driver version 2841.4, device version OpenCL 1.2 AMD-APP (2841.4), 7295MB, 7295MB available, 5161 GFLOPS peak)

<use_all_gpus>1</use_all_gpus> is set and acknowledged by BOINC:Wed 08 May 2019 20:20:28 CEST | | Config: use all coprocessors

Regardless how many WUs I run in parallel (tested 1 and 2), they all end up on the Vega. The RX580 shows no load / increased temperature.

With ngpus 1.0 the BOINC client sends one WU to each GPU, in the manager this is shown in the status column as (device 0) & (device 1). The FGRP1G app is correctly called, once with --device 0 and once with --device 1:

root 28013 11934 14 23:13 pts/2 00:01:03 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile LATeah1049X.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 180.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1049X_0188_2669947.dat --debug 1 --debugCommandLineMangling --device 1

root 28592 11934 57 23:20 pts/2 00:00:05 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile LATeah1049X.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 180.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1049X_0188_2793903.dat --debug 1 --debugCommandLineMangling --device 0

However, lmsensors, amdgpu-utils and the WU runtime indicate that both WUs are being run on the Vega, while the RX580 remains idle.

Quite a strange problem. I'm not sure at what level this is screwed up. Most likely not BOINC, it was sending WUs to devices 0 and 1, as shown by the manager and the FGRPB1G processes. Is it the Einstein executable that ignores the device parameter (and runs everything on device 0) or somewhere in OpenCL, scheduling these tasks to the more powerful card?

I'm a bit out of ideas...

cecht

Joined: 7 Mar 18

Posts: 1619

Credit: 3031610233

RAC: 1446504

Is the RX 580 getting enough

9 May 2019 1:21:19 UTC

Message 171215

(moderation:

)

Koschi wrote:

However, it runs next to an RX580, but steals all the work from it (not VEGAs fault though).

Is the RX 580 getting enough power? Perhaps try using amdgpu-utils (amdgpu-pac --execute) to state mask the Vega, so it draws less system power while crunching, and see if the 580 then goes to work. Maybe even state mask both cards to really drop their combined power needs for this test. I'd recommend a state mask of 0,4 for both just to see if it's a power issue (don't forget to suspend BOINC Mgr before applying the masks). For this test, I think masking might work better than power capping.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

koschi

Joined: 17 Mar 05

Posts: 87

Credit: 1719058593

RAC: 211941

My system is powered by a

9 May 2019 8:17:34 UTC

Message 171220

(moderation:

)

My system is powered by a BeQuiet Straight Power 11 650W (93% / Gold). The base system (undervolted R7 1700) draws 120W under load, the RX580 with mining BIOS 82W doing FGRP, that should be plenty of room for the V56 (Sapphire Pulse), which has a default PL of 180W.

I did set the SCLK mask to 0,4 (& 950mV on the Vega) for both cards. It drops the power consumption from 180W to 140W on the V56, but doesn't get the RX580 crunching. The RX580 is the primary card in this setup, it renders my desktop etc very well, so its not entirely disabled, just not doing working on Einstein tasks now.

koschi

Joined: 17 Mar 05

Posts: 87

Credit: 1719058593

RAC: 211941

I fired up an Ubuntu 19.04

9 May 2019 15:30:14 UTC

Message 171225

(moderation:

)

I fired up an Ubuntu 19.04 installation with the AMDGPU-PRO 18.50 openCL, kernel 5.0.0 and BOINC 7.14.2.

Same issue, both tasks are computed on the VEGA, the Polaris doesn't get any work.

koschi

Joined: 17 Mar 05

Posts: 87

Credit: 1719058593

RAC: 211941

Long story short, the above

21 May 2019 22:07:59 UTC

Message 171400

(moderation:

)

Long story short, the above described problem seems to result from the AMDGPU-PRO OpenCL not being able to handle legacy (Polaris) and PAL (VEGA) implementations at once, instead scheduling everything onto the Vega.

Replacing official OpenCL with AMD ROCm did the trick, though resulting in slightly slower computing. The RX580 made up for it though.

Now that the RX580 is removed again, I went back to OpenCL PAL from 19.10 on my Ubuntu 19.04. WU completion times decreased again. At 940MHz HBM2, 160W and 3 dedicated threads for Einstein am able to complete 2 WUs in around 9:45min, which results in a theoretical RAC of over 1 million credits per day. Tuning isn't final yet, I am aiming at <10min with 140W, lets see whether that works out. The Vega needs to be supported by free CPU capacity, so right now I keep 2 x 1.5 cores free, as even with 2 x 1 core the runtime increases by 15 sec.

My Sapphire Pulse Vega 56 now costs just 275€ in Germany, which puts it in a really nice sweet spot. IMHO more throughput and better efficiency (at 160W) than Polaris cards at around the same price (1 Vega <->2 Polaris). Looking the other direction, 2 have more throughput than a Radeon VII for a lower price. However, they will consume more energy and PCIe slots.

QuantumHelos

Joined: 5 Nov 17

Posts: 190

Credit: 65875180

RAC: 2676

koschi wrote:Long story

22 May 2019 11:27:11 UTC

Message 171407 in response to message 171400

(moderation:

)

koschi wrote:

Long story short, the above described problem seems to result from the AMDGPU-PRO OpenCL not being able to handle legacy (Polaris) and PAL (VEGA) implementations at once, instead scheduling everything onto the Vega.

Replacing official OpenCL with AMD ROCm did the trick, though resulting in slightly slower computing. The RX580 made up for it though.

Now that the RX580 is removed again, I went back to OpenCL PAL from 19.10 on my Ubuntu 19.04. WU completion times decreased again. At 940MHz HBM2, 160W and 3 dedicated threads for Einstein am able to complete 2 WUs in around 9:45min, which results in a theoretical RAC of over 1 million credits per day. Tuning isn't final yet, I am aiming at <10min with 140W, lets see whether that works out. The Vega needs to be supported by free CPU capacity, so right now I keep 2 x 1.5 cores free, as even with 2 x 1 core the runtime increases by 15 sec.

My Sapphire Pulse Vega 56 now costs just 275€ in Germany, which puts it in a really nice sweet spot. IMHO more throughput and better efficiency (at 160W) than Polaris cards at around the same price (1 Vega <->2 Polaris). Looking the other direction, 2 have more throughput than a Radeon VII for a lower price. However, they will consume more energy and PCIe slots.

So ROCm does the trick! ROCm is being upgraded at AMD thanks to projects like :

https://www.amd.com/system/files/documents/lawrence-livermore-national-laboratory-case-study.pdf

https://www.amd.com/en/case-studies/lawrence-livermore-national-laboratory

& Frontier ... (Noted is the update of ROCm with Cray systems)

VinodK

Joined: 31 Jan 17

Posts: 15

Credit: 246751087

RAC: 0

I am getting a vega64 card

1 Feb 2020 0:01:36 UTC

Message 175501

(moderation:

)

I am getting a vega64 card tomorrow. I am thinking about undervolting as well. The current way seems to be "reduce voltage -> run some WUs -> reduce more if there are no invalids and repeat" . Wondering if there is a better way than this. All the online instructions are gaming focused.

solling2

Joined: 20 Nov 14

Posts: 219

Credit: 1579977860

RAC: 65566

VinodK schrieb:I am getting a

4 Feb 2020 9:14:29 UTC

Message 175520 in response to message 175501

(moderation:

)

VinodK wrote:

I am getting a vega64 card tomorrow. I am thinking about undervolting as well. The current way seems to be "reduce voltage -> run some WUs -> reduce more if there are no invalids and repeat" . Wondering if there is a better way than this. All the online instructions are gaming focused.

When going through previous posts in this forum you'll notice different approaches. They all have their justification. It just depends on the goal the cruncher has: highest throughput, lowest power draw, best bang for the buck? Some are happy with just dimming the power limit. I prefer to reduce voltage plus set memory clock higher. Also check how many tasks you can best run at the same time. Please report how your efforts go. :-)

VinodK

Joined: 31 Jan 17

Posts: 15

Credit: 246751087

RAC: 0

I got the card and am having

5 Feb 2020 4:10:10 UTC

Message 175549 in response to message 175520

(moderation:

)

I got the card and was having a terrible time with system crashes. It was crashing the system every few hours. I have a decent power supply in evga 850W gold. Looks like that is not enough. I am testing out with power limit set to -50% , so far no crashes yet. Performance loss doesn't seem too much.

Radeon Vega

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner