Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519315126
RAC: 13909

Peter Hucker wrote:In the

Peter Hucker wrote:
In the last day, my O2MDF WUs have been taking 4-5 times longer.  Is this normal or is there something up with my system?  Gamma on the same GPU takes the usual time.

To answer my own question, I was running two on one GPU (as it got them done quicker - not waiting on the CPU so much).  I noticed the GPU RAM was full when running two at once (a WU needs 2.4GB, the card has 4GB), so it was using main system RAM, which presumably is a lot slower.  Changing it to run one at a time on the GPU has speeded it back up again.  I guess the WUs recently got large in RAM size.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909648736
RAC: 2107777

Ian&Steve C. wrote:Does

Ian&Steve C. wrote:

Does Wattman run on Linux though? 

 

does Rick’s tool show PCIe utilization for AMD cards?  

Rick's tool, amdgpu-utils, has many of Wattman's features, but it does not currently show PCIe bus utilization. I may put that to him as a feature request. It seems that there is a utility, Processor Counter Monitor, that can do that, but I haven't delved into it and am not sure whether it can handle 8th and 9th gen Intel CPUs. (From what I gather, measuring PCIe bus utilization is done on the CPU side, not on the GPU device side.)

EDIT/UPDATE: So I got the PCM program running and the command to report PCIe activity:

~/opcm/pcm$ sudo ./pcm-pcie.x 1

results in this:

Detected Intel(R) Pentium(R) Gold G5600 CPU @ 3.90GHz "Intel(r) microarchitecture codename Kabylake" stepping 11 microcode level 0xca
Jaketown, Ivytown, Haswell, Broadwell-DE Server CPU is required for this tool! Program aborted


*sigh*

Other modules of the PCM package do work, however, and provide all sorts of deep information on CPU metrics.

BTW, PCM also runs in Windows and Mac OSX.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909648736
RAC: 2107777

Peter Hucker wrote:To answer

Peter Hucker wrote:
To answer my own question, I was running two on one GPU (as it got them done quicker - not waiting on the CPU so much).  I noticed the GPU RAM was full when running two at once (a WU needs 2.4GB, the card has 4GB), so it was using main system RAM, which presumably is a lot slower.  Changing it to run one at a time on the GPU has speeded it back up again.  I guess the WUs recently got large in RAM size.

I see from your RX 560 host that those increased times were for a few tasks in the VelaJr data series, which have been known to bog down run times.  It's good to know though that changing task runs to 1x can resolve the issue.

For the O2MDFG3a_G34731 data series that my host (two RX 570s) is currently running, GPU memory of 4 GB doesn't appear to be a limiting factor for 2x runs on a single card.  Across different run configurations, here are my average GPU memory loads, sampled for 10 min with readings every 2 sec:

Run config.   GPU mem. load
1 GPU@1x = 25%
1 GPU@2x = 64%
2 GPU@1x = 20% across both cards
2 GPU@2x = 18% across both cards

So for 2x tasks on one card, there seems to be plenty of memory headroom.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18752320316
RAC: 7121602

Ian&Steve C. wrote:Does

Ian&Steve C. wrote:

Does Wattman run on Linux though? 

 

does Rick’s tool show PCIe utilization for AMD cards?  

I offered what I know about.  Wattman is for Windows hosts.  Rick wanted something similar for Linux so developed his tool.

I am not sure if he shows PCIe utilization. I have never owned any AMD card.  I just followed his commentary as he developed the utility.  I think it does if I remember correctly.  If the parameter gets exposed by the OS, I think he has figured out how to dig the information out.

 

[Edit] Thanks @cecht for the info that Rick's tool does not show PCIe utilization, yet.

 

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519315126
RAC: 13909

cecht wrote:Peter Hucker

cecht wrote:
Peter Hucker wrote:
To answer my own question, I was running two on one GPU (as it got them done quicker - not waiting on the CPU so much).  I noticed the GPU RAM was full when running two at once (a WU needs 2.4GB, the card has 4GB), so it was using main system RAM, which presumably is a lot slower.  Changing it to run one at a time on the GPU has speeded it back up again.  I guess the WUs recently got large in RAM size.

I see from your RX 560 host that those increased times were for a few tasks in the VelaJr data series, which have been known to bog down run times.  It's good to know though that changing task runs to 1x can resolve the issue.

For the O2MDFG3a_G34731 data series that my host (two RX 570s) is currently running, GPU memory of 4 GB doesn't appear to be a limiting factor for 2x runs on a single card.  Across different run configurations, here are my average GPU memory loads, sampled for 10 min with readings every 2 sec:

Run config.   GPU mem. load
1 GPU@1x = 25%
1 GPU@2x = 64%
2 GPU@1x = 20% across both cards
2 GPU@2x = 18% across both cards

So for 2x tasks on one card, there seems to be plenty of memory headroom.

I was seeing 2.4GB usage for 1x task, and 4GB (+ a fair bit on dynamic memory (as in system RAM)) for 2x tasks.

Now at 3.4GB running 1x task - h1_1396.55_O2C02Cl4In0__O2MDFV2g_VelaJr1_1397.35Hz_307

Your numbers don't make sense, why were you getting no increase at all for 2 GPU@2x over 2 GPU@1x?  (I assume that means 2 GPUs with two WUs each).

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909648736
RAC: 2107777

Peter Hucker wrote:Your

Peter Hucker wrote:
Your numbers don't make sense, why were you getting no increase at all for 2 GPU@2x over 2 GPU@1x?  (I assume that means 2 GPUs with two WUs each).

Good question. That's the topic of my earlier post here where I was guessing that GPU utilization for 2GPU@2x (yes, 4 concurrent tasks) is throttled by limited CPU resources. If it's a matter of number of cores and their speed, then a major CPU upgrade should much improve daily task yields. If it's a matter of limited PCIe bandwidth, as Zalster suggested, then I (we) may be stuck because all the consumer Intel CPUs that I've looked at list  "Max # of PCI Express Lanes" as 16 and "PCI Express Configurations: Up to 1x16, 2x8, 1x8+2x4". Meaning that two GPUs will get no more than 8 PCIe lanes.  That hasn't mattered for FGRBPG1 work, which can get by fine with 2 lanes, but it MAY be a limiting factor for boosting GW productivity.

EDIT: I just saw over in Problems and Bug Reports that Gary Roberts has been running tasks at 3x on an RX570 with 12 min task times for non-VelaJr tasks. I'll give that a shot because, based on those times, it looks like a single card at 3x would increase daily task productivity over that of two cards at 1x.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519315126
RAC: 13909

cecht wrote:Peter Hucker

cecht wrote:
Peter Hucker wrote:
Your numbers don't make sense, why were you getting no increase at all for 2 GPU@2x over 2 GPU@1x?  (I assume that means 2 GPUs with two WUs each).

Good question. That's the topic of my earlier post here where I was guessing that GPU utilization for 2GPU@2x (yes, 4 concurrent tasks) is throttled by limited CPU resources. If it's a matter of number of cores and their speed, then a major CPU upgrade should much improve daily task yields. If it's a matter of limited PCIe bandwidth, as Zalster suggested, then I (we) may be stuck because all the consumer Intel CPUs that I've looked at list  "Max # of PCI Express Lanes" as 16 and "PCI Express Configurations: Up to 1x16, 2x8, 1x8+2x4". Meaning that two GPUs will get no more than 8 PCIe lanes.  That hasn't mattered for FGRBPG1 work, which can get by fine with 2 lanes, but it MAY be a limiting factor for boosting GW productivity.

EDIT: I just saw over in Problems and Bug Reports that Gary Roberts has been running tasks at 3x on an RX570 with 12 min task times for non-VelaJr tasks. I'll give that a shot because, based on those times, it looks like a single card at 3x would increase daily task productivity over that of two cards at 1x.

You should be able to check CPU and GPU usage fairly easily.  If neither of those is reaching 100%, then it could be the bus.  Most of my machines have rubbish CPUs compared to the GPUs, so I can't run gravity at all or the GPU is sat at something like 25% usage.  I put those on Gamma only.  I have only one machine (the main one I'm using now) which can handle Gravity tasks (good CPU, rubbish GPU).  The others are all Boinc only machines built out of whatever cheap parts I find.  I tend to go for powerful GPUs, and the CPUs are old crap.  There is a dual Xeon coming shortly which I was hoping would power the GPUs on gravity, using several WUs per GPU so it could take many of the 24 CPU cores to help it, but with the higher GPU memory usage for Gravity WUs now, that won't be possible.  The GPUs only have 3GB each.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117697625801
RAC: 35086750

cecht wrote:EDIT: I just saw

cecht wrote:
EDIT: I just saw over in Problems and Bug Reports that Gary Roberts has been running tasks at 3x on an RX570 with 12 min task times for non-VelaJr tasks.

Some time ago, when the frequency component of the task name was much, much lower, I was very happily running the then tasks at 4x. There was a reasonably good further improvement over 3x at that time.  There was a lift in frequency, and also a change in the target pulsar that caused tasks to fail.  I noticed it immediately at the time and reduced to 3x and have been running at that multiplicity ever since, with no further problems.

As discussed in the thread you just linked to, the latest Vela tasks (at high frequency) seem likely to cause failures again and I'll very soon be forced to reduce to 2x.  I tend to think there are two factors at work, that determine the memory requirements.  Those are the target pulsar and the frequency term.  I say that in particular because lower frequency VelaJr tasks were crunched at 3x.  The obvious change now is frequency but maybe there's something else as well.

I think the frequency term is not the actual spin frequency of the pulsar (there are values approaching 2000Hz) but rather some multiple of the spin frequency (perhaps 4 or larger).  There are known millisecond pulsars that exist (I don't know the spin frequencies of the target pulsars off-hand) but it's a bit hard to imagine anything rotating at 0.5 milliseconds :-).

I've been trying to find time to congratulate you on your excellent results presentation earlier.  It's a very valuable addition to the knowledge-base available to volunteers.  It seems that nobody on the Staff side has the time to test this app on a range of typical equipment and give out advice on what to expect so it's up to us to measure and publish whenever possible.  Huge thanks for doing just that!

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909648736
RAC: 2107777

cecht wrote:[... If it's a

cecht wrote:
... If it's a matter of limited PCIe bandwidth, as Zalster suggested, then I (we) may be stuck because all the consumer Intel CPUs that I've looked at list  "Max # of PCI Express Lanes" as 16 and "PCI Express Configurations: Up to 1x16, 2x8, 1x8+2x4".

Intel's i9 X-series of CPUs have 36+ maximum PCIe lanes, but require a motherboard with a LGA2066 socket.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909648736
RAC: 2107777

Gary Roberts wrote:I've been

Gary Roberts wrote:
I've been trying to find time to congratulate you on your excellent results presentation earlier.  It's a very valuable addition to the knowledge-base available to volunteers.  It seems that nobody on the Staff side has the time to test this app on a range of typical equipment and give out advice on what to expect so it's up to us to measure and publish whenever possible.  Huge thanks for doing just that!

LOL, my pleasure. It keeps me occupied during the pandemic "lock-down".

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.