The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2898948900
RAC: 2133963

While running only V1.07

While running only V1.07 tasks, with app_config.xml set like this:

 <app>
    <name>einstein_O2AS20-500</name>
    <gpu_versions>
      <gpu_usage>0.25</gpu_usage>
      <cpu_usage>0.4</cpu_usage>
    </gpu_versions>
  </app>
<project_max_concurrent>8</project_max_concurrent>

...I have 8 concurrent tasks running with 3 CPU threads @ ~90% usage, 1 @ ~98%.
But with this, at full CPU usage:

<app>
    <name>einstein_O2AS20-500</name>
    <gpu_versions>
      <gpu_usage>0.25</gpu_usage>
      <cpu_usage>1</cpu_usage>
    </gpu_versions>
  </app>
<project_max_concurrent>8</project_max_concurrent>

BOINC can only run 5 concurrent tasks, with 3 threads @ ~76% and 1 @ ~90%; clearly not the way to go.

So limiting cpu_usage is needed to squeeze in the full complement of GPU tasks. I'm waiting to see how those task completion times at 4x compare with previous times running 3x.   First, though, I need to stagger the eight (or 4) task progress points. Ugh. Does anybody know a less labor intensive way to do that?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220224931
RAC: 966899

cecht wrote:First, though, I

cecht wrote:
First, though, I need to stagger the eight (or 4) task progress points. Ugh. Does anybody know a less labor intensive way to do that?

You may get a modest reduction in labor by not trying too hard to get them evenly spaced.  I think for this application the great majority of the task run time has equivalent compute load characteristics, so the goal of avoiding efficiency loss by having two tasks simultaneously using too little of one resource or too much of another can probably be met by spacing them as little as ten minutes elapsed time apart (which is a smallish fraction, given you are planning to run this batch at 4X).

If you don't use BoinkTasks, you may wish to consider it.  Monitoring and adjusting a multi-host fleet in the small to medium size seems to be what many of us like about it.  (I doubt it suitable for a Gary-size fleet).

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

cecht wrote:During most of a

cecht wrote:
During most of a run, CPU usage was steady at ~80% for two threads and ~90% for the other two threads, but when tasks on two GPU hit the 99% mark, CPU usage held at ~95% and 100% for about three minutes until the tasks completed.

I noticed a similar behavior on my first two LINUX tasks, although my CPU usage was about 38~40% until the last few minutes,where it went into CPU overdrive (99%). The tasks completed and validated without error about ten minutes ago. The crunch time was 3 hours and 36 for one, and 3 hours and 42 minutes for the other, both running on the RX560 GPU.

The two NVIDIA tasks on the other box will be done shortly. It is a slower card so these are taking a bit longer.

Clear skies,
Matt
cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2898948900
RAC: 2133963

Running V1.07 at 4x tasks per

Running V1.07 at 4x tasks per GPU (and cpu_usage = 0.25), my average task time for 16 pending tasks is 35.3 minutes, compared to 35.7 minutes for 11 pending tasks run at 3x (and cpu_usage = 0.5). So, no real performance advantage at higher multiplicities on my system. I'm going back to 3x.

Dang it though, I have only 5 tasks validated so far, with only 1  task validated  in the past 6 hr (since 16:21 UTC).

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2898948900
RAC: 2133963

archae86 wrote:You may get a

archae86 wrote:
You may get a modest reduction in labor by not trying too hard to get them evenly spaced. ...  Monitoring and adjusting a multi-host fleet in the small to medium size seems to be what many of us like about it.  (I doubt it suitable for a Gary-size fleet).

Ha! Right you are.  It wasn't much of a bother after all.  I wasn't so concerned about even spacing as about getting tasks to not overlap by less than about 3 minutes, which is the time it takes to finish up the last 1% and during which a CPU thread is fully occupied. As I've always done with the binary pulsar GPU tasks, I just check in a few times during the day to give it a poke and make sure everything is running smoothly.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117522230234
RAC: 35400308

cecht wrote:... But with

cecht wrote:

... But with this, at full CPU usage:

<app>
    <name>einstein_O2AS20-500</name>
    <gpu_versions>
      <gpu_usage>0.25</gpu_usage>
      <cpu_usage>1</cpu_usage>
    </gpu_versions>
  </app>
<project_max_concurrent>8</project_max_concurrent>

BOINC can only run 5 concurrent tasks, with 3 threads @ ~76% and 1 @ ~90%; clearly not the way to go.

The only thing surprising about that is that you got 5 tasks rather than 4.  That must somehow mean that cpu_usage of 1 is being treated internally as something like 0.9999999.....  Seems to point at some sort of rounding bug.  4 times that internal representation must somehow be being seen as less than 4 so that BOINC still thinks it has a (perhaps tiny) fraction of a CPU thread available to support the fifth concurrent task.  BOINC will always run that extra task even if the fraction of a thread remaining is way less than what the 'budget' specifies.

Think of things this way (perhaps) :-).  The problem with over-specifying the 'budget' is that it could limit the number of tasks you are trying to run.  So, in order to run 8 tasks, don't exceed cpu_usage of 0.5 when you have 4 threads to play with.  The problem with under-specifying is that BOINC may think that it has threads available with which it could run totally unrelated tasks.  If there is nothing else that BOINC could run (no ready to run CPU tasks from this or any other project) there is no problem with under-specifying as the running GPU tasks will grab whatever they need.  The GPU tasks are not limited by the budget specification.

cecht wrote:
First, though, I need to stagger the eight (or 4) task progress points. Ugh. Does anybody know a less labor intensive way to do that?

There are probably going to be 'congestion points' around both task startup and task completion.  From a few comments made so far (and not from relevant experience) my guess is that what Peter has suggested makes a lot of sense.  You don't need to get things absolutely equally spaced so don't stress too much about it, certainly not until we get some experience with how these tasks really behave.

My intention is to wait until things settle as far as validation is concerned.  There's no point in disturbing the bulk of my fleet until it's clear that validation is no longer an issue.  Today, if there are no problems with validation showing up for V1.07 tasks, I'll switch the machine I've been using to test Southern Islands (Pitcairn) GPUs with the amdgpu driver (where the FGRP tasks have been computing without error but then failing to validate) over to GW.  You never know, I might just get lucky :-).  I won't know if I don't try :-).

My expectation is that GW tasks will probably fail validation as well - in which case it will go straight back to FGRP tasks and the fglrx driver and I'll start experimenting with GW on an RX 570 or 580 machine.  Hopefully in a few days time I might be able to comment sensibly about the advisability or otherwise of staggering the start times for concurrent tasks with the latest app.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117522230234
RAC: 35400308

cecht wrote:Running V1.07 at

cecht wrote:
Running V1.07 at 4x tasks per GPU (and cpu_usage = 0.25), my average task time for 16 pending tasks is 35.3 minutes, compared to 35.7 minutes for 11 pending tasks run at 3x (and cpu_usage = 0.5). So, no real performance advantage at higher multiplicities on my system. I'm going back to 3x.

Thanks very much for that.  Very useful to know that result!

I think my first experiment might be (after confirming or otherwise about validation with SI GPUs) a side by side comparison of the 3x performance (using RX 570s) of a 2009 dual core against the fastest quad I can find so as to get a real 'feel' for the importance of both numbers of cores and CPU speed.

Cheers,
Gary.

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

The NVIDIA tasks came in at 5

The NVIDIA tasks came in at 5 hours and 13 minutes. This is running two task concurrently on the GT1030 in my DL360, using one CPU per task. The NVIDIA application utilizes a much higher percentage of the CPU resources than the AMD. The NVIDIA tasks ran without error, but are still pending validation.

Concerning the AMD and comparing to some of the other numbers posted here, the RX570 appears to be significantly faster than the RX560, especially on this application. Unfortunately, my RX560, although capable of PCIe 3, is being forced to run at PCIe 2, due to the motherboard limitations of the XW4600 it lives in. I suspect the card would run quicker were it not for that bottleneck.

Clear skies,
Matt
Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117522230234
RAC: 35400308

Matt White wrote:... the

Matt White wrote:
... the RX570 appears to be significantly faster than the RX560, especially on this application.

This is not surprising.  I have both types and the 570 is twice as fast as the 560 for the FGRPB1G app.  It wouldn't surprise to see a similar ratio for the new app.  Also, I just looked at your host and the GPU is identified as an RX 460 which is a bit slower again than the 560.  Are you running 2 tasks concurrently?  If so, and when comparing to Cecht's times, perhaps there is a further penalty for the RX 460.  Your times seem to be about twice as long as Cecht's 3x times.

Matt White wrote:
Unfortunately, my RX560, although capable of PCIe 3, is being forced to run at PCIe 2, due to the motherboard limitations of the XW4600 it lives in. I suspect the card would run quicker were it not for that bottleneck.

It's too early to be sure about that.  It really depends on the volume of data that has to flow back and forth and on how much processing is done at each end when it gets there.  If the volume is not too large and the processing time at each end is long enough, you may not see too much bottlenecking going on.  These are things that should become clearer over the coming days and weeks.

Similar situations have arisen and been overcome in the past.  There was a time when PCIe bus bandwidth was quite a problem for binary radio pulsar tasks.  Some smart programming changes were able to alleviate that.

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1533
Credit: 2898948900
RAC: 2133963

Things are rolling now! As of

Things are rolling now! As of now, 22 of my V1.07 tasks have been validated (81 pending, 0 invalid), BUT everything validated overnight (starting 15 Aug. CMT) was done so with a minimum quorum of 1, instead of the usual 2.

Gary roberts wrote:

The only thing surprising about that is that you got 5 tasks rather than 4.  That must somehow mean that cpu_usage of 1 is being treated internally as something like 0.9999999.....  Seems to point at some sort of rounding bug.  4 times that internal representation must somehow be being seen as less than 4 so that BOINC still thinks it has a (perhaps tiny) fraction of a CPU thread available to support the fifth concurrent task.  BOINC will always run that extra task even if the fraction of a thread remaining is way less than what the 'budget' specifies.

Think of things this way (perhaps) :-).  The problem with over-specifying the 'budget' is that it could limit the number of tasks you are trying to run.  So, in order to run 8 tasks, don't exceed cpu_usage of 0.5 when you have 4 threads to play with.  The problem with under-specifying is that BOINC may think that it has threads available with which it could run totally unrelated tasks.  If there is nothing else that BOINC could run (not ready to run CPU tasks from this or any other project) there is no problem with under-specifying as the running GPU tasks will grab whatever they need.  The GPU tasks are not limited by the budget specification.

Thanks for going over that. I hadn't thought of it in those terms. You did touch on that in a previous post, but it takes a while for things to sink in this ol' cortex. :)

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.