All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4307
Credit: 249625492
RAC: 34149

This is not a matter of the

This is not a matter of the application, the application (binaries) are identical. The distinction is in the workunits, notably in the command-lines. The reason why this distinction is implemented via different plan classes is that in BOINC you can specify the (CPU) memory usage per workunit, while the GPU VRAM usage can only be specified in the plan class. Thus for different VRAM requirements you need to have different plan classes.

And no, there are no plans to issue "old" and "new" workunits in parallel. Bookkeeping (e.g. about completed frequency ranges) would be a nightmare.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3923
Credit: 45245722642
RAC: 63247247

ah, ok. unfortunate, but I

ah, ok. unfortunate, but I understand the constraints better. I thought it was just a difference in the app itself.

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7198024931
RAC: 872473

Bernd Machenschalk wrote:We

Bernd Machenschalk wrote:
We are planning a change to the current run. We will trade in a bit of runtime for memory. The workunits that we plan to produce in the future will run a bit longer (~10%).

On the three machines in my flotilla, the productivity degradation appears to be considerably more than 10%.

Today I conducted what I believe to be a well-controlled comparison on one machine

During the trial the machine ran more than ten consecutive WU's of a given flavor sequentially, with pretty small variation in reported elapsed time.  In each case I took the trouble to arrange the start-stop times of the three WUs in progress to be pretty evenly spaced (this actually matters a lot for recent GW GPU work, as the CPU vs. GPU usage varies greatly during the run time).

Running at 3X multiplicity, the ...ati WU (previous style) elapsed time averaged about 25 minutes, while the ...ati-2 WU (new style) elapsed time averaged about 34 minutes.

In other words, on this system as currently operated, the old style was about 36% more productive than the new style.

As I understand the major purpose of the change was to increase total task output of the user base on this particular work, the concern to be checked is whether the hoped-for improvement coming from addition of new machines not previously able to run the work might be more than wiped out by the productivity loss on the machines that were already running the work (which may be more than 36%, as some users may withdraw their machines in discouragement).

I'm not proposing that my machine is typical--in fact I could enumerate more than one way that it is not, but just suggesting that the proposition needs to be checked.

 

 

 

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4307
Credit: 249625492
RAC: 34149

I can currently see that on

I can currently see that on average the overall runtime of a "new" task compared to an "old" task on the same host is not longer by more than 10%. This is averaged over 1102 Windows and 116 Linux hosts, both with NVidia GPUs. based on >100k tasks they reported.

An issue with running multiple instances of this app (in general, not only on the same GPU) that are started at the same time is that they take a significant time to read the input, depending on the I/O system, as there is only one. The way the current tasks are processed this now happens two times, once at the beginning and once in the middle of a task. Given the relatively short overall run time (<1h) a few minutes during initialization could make a big difference here. Looking at stderr, does the App spend significant time between "Loading SFTs" and "Search FstatMethod used"?

BM

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 53
Credit: 1481438691
RAC: 3841760

I have to concur, with both

I have to concur, with both my machines while running 2x and 3x for best throughput, the times are much more than 10% longer. My results are similar to what  ARCHEA86 is reporting.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4307
Credit: 249625492
RAC: 34149

Actually, the issue reported

Actually, the issue reported here might be responsible for a significant slowdown on multi-GPU systems. A fix (app version 1.07) is being released.

BM

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7198024931
RAC: 872473

Bernd Machenschalk wrote:I

Bernd Machenschalk wrote:
I can currently see that on average the overall runtime of a "new" task compared to an "old" task on the same host is not longer by more than 10%.

Great--this means the trouble I am experiencing will not dominate the overall productivity result.

Quote:
Looking at stderr, does the App spend significant time between "Loading SFTs" and "Search FstatMethod used"?

Looking at this stderr from the machine on which I did my careful comparison I see these milestones in the stderr for a WU that took 2043 elapsed seconds to run (running at 3X with evenly-spaced start timing).

17:25:41.2398 (12280) [normal]: Start of BOINC application 
17:25:41.9587 (12280) [normal]: Loading SFTs
17:26:30.2127 (12280) [normal]: Search FstatMethod used
17:26:55.9027 (12280) [normal]: CG:18271827 FG:250000 
   (That is the last time-stamped entry before the first pass "dot progress" begins and continues for many lines)
 

17:39:17.1455 (12280) [normal]: Finished main analysis.
17:43:15.7492 (12280) [normal]: Finished recalculating toplist
 (that appears to end the first pass, after which the second pass milestones I choose to show)

17:43:16.0148 (12280) [normal]: Parsed user input successfully
17:43:16.5930 (12280) [normal]: Loading SFTs matching
17:43:44.7350 (12280) [normal]: Search FstatMethod used
17:44:04.0953 (12280) [normal]: Finished reading input data
17:44:10.2517 (12280) [normal]: CG:18271827 FG:250000  f1dotmin_fg:-2.773529411765e-009 
  (then the "dot progress for the second pass begins)
 

17:56:31.7073 (12280) [normal]: Finished main analysis
18:00:41.0919 (12280) [normal]: Finished recalculating toplist
18:00:41 (12280): called boinc_finish

 

 

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 235
Credit: 9926705586
RAC: 20323857

Bernd Machenschalk

Bernd Machenschalk wrote:

Actually, the issue reported here might be responsible for a significant slowdown on multi-GPU systems. A fix (app version 1.07) is being released.

 

Feeling this one! It is interesting to watch a RTX A6000 try to work through 8 of these work units at the same time while the other GPU is idle. Trying to drain these tasks but it looks like it is going to take some time. 

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 53
Credit: 1481438691
RAC: 3841760

On a related note, the run

On a related note, the run times are different but not nearly as much as the estimated compute size the work units come with. The old ones are 144,000 GFLOPS while the new ones are 720,000 GFLOPS. This causes some confusion with the estimated run time when both are loaded at the same time.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3923
Credit: 45245722642
RAC: 63247247

the GFlops change happened a

the GFlops change happened a while ago with the change to 5000cr reward from 1000cr, scaling the reward is achieved by increasing the set computation size.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.