Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

astro-marwil

Joined: 28 May 05

Posts: 536

Credit: 692936543

RAC: 499573

Hallo! At my new PC I have

15 Apr 2023 17:21:17 UTC

Message 211115

(moderation:

)

Hallo!

At my new PC I have an for me unusual high failure rate of 27% at O3(GPU)-tasks only. For other GPU-tasks this is close to zero. From server status page for O3MDF I learned, that (Task failed)/(Task valid + Task failed) = 4,9%, which is also high, but not so high as at my side. All though O3MD1 has a failure rate of 43% at server status page and the average failure rate over all applications I calculated to 8,7%. What a waste of crunching power!

Can I reduce this by adjustments at my side?

Kind regards and happy crunching

Martin

Scrooge McDuck

Joined: 2 May 07

Posts: 1134

Credit: 18896659

RAC: 11441

I think that currently O3MD1

17 Apr 2023 11:45:02 UTC

Message 211174

(moderation:

)

I think that currently O3MD1 CPU workunits are generated with an erroneously small value for "memory bound". The workunits state ONLY ~1.9 GiB as upper memory limit but actually allocate 3.2 GiB:

======== Workunits ========
1) -----------
  name: h1_1331.40_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1332.00Hz_49
  FP estimate: 1.440000e+014
  FP bound: 2.880000e+015
  memory bound: 1931.00 MB
  disk bound: 100.00 MB

This hinders the correct management of these tasks by BOINC and is possibly a reason for the currently high rate of 44% failed tasks for O3MD1 CPU, while this error rate for O3MDF GPU tasks is only about 4%.

I discussed this finding in more detail with example tasks in this thread in the 'problems' section.

See also server status page: https://einsteinathome.org/server_status.php

[updated 17 Apr 2023, 10:25:01 UTC]

"O3MD1" (CPU):

Tasks...

valid: 98,555
invalid: 64
inconclusive: 0
pending: 0
failed: 77,421 (44% failed)
too late: 1,201

"O3MDF" (GPU):

valid: 855,950
invalid: 1,408
inconclusive: 667
pending: 166,601
failed: 44,856 (only 4% failed)
too late: 704

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4350

Credit: 253683537

RAC: 35092

Scrooge McDuck wrote:I

18 Apr 2023 10:07:00 UTC

Message 211213 in response to message 211174

(moderation:

)

Scrooge McDuck wrote:

I think that currently O3MD1 CPU workunits are generated with an erroneously small value for "memory bound".

This is true. While investigating, I stopped workunit generaton for O3MD1.

Scrooge McDuck

Joined: 2 May 07

Posts: 1134

Credit: 18896659

RAC: 11441

The latest O3MD1 CPU

19 Apr 2023 16:27:58 UTC

Message 211258

(moderation:

)

The latest O3MD1 CPU workunits now have NEGATIVE memory bound values.

Uuuh.... an overflowing (signed) INT32 set to ~3.5*10^9 (~3.5G bytes)? which gives ~ -796M ?

11) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_473
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: -796.00 MB
   disk bound: 100.00 MB
12) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_472
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: -796.00 MB
   disk bound: 100.00 MB
13) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_471
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: -796.00 MB
   disk bound: 100.00 MB

...

Scrooge McDuck

Joined: 2 May 07

Posts: 1134

Credit: 18896659

RAC: 11441

Problem solved. There are

25 Apr 2023 9:12:26 UTC

Message 211522

(moderation:

)

Problem solved. There are reissued, previously failed tasks, with memory bound now set to ~3.2 GB. The problem will be out of the pipeline in a few days.

task name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537_1

21) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: 3242.49 MB
   disk bound: 100.00 MB

mikey

Joined: 22 Jan 05

Posts: 12945

Credit: 1884484015

RAC: 27494

Scrooge McDuck

25 Apr 2023 10:46:03 UTC

Message 211523 in response to message 211522

(moderation:

)

Scrooge McDuck wrote:

Problem solved. There are reissued, previously failed tasks, with memory bound now set to ~3.2 GB. The problem will be out of the pipeline in a few days.

task name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537_1
21) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: 3242.49 MB
   disk bound: 100.00 MB

WOO HOO!!! Thank you for your hard work in identifying the problem and for getting the right people at Einstein to fix it!!

Marcin

Joined: 19 Jun 09

Posts: 20

Credit: 6749809

RAC: 0

hi, a quick question: The

25 Apr 2023 17:30:17 UTC

Message 211539

(moderation:

)

hi, a quick question:

The status page shows negative values for the O3MDF tasks, does this mean that those WU's are 100% complete?

screenshot

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4350

Credit: 253683537

RAC: 35092

This happens on the first day

26 Apr 2023 7:15:00 UTC

Message 211556

(moderation:

)

This happens on the first day when we add a new "sub-search". The one we just added ("C2") is the last of the current "O3MD1" search (series).

This one seems trickier to get started than the ones before. We are still struggling with memory requirement predictions that seems to be way off. For this search in particular, as this was originally designed to run on CPUs and is not put on the GPU app.

This will be the most demanding sub-search in terms of memory. I manually raised the requirement to 4.5GB, but I'm still not sure that this will be sufficient.

When this is done, we will certainly revise the model our memory predictions are based on.

Marcin

Joined: 19 Jun 09

Posts: 20

Credit: 6749809

RAC: 0

Thanks for the explanation,

26 Apr 2023 11:15:25 UTC

Message 211560

(moderation:

)

Thanks for the explanation, You're the best Bernd !

Rodrigo

Joined: 5 Aug 17

Posts: 22

Credit: 251535289

RAC: 12214

Bernd Machenschalk

28 Apr 2023 18:38:27 UTC

Message 211642 in response to message 211556

(moderation:

)

Bernd Machenschalk wrote:

This happens on the first day when we add a new "sub-search". The one we just added ("C2") is the last of the current "O3MD1" search (series).

This one seems trickier to get started than the ones before. We are still struggling with memory requirement predictions that seems to be way off. For this search in particular, as this was originally designed to run on CPUs and is not put on the GPU app.

This will be the most demanding sub-search in terms of memory. I manually raised the requirement to 4.5GB, but I'm still not sure that this will be sufficient.

When this is done, we will certainly revise the model our memory predictions are based on.

Yay, time to install that spare stick of RAM!! Thanks for the explanation sir.

Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

Forums › Technical News

Comment viewing options

Forums › Technical News