Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 527
Credit: 619096543
RAC: 1076018

Hallo! At my new PC I have

Hallo!

At my new PC I have an for me unusual high failure rate of 27% at O3(GPU)-tasks only. For other GPU-tasks this is close to zero. From server status page for O3MDF I learned, that (Task failed)/(Task valid + Task failed) = 4,9%, which is also high, but not so high as at my side. All though O3MD1 has a failure rate of 43% at server status page and the average failure rate over all applications I calculated to 8,7%. What a waste of crunching power!

Can I reduce this by adjustments at my side?

Kind regards and happy crunching

Martin

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1032
Credit: 17605891
RAC: 11681

I think that currently O3MD1

I think that currently O3MD1 CPU workunits are generated with an erroneously small value for "memory bound". The workunits state ONLY ~1.9 GiB as upper memory limit but actually allocate 3.2 GiB:

======== Workunits ========
1) -----------
  name: h1_1331.40_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1332.00Hz_49
  FP estimate: 1.440000e+014
  FP bound: 2.880000e+015
  memory bound: 1931.00 MB
  disk bound: 100.00 MB

This hinders the correct management of these tasks by BOINC and is possibly a reason for the currently high rate of 44% failed tasks for O3MD1 CPU, while this error rate for O3MDF GPU tasks is only about 4%.

I discussed this finding in more detail with example tasks in this thread in the 'problems' section.

See also server status page: https://einsteinathome.org/server_status.php

[updated 17 Apr 2023, 10:25:01 UTC]

"O3MD1" (CPU):

Tasks...

  • valid: 98,555
  • invalid: 64
  • inconclusive: 0
  • pending: 0
  • failed: 77,421      (44% failed)
  • too late: 1,201

"O3MDF" (GPU):

  • valid: 855,950
  • invalid: 1,408
  • inconclusive: 667
  • pending: 166,601
  • failed: 44,856       (only 4% failed)
  • too late: 704
Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4307
Credit: 249644124
RAC: 34386

Scrooge McDuck wrote:I

Scrooge McDuck wrote:

I think that currently O3MD1 CPU workunits are generated with an erroneously small value for "memory bound".

This is true. While investigating, I stopped workunit generaton for O3MD1.

BM

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1032
Credit: 17605891
RAC: 11681

The latest O3MD1 CPU

The latest O3MD1 CPU workunits now have NEGATIVE memory bound values.

Uuuh.... an overflowing (signed) INT32 set to ~3.5*10^9 (~3.5G bytes)? which gives  ~ -796M ?

11) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_473
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: -796.00 MB
   disk bound: 100.00 MB
12) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_472
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: -796.00 MB
   disk bound: 100.00 MB
13) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_471
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: -796.00 MB
   disk bound: 100.00 MB

...

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1032
Credit: 17605891
RAC: 11681

Problem solved. There are

Problem solved. There are reissued, previously failed tasks, with memory bound now set to ~3.2 GB. The problem will be out of the pipeline in a few days.

task name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537_1

21) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: 3242.49 MB
   disk bound: 100.00 MB
mikey
mikey
Joined: 22 Jan 05
Posts: 12618
Credit: 1839003036
RAC: 7100

Scrooge McDuck

Scrooge McDuck wrote:

Problem solved. There are reissued, previously failed tasks, with memory bound now set to ~3.2 GB. The problem will be out of the pipeline in a few days.

task name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537_1

21) -----------
   name: h1_1413.60_O3aC01Cl0In0__O3MD1V2a_VelaJr1_1414.00Hz_537
   FP estimate: 1.440000e+14
   FP bound: 2.880000e+15
   memory bound: 3242.49 MB
   disk bound: 100.00 MB

WOO HOO!!! Thank you for your hard work in identifying the problem and for getting the right people at Einstein to fix it!!

Marcin
Marcin
Joined: 19 Jun 09
Posts: 20
Credit: 6716800
RAC: 1639

hi, a quick question: The

hi, a quick question:

The status page shows negative values for the O3MDF tasks, does this mean that those WU's are 100% complete?

screenshot

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4307
Credit: 249644124
RAC: 34386

This happens on the first day

This happens on the first day when we add a new "sub-search". The one we just added ("C2") is the last of the current "O3MD1" search (series).

This one seems trickier to get started than the ones before. We are still struggling with memory requirement predictions that seems to be way off. For this search in particular, as this was originally designed to run on CPUs and is not put on the GPU app.

This will be the most demanding sub-search in terms of memory. I manually raised the requirement to 4.5GB, but I'm still not sure that this will be sufficient.

When this is done, we will certainly revise the model our memory predictions are based on.

BM

Marcin
Marcin
Joined: 19 Jun 09
Posts: 20
Credit: 6716800
RAC: 1639

Thanks for the explanation,

Thanks for the explanation, You're the best Bernd !

Rodrigo
Rodrigo
Joined: 5 Aug 17
Posts: 22
Credit: 249736624
RAC: 38048

Bernd Machenschalk

Bernd Machenschalk wrote:

This happens on the first day when we add a new "sub-search". The one we just added ("C2") is the last of the current "O3MD1" search (series).

This one seems trickier to get started than the ones before. We are still struggling with memory requirement predictions that seems to be way off. For this search in particular, as this was originally designed to run on CPUs and is not put on the GPU app.

This will be the most demanding sub-search in terms of memory. I manually raised the requirement to 4.5GB, but I'm still not sure that this will be sufficient.

When this is done, we will certainly revise the model our memory predictions are based on.

 

Yay, time to install that spare stick of RAM!! Thanks for the explanation sir.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.