Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

Vato
Vato
Joined: 19 Jun 10
Posts: 2
Credit: 77896217
RAC: 133091

At around the same time that

At around the same time that O3MD1 tasks started flowing again, i stopped receiving O3MDF tasks for my NVIDIA GPU under Linux. Is anyone else seeing this issue? Any ideas? Host is https://einsteinathome.org/host/12844421

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7229801530
RAC: 1156305

Vato wrote:At around the

Vato wrote:

At around the same time that O3MD1 tasks started flowing again, i stopped receiving O3MDF tasks for my NVIDIA GPU under Linux. Is anyone else seeing this issue? Any ideas? Host is https://einsteinathome.org/host/12844421

Yes, though my case is a bit odd.

I have three hosts running Einstein, and with the "fixed" application the one that formerly errored all O3 GPU units in early December was now able to run them to completion and validation.  Initially all three hosts got a really large fraction of GW tasks relative to BRP tasks, so as a means of throttling I turned off O3 task download for all but about an hour a day. 

But a couple of days ago or so, this resulted in zero O3 tasks during the hour I permitted both.  The next day, as a test, I turned off BRP permission, and still got zero O3 tasks in rather more than an hour.  As gazillions of O3 tasks show as ready to send, it seems something thought my system in some way unsuitable.

While composing this comment, I've switched preferences again temporarily only to request O3 GPU tasks.  I'll see whether any come now.

[edit to add observations:

After more than an hour with all three hosts requesting O3 GPU work only repeatedly, zero O3 tasks were sent.

Here are what I imagine are the relevant lines from the work request log from one of those hosts late in this hour:

Quote:
[send] Not using matchmaker scheduling; Not using EDF sim
[send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
[send] ATI: req 8725.21 sec, 0.00 instances; est delay 0.00
[send] work_req_seconds: 0.00 secs
[send] available disk 58.75 GB, work_buf_min 172800
[send] active_frac 0.869587 on_frac 0.999825 DCF 0.196890
[mixed] sending locality work first (0.0542)
[send] send_old_work() no feasible result older than 336.0 hours
[send] send_old_work() no feasible result younger than 208.7 hours and older than 168.0 hours
[mixed] sending non-locality work second
[send] [HOST#12260865] will accept beta work.  Scanning for beta work.
[debug]   [HOST#12260865] MSG(high) No work sent
 Sending reply to [HOST#12260865]: 0 results, delay req 60.00
 Scheduler ran 12.210 seconds

I've decided to request FGRP only again for a while pending resolution of this situation]

 

Ereignishorizont
Ereignishorizont
Joined: 17 May 21
Posts: 19
Credit: 3025792861
RAC: 1451667

Vato schrieb:At around the

Vato wrote:

At around the same time that O3MD1 tasks started flowing again, i stopped receiving O3MDF tasks for my NVIDIA GPU under Linux. Is anyone else seeing this issue? Any ideas? Host is https://einsteinathome.org/host/12844421

 

The same here. No O3MDF-Tasks for my NVIDIA-GPUs for a few days now.

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 105
Credit: 3885456854
RAC: 4972458

I can confirm this. Still

I can confirm this. Still haven't received any O3MDF today.

On my CPU (AMD 3700x, @45 W, 8 tasks) the O3MD1 tasks take about 21 hours each.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250672854
RAC: 34831

There was a problem with the

There was a problem with the project configuration that was fixed just minutes ago. It should work again now.

BM

Aurum
Aurum
Joined: 12 Jul 17
Posts: 77
Credit: 3412397040
RAC: 436

What is Erorr 1152 and can I

What is Erorr 1152 and can I do anything to alleviate it?

MAIN: XLALComputeFstat() failed with errno=1152
2023-01-13 23:35:17.1679 (432423) [CRITICAL]: ERROR: MAIN() returned with error '1152'

https://einsteinathome.org/task/1409447039

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47202372642
RAC: 65441229

Aurum wrote: What is Erorr

Aurum wrote:

What is Erorr 1152 and can I do anything to alleviate it?

MAIN: XLALComputeFstat() failed with errno=1152
2023-01-13 23:35:17.1679 (432423) [CRITICAL]: ERROR: MAIN() returned with error '1152'

https://einsteinathome.org/task/1409447039

you need to look at the first error in the chain. everything after that is just cascading errors as fallout.

your real issue is this:

failed with OpenCL error: CL_MEM_OBJECT_ALLOCATION_FAILURE

you ran out of VRAM. if you're trying to run 4x tasks, it wont work. there is only enough VRAM on the 3080ti for 3x tasks.

_________________________________________________________________________

Aurum
Aurum
Joined: 12 Jul 17
Posts: 77
Credit: 3412397040
RAC: 436

Thanks, I was running 3 tasks

Thanks, I was running 3 tasks at a time but now I'm running just one on all GPU models. So far so good.

This project DLs far too many WUs and so they quickly trigger Running High Priority. This sometimes switches a running WU to Waiting and so with 3 WUs running and one or two Waiting it may have wanted too much VRAM.

If the supply is going to be continuous might be a good idea to run in RZM.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7229801530
RAC: 1156305

Bernd Machenschalk

Bernd Machenschalk wrote:
There was a problem with the project configuration that was fixed just minutes ago. It should work again now.

I confirm that all three of my hosts received new GW O3 GPU work after the change today.  They had not received any since seven days earlier, with the last at 14:26 UTC on January 9.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117777741970
RAC: 34749922

Aurum wrote:This project DLs

Aurum wrote:
This project DLs ...

No, it doesn't.  The project tries to supply exactly what the client asks for.  Your client needs to stop asking :-).

You have to figure out why the client is asking for so much work that high priority mode is being triggered.  Because of things like you describe, you really, really, really don't want to allow the client to go into high priority mode (panic mode).  Things can become really complicated if you run multiple projects, multiple searches per project and asymmetric resource shares.  Perhaps as a first step you might review the settings for work cache size to see if a reduction there lowers the amount of work on hand for Einstein to a point where panic mode is never triggered.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.