Errors with GravWave Search - GPU

Werinbert

Joined: 31 Dec 12

Posts: 20

Credit: 100156387

RAC: 0

29 Mar 2020 21:35:45 UTC

Topic 221469

(moderation:

)

As of today I have been getting the following error:

<message>
exceeded elapsed time limit 7912.18 (2880000.00G/364.00G)</message>

The computer with the issue is https://einsteinathome.org/host/12815268

The error doesn't occur all the time and there has been a task longer than the 7912 second limit that did not error out. I am also unsure why such a short time limit is in place to begin with.

Edit: I am not sure if it makes a difference...But I am running a Ryzen 3700X/GTX 750Ti on Linux Mint. Also after reading the thread from earlier in the month about GW - GPU errors (mainly asking about memory) I noticed that my rig takes a lot longer to process the tasks than others with the same card.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119892632385

RAC: 25971955

Werinbert wrote:The error

29 Mar 2020 23:47:27 UTC

Message 176258

(moderation:

)

Werinbert wrote:

The error doesn't occur all the time and there has been a task longer than the 7912 second limit that did not error out.

There are several different known pulsars being targeted as potential sources of continuous GW. Each one gives different crunching behaviour - such as crunch time estimates and time limits - so there wouldn't be a single fixed time limit for all tasks. Initial task estimates are a lot shorter than the true crunch time so that means that time limits are also probably far shorter than they should be as well.

There have been volunteer comments in the past about the underestimates for crunch time so I'm sure the Devs are aware of this. I don't know why this hasn't been addressed. No explanation has been offered that I've noticed.

Werinbert wrote:

I am also unsure why such a short time limit is in place to begin with.

The time limit is probably some fixed multiple of the estimated 'work content' of a task - hence the problem for you if your GPU is running slower than it should be.

Werinbert wrote:

... I noticed that my rig takes a lot longer to process the tasks than others with the same card.

If you are using your CPU cores to run CPU tasks for other projects, it could be that your GPU doesn't have enough CPU support. If so, you could try reducing the number of cores BOINC is allowed to use by one to see if that allows your GPU tasks to run faster.

Cheers,
Gary.

Werinbert

Joined: 31 Dec 12

Posts: 20

Credit: 100156387

RAC: 0

@Gary as you mentioned that

30 Mar 2020 2:05:41 UTC

Message 176260

(moderation:

)

@Gary as you mentioned that there are multiple GW sources being targeted, I looked again at the WUs and the problem WUs seem to only be the G34731 tasks. So this supports your theory as to the underlying problem.

I did check on the issue with CPU load. Default is 0.9 cores per task and giving it one dedicated core showed no improvement. However, giving it two dedicated cores did show improvement. None the less, I do feel that the run time limit is too low and not run the app on my computer as I can get better use out of my CPU cores than to babysit a mis-tuned app.

Arnaldy Medina

Joined: 22 Mar 20

Posts: 1

Credit: 120296

RAC: 0

So I have the error too with

30 Mar 2020 5:27:24 UTC

Message 176261

(moderation:

)

So I have the error too with the G34731 tasks, I digged into the debugger log and notice some Deprecation warnings that I didn't found in the other tasks that I have done. Also the error points to an unhandled exception in the KERNELBASE.dll that was the main cause of the error. Maybe that's what Gary was mentioning about devs awareness on this.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340223230

RAC: 133654

Here is a possibly related

30 Mar 2020 6:05:46 UTC

Message 176262

(moderation:

)

Here is a possibly related error on my machine but also many others in the quorum had difficulty too. The Vela Junior pulsar is being analysed. The error report also mentions an unhandled exception, that condition being an 'access violation', plus there is a deprecation warning also.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119892632385

RAC: 25971955

In response to the two

30 Mar 2020 8:34:50 UTC

Message 176263

(moderation:

)

In response to the two previous messages, the "deprecation warning" messages are quite normal. I've been seeing those for months and they don't lead to any problems. They just seem to be harmless and have been reported previously. There's been no comment from the Devs about them that I've noticed.

The "TIME LIMIT EXCEEDED" errors have been seen before, usually for less capable GPUs that aren't really up to the job. They also are more likely to occur if the tasks are taking a lot longer than was initially thought. There was an example of this quite some time ago when VelaJr tasks were taking about double the time that was anticipated. As a result, the estimates were doubled for those as was the credit award - if I remember correctly, as it was a while ago.

We have been doing more VelaJr tasks recently - one of my hosts is still doing them. I run 3 at a time on an RX 570 which is a mid-range discrete GPU at best. Three tasks get finished in about 36 mins - ie. ~12 mins per task. There are a number of new G34731 tasks coming through so I've promoted a couple of those to crunch 'out of order' to see what the crunch time is like. At the moment, the first of these is 50% complete after 30 mins so around a full hour to complete.

So it looks like this new batch may take close to twice as long as the previous VelaJr tasks. They were actually estimated at half the time so it looks like the Devs may have to make some more adjustments to the estimates and correspondingly, to the time limit before the task is terminated. I'll send a PM to Bernd and ask him to have a look at this.

Cheers,
Gary.

Werinbert

Joined: 31 Dec 12

Posts: 20

Credit: 100156387

RAC: 0

I do hope the Devs extend the

30 Mar 2020 9:21:37 UTC

Message 176264

(moderation:

)

I do hope the Devs extend the time limit, if so I may go back to running these tasks.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4352

Credit: 253949193

RAC: 34489

Thanks for the note. I'm

30 Mar 2020 9:58:40 UTC

Message 176265

(moderation:

)

Thanks for the note. I'm still waiting for feedback from the scientists on that new setup. For the time being I doubled the "flops estimation" (and credit), which should aslo double the runtime limit (for newly generated workunits, sorry).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119892632385

RAC: 25971955

Thanks Bernd.

30 Mar 2020 19:34:34 UTC

Message 176269

(moderation:

)

Thanks Bernd.

Cheers,
Gary.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I wonder if some of the tasks

30 Mar 2020 21:54:56 UTC

Message 176276

(moderation:

)

I wonder if some of the tasks require more memory than Nvidia card with 2 GB is able to offer in practice. My 2GB GTX 960 are running only one task at a time per card. In the last couple of days all these three hosts have started to face clearly more computation errors. Tasks crash in about 100 seconds. Here's an example: https://einsteinathome.org/task/935886632

In the stderr there's always at first this bold info how the problem started:

XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Processing FFT failed: CL_MEM_OBJECT_ALLOCATION_FAILURE

I've seen exactly this same happening for others running some other 2 GB Nvidia cards... for example GTX 1050, 950, 760, 660 models. On the other hand, I don't think I've yet seen that happening with Nvidia cards with 4 GB or more.

I know some of these GW GPU tasks fill the GPU memory up so that almost all of the 2 GB is in use while there's nothing else than Boinc open and one task running. But I'm starting to think that some tasks require more memory... and the problem might be that the project server isn't able to exclude any host with not enough memory from getting those large tasks. So basically this is just the same thing that was earlier in place already with 1 GB cards.

Another thing... number of validate errors have started accumulating in last few days. But I see that might involve many users and many cards... upper 1000- and 900-series Nvidia cards running many different driver versions and both Windows and Linux. Naturally my AMD cards got many validate errors already. But they never seem to keep themselves out of troubles.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340223230

RAC: 133654

I think you're right. Some

31 Mar 2020 0:12:00 UTC

Message 176277 in response to message 176276

(moderation:

)

I think you're right. Some research on CL_MEM_OBJECT_ALLOCATION_FAILURE indicates that it signals a generic failure to find enough available memory for some given request at a certain time. That can be the card memory is too small and/or might mean that previously allocated buffers, no longer needed, haven't been released/deallocated. But FFTs are memory hungry beasts so the former is likely. Interestingly the error may not be emitted when the memory is requested but when the memory is first used ( so called lazy allocation by the OpenCL implementation ).

{ This computer has a 2GB Nvidia card and shows this failure mode in three recent Vela Junior tasks. }

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Errors with GravWave Search - GPU

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports