Errors with GravWave Search - GPU

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

You all remember that when

31 Mar 2020 3:52:33 UTC

Message 176281

(moderation:

)

You all remember that when using OpenCl, only 27% of the GPU memory is available by design for Scientific purposes. They was specifically designed to prevent people from doing what we are and force them to use the professional cards ie. Tesla So don't expect to use all 2 GB of that card.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1590475706

RAC: 777231

A new twist on the Error

2 Apr 2020 17:22:57 UTC

Message 176327

(moderation:

)

A new twist on the Error while computing https://einsteinathome.org/workunit/447314688, 15 in a row, 3GB card usually fails ~ 1min. The host had been running S@H quite well, and runs pulsars quite well. Methinks it's not the host but bad data.

I'll let it go on for a while longer so the devs can take a look and then deselect the app. Damn I really want to find a GW!

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Betreger wrote:A new twist on

2 Apr 2020 18:54:52 UTC

Message 176329 in response to message 176327

(moderation:

)

Betreger wrote:

A new twist on the Error while computing https://einsteinathome.org/workunit/447314688,

Was that card running tasks in parallel... or just one ?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117622776267

RAC: 35234917

Betreger wrote:The host had

2 Apr 2020 22:13:00 UTC

Message 176332 in response to message 176327

(moderation:

)

Betreger wrote:

The host had been running S@H quite well, and runs pulsars quite well. Methinks it's not the host but bad data.

It looks like you've just recently moved the host from Seti back to here and been allocated a rather high 'frequency' to start working on. The task names have a frequency component as part of the name - in your case it's a range between 1587.25 - 1587.80 Hz. The frequency you get allocated is a 'luck of the draw'. I looked at one of my machines and it's currently doing multiple frequency bins around 1100 to 1300 Hz. The latest tasks being received right now are in the mid 1400s. I guess that sooner or later it will go higher but the GPU has 4GB so maybe it will be OK. Worth keeping an eye on though :-).

As Mike has mentioned, a big part of the processing involves FFTs (Fast Fourier Transforms) which are memory hungry beasts. I'm no mathematician but my guess is that the higher the frequency, the more memory hungry the process might be. So I would guess it's not "bad data" but an inability to allocate sufficient memory to hold the data.

The Devs would be tracking the error rate pretty closely so I'm sure they will notice if this becomes a widespread problem for GPUs with 3GB or less. You have really 'done your duty' in providing examples so I suggest you put it back to GRP tasks until we see how widespread this might become. I'll be leaving mine alone for the moment but will be paying rather more attention as the frequency gets higher.

Thanks for your report and sorry for your pain :-).

Cheers,
Gary.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1590475706

RAC: 777231

It looks like you've just

3 Apr 2020 1:25:08 UTC

Message 176338 in response to message 176332

(moderation:

)

"It looks like you've just recently moved the host from Seti back to here and been allocated a rather high 'frequency' to start working on."

No good deed wil go unpunished.

"Thanks for your report and sorry for your pain"

The pain is very small on my part since they bomb out in ~ 1min but that is not good for the project.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1590475706

RAC: 777231

Gary I haven't given up on GW

4 Apr 2020 15:49:51 UTC

Message 176368 in response to message 176332

(moderation:

)

Gary I haven't given up on GW tasks on that GPU yet. I have gone back to one at a time from 2. The thing that puzzles me is I have an almost identical host with the same GPU has not exhibited the issue. The main difference is the one wo problems is a W8.1 machine the problem child is a W10 machine.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117622776267

RAC: 35234917

Betreger wrote:Gary I haven't

4 Apr 2020 22:08:44 UTC

Message 176377 in response to message 176368

(moderation:

)

Betreger wrote:

Gary I haven't given up on GW tasks on that GPU yet.

So I guess you're now saying that a reboot was a temporary fix only? This doesn't seem like a data problem - more like something flakey with the hardware. Only you can prove or disprove that.

The basic technique (assuming desktop or tower type machines) is to swap particular hardware items (one at a time - that's important) and run the machines until you are certain whether or not the problem has stayed with the host or moved with the swapped hardware item. There is also a third possibility which is the best outcome of all. Neither machine might be misbehaving after the swap. That can happen if the problem was actually due to a dicey connection to the attached hardware which has now been 'fixed' due to the remaking of that connection or it could just be the change in electrical conditions (voltage levels and stability, etc.) with the change of host.

Betreger wrote:

... I have an almost identical host ...

That's not actually true :-). Apart from the different OS versions, there are also different driver versions. On the hardware front, one host has an i5 3350P which is an Ivy Bridge processor launched in 2012 whilst the other has an i5 6400 which is a Skylake processor first launched in 2015. Undoubtedly the motherboards would be quite different, as could be the RAM.

The items of hardware most amenable to swapping are the GPUs, the RAM (check carefully for compatability) and finally, the PSUs. If they were my hosts, the first check I would make would be to swap the GPUs (check all contacts carefully) and see what happens. The best outcome would be that the problem suddenly disappears from both. If so, make sure you give both an extended test before assuming success. If the problem stays with the host, it's unlikely to be the GPU so you could move on to test something else. If the problem transfers with the GPU, it's that hardware that is likely at fault. The most common cause of GPU problems is insufficient cooling - check all fans for smooth free running as you transfer them. Also give each GPU a good clean whilst you have them out.

Cheers,
Gary.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1590475706

RAC: 777231

Gary It's been a day since I

5 Apr 2020 16:03:13 UTC

Message 176393 in response to message 176377

(moderation:

)

Gary It's been a day since I went to 1 at a time and the problem hasn't resurfaced yet. I'm keeping a close eye on it.

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1590475706

RAC: 777231

Gary just an update. Upon

9 Apr 2020 21:09:08 UTC

Message 176492

(moderation:

)

Gary just an update. Upon further examination it turned out both GTX1060s were having problems with some GW tasks. Since the 4th when I went to running 1 at a time on them I have not had any bomb out. I really don't know if that did it or if Bernd stopped sending out those poison pills but everything is looking good as of now. As an aside pulsars run fine 2 at a time.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

My host got 60 of those Vela

15 Apr 2020 9:47:31 UTC

Message 176632

(moderation:

)

My host got 60 of those Vela tasks in a row. All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE. Now that host isn't getting anything of course until tomorrow I hope it will be G3a sent out then, not Vela.

Errors with GravWave Search - GPU

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports