Errors with GravWave Search - GPU

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

You all remember that when

You all remember that when using OpenCl, only 27% of the GPU memory is available by design for Scientific purposes.  They was specifically designed to prevent people from doing what we are and force them to use the professional cards ie. Tesla  So don't expect to use all 2 GB of that card.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433742280
RAC: 583961

A new twist on the Error

A new twist on the Error while computing https://einsteinathome.org/workunit/447314688, 15 in a row, 3GB card usually fails ~ 1min. The host had been running S@H quite well, and runs pulsars quite well. Methinks it's not the host but bad data. 

I'll let it go on for a while longer so the devs can take a look and then deselect the app. Damn I really want to find a GW!

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Betreger wrote:A new twist on

Betreger wrote:
A new twist on the Error while computing https://einsteinathome.org/workunit/447314688,

Was that card running tasks in parallel... or just one ?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5846
Credit: 109975353623
RAC: 29587580

Betreger wrote:The host had

Betreger wrote:
The host had been running S@H quite well, and runs pulsars quite well. Methinks it's not the host but bad data.

It looks like you've just recently moved the host from Seti back to here and been allocated a rather high 'frequency' to start working on.  The task names have a frequency component as part of the name - in your case it's a range between 1587.25 - 1587.80 Hz.  The frequency you get allocated is a 'luck of the draw'.  I looked at one of my machines and it's currently doing multiple frequency bins around 1100 to 1300 Hz.  The latest tasks being received right now are in the mid 1400s.  I guess that sooner or later it will go higher but the GPU has 4GB so maybe it will be OK.  Worth keeping an eye on though :-).

As Mike has mentioned, a big part of the processing involves FFTs (Fast Fourier Transforms) which are memory hungry beasts.  I'm no mathematician but my guess is that the higher the frequency, the more memory hungry the process might be.  So I would guess it's not "bad data" but an inability to allocate sufficient memory to hold the data.

The Devs would be tracking the error rate pretty closely so I'm sure they will notice if this becomes a widespread problem for GPUs with 3GB or less.  You have really 'done your duty' in providing examples so I suggest you put it back to GRP tasks until we see how widespread this might become.  I'll be leaving mine alone for the moment but will be paying rather more attention as the frequency gets higher.

Thanks for your report and sorry for your pain :-).

Cheers,
Gary.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433742280
RAC: 583961

It looks like you've just

"It looks like you've just recently moved the host from Seti back to here and been allocated a rather high 'frequency' to start working on."

No good deed wil go unpunished.

"Thanks for your report and sorry for your pain"

The pain is very small on my part since they bomb out in ~ 1min but that is not good for the project.  

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433742280
RAC: 583961

Gary I haven't given up on GW

Gary I haven't given up on GW tasks on that GPU yet. I have gone back to one at a time from 2. The thing that puzzles me is I have an almost identical host with the same GPU has not exhibited the issue. The main difference is the one wo problems is a W8.1 machine the problem child is a W10 machine. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5846
Credit: 109975353623
RAC: 29587580

Betreger wrote:Gary I haven't

Betreger wrote:
Gary I haven't given up on GW tasks on that GPU yet.

So I guess you're now saying that a reboot was a temporary fix only?  This doesn't seem like a data problem - more like something flakey with the hardware.  Only you can prove or disprove that.

The basic technique (assuming desktop or tower type machines) is to swap particular hardware items (one at a time - that's important) and run the machines until you are certain whether or not the problem has stayed with the host or moved with the swapped hardware item.  There is also a third possibility which is the best outcome of all.  Neither machine might be misbehaving after the swap.  That can happen if the problem was actually due to a dicey connection to the attached hardware which has now been 'fixed' due to the remaking of that connection or it could just be the change in electrical conditions (voltage levels and stability, etc.) with the change of host.

Betreger wrote:
... I have an almost identical host ...

That's not actually true :-).  Apart from the different OS versions, there are also different driver versions.  On the hardware front, one host has an i5 3350P which is an Ivy Bridge processor launched in 2012 whilst the other has an i5 6400 which is a Skylake processor first launched in 2015.  Undoubtedly the motherboards would be quite different, as could be the RAM.

The items of hardware most amenable to swapping are the GPUs, the RAM (check carefully for compatability) and finally, the PSUs.  If they were my hosts, the first check I would make would be to swap the GPUs (check all contacts carefully) and see what happens. The best outcome would be that the problem suddenly disappears from both.  If so, make sure you give both an extended test before assuming success.  If the problem stays with the host, it's unlikely to be the GPU so you could move on to test something else.  If the problem transfers with the GPU, it's that hardware that is likely at fault.  The most common cause of GPU problems is insufficient cooling - check all fans for smooth free running as you transfer them.  Also give each GPU a good clean whilst you have them out.

Cheers,
Gary.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433742280
RAC: 583961

Gary It's been a day since I

Gary It's been a day since I went to 1 at a time and the problem hasn't resurfaced yet. I'm keeping a close eye on it. 

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433742280
RAC: 583961

Gary just an update. Upon

Gary just an update. Upon further examination it turned out both GTX1060s were having problems with some GW tasks. Since the 4th when  I went to running 1 at a time on them I have not had any bomb out. I really don't know if that did it or if Bernd stopped sending out those poison pills but everything is looking good as of now. As an aside pulsars run fine 2 at a time. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

My host got 60 of those Vela

My host got 60 of those Vela tasks in a row. All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE. Now that host isn't getting anything of course until tomorrow Undecided I hope it will be G3a sent out then, not Vela.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.