You all remember that when using OpenCl, only 27% of the GPU memory is available by design for Scientific purposes. They was specifically designed to prevent people from doing what we are and force them to use the professional cards ie. Tesla So don't expect to use all 2 GB of that card.
A new twist on the Error while computing https://einsteinathome.org/workunit/447314688, 15 in a row, 3GB card usually fails ~ 1min. The host had been running S@H quite well, and runs pulsars quite well. Methinks it's not the host but bad data.
I'll let it go on for a while longer so the devs can take a look and then deselect the app. Damn I really want to find a GW!
The host had been running S@H quite well, and runs pulsars quite well. Methinks it's not the host but bad data.
It looks like you've just recently moved the host from Seti back to here and been allocated a rather high 'frequency' to start working on. The task names have a frequency component as part of the name - in your case it's a range between 1587.25 - 1587.80 Hz. The frequency you get allocated is a 'luck of the draw'. I looked at one of my machines and it's currently doing multiple frequency bins around 1100 to 1300 Hz. The latest tasks being received right now are in the mid 1400s. I guess that sooner or later it will go higher but the GPU has 4GB so maybe it will be OK. Worth keeping an eye on though :-).
As Mike has mentioned, a big part of the processing involves FFTs (Fast Fourier Transforms) which are memory hungry beasts. I'm no mathematician but my guess is that the higher the frequency, the more memory hungry the process might be. So I would guess it's not "bad data" but an inability to allocate sufficient memory to hold the data.
The Devs would be tracking the error rate pretty closely so I'm sure they will notice if this becomes a widespread problem for GPUs with 3GB or less. You have really 'done your duty' in providing examples so I suggest you put it back to GRP tasks until we see how widespread this might become. I'll be leaving mine alone for the moment but will be paying rather more attention as the frequency gets higher.
Thanks for your report and sorry for your pain :-).
Gary I haven't given up on GW tasks on that GPU yet. I have gone back to one at a time from 2. The thing that puzzles me is I have an almost identical host with the same GPU has not exhibited the issue. The main difference is the one wo problems is a W8.1 machine the problem child is a W10 machine.
Gary I haven't given up on GW tasks on that GPU yet.
So I guess you're now saying that a reboot was a temporary fix only? This doesn't seem like a data problem - more like something flakey with the hardware. Only you can prove or disprove that.
The basic technique (assuming desktop or tower type machines) is to swap particular hardware items (one at a time - that's important) and run the machines until you are certain whether or not the problem has stayed with the host or moved with the swapped hardware item. There is also a third possibility which is the best outcome of all. Neither machine might be misbehaving after the swap. That can happen if the problem was actually due to a dicey connection to the attached hardware which has now been 'fixed' due to the remaking of that connection or it could just be the change in electrical conditions (voltage levels and stability, etc.) with the change of host.
Betreger wrote:
... I have an almost identical host ...
That's not actually true :-). Apart from the different OS versions, there are also different driver versions. On the hardware front, one host has an i5 3350P which is an Ivy Bridge processor launched in 2012 whilst the other has an i5 6400 which is a Skylake processor first launched in 2015. Undoubtedly the motherboards would be quite different, as could be the RAM.
The items of hardware most amenable to swapping are the GPUs, the RAM (check carefully for compatability) and finally, the PSUs. If they were my hosts, the first check I would make would be to swap the GPUs (check all contacts carefully) and see what happens. The best outcome would be that the problem suddenly disappears from both. If so, make sure you give both an extended test before assuming success. If the problem stays with the host, it's unlikely to be the GPU so you could move on to test something else. If the problem transfers with the GPU, it's that hardware that is likely at fault. The most common cause of GPU problems is insufficient cooling - check all fans for smooth free running as you transfer them. Also give each GPU a good clean whilst you have them out.
Gary just an update. Upon further examination it turned out both GTX1060s were having problems with some GW tasks. Since the 4th when I went to running 1 at a time on them I have not had any bomb out. I really don't know if that did it or if Bernd stopped sending out those poison pills but everything is looking good as of now. As an aside pulsars run fine 2 at a time.
My host got 60 of those Vela tasks in a row. All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE. Now that host isn't getting anything of course until tomorrow I hope it will be G3a sent out then, not Vela.
You all remember that when
)
You all remember that when using OpenCl, only 27% of the GPU memory is available by design for Scientific purposes. They was specifically designed to prevent people from doing what we are and force them to use the professional cards ie. Tesla So don't expect to use all 2 GB of that card.
A new twist on the Error
)
I'll let it go on for a while longer so the devs can take a look and then deselect the app. Damn I really want to find a GW!
Betreger wrote:A new twist on
)
Was that card running tasks in parallel... or just one ?
Betreger wrote:The host had
)
It looks like you've just recently moved the host from Seti back to here and been allocated a rather high 'frequency' to start working on. The task names have a frequency component as part of the name - in your case it's a range between 1587.25 - 1587.80 Hz. The frequency you get allocated is a 'luck of the draw'. I looked at one of my machines and it's currently doing multiple frequency bins around 1100 to 1300 Hz. The latest tasks being received right now are in the mid 1400s. I guess that sooner or later it will go higher but the GPU has 4GB so maybe it will be OK. Worth keeping an eye on though :-).
As Mike has mentioned, a big part of the processing involves FFTs (Fast Fourier Transforms) which are memory hungry beasts. I'm no mathematician but my guess is that the higher the frequency, the more memory hungry the process might be. So I would guess it's not "bad data" but an inability to allocate sufficient memory to hold the data.
The Devs would be tracking the error rate pretty closely so I'm sure they will notice if this becomes a widespread problem for GPUs with 3GB or less. You have really 'done your duty' in providing examples so I suggest you put it back to GRP tasks until we see how widespread this might become. I'll be leaving mine alone for the moment but will be paying rather more attention as the frequency gets higher.
Thanks for your report and sorry for your pain :-).
Cheers,
Gary.
It looks like you've just
)
Gary I haven't given up on GW
)
Gary I haven't given up on GW tasks on that GPU yet. I have gone back to one at a time from 2. The thing that puzzles me is I have an almost identical host with the same GPU has not exhibited the issue. The main difference is the one wo problems is a W8.1 machine the problem child is a W10 machine.
Betreger wrote:Gary I haven't
)
So I guess you're now saying that a reboot was a temporary fix only? This doesn't seem like a data problem - more like something flakey with the hardware. Only you can prove or disprove that.
The basic technique (assuming desktop or tower type machines) is to swap particular hardware items (one at a time - that's important) and run the machines until you are certain whether or not the problem has stayed with the host or moved with the swapped hardware item. There is also a third possibility which is the best outcome of all. Neither machine might be misbehaving after the swap. That can happen if the problem was actually due to a dicey connection to the attached hardware which has now been 'fixed' due to the remaking of that connection or it could just be the change in electrical conditions (voltage levels and stability, etc.) with the change of host.
That's not actually true :-). Apart from the different OS versions, there are also different driver versions. On the hardware front, one host has an i5 3350P which is an Ivy Bridge processor launched in 2012 whilst the other has an i5 6400 which is a Skylake processor first launched in 2015. Undoubtedly the motherboards would be quite different, as could be the RAM.
The items of hardware most amenable to swapping are the GPUs, the RAM (check carefully for compatability) and finally, the PSUs. If they were my hosts, the first check I would make would be to swap the GPUs (check all contacts carefully) and see what happens. The best outcome would be that the problem suddenly disappears from both. If so, make sure you give both an extended test before assuming success. If the problem stays with the host, it's unlikely to be the GPU so you could move on to test something else. If the problem transfers with the GPU, it's that hardware that is likely at fault. The most common cause of GPU problems is insufficient cooling - check all fans for smooth free running as you transfer them. Also give each GPU a good clean whilst you have them out.
Cheers,
Gary.
Gary It's been a day since I
)
Gary It's been a day since I went to 1 at a time and the problem hasn't resurfaced yet. I'm keeping a close eye on it.
Gary just an update. Upon
)
Gary just an update. Upon further examination it turned out both GTX1060s were having problems with some GW tasks. Since the 4th when I went to running 1 at a time on them I have not had any bomb out. I really don't know if that did it or if Bernd stopped sending out those poison pills but everything is looking good as of now. As an aside pulsars run fine 2 at a time.
My host got 60 of those Vela
)
My host got 60 of those Vela tasks in a row. All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE. Now that host isn't getting anything of course until tomorrow I hope it will be G3a sent out then, not Vela.