Errors with GravWave Search - GPU

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1593205673
RAC: 779419

Today the computation errors

Today the computation errors came back even running 1 at a time. The only saving grace is they continue to bomb out in ~ 1min.  So I had an 11 day run w/o an error, now I'm getting a bunch. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117742785509
RAC: 34859286

Richie wrote:All errored out

Richie wrote:
All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE.

I'm sorry for your pain, but thanks for reporting.

I notice that all the latest tasks being received are labeled VelaJr1 (ie. looking at the Vela Pulsar) but in my case, I have a big enough cache so that it will be a little while yet before those reach the top of the queue.  I still have quite a few G34731 to go (running 3x).  I've suspended all those that are waiting and have allowed a single VelaJr1 task to start when one of the current three finish.

That's just happened and so far no problems.  Two G34731 plus one VelaJr1 are running without a problem.  When the two old tasks finish, I'll allow a 2nd VelaJr1 task to start.  If that is OK, I'll allow a third.

20 mins later:

Well, the 2nd VelaJr1 task started OK so I allowed both to run for a while and then tried a third.  That failed after 7 secs whilst the other two kept going.  The GPU is a 4GB RX 570 so it looks like you might need (roughly) close to 2GB per concurrent task.  I remember someone (Zalster, I think, or perhaps Keith)  mentioning that nvidia crippled the memory use on consumer grade cards being used for compute so as to force the use of their much more expensive professional range for that purpose.  If true, that might be why both 3GB and 2GB nvidia cards are having problems with these tasks.  So betreger's "poison pills" might more appropriately be described as nvidias "dirty tricks" :-(.

The two running VelaJr1 tasks have now finished successfully.  They took 55 mins and 53 mins respectively so averaging 27 mins per task, based on just those two.  In other words, the run time has doubled.

The previous G34731 tasks had been taking ~36 mins at 3x, so ~12 mins per task.  So, looks like I'll need to change back to 2x when the G34731 tasks finally run out :-(.  So that machine will be going from G34731 task every ~12 mins to a VelaJr1 task every ~27 mins.  I'll need to watch closely in case 2x turns out not to be viable after all.

Thanks for the "heads up" guys!

Cheers,
Gary.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Both Keith and I have

Both Keith and I have commented on the RAM availability on NVIDIA cards.  It's hard locked at 27% of the total Ram of the card. Both him and I reviewed a white paper several years back on the subject.

Wiggo
Wiggo
Joined: 29 Oct 10
Posts: 10
Credit: 243184304
RAC: 0

I'm not getting the

I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, but with my 3GB GTX 1060's after 21-26secs of starting a VelaJr1 task the task goes to 100% and I get "exited with zero status but no 'finished' file" as well as "If this happens repeatedly you may need to reset the project" and then the task just restarts again.

Ha, I've reset the project twice and I'm still getting the same results with them, but it only seems to happen with these VelaJr1 tasks so I'm guessing that my problem could be related to this.

 

Cheers.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Wiggo, try restarting your

Wiggo, try restarting your computer and see if it goes away.

halfempty
halfempty
Joined: 3 Apr 20
Posts: 14
Credit: 37595576
RAC: 0

Excuse my jumping in here,

Excuse my jumping in here, but I think I have the same problem. Just got home from work and found 59 errors on my main system which has two GPUs. All of the errors are VelaJr1 tasks, and all failed on the GTX 1060 3GB.

I am going to try to figure out how to exclude that GPU from Einstein work and see if I get any errors on the other card, a 1660 with 6GB. That will have to wait until tomorrow, right now I'm in the Einstinian Dog House :-(

If anyone wants to take a look, this is the host:

https://einsteinathome.org/host/12820614

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4968
Credit: 18762491690
RAC: 7168397

You need to write an

You need to write an app_config file for Einstein excluding the 3GB card from the project.  The instructions can be followed at the reference document page.

https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration

 

Wiggo
Wiggo
Joined: 29 Oct 10
Posts: 10
Credit: 243184304
RAC: 0

Zalster wrote:Wiggo, try

Zalster wrote:
Wiggo, try restarting your computer and see if it goes away.

Already did that 3 times before the original post Zalster. :-(

 

I also set NNT hours ago, but I keep getting sent these "lost tasks" that are also Velajr1 work that all end in the same result even when suspending all CPU tasks. :-(

 

Cheers.

microchip
microchip
Joined: 10 Jun 06
Posts: 50
Credit: 200915038
RAC: 743541

I'm also getting a lot of

I'm also getting a lot of errors since a few days. Have an older GTX 750 Ti with 2 GB of VRAM. When the WU starts it quickly fills up all the VRAM of the card and errors out. Before these few days, I was crunching successfully E@H GW tasks. I've detatched from the project until this can be resolved

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117742785509
RAC: 34859286

Wiggo wrote:I'm not getting

Wiggo wrote:
I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, ....

For the tasks on the website that show as computation errors, you certainly were.

You don't see the error message in the event log on your machine.  You need to go to the list of tasks on the website and choose any one showing as a computation error.  If you click the "Tasks ID" link for that task, you get to see the whole of the stderr.txt that was returned to the project after the task failed.  That is where you get to see a better idea of what caused the failure.  In other words, there was a failure to allocate sufficient memory for the job to run.

Unfortunately, there seems to be enough evidence to suggest that these latest high frequency VelaJr1 branded tasks require more than 3GB of memory to run as singles on nvidia GPUs.  If you're in that boat, the safest way to avoid the angst of failed tasks is to switch over to the Gamma-ray Pulsar GPU tasks and deselect the GW stuff.

Get some new GRP tasks before aborting all the old GW tasks.  That way, you'll have something to crunch and return if you have to abort so many that you use up your daily quota and the project gives you a 24 hour backoff.  As soon as you successfully crunch a GRP task, force an update to return it (even if backed off).  That will start to restore your daily limit.  Rinse and repeat as necessary.  After that, you should hopefully have no further problems.

If anyone wants more details on what people are experiencing, take a look at the messages a little earlier in this thread.  My guess is that the project wants to get this work done 'as it is' even if that means excluding GPUs with lower available memory.  I doubt they can change the task configuration to avoid the problem, certainly not at short notice.  The Devs will see the failures, so hopefully we will get some recommendations/suggestions on how to deal with this once they've had time to analyse the issue..

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.