Errors with GravWave Search - GPU

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1590475706

RAC: 777231

Today the computation errors

15 Apr 2020 17:27:23 UTC

Message 176640

(moderation:

)

Today the computation errors came back even running 1 at a time. The only saving grace is they continue to bomb out in ~ 1min. So I had an 11 day run w/o an error, now I'm getting a bunch.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117623126266

RAC: 35241954

Richie wrote:All errored out

15 Apr 2020 22:29:54 UTC

Message 176648 in response to message 176632

(moderation:

)

Richie wrote:

All errored out with CL_MEM_OBJECT_ALLOCATION_FAILURE.

I'm sorry for your pain, but thanks for reporting.

I notice that all the latest tasks being received are labeled VelaJr1 (ie. looking at the Vela Pulsar) but in my case, I have a big enough cache so that it will be a little while yet before those reach the top of the queue. I still have quite a few G34731 to go (running 3x). I've suspended all those that are waiting and have allowed a single VelaJr1 task to start when one of the current three finish.

That's just happened and so far no problems. Two G34731 plus one VelaJr1 are running without a problem. When the two old tasks finish, I'll allow a 2nd VelaJr1 task to start. If that is OK, I'll allow a third.

20 mins later:

Well, the 2nd VelaJr1 task started OK so I allowed both to run for a while and then tried a third. That failed after 7 secs whilst the other two kept going. The GPU is a 4GB RX 570 so it looks like you might need (roughly) close to 2GB per concurrent task. I remember someone (Zalster, I think, or perhaps Keith) mentioning that nvidia crippled the memory use on consumer grade cards being used for compute so as to force the use of their much more expensive professional range for that purpose. If true, that might be why both 3GB and 2GB nvidia cards are having problems with these tasks. So betreger's "poison pills" might more appropriately be described as nvidias "dirty tricks" :-(.

The two running VelaJr1 tasks have now finished successfully. They took 55 mins and 53 mins respectively so averaging 27 mins per task, based on just those two. In other words, the run time has doubled.

The previous G34731 tasks had been taking ~36 mins at 3x, so ~12 mins per task. So, looks like I'll need to change back to 2x when the G34731 tasks finally run out :-(. So that machine will be going from G34731 task every ~12 mins to a VelaJr1 task every ~27 mins. I'll need to watch closely in case 2x turns out not to be viable after all.

Thanks for the "heads up" guys!

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Both Keith and I have

15 Apr 2020 23:17:52 UTC

Message 176652

(moderation:

)

Both Keith and I have commented on the RAM availability on NVIDIA cards. It's hard locked at 27% of the total Ram of the card. Both him and I reviewed a white paper several years back on the subject.

Wiggo

Joined: 29 Oct 10

Posts: 10

Credit: 243184304

RAC: 0

I'm not getting the

16 Apr 2020 3:51:41 UTC

Message 176656

(moderation:

)

I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, but with my 3GB GTX 1060's after 21-26secs of starting a VelaJr1 task the task goes to 100% and I get "exited with zero status but no 'finished' file" as well as "If this happens repeatedly you may need to reset the project" and then the task just restarts again.

Ha, I've reset the project twice and I'm still getting the same results with them, but it only seems to happen with these VelaJr1 tasks so I'm guessing that my problem could be related to this.

Cheers.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Wiggo, try restarting your

16 Apr 2020 4:02:05 UTC

Message 176657 in response to message 176656

(moderation:

)

Wiggo, try restarting your computer and see if it goes away.

halfempty

Joined: 3 Apr 20

Posts: 14

Credit: 37595576

RAC: 0

Excuse my jumping in here,

16 Apr 2020 5:05:56 UTC

Message 176658

(moderation:

)

Excuse my jumping in here, but I think I have the same problem. Just got home from work and found 59 errors on my main system which has two GPUs. All of the errors are VelaJr1 tasks, and all failed on the GTX 1060 3GB.

I am going to try to figure out how to exclude that GPU from Einstein work and see if I get any errors on the other card, a 1660 with 6GB. That will have to wait until tomorrow, right now I'm in the Einstinian Dog House :-(

If anyone wants to take a look, this is the host:

https://einsteinathome.org/host/12820614

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18735332198

RAC: 6948703

You need to write an

16 Apr 2020 6:21:11 UTC

Message 176660

(moderation:

)

You need to write an app_config file for Einstein excluding the 3GB card from the project. The instructions can be followed at the reference document page.

https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration

Wiggo

Joined: 29 Oct 10

Posts: 10

Credit: 243184304

RAC: 0

Zalster wrote:Wiggo, try

16 Apr 2020 6:56:20 UTC

Message 176661 in response to message 176657

(moderation:

)

Zalster wrote:

Wiggo, try restarting your computer and see if it goes away.

Already did that 3 times before the original post Zalster. :-(

I also set NNT hours ago, but I keep getting sent these "lost tasks" that are also Velajr1 work that all end in the same result even when suspending all CPU tasks. :-(

Cheers.

microchip

Joined: 10 Jun 06

Posts: 50

Credit: 198095066

RAC: 716504

I'm also getting a lot of

16 Apr 2020 7:14:44 UTC

Message 176662

(moderation:

)

I'm also getting a lot of errors since a few days. Have an older GTX 750 Ti with 2 GB of VRAM. When the WU starts it quickly fills up all the VRAM of the card and errors out. Before these few days, I was crunching successfully E@H GW tasks. I've detatched from the project until this can be resolved

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117623126266

RAC: 35241954

Wiggo wrote:I'm not getting

16 Apr 2020 8:30:47 UTC

Message 176663 in response to message 176656

(moderation:

)

Wiggo wrote:

I'm not getting the "CL_MEM_OBJECT_ALLOCATION_FAILURE" errors, ....

For the tasks on the website that show as computation errors, you certainly were.

You don't see the error message in the event log on your machine. You need to go to the list of tasks on the website and choose any one showing as a computation error. If you click the "Tasks ID" link for that task, you get to see the whole of the stderr.txt that was returned to the project after the task failed. That is where you get to see a better idea of what caused the failure. In other words, there was a failure to allocate sufficient memory for the job to run.

Unfortunately, there seems to be enough evidence to suggest that these latest high frequency VelaJr1 branded tasks require more than 3GB of memory to run as singles on nvidia GPUs. If you're in that boat, the safest way to avoid the angst of failed tasks is to switch over to the Gamma-ray Pulsar GPU tasks and deselect the GW stuff.

Get some new GRP tasks before aborting all the old GW tasks. That way, you'll have something to crunch and return if you have to abort so many that you use up your daily quota and the project gives you a 24 hour backoff. As soon as you successfully crunch a GRP task, force an update to return it (even if backed off). That will start to restore your daily limit. Rinse and repeat as necessary. After that, you should hopefully have no further problems.

If anyone wants more details on what people are experiencing, take a look at the messages a little earlier in this thread. My guess is that the project wants to get this work done 'as it is' even if that means excluding GPUs with lower available memory. I doubt they can change the task configuration to avoid the problem, certainly not at short notice. The Devs will see the failures, so hopefully we will get some recommendations/suggestions on how to deal with this once they've had time to analyse the issue..

Cheers,
Gary.

Errors with GravWave Search - GPU

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports