Machines not receiving new tasks after 12 per day

SJC_Steve
SJC_Steve
Joined: 20 Jul 11
Posts: 28
Credit: 523385984
RAC: 485668
Topic 224502

My PC isn't doing much E@H GPU work. In the message log, it says no new tasks are being sent since it has completed the daily limit of 12 tasks.

 Is this a new limit on the number tasks that can be done / day and why?

Thanks,
Steve

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221554931
RAC: 967170

SJC_Steve wrote: Is this a

SJC_Steve wrote:
 Is this a new limit on the number tasks that can be done / day and why?

The limit has not changed in a long while.  And for your machine it is far more than 12 until it got in trouble.

Your actual problem is that the machine produced a flurry of nearly instant fails, with a reported 2 seconds of run time giving computation error result on GW GPU tasks.

Each such error reduces your daily quota.  Yours worked its way down to 12.

If you find the problem and fix it, then on the next new day you'll be able to get some work on the reduced quota, and each successful result returned will greatly raise the quota.

So fixing the problem causing instant failure of every task is the priority here.

Unfortunately, my brief review of your stderr did not show the symptoms of a problem that I recognize.  Possibly someone else will drop by to help.

My personal first move would be to reboot the machine.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117573353252
RAC: 35248388

SJC_Steve wrote:My

SJC_Steve wrote:
My PC isn't doing much E@H GPU work.

You have 3 different machines on your account so I had to do a bit of searching to identify one with problems.  The tasks list currently shows 549 tasks in total of which the most recent 35 have compute errors.  By clicking on the TaskID link for the first error, there is an output log that contains the line:-

[ERROR] Couldn't get OpenCL device from BOINC (-1)!

The task was returned at 13 Jan 2021 23:00:09 UTC, as were a lot more of these.  Interestingly, there were also 3 successfully completed and validated tasks returned at exactly the same time along with all the errors.  Prior to that, at 12:17:36 UTC (almost 11 hours earlier) there were more validated tasks returned so it seems to suggest that the machine was working fine and was perhaps shut down (or at least stopped from crunching) for a number of hours (did you do something like a driver update??) and that when crunching was resumed there were already the 3 'good' completed tasks and then a bunch of immediate errors that got returned.

If you want to identify what caused the issue, I suggest you work out your local times corresponding to the above two UTC times recorded by the server and think about what changes you might have made during that particular interval.  It seems likely that something you did caused the rather drastic set of events that followed.  I note that your three machines have the same Ubuntu version but have 3 different GPU driver versions.  That may be significant.

Cheers,
Gary.

SJC_Steve
SJC_Steve
Joined: 20 Jul 11
Posts: 28
Credit: 523385984
RAC: 485668

Gary and Archae86, Thanks

Gary and Archae86,

Thanks for the input, I'll dig into it.

Steve

SJC_Steve
SJC_Steve
Joined: 20 Jul 11
Posts: 28
Credit: 523385984
RAC: 485668

In looking into the lastest

In looking into the lastest failure on WORKUNIT 517274929, it shows three different computers erroring out and one computer that completed and validated the task. It's uncompleted at this point waiting for another validate.

Is this an issue with our PCs or the task?

Thanks,
Steve

SJC_Steve
SJC_Steve
Joined: 20 Jul 11
Posts: 28
Credit: 523385984
RAC: 485668

I updated and rebooted one of

I updated and rebooted one of my PCs that wasn't getting work and it began working. I'll try the same with the remaining ones.

Thanks again,

Steve

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221554931
RAC: 967170

SJC_Steve wrote: In looking

SJC_Steve wrote:

In looking into the lastest failure on WORKUNIT 517274929, it shows three different computers erroring out and one computer that completed and validated the task. It's uncompleted at this point waiting for another validate.

The task you mention is a GW task with very high issue number and a DF of .60.  Probably this has quite a high requirement of VRAM on the GPU card to run successfully.  

Your two other quorum partners who failed did so after about 90 seconds of run time.  The first reports that the card had 2048 MiB of memory, and includes the familiar error indication:

CL_MEM_OBJECT_ALLOCATION_FAILURE

The second for some reason has an empty stderr, but the computer does report 2048 MB of GPU memory, and it is in my opinion nearly certain that it also failed for lacking adequate VRAM to run this particular task.

The single computer which has so far successfully completed this task reports GPU VRAM of 4095 MB, which obviously was sufficient.

Yours, on the other hand, failed after less than 3 seconds, not 90, does not have the standard memory inadequacy message, does have another standard message (Gary discussed this point) and the computer reports 4083 MB of VRAM on the GPU.

You had a configuration problem on the machine.  The reason for your sequence of fast failures is not bad tasks.

 

 

SJC_Steve
SJC_Steve
Joined: 20 Jul 11
Posts: 28
Credit: 523385984
RAC: 485668

Thanks for the updates. I

Thanks for the updates. I updated and rebooted all three computers. One of them was still doing work while the other two were failing after a few seconds as you stated. It's a mystery to me as I don't remember doing anything to two of them in past couple of days but all seems well now.

Thanks again,

Steve

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

You were probably the victim

You were probably the victim of a critical update for your Ubuntu systems.  I was talking to Keith about the same issue with 2 of my machines. There was a critical update that also also updated the drivers on the machines, literally pulling the opencl component out was they were still crunching. Since the GPUs were in use, it didn't install the updated opencl until you did an manual update and reboot. That's why you get the message of the missing OpenCl

 

Z

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.