(New user) Most GPU tasks failing with status=-36 CL_INVALID_COMMAND_QUEUE

Mike
Mike
Joined: 17 Jan 17
Posts: 7
Credit: 256649329
RAC: 0
Topic 204743

 

 

I joined yesterday & started processing GPU units (on a Win10 x64 with a GTX 1080).

But there seem to be quite a few tasks failed with 'error 28' "out of paper" (no doubt the usual Boinc idiosyncrasy of showing every return code as the matching windows error code) at the top of the report, and a status=-36 at the bottom of the report.

Is this a common problem & there anything I can do to improve the success rate on my PC?  Seems a shame to be killing so many tasks.  I think I've made my PC public but not 100% sure.

 

 https://einsteinathome.org/task/603674888

Quote:

The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)

 ... edited for brevity ...

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:882: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 0
08:48:40 (14828): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION

 

https://einsteinathome.org/task/603674987

 

Quote:

The printer is out of paper. (0x1c) - exit code 28 (0x1c)

 ...

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:882: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 0
02:47:43 (10964): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
02:47:54 (10964): [normal]: done. calling boinc_finish(28).

 

https://einsteinathome.org/task/603674992

Quote:

The printer is out of paper. (0x1c) - exit code 28 (0x1c)

...

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:882: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 0
02:35:57 (14816): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
02:36:09 (14816): [normal]: done. calling boinc_finish(28).

 

-- Edit: I found a reference to this return code here :-

https://einsteinathome.org/content/task-running-out-paper#comment-153670



Christian Beer wrote:
As seen in the stderr.log the real error code is -36 which is openCL specific and translates to CL_INVALID_COMMAND_QUEUE which is something Bernd will have to look at when he is back.

So my PC can be considered a second example.  Let me know if you want log files or whatever (and where to find them).

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7226431595
RAC: 1077073

Mike_279 wrote:  there

Mike_279 wrote:
 there anything I can do to improve the success rate on my PC?  

You appear to running these on a Windows 10 PC with a modern driver.  There are plenty of other Pascal cards with far higher success rates.

I suggest you try turning down clock speeds on your card, substantially at first.  If that makes the problem go way down, you have strong evidence that clock rates matter, and can experiment to find your preferred settings.

BobMALCS
BobMALCS
Joined: 13 Aug 10
Posts: 20
Credit: 54539336
RAC: 0

As I raised the initial query

As I raised the initial query on this I'll add a little bit more info.

I run BOINC on a GT640 and GTX760, Windows 7 and latest Nvidia driver.  If I remember correctly this error occurred on both of them.  They are both slightly overclocked.  However they run several other projects without error.  Until this particular error occurred they ran Einstein without an error.

I will not be reducing my GPU speeds to the detriment of other successfully running projects.  So, until something changes I will not be running Einstein.

 

BobM

Mike
Mike
Joined: 17 Jan 17
Posts: 7
Credit: 256649329
RAC: 0

Finally got internet back

Finally got internet back :-)

 

Quote:
 If that makes the problem go way down, you have strong evidence that clock rates matter, 

Yes, I'll try downclocking the 1080 as an experiment to see if it makes a difference (I'm comfortable with overclocking CPUs, but never tried GPUs before, so not familiar with the tools).   It's at factory clocks.

 

 -- Edit:

Found the overclocking tool, first gathering data for the factory setting - so I'll run my current batch of workunits through without changing any settings (other than fan curve, which by default only comes on at 50c, I changed it so it comes on at 30c instead.  Now it seems to be maintaining 24c idle, and around 36-38c after 15 minutes when running einstein.  Prior to fiddling with the fan curve I think it was hovering around the 50-55c mark after running for several hours.  Being winter the ambient temperature is low, so I suspect the GPU temps would be higher in summer).

Max temperature after running furmark GPU stress test (factory settings except the new fan curve, fan at 66%@55c) for 10 minutes is 55c.

Once the current Einstein GPU units have finished executing, I'll calculate the % failed, drop the GPU clock offset from 0 to (say -300mhz? is that reasonable?  I'm new to GPU overclocking), then repeat with a new batch.  

In terms of the errors found, so far I've had no verification failures, only the code=-36 thing, and it seems to be about 25% failed or so at a rough glance.

 

Talking from the viewpoint of a CUDA/GPU processing newbie who doesn't know what they're doing, the error sounds more like a clash with something else trying to use the GPU, such as the screensaver, but taking a thorough approach to diagnostics is always good.

 

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3413876540
RAC: 3486162

BobmALCS wrote:As I raised

BobmALCS wrote:

As I raised the initial query on this I'll add a little bit more info.

I run BOINC on a GT640 and GTX760, Windows 7 and latest Nvidia driver.  If I remember correctly this error occurred on both of them.  They are both slightly overclocked.  However they run several other projects without error.  Until this particular error occurred they ran Einstein without an error.

I will not be reducing my GPU speeds to the detriment of other successfully running projects.  So, until something changes I will not be running Einstein.

 

BobM

/facepalm

Some projects require different clocks. Setting a 0mhz overclock on the card and letting it boost to whatever clock it can will show that project by project the demand is different. And thus the max OC will be different. The only one size fits all clock is stock and thats not guaranteed.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7226431595
RAC: 1077073

Mike_279 wrote: Once the

Mike_279 wrote:

Once the current Einstein GPU units have finished executing, I'll calculate the % failed, drop the GPU clock offset from 0 to (say -300mhz? is that reasonable?  I'm new to GPU overclocking), then repeat with a new batch.  

 

When I say this, I always say clocks not clock.  At least on the Nvidia cards I've played with, there generally are two, often called Core clock and Memory clock.

300 is a big drop for the core clock (which I assume is the one your are speaking of).  If many units work and many don't, and if it is actually a clock related issue, going -100 should show an appreciable change.

But if the real problem is the memory clock, that drop in core clock will just lower your productivity.  If you like to make single changes, take data, and make a disciplined comparison, I suggest lowering core and memory clocks by 100 each.

The other nasty thing about memory clocks is that they get discussed on at least three different scales, each differing by a factor of two.  So it can be helpful to specify which observations tools is in use (say GPU-Z vs. HWiNFO vs. MSIAfterburner vs. nvidiainspector, to name four I like and use).

On a personal note, I've been redoing the overclocking of my 1050, which failed catastrophically when I started running 1.18 work at the same settings I had used for months, and was completely healed when I dropped both core and memory clock considerably.  As I have inched back up, the evolving picture is that on my card, for 1.18, the productivity is stunningly insensitive to memory clock rate, but about as sensitive to core clock rate as I might have expected.  Also, of the two, it currently appears that for my card it is memory clock which needs to be lowered appreciably compared to previously successful setttings.  

However the 1050 is built in a different fab on a different production process from the other Pascal GPU chips, and quite likely has differing sensitivities to this application.

Good luck, and please share your observations.

 

 

Mike
Mike
Joined: 17 Jan 17
Posts: 7
Credit: 256649329
RAC: 0

archae86 wrote:Mike_279

archae86 wrote:
Mike_279 wrote:

Once the current Einstein GPU units have finished executing, I'll calculate the % failed, drop the GPU clock offset from 0 to (say -300mhz? is that reasonable?  I'm new to GPU overclocking), then repeat with a new batch.  

 

When I say this, I always say clocks not clock.  At least on the Nvidia cards I've played with, there generally are two, often called Core clock and Memory clock.

300 is a big drop for the core clock (which I assume is the one your are speaking of).  If many units work and many don't, and if it is actually a clock related issue, going -100 should show an appreciable change.

But if the real problem is the memory clock, that drop in core clock will just lower your productivity.  If you like to make single changes, take data, and make a disciplined comparison, I suggest lowering core and memory clocks by 100 each.

 ...

 

 

Yes, it was the core clock I was talking about.  I'll try dropping the core clock first, then the next round will be memory clock reduced & core clock back at factory, etc.  That way I can see which of the two affects it.  Each round will probably take a few days because I usually run with a big buffer (... and bigger sample size = better, anyway).  Just set all my other projects to 'no more tasks' to make it quicker.

I had 3 more failures before my internet came back, but haven't actually had any in the last couple of days from the look of things.  If I don't get any more failures in the current batch it'll be a pain because this is supposed to be my control group.  The only difference is the fan curve, but I sort of doubt that was the issue. Unfortunately I didn't realise that Einstein had both CPU and GPU applications, because I now have a big bucket of both and it'll take longer to work out the proportion of failures lol.  Now I've set it to download GPU only.

I'm using the clock numbers which appear on the tool ('thunder master' which I've never heard of before) which the card manufacturer supplied (as per the images below).  These are all factory-except-stronger-fan-curve.  

 

Full album

http://imgur.com/a/jJYtq

 

Factory (except fan curve) - GPU idle  

 

Factory - Einstein running

 

Factory - Einstein running, a bit later

 

Factory - Running FurMark stress test

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7226431595
RAC: 1077073

Mike_279 wrote:The only

Mike_279 wrote:
The only difference is the fan curve, but I sort of doubt that was the issue.  

If I understand the fan change you made, you are getting quite a lot more cooling when temperature levels are moderate, so running card cooler.  That would likely nudge up the critical failure clock speeds (both of them), thus improving your behavior at what was a balanced-on-the-knife-edge operating point.

In my experience the transition region between almost all working to almost all failing is pretty narrow, though there can be a tail of a few failures a ways beyond.

Your tool is reporting memory speed on the middle of the three methods.  I revise my memory clock offset suggestion to suggest -200 (as the -100 advice was specific to the next lower reporting scale, as used, for example, by GPU-Z).

Thanks for reporting, and I hope you learn something.  Sadly that means I have to hope you get some more failures.

 

Mike
Mike
Joined: 17 Jan 17
Posts: 7
Credit: 256649329
RAC: 0

archae86 wrote:Mike_279

archae86 wrote:
Mike_279 wrote:
The only difference is the fan curve, but I sort of doubt that was the issue.  

If I understand the fan change you made, you are getting quite a lot more cooling when temperature levels are moderate, so running card cooler.  ...

Yes, exactly - the default fan profile was zero fan until 50c, and then quite a sharp ramp up from there, so I gave it a more gradual start earlier.

If I don't get any more failures I might actually ramp up each of the clocks a little bit in turn to see which one of the two was triggering it.  Then drop that one down to give more headroom (ambient temp is going to be higher in the summer so I want a safety margin).

Initially I was actually expecting the fan change to cause *more* errors because the card automatically started boosting the core clock higher due to the lower temp (1985 in the second screenshot), but after a while it seemed to stabilise a little lower.

 

Mike
Mike
Joined: 17 Jan 17
Posts: 7
Credit: 256649329
RAC: 0

OK, looks like the 'control'

OK, looks like the 'control' GPU tasks have all gone though now.

 

180 GPU tasks total

132 GPU tasks validated (73%, of which ~2/3 were from after the fan curve change)

30 GPU tasks pending validation (17%, about half after the fan curve change)

17 GPU tasks errored (10.6%) - (only 1 of which happened after the fan curve change)

1 GPU task invalid   https://einsteinathome.org/task/603674989

So after the fan curve change there was ~1% error rate, and prior to that point was a ~21% error rate.

 

 

21 CPU tasks validated

320 CPU tasks queued

 

1% isn't really enough of an error rate to be useful from an experimental viewpoint, so I think the next test will be +100MHz core clock (1700 to 1800), and -200 on the memory clock (5005 to 4805).  The idea here being to stress the core clock and destress the memory.  Then another batch, swapped around (1600 and 5205). 

 

If those don't offer clues, then I'll reset them back to 1700/5005 and look at potential software clashes instead (i.e., screensaver, a/v scans, ...?) 

I already set up my games as exclusive applications in boinc.  Looking at the timestamps on the crashes, most of them seem to be in two clusters 01:00 - 03:00 am and 14:00-15:00 pm on the 18th.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7226431595
RAC: 1077073

Mike_279 wrote:1% isn't

Mike_279 wrote:
1% isn't really enough of an error rate to be useful from an experimental viewpoint, so I think the next test will be +100MHz core clock (1700 to 1800), and -200 on the memory clock (5005 to 4805).  The idea here being to stress the core clock and destress the memory.  Then another batch, swapped around (1600 and 5205).

 

Agreed on all counts.  Those offsets should be plenty enough to expose a clock rate issue if it was ever part of the problem (or even if you are just pretty near the edge.

I should warn that the manifestations of excess clock rate vary quite a bit.  Just to list some I have personally observed in circumstances which persuaded me clock rate was the primary issue:.

1. normal completion in normal time, but declared invalid by comparison to quorum partner.
2. abnormal early termination--with an express error status declared to BOINC locally.
3. early termination with error as in number 2, but with the extra feature that all subsequent GPU tasks error out very quickly (say 12 seconds elapsed) until reboot, even if clock rate reduced.
4. downclocking of the card, in both core and memory clocks, to something like 200 to 400, but actually continuing to produce correct results--just with greatly increased ET.
5. downclocking as in 4, but with the extra feature that the downclocking did not resolve even on cold iron reboot.  I thought the card was dead to me, and bought a new one, only to find it revived when I did a full-cleanliness driver uninstall/install.
6. a simple direct crash of the PC--no error messages left behind or other clues.

I'm not trying to scare you, just to emphasize the variety of symptoms (there may well be others syndromes I've forgotten, or just not personally experienced). 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.