nvidia platform jobs immediately error out (except two)

Anonymous
Topic 212422

For the pc involved I have .5 days of work with another .5 days of work.  About 13 jobs download and 3 begin processing.  The the 3rd job errors out, followed by another and then another until all remaining jobs error out other than the first two that continue to process.  I have then reached my daily quota.  When I look at one of the jobs that resulted in computation error I see the following:

--------cut here -------

Stderr output

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 65 (0x41, -191)
</message>
<stderr_txt>
18:39:53 (20803): [normal]: This Einstein@home App was built at: Feb 15 2017 10:50:14

18:39:53 (20803): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia'.
18:39:53 (20803): [debug]: 1e+16 fp, 2.9e+09 fp/s, 3650182 s, 1013h56m22s33
18:39:53 (20803): [normal]: % CPU usage: 1.000000, GPU usage: 0.330000
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0049L.dat --alpha 4.42281478648 --delta -0.0345027837249 --skyRadius 2.152570e-06 --ldiBins 15 --f0start 1196.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 3.344368011e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0049L_1204_14997250.dat --debug 1 --device 0 -o LATeah0049L_1204.0_0_0.0_14997250_1_0.out
output files: 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0049L_1204.0_0_0.0_14997250_1_0' 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0049L_1204.0_0_0.0_14997250_1_1'
18:39:53 (20803): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
18:39:53 (20803): [debug]: glibc version/release: 2.23/stable
18:39:53 (20803): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x2d02ca0 , 0x2d02bd0]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 770" by: NVIDIA Corporation
Max allocation limit: 522747904
Global mem size: 2090991616
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0049L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
read_checkpoint(): Couldn't open file 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1255
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:1237: kernel setupPhotonPairsArray failed. status=-4
An error has occured in opencl_setup_photon_pairs_array
Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GTX 770 (Device 0).

18:40:04 (20803): [CRITICAL]: ERROR: MAIN() returned with error '1' FPU status flags: mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cohfu': No such file or directory 18:40:16 (20803): [normal]: done. calling boinc_finish(65). 18:40:16 (20803): called boinc_finish Warning: Program terminating, but clFFT resources not freed. Please consider explicitly calling clfftTeardown( ).

</stderr_txt>
]]>

--------cut here -------

 

why would two WUs proccess correctly i.e. not error out but the remainder of those downloaded result in errors?

Any ideas?

mikey
mikey
Joined: 22 Jan 05
Posts: 12787
Credit: 1874602561
RAC: 1833512

Are you running 3 workunits

Are you running 3 workunits at the same time? If so rename your app_config file so you only run one at a time and see if that helps. Also be sure to leave one cpu core open for the gpu to use...I know I know back to basics but at least it's a starting point to then figure out what works and what isn't, because your current setup isn't. If you were running Windows it would involve a reboot too as that's the only way to reset the gpu, I know it can be done in Linux without a reboot but I don't know how.

Anonymous

mikey wrote:Are you running 3

mikey wrote:
Are you running 3 workunits at the same time? If so rename your app_config file so you only run one at a time and see if that helps. Also be sure to leave one cpu core open for the gpu to use...I know I know back to basics but at least it's a starting point to then figure out what works and what isn't, because your current setup isn't. If you were running Windows it would involve a reboot too as that's the only way to reset the gpu, I know it can be done in Linux without a reboot but I don't know how.

I am running 3 gpu WUs but this is defined through the website "project" interface.  I have no app_config file in 

/var/lib/boinc-client/projects/einstein.phys.uwm.edu  .  I tried resetting the project and I receive the same two GPU WUs that were crunching normally prior to the reset.  No other units come in.  I know that I have done something but what?

I did a boinc-client restart and a reboot.  No improvement.  

Bad robl, bad!, bad! 

 

Jonathan Jeckell
Jonathan Jeckell
Joined: 11 Nov 04
Posts: 114
Credit: 1342006519
RAC: 631

I doubt this is it, but I

I doubt this is it, but I can't tell if either of your NVIDIA machines are laptops or not.  My laptop with an NVIDIA GPU started barfing when my battery went bad and the CPU slowed to 45% clock speed.  

 

Even if it's not a laptop, perhaps you are pushing the edge of your power supply?  Have you used GPU-Z before?  That should tell you about the bottleneck your GPU is running into.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

If I read your log correctly

If I read your log correctly the first error is: CL_MEM_OBJECT_ALLOCATION_FAILURE

To me this indicates a memory issue. How much memory is available/used on your GPU? Are you running other stuff than E@H on the GPU? Trying to run just 1 task at a time like Mikey suggested should work if the GPU is running out of memory.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5875
Credit: 118471748215
RAC: 25981134

I agree with Logforme.You

I agree with Logforme.

You have highlighted a lot of stuff in the stderr output, but most of it is pretty normal from what I've observed whenever I've looked at this before.

18:39:53 (20803): [normal]: % CPU usage: 1.000000, GPU usage: 0.330000

This shows you are using a GPU utilization factor of 0.33.  Did you just jump straight in at that level or did you get some performance data at x1 and then x2 before trying x3?  Pretty brave if you just jumped straight in at x3 :-).

read_checkpoint(): Couldn't open file 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cpt': No such file or directory (2)

This is seen every time the app starts.  The app is designed to look for a checkpoint in case it is being re-started on a partially crunched task.  If it can't find one it just tells you that a checkpoint couldn't be found.  Just info and not an error.

Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE

This is the actual error I believe.  There was a problem allocating sufficient memory to allow the third task to run.  It seems that when the first attempt to load a third task failed, the system just kept trying again and again until there were no more left in your work cache.  I've actually seen this happen before and there is a bit of a delay - maybe 15 to 20 secs between each task fail - certainly enough time to pull the plug on BOINC before it chews through the whole cache :-).  So it's not a matter of two good tasks and all the rest bad.  It seems that all tasks will process correctly if you just limit them to two at a time.

When you complete the two remaining tasks and report them, you'll start to get your cache back.  If you had 13 to start with and 11 failed, surely that's not enough to completely wipe out your daily allowance?  Completing two will quadruple whatever allowance you had left.  If you happen to be waiting for a new 'day' to tick over, force an update to see if the quadrupled limit will allow you to get some more - you did complete those last two, didn't you :-).

The multiple lines of the form

mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory

are not any indication of an additional problem.  If a task fails right at the start, there will be no normal results to package up and send back to the project.  In some failure modes there may be partial results so I guess the system tries to find any output files that might have been produced before the failure.

The very final message about clFFT resources not being freed is always seen at the end irrespective of whether a task has failed or not.  I've often wondered if there is any significance to it.  I figured the programmer would have done what was recommended (called clfftTeardown() explicitly) if it was really needed.

However, now that I'm successfuly rebooting machines with Polaris GPUs after 25 days uptime so as not to have the GPU decide around the 26-27 day mark to suddenly stop crunching and go AWOL, I cannot help continuing to wonder if this bears any relationship to what I see.   Seeing references to resources not freed always makes me a bit nervous :-).

 

Cheers,
Gary.

Anonymous

Gary Roberts wrote:I agree

Gary Roberts wrote:

I agree with Logforme.

You have highlighted a lot of stuff in the stderr output, but most of it is pretty normal from what I've observed whenever I've looked at this before.

18:39:53 (20803): [normal]: % CPU usage: 1.000000, GPU usage: 0.330000

This shows you are using a GPU utilization factor of 0.33.  Did you just jump straight in at that level or did you get some performance data at x1 and then x2 before trying x3?  Pretty brave if you just jumped straight in at x3 :-).

read_checkpoint(): Couldn't open file 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out.cpt': No such file or directory (2)

This is seen every time the app starts.  The app is designed to look for a checkpoint in case it is being re-started on a partially crunched task.  If it can't find one it just tells you that a checkpoint couldn't be found.  Just info and not an error.

Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE

This is the actual error I believe.  There was a problem allocating sufficient memory to allow the third task to run.  It seems that when the first attempt to load a third task failed, the system just kept trying again and again until there were no more left in your work cache.  I've actually seen this happen before and there is a bit of a delay - maybe 15 to 20 secs between each task fail - certainly enough time to pull the plug on BOINC before it chews through the whole cache :-).  So it's not a matter of two good tasks and all the rest bad.  It seems that all tasks will process correctly if you just limit them to two at a time.

When you complete the two remaining tasks and report them, you'll start to get your cache back.  If you had 13 to start with and 11 failed, surely that's not enough to completely wipe out your daily allowance?  Completing two will quadruple whatever allowance you had left.  If you happen to be waiting for a new 'day' to tick over, force an update to see if the quadrupled limit will allow you to get some more - you did complete those last two, didn't you :-).

The multiple lines of the form

mv: cannot stat 'LATeah0049L_1204.0_0_0.0_14997250_1_0.out': No such file or directory

are not any indication of an additional problem.  If a task fails right at the start, there will be no normal results to package up and send back to the project.  In some failure modes there may be partial results so I guess the system tries to find any output files that might have been produced before the failure.

The very final message about clFFT resources not being freed is always seen at the end irrespective of whether a task has failed or not.  I've often wondered if there is any significance to it.  I figured the programmer would have done what was recommended (called clfftTeardown() explicitly) if it was really needed.

However, now that I'm successfuly rebooting machines with Polaris GPUs after 25 days uptime so as not to have the GPU decide around the 26-27 day mark to suddenly stop crunching and go AWOL, I cannot help continuing to wonder if this bears any relationship to what I see.   Seeing references to resources not freed always makes me a bit nervous :-).

 

Gary,

The above is quite informative and yes I did let the last two complete.  i will allow everything to settle and tomorrow give it a look.  I did change the GPU utilization factor to one  as suggested so I'll have a look tomorrow.  And your are correct about the 20 sec delay before the jobs "error out".  Hopefully its not a card issue.  The ambient room temp is around 60F at the moment so I don't believe its a temp problem.  But in all fairness I have had the card for a while so it might be time for a replacement.  

Thanks to all who replied.

Anonymous

Logforme wrote:If I read your

Logforme wrote:

If I read your log correctly the first error is: CL_MEM_OBJECT_ALLOCATION_FAILURE

To me this indicates a memory issue. How much memory is available/used on your GPU? Are you running other stuff than E@H on the GPU? Trying to run just 1 task at a time like Mikey suggested should work if the GPU is running out of memory.

Nothing else running.  All had been good yesterday until I adjusted down the GPU utilization factor.  I then did a project update and it did not scale down as I expected it should.  As noted in my response to Gary I am going to let things settle before doing anything else.  Tomorrow is a new day. Hopefully.  

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5875
Credit: 118471748215
RAC: 25981134

robl wrote:... and it did not

robl wrote:
... and it did not scale down as I expected it should.

When you change the GPU utilization factor on the website, that information doesn't get transmitted to the client until there are some new GPU tasks sent with the response.  A simple 'update' wont do it.  If the project isn't sending new tasks (until a new 'day' arrives - midnight UTC) your client wont have been told to use something different from the previous 0.33.

As I write this, we are more than 4 hours into a new day already. Surprised  Get going man, you're wasting good crunching time. WinkLaughing

In all seriousness, unless your card has more than 2GB VRAM, there's unlikely to be any reason other than "not enough RAM" for what you saw.

 

Cheers,
Gary.

Jonathan Jeckell
Jonathan Jeckell
Joined: 11 Nov 04
Posts: 114
Credit: 1342006519
RAC: 631

You can verify your GPU's

You can verify your GPU's bottleneck directly with GPU-Z also so you can confirm if it's the VRAM or some other factor limiting your performance.

Anonymous

Gary Roberts wrote:robl

Gary Roberts wrote:
robl wrote:
... and it did not scale down as I expected it should.

When you change the GPU utilization factor on the website, that information doesn't get transmitted to the client until there are some new GPU tasks sent with the response.  A simple 'update' wont do it.  If the project isn't sending new tasks (until a new 'day' arrives - midnight UTC) your client wont have been told to use something different from the previous 0.33.

As I write this, we are more than 4 hours into a new day already. Surprised  Get going man, you're wasting good crunching time. WinkLaughing

In all seriousness, unless your card has more than 2GB VRAM, there's unlikely to be any reason other than "not enough RAM" for what you saw.

 

Ah, again most informative.  Now if I can only retain it.  I did follow Mikey's suggestion of going to one GPU WU and this morning when manually performing an update I received ~24 GPU WUs with only one processing as expected.  No WUs were timing out so it now a question of hardware?  i.e. if I increase the GPU utilization factor to 3 will two GPU WUs process while the rest time out like the original post.  For the moment I think I will defer doing that for a couple of days.  I have yet one more issue to work on another pc  which I managed to "fat finger" yesterday and was a major SNAFU.  

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.