GPU Units Erroring Out - Please Help

Cherokee150
Cherokee150
Joined: 13 May 11
Posts: 23
Credit: 288,584,232
RAC: 497,009
Topic 224513

Please help!
My fastest computer's GPU units have all started ending in errors approximately 12 clock seconds after starting.

This computer normally processes 350,000 or more credits per day, so I really want to get this fixed right away. The computer is 4213062. It has one NVIDIA GeForce GTX 1070 (4095MB), driver: 390.65.

Here is one of the failed GPU task reports:

----------------------------------------------------------------
Task 1058393606
Name:
LATeah2049Laf_92.0_0_0.0_1821827_0
Workunit ID:
517834397
Created:
13 Jan 2021 22:17:27 UTC
Sent:
13 Jan 2021 23:56:04 UTC
Report deadline:
27 Jan 2021 23:56:04 UTC
Received:
15 Jan 2021 23:31:34 UTC
Server state:
Over
Outcome:
Computation error
Client state:
Compute error
Exit status:
69 (0x00000045) Unknown error code
Computer:
4213062
Run time (sec):
13.17
CPU time (sec):
0.38
Peak working set size (MB):
22.78
Peak swap size (MB):
15.72
Peak disk usage (MB):
0.01
Validation state:
Invalid
Granted credit:
0
Application:
Gamma-ray pulsar binary search #1 on GPUs v1.22 (FGRPopencl-nvidia)
windows_x86_64
Stderr output

7.14.2

<message> The network BIOS session limit was exceeded. (0x45) - exit code 69 (0x45)</message> <stderr_txt> 12:31:24 (10048): [normal]: This Einstein@home App was built at: May 8 2019 13:29:27</p>

<p>12:31:24 (10048): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl-nvidia.exe'.<br /> 12:31:24 (10048): [debug]: 1e+016 fp, 3.1e+009 fp/s, 3397983 s, 943h53m02s75<br /> 12:31:24 (10048): [normal]: % CPU usage: 1.000000, GPU usage: 0.500000<br /> command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah2049Laf.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 84.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah2049Laf_0092_1821827.dat --debug 0 --device 0 -o LATeah2049Laf_92.0_0_0.0_1821827_0_0.out<br /> output files: 'LATeah2049Laf_92.0_0_0.0_1821827_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah2049Laf_92.0_0_0.0_1821827_0_0' 'LATeah2049Laf_92.0_0_0.0_1821827_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah2049Laf_92.0_0_0.0_1821827_0_1'<br /> 12:31:24 (10048): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86<br /> 12:31:24 (10048): [debug]: Set up communication with graphics process.<br /> boinc_get_opencl_ids returned [0000000003e6a090 , 0000000003e6a4f0] <br /> Using OpenCL platform provided by: NVIDIA Corporation<br /> Using OpenCL device "GeForce GTX 1070" by: NVIDIA Corporation<br /> Max allocation limit: 2147483648<br /> Global mem size: 0<br /> Couldn't create OpenCL context (error: 999)!<br /> initialize_ocl returned error [2007]<br /> OCL context null<br /> OCL queue null<br /> Error generating generic FFT context object [5]<br /> 12:31:25 (10048): [CRITICAL]: ERROR: MAIN() returned with error '5'<br /> FPU status flags: <br /> 12:31:36 (10048): [normal]: done. calling boinc_finish(69).<br /> 12:31:36 (10048): called boinc_finish</p>

<p></stderr_txt>

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,320
Credit: 2,427,475,885
RAC: 7,601,004

You lost your video drivers

You lost your video drivers compute component.

Couldn't create OpenCL context (error: 999)!

initialize_ocl returned error [2007]

Reload your drivers or restart the computer.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,493
Credit: 63,818,092,809
RAC: 53,657,564

Cherokee150 wrote:Please

Cherokee150 wrote:
Please help! My fastest computer's GPU units have all started ending in errors approximately 12 clock seconds after starting.

I don't think this is anything to do with missing OpenCL compute libs for the GPU.

If you look at your tasks list on the website, you will see there are compute errors going back to an earlier time.  There were 12 between 8th to 13th Jan where tasks were partially completed before failure, so the proper libs must have been in place for crunching even to start.  Then, all of a sudden at 23:31:34 UTC on Jan 15, the failed results started pouring in as each task tried to start - 337 of them (all 12s run time) from that time to 01:14:35 UTC on Jan 16 - less than 2 hrs in total.

To me, you would seem to have a developing hardware issue that probably started on 8th Jan (or even earlier).  Things like a deteriorating PSU that isn't delivering good clean power to the GPU or perhaps issues with the voltage regulator circuitry on the GPU itself.

Another possibility is improper cooling of the GPU due to dust/fluff buildup on the heatsink or perhaps progressively failing cooling fans.  You should give your machine a thorough inspection and cleaning.  Heat or voltage related problems usually start like this and gradually get worse until there is a catastrophic failure of all tasks in your work cache.  You seem to have experienced just that.

If a cooling issue isn't detected, I'd try swapping to a known good PSU with a sufficient 12V power rating.  If the problem remains, it may be the GPU itself that has failed.  Good luck with finding and fixing the problem!

Cheers,
Gary.

Cherokee150
Cherokee150
Joined: 13 May 11
Posts: 23
Credit: 288,584,232
RAC: 497,009

Thank you, Gary and

Thank you, Gary and Keith.

I will try everything, as I really can't afford to lose this computer's processing power for Einstein.

I suspect (and hope) that it might just need cleaning, as I haven't gotten around to that for a little longer than usual.

I hope I will be back to normal soon.  If, however, the GPU is failing, it may be quite some time before I can afford a replacement.

We can always count on both of you, along with a number of others, to provide wonderful assistance any time we have a problem.  Thank you again!!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,493
Credit: 63,818,092,809
RAC: 53,657,564

Cherokee150 wrote:Thank you

Cherokee150 wrote:
Thank you again!!

You're most welcome!

Hopefully it's just an excess heat issue that a good clean will fix.  Check the fans on the GPU for completely free running.  Make sure they 'free-wheel' for the proper time when spun up.  Good luck with sorting it out.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.