All WUs on RTX 2070 Error Out

CElliott
CElliott
Joined: 9 Feb 05
Posts: 28
Credit: 997463152
RAC: 482514
Topic 216786

I have been successfully processing E@H work units for many years, but only for the last week this season.  I have been doing almost 200 per day with a GTX 1070 and a GTX 770 and Nvidia driver 416.34.

I just installed a new RTX 2070 video card.  Twenty-five of 25 attempted work units have experienced an error.  The screen flickers after a few seconds, goes dark, recovers, and the WU is toast.

The error in the event log is "Display driver nvlddmkm stopped responding and has successfully recovered."  The Stderr output on this website says
"The printer is out of paper.   (0x1c) - exit code 28 (0x1c)

...

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 85342400
15:13:00 (1388): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
15:13:12 (1388): [normal]: done. calling boinc_finish(28).
15:13:12 (1388): called boinc_finish

 The printer out of paper message does not appear on successfully completed work units.  Does anyone know of a computer on this website that has a RTX 2070 that has successfully processed an Einstein@Home 1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopencl1K-nvidia) work unit?

Does anyone have any other ideas about what could be wrong? 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18702649643
RAC: 6280739

Please read through the

Please read through the https://einsteinathome.org/content/pascal-again-available-turing-may-be-coming-soon

thread, especially the last two weeks of posts.  Several Turing users have documented failures when trying to process "high-pay" tasks.  "Low-pay" tasks seem to run fine.  The mix of work is turning to "low-pay" tasks so you should be able to process those on your Turing card.

 

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

CElliott wrote:The Stderr

CElliott wrote:
The Stderr output on this website says
"The printer is out of paper.   (0x1c) - exit code 28 (0x1c)

I just started seeing that too, though it appears for entirely different reasons on my GTX 1060 (Win7 64-bit).

I am tempted to say that they should add more paper.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>
<stderr_txt>
14:50:10 (3004): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49

14:50:10 (3004): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
14:50:10 (3004): [debug]: 1.1e+016 fp, 4.2e+009 fp/s, 2485530 s, 690h25m29s55
14:50:10 (3004): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1025L.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 92.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah1025L_0100_7122577.dat --debug 1 --device 0 -o LATeah1025L_100.0_0_0.0_7122577_1_0.out
output files: 'LATeah1025L_100.0_0_0.0_7122577_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1025L_100.0_0_0.0_7122577_1_0' 'LATeah1025L_100.0_0_0.0_7122577_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1025L_100.0_0_0.0_7122577_1_1'
14:50:10 (3004): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
14:50:10 (3004): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [00000000012147F0 , 0000000001214430]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 1060 6GB" by: NVIDIA Corporation
Max allocation limit: 1610612736
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah1025L.dat
% Total amount of photon times: 8950
% Preparing toplist of length: 10
% Read 1631 binary points
read_checkpoint(): Couldn't open file 'LATeah1025L_100.0_0_0.0_7122577_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1631
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343555
14:50:15 (3004): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
14:50:26 (3004): [normal]: done. calling boinc_finish(28).
14:50:26 (3004): called boinc_finish

 

https://einsteinathome.org/host/12599270/tasks/6/0
archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7219114931
RAC: 962489

Thank you for the report. 

CElliott, thank you for the report.  I've spent a lot of time on this issue, and yours is the first 1070 report, we now add to the consistent reporting on two 1080 cards and two 1080 Ti cards here at Einstein.  Yours is also our first Windows 8 report, which adds to multiple Windows 10 and one Windows 7 reports.

As it happens, the WUs being issued since a little over half a day ago are of a different file, and will probably work correctly on your 2070 with current driver and software.  As your system has already downloaded a few of these LATeah1029L WUs, you could test this possibility by suspending all of your 104V WUs (and any other 104* units) so the ones likely to work go on ahead.  I would be very interested in this result.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117484023805
RAC: 35498079

Jim1348 wrote:CElliott

Jim1348 wrote:
CElliott wrote:
The Stderr output on this website says
"The printer is out of paper.   (0x1c) - exit code 28 (0x1c)

I just started seeing that too, though it appears for entirely different reasons on my GTX 1060 (Win7 64-bit).

I am tempted to say that they should add more paper.<core_client_version>7.14.2</core_client_version> <![CDATA[ <message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>

You should just ignore this.  This is Windows taking an exit code that is nothing to do with Windows and misinterpreting it.  The exit code belongs to the app and is probably specific to some OpenCL internal problem within the app itself.  No doubt it would mean something to the Devs which is why it's being reported back to the project as part of the <stderr_txt> output stream.

Cheers,
Gary.

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Gary Roberts wrote:You should

Gary Roberts wrote:
You should just ignore this.  This is Windows taking an exit code that is nothing to do with Windows and misinterpreting it.  The exit code belongs to the app and is probably specific to some OpenCL internal problem within the app itself.

Thanks.  I am beginning to think that my card died.  I tried it on Folding, and it did not work there either.  It is a bit strange, since I had just replaced the cooler, and it was running very cool for a few days.  But something else may have gone out.  It will give Nvidia some more business (but not for an RTX).

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315258981
RAC: 312710

Presumably this :" ..... %

Presumably this :

" ..... % Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343555
14:50:15 (3004): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
14:50:26 (3004): [normal]: done. calling boinc_finish(28).
14:50:26 (3004): called boinc_finish"

 ... gives the clue that the OpenCL fast fourier transform is what is vomiting. The was an almost identical error reported last year by Archae86. Peter do you remember what was the conclusion drawn, if any, from that thread ?

I'll take a stab and say that ( some of ) the current Nvidia drivers aren't supporting OpenCL properly ( not withstanding hardware problems with RTX ). 

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2954963252
RAC: 715567

Peter wrote that report on 16

Peter wrote that report on 16 January 2017, in relation to application version 1.18

Application version 1.20 was released on 16 February 2017

I'll try to find anything which might link the two events.

Edit - no, nothing. Peter reported (in https://einsteinathome.org/content/observations-fgrbp1-118-windows?page=10#comment-154201) that downclocking the cards improved stability with the older application, and also reported a driver restart event - which has also been reported on the RTX cards.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315258981
RAC: 312710

Thank you Richard. I've found

Thank you Richard. I've found out what the "-36" error really means, see this post : 

"... openCL specific and translates to CL_INVALID_COMMAND_QUEUE ...."

at Nvidia there is an explanation of this relating driver reset to work size eg. probably our shorties low pay vs longies high pay WUs. More or less means the driver restarts if something takes 'too long'. Perhaps this is under developer control ie. change the timeouts, see TdrDelay and TdrDdiDelay  here for Windows ? :-)))

Cheers, Mike.

( edit ) Begs the question as to how many, if any, are Linux machines with this error variety.

( edit ) E@H can generate large FFTs with long signal integrations, say, 222 data points.

( edit ) Of course if there is no paper available then it will hang while waiting for some to be delivered, thus maybe triggering a reset too .... :-)))

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2954963252
RAC: 715567

Mike Hewson wrote:( edit ) Of

Mike Hewson wrote:
( edit ) Of course if there is no paper available then it will hang while waiting for some to be delivered, thus maybe triggering a reset too .... :-)))

I always make sure to keep a spare roll handy :-)))

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315258981
RAC: 312710

LOL. Absorbent hand-towel to

LOL. Absorbent hand-towel to cleanup after driver messes. ;=}

Cheers, Mike.

( edit ) FWIW : on NVidia cards OpenCL is implemented using CUDA calls underneath and so :

- OpenCL will never beat CUDA

- if the CUDA aspect of a driver is bad then so will the OpenCL

- a single crappy CUDA core may break an OpenCL application eg. all the other cores working have completed bar the bad core and thus futilely await the exit barrier of the parallel portion of the code ( all need to complete before any proceed beyond ). So timeout occurs. The crappiness of a given CUDA core may be clock frequency dependent .

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.