All WUs on RTX 2070 Error Out

CElliott

Joined: 9 Feb 05

Posts: 28

Credit: 997529812

RAC: 476631

31 Oct 2018 19:43:59 UTC

Topic 216786

(moderation:

)

I have been successfully processing E@H work units for many years, but only for the last week this season. I have been doing almost 200 per day with a GTX 1070 and a GTX 770 and Nvidia driver 416.34.

I just installed a new RTX 2070 video card. Twenty-five of 25 attempted work units have experienced an error. The screen flickers after a few seconds, goes dark, recovers, and the WU is toast.

The error in the event log is "Display driver nvlddmkm stopped responding and has successfully recovered." The Stderr output on this website says
"The printer is out of paper. (0x1c) - exit code 28 (0x1c)

...

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 85342400
15:13:00 (1388): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
15:13:12 (1388): [normal]: done. calling boinc_finish(28).
15:13:12 (1388): called boinc_finish

The printer out of paper message does not appear on successfully completed work units. Does anyone know of a computer on this website that has a RTX 2070 that has successfully processed an Einstein@Home 1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopencl1K-nvidia) work unit?

Does anyone have any other ideas about what could be wrong?

Keith Myers

Joined: 11 Feb 11

Posts: 4963

Credit: 18704105136

RAC: 6280575

Please read through the

31 Oct 2018 20:04:11 UTC

Message 167575

(moderation:

)

Please read through the https://einsteinathome.org/content/pascal-again-available-turing-may-be-coming-soon

thread, especially the last two weeks of posts. Several Turing users have documented failures when trying to process "high-pay" tasks. "Low-pay" tasks seem to run fine. The mix of work is turning to "low-pay" tasks so you should be able to process those on your Turing card.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

CElliott wrote:The Stderr

31 Oct 2018 20:22:23 UTC

Message 167576

(moderation:

)

CElliott wrote:

The Stderr output on this website says
"The printer is out of paper. (0x1c) - exit code 28 (0x1c)

I just started seeing that too, though it appears for entirely different reasons on my GTX 1060 (Win7 64-bit).

I am tempted to say that they should add more paper.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>
<stderr_txt>
14:50:10 (3004): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49
14:50:10 (3004): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
14:50:10 (3004): [debug]: 1.1e+016 fp, 4.2e+009 fp/s, 2485530 s, 690h25m29s55
14:50:10 (3004): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1025L.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 92.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah1025L_0100_7122577.dat --debug 1 --device 0 -o LATeah1025L_100.0_0_0.0_7122577_1_0.out
output files: 'LATeah1025L_100.0_0_0.0_7122577_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1025L_100.0_0_0.0_7122577_1_0' 'LATeah1025L_100.0_0_0.0_7122577_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1025L_100.0_0_0.0_7122577_1_1'
14:50:10 (3004): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
14:50:10 (3004): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [00000000012147F0 , 0000000001214430]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 1060 6GB" by: NVIDIA Corporation
Max allocation limit: 1610612736
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah1025L.dat
% Total amount of photon times: 8950
% Preparing toplist of length: 10
% Read 1631 binary points
read_checkpoint(): Couldn't open file 'LATeah1025L_100.0_0_0.0_7122577_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1631
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343555
14:50:15 (3004): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
14:50:26 (3004): [normal]: done. calling boinc_finish(28).
14:50:26 (3004): called boinc_finish
 

https://einsteinathome.org/host/12599270/tasks/6/0

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219534931

RAC: 981486

Thank you for the report.

31 Oct 2018 20:38:26 UTC

Message 167577

(moderation:

)

CElliott, thank you for the report. I've spent a lot of time on this issue, and yours is the first 1070 report, we now add to the consistent reporting on two 1080 cards and two 1080 Ti cards here at Einstein. Yours is also our first Windows 8 report, which adds to multiple Windows 10 and one Windows 7 reports.

As it happens, the WUs being issued since a little over half a day ago are of a different file, and will probably work correctly on your 2070 with current driver and software. As your system has already downloaded a few of these LATeah1029L WUs, you could test this possibility by suspending all of your 104V WUs (and any other 104* units) so the ones likely to work go on ahead. I would be very interested in this result.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117491830419

RAC: 35452565

Jim1348 wrote:CElliott

31 Oct 2018 22:15:45 UTC

Message 167582 in response to message 167576

(moderation:

)

Jim1348 wrote:

CElliott wrote:
The Stderr output on this website says
"The printer is out of paper. (0x1c) - exit code 28 (0x1c)

I just started seeing that too, though it appears for entirely different reasons on my GTX 1060 (Win7 64-bit).

I am tempted to say that they should add more paper.<core_client_version>7.14.2</core_client_version> <![CDATA[ <message>
The printer is out of paper.
(0x1c) - exit code 28 (0x1c)</message>

You should just ignore this. This is Windows taking an exit code that is nothing to do with Windows and misinterpreting it. The exit code belongs to the app and is probably specific to some OpenCL internal problem within the app itself. No doubt it would mean something to the Devs which is why it's being reported back to the project as part of the <stderr_txt> output stream.

Cheers,
Gary.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

Gary Roberts wrote:You should

1 Nov 2018 1:47:58 UTC

Message 167585 in response to message 167582

(moderation:

)

Gary Roberts wrote:

You should just ignore this. This is Windows taking an exit code that is nothing to do with Windows and misinterpreting it. The exit code belongs to the app and is probably specific to some OpenCL internal problem within the app itself.

Thanks. I am beginning to think that my card died. I tried it on Folding, and it did not work there either. It is a bit strange, since I had just replaced the cooler, and it was running very cool for a few days. But something else may have gone out. It will give Nvidia some more business (but not for an RTX).

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 315391848

RAC: 319153

Presumably this :" ..... %

1 Nov 2018 7:57:00 UTC

Message 167586

(moderation:

)

Presumably this :

" ..... % Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343555
14:50:15 (3004): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
14:50:26 (3004): [normal]: done. calling boinc_finish(28).
14:50:26 (3004): called boinc_finish"

... gives the clue that the OpenCL fast fourier transform is what is vomiting. The was an almost identical error reported last year by Archae86. Peter do you remember what was the conclusion drawn, if any, from that thread ?

I'll take a stab and say that ( some of ) the current Nvidia drivers aren't supporting OpenCL properly ( not withstanding hardware problems with RTX ).

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2955126569

RAC: 715256

Peter wrote that report on 16

1 Nov 2018 8:54:00 UTC

Message 167588 in response to message 167586

(moderation:

)

Peter wrote that report on 16 January 2017, in relation to application version 1.18

Application version 1.20 was released on 16 February 2017

I'll try to find anything which might link the two events.

Edit - no, nothing. Peter reported (in https://einsteinathome.org/content/observations-fgrbp1-118-windows?page=10#comment-154201) that downclocking the cards improved stability with the older application, and also reported a driver restart event - which has also been reported on the RTX cards.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 315391848

RAC: 319153

Thank you Richard. I've found

1 Nov 2018 9:34:00 UTC

Message 167589

(moderation:

)

Thank you Richard. I've found out what the "-36" error really means, see this post :

"... openCL specific and translates to CL_INVALID_COMMAND_QUEUE ...."

at Nvidia there is an explanation of this relating driver reset to work size eg. probably our ~~shorties~~ low pay vs ~~longies~~ high pay WUs. More or less means the driver restarts if something takes 'too long'. Perhaps this is under developer control ie. change the timeouts, see TdrDelay and TdrDdiDelay here for Windows ? :-)))

Cheers, Mike.

( edit ) Begs the question as to how many, if any, are Linux machines with this error variety.

( edit ) E@H can generate large FFTs with long signal integrations, say, 2²²data points.

( edit ) Of course if there is no paper available then it will hang while waiting for some to be delivered, thus maybe triggering a reset too .... :-)))

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2955126569

RAC: 715256

Mike Hewson wrote:( edit ) Of

1 Nov 2018 10:12:10 UTC

Message 167590 in response to message 167589

(moderation:

)

Mike Hewson wrote:

( edit ) Of course if there is no paper available then it will hang while waiting for some to be delivered, thus maybe triggering a reset too .... :-)))

I always make sure to keep a spare roll handy :-)))

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 315391848

RAC: 319153

LOL. Absorbent hand-towel to

1 Nov 2018 10:14:00 UTC

Message 167591 in response to message 167590

(moderation:

)

LOL. Absorbent hand-towel to cleanup after driver messes. ;=}

Cheers, Mike.

( edit ) FWIW : on NVidia cards OpenCL is implemented using CUDA calls underneath and so :

- OpenCL will never beat CUDA

- if the CUDA aspect of a driver is bad then so will the OpenCL

- a single crappy CUDA core may break an OpenCL application eg. all the other cores working have completed bar the bad core and thus futilely await the exit barrier of the parallel portion of the code ( all need to complete before any proceed beyond ). So timeout occurs. The crappiness of a given CUDA core may be clock frequency dependent .

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

All WUs on RTX 2070 Error Out

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports