I have been successfully processing E@H work units for many years, but only for the last week this season. I have been doing almost 200 per day with a GTX 1070 and a GTX 770 and Nvidia driver 416.34.
I just installed a new RTX 2070 video card. Twenty-five of 25 attempted work units have experienced an error. The screen flickers after a few seconds, goes dark, recovers, and the WU is toast.
The error in the event log is "Display driver nvlddmkm stopped responding and has successfully recovered." The Stderr output on this website says
"The printer is out of paper. (0x1c) - exit code 28 (0x1c)
...
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 85342400
15:13:00 (1388): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
15:13:12 (1388): [normal]: done. calling boinc_finish(28).
15:13:12 (1388): called boinc_finish
The printer out of paper message does not appear on successfully completed work units. Does anyone know of a computer on this website that has a RTX 2070 that has successfully processed an Einstein@Home 1.20 Gamma-ray pulsar binary search #1 on GPUs (FGRPopencl1K-nvidia) work unit?
Does anyone have any other ideas about what could be wrong?
Copyright © 2024 Einstein@Home. All rights reserved.
Please read through the
)
Please read through the https://einsteinathome.org/content/pascal-again-available-turing-may-be-coming-soon
thread, especially the last two weeks of posts. Several Turing users have documented failures when trying to process "high-pay" tasks. "Low-pay" tasks seem to run fine. The mix of work is turning to "low-pay" tasks so you should be able to process those on your Turing card.
CElliott wrote:The Stderr
)
I just started seeing that too, though it appears for entirely different reasons on my GTX 1060 (Win7 64-bit).
I am tempted to say that they should add more paper.
Thank you for the report.
)
CElliott, thank you for the report. I've spent a lot of time on this issue, and yours is the first 1070 report, we now add to the consistent reporting on two 1080 cards and two 1080 Ti cards here at Einstein. Yours is also our first Windows 8 report, which adds to multiple Windows 10 and one Windows 7 reports.
As it happens, the WUs being issued since a little over half a day ago are of a different file, and will probably work correctly on your 2070 with current driver and software. As your system has already downloaded a few of these LATeah1029L WUs, you could test this possibility by suspending all of your 104V WUs (and any other 104* units) so the ones likely to work go on ahead. I would be very interested in this result.
Jim1348 wrote:CElliott
)
You should just ignore this. This is Windows taking an exit code that is nothing to do with Windows and misinterpreting it. The exit code belongs to the app and is probably specific to some OpenCL internal problem within the app itself. No doubt it would mean something to the Devs which is why it's being reported back to the project as part of the <stderr_txt> output stream.
Cheers,
Gary.
Gary Roberts wrote:You should
)
Thanks. I am beginning to think that my card died. I tried it on Folding, and it did not work there either. It is a bit strange, since I had just replaced the cooler, and it was running very cool for a few days. But something else may have gone out. It will give Nvidia some more business (but not for an RTX).
Presumably this :" ..... %
)
Presumably this :
" ..... % Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343555
14:50:15 (3004): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
14:50:26 (3004): [normal]: done. calling boinc_finish(28).
14:50:26 (3004): called boinc_finish"
... gives the clue that the OpenCL fast fourier transform is what is vomiting. The was an almost identical error reported last year by Archae86. Peter do you remember what was the conclusion drawn, if any, from that thread ?
I'll take a stab and say that ( some of ) the current Nvidia drivers aren't supporting OpenCL properly ( not withstanding hardware problems with RTX ).
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Peter wrote that report on 16
)
Peter wrote that report on 16 January 2017, in relation to application version 1.18
Application version 1.20 was released on 16 February 2017
I'll try to find anything which might link the two events.
Edit - no, nothing. Peter reported (in https://einsteinathome.org/content/observations-fgrbp1-118-windows?page=10#comment-154201) that downclocking the cards improved stability with the older application, and also reported a driver restart event - which has also been reported on the RTX cards.
Thank you Richard. I've found
)
Thank you Richard. I've found out what the "-36" error really means, see this post :
"... openCL specific and translates to CL_INVALID_COMMAND_QUEUE ...."
at Nvidia there is an explanation of this relating driver reset to work size eg. probably our
shortieslow pay vslongieshigh pay WUs. More or less means the driver restarts if something takes 'too long'. Perhaps this is under developer control ie. change the timeouts, see TdrDelay and TdrDdiDelay here for Windows ? :-)))Cheers, Mike.
( edit ) Begs the question as to how many, if any, are Linux machines with this error variety.
( edit ) E@H can generate large FFTs with long signal integrations, say, 222 data points.
( edit ) Of course if there is no paper available then it will hang while waiting for some to be delivered, thus maybe triggering a reset too .... :-)))
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike Hewson wrote:( edit ) Of
)
I always make sure to keep a spare roll handy :-)))
LOL. Absorbent hand-towel to
)
LOL. Absorbent hand-towel to cleanup after driver messes. ;=}
Cheers, Mike.
( edit ) FWIW : on NVidia cards OpenCL is implemented using CUDA calls underneath and so :
- OpenCL will never beat CUDA
- if the CUDA aspect of a driver is bad then so will the OpenCL
- a single crappy CUDA core may break an OpenCL application eg. all the other cores working have completed bar the bad core and thus futilely await the exit barrier of the parallel portion of the code ( all need to complete before any proceed beyond ). So timeout occurs. The crappiness of a given CUDA core may be clock frequency dependent .
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal