All WUs on RTX 2070 Error Out

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2955309884
RAC: 720564

The problem of a bad core

The problem of a bad core causing an over-long execution time and hence triggering a tdr reset feels like "an optimisation too far" - trying to squeeze a quart into a pint pot.

Which is exactly what happened with the RTX range. The basic hardware architecture changed from 128 cores per SM down to 64 cores per SM. The same architecture change also applies to the Tesla P100, the Tesla V100, and the Titan V.

If anyone is in position to test-run one of the failing high-pay tasks on any of those GPUs, a similar error might be evidential.

(edit) Or if anyone with database access here can identify active hosts with any of those cards, they could examine their recent error history.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315492506
RAC: 322947

Richard Haselgrove

Richard Haselgrove wrote:
Which is exactly what happened with the RTX range. The basic hardware architecture changed from 128 cores per SM down to 64 cores per SM. The same architecture change also applies to the Tesla P100, the Tesla V100, and the Titan V.

Presumably still 32 cores per warp though ?

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2955309884
RAC: 720564

Pass. I was researching that

Pass. I was researching that for https://github.com/BOINC/boinc/pull/2707 - the BOINC client report-back of the nominal GFLOPS Peak speed of the new cards. That depends (only) on the total core count, which BOINC calculates from the SM count (obtained via API) and a hardwired 'cores per SM' value - nothing else.

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Jim1348 wrote:I am beginning

Jim1348 wrote:
I am beginning to think that my card died.  I tried it on Folding, and it did not work there either.

This gets curiuoser and curiouser, and may not have anything to do with the RTX problem, or maybe it does.  I will let the experts decide.  Before pulling my card, I decided to try it on a CUDA project, and SETI v8 (cuda 42) runs fine, returning a work unit with success.  There is no indication of a driver problem in Windows Device Manager, and BOINC reports the card as usual:

CUDA: NVIDIA GPU 0: GeForce GTX 1060 6GB (driver version 416.34, CUDA version 10.0, compute capability 6.1, 4096MB, 3044MB available, 4535 GFLOPS peak)    

OpenCL: NVIDIA GPU 0: GeForce GTX 1060 6GB (driver version 416.34, device version OpenCL 1.2 CUDA, 6144MB, 3044MB available, 4535 GFLOPS peak)    

But Folding (which uses OpenCl) does not recognize the card at all, and I get the same error (insofar as I can see) on FGRPopencl1K-nvidia.

EDIT: The original failure was on the 373.06 drivers, but I updated to the latest, using a DDU uninstall, just to make sure that it was not a driver problem.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>
<stderr_txt>
11:49:06 (5724): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49

11:49:06 (5724): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
11:49:06 (5724): [debug]: 1.1e+016 fp, 4.2e+009 fp/s, 2485530 s, 690h25m29s55
11:49:06 (5724): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1025L.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 164.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah1025L_0172_8158262.dat --debug 1 --device 0 -o LATeah1025L_172.0_0_0.0_8158262_0_0.out
output files: 'LATeah1025L_172.0_0_0.0_8158262_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1025L_172.0_0_0.0_8158262_0_0' 'LATeah1025L_172.0_0_0.0_8158262_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1025L_172.0_0_0.0_8158262_0_1'
11:49:06 (5724): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
11:49:06 (5724): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000001181620 , 0000000001181300]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 1060 6GB" by: NVIDIA Corporation
Max allocation limit: 1610612736
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah1025L.dat
% Total amount of photon times: 8950
% Preparing toplist of length: 10
% Read 1631 binary points
read_checkpoint(): Couldn't open file 'LATeah1025L_172.0_0_0.0_8158262_0_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1631
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343555
11:49:10 (5724): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
11:49:22 (5724): [normal]: done. calling boinc_finish(28).
11:49:22 (5724): called boinc_finish

I haven't done any MS updates recently, and the chances of the hardware causing a selective OpenCl failure seem rather small also.  But at least I can use it for CUDA.
Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Jim1348 wrote: Before pulling

Jim1348 wrote:

Before pulling my card, I decided to try it on a CUDA project, and SETI v8 (cuda 42) runs fine, returning a work unit with success.  

 

If you are going to use Seti to compare then I would suggest running the SoG which is also an OpenCl application. Then you will know for sure if it's an OpenCl issue or a card issue. If the OpenCl SoG succeeds then it's most likely the OpenCl here. If the OpenCl SoG fails at Seti, then either the card has failed or the nvidia driver OpenCl is the issue.

 

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Zalster wrote:If you are

Zalster wrote:
If you are going to use Seti to compare then I would suggest running the SoG which is also an OpenCl application. Then you will know for sure if it's an OpenCl issue or a card issue. If the OpenCl SoG succeeds then it's most likely the OpenCl here. If the OpenCl SoG fails at Seti, then either the card has failed or the nvidia driver OpenCl is the issue.

Well I finally got some SoG, but whether they are working or not is not entirely clear to me.

http://setiathome.berkeley.edu/results.php?hostid=8606531&offset=0&show_names=0&state=2&appid=

At least they did not obviously fail.  But if not, that only deepens the mystery.  I use Folding all the time, and it is not recognizing the card.  So something is wrong.  I will have to pull the card in any case and hope for better luck.

This is my third card to fail in the last six weeks.  I just got a GTX 1070 back from RMA today.  The others are out of warranty.  However, if this still works for CUDA, I can use it to replace the GTX 750 Ti on GPUGrid which failed; it is a fortuitous chain of failures.

 

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Jim1348 wrote:Zalster

Jim1348 wrote:
Zalster wrote:
If you are going to use Seti to compare then I would suggest running the SoG which is also an OpenCl application. Then you will know for sure if it's an OpenCl issue or a card issue. If the OpenCl SoG succeeds then it's most likely the OpenCl here. If the OpenCl SoG fails at Seti, then either the card has failed or the nvidia driver OpenCl is the issue.

Well I finally got some SoG, but whether they are working or not is not entirely clear to me.

http://setiathome.berkeley.edu/results.php?hostid=8606531&offset=0&show_names=0&state=2&appid=

At least they did not obviously fail.  But if not, that only deepens the mystery.  I use Folding all the time, and it is not recognizing the card.  So something is wrong.  I will have to pull the card in any case and hope for better luck.

This is my third card to fail in the last six weeks.  I just got a GTX 1070 back from RMA today.  The others are out of warranty.  However, if this still works for CUDA, I can use it to replace the GTX 750 Ti on GPUGrid which failed; it is a fortuitous chain of failures.

 

 

 

Looks like it is working normal with the Seti OpenCl SoG.  Those 30 second work units are noise bombs. The 6 minute run time is more in line with what I would expect a high end Turing card would do. 1080Ti is 8 minutes 30 seconds so....  Will have to wait and see if they validate.

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

FWIW, a couple of the SETI

FWIW, a couple of the SETI CUDA work units have now been marked "invalid".  That corresponds to what I see here.  After a couple of re-installs, I was finally able to get Folding to recognize the card, but it would run a work unit for only a few seconds before failing.  So the card is dying on everything; no great mystery  A new one is on order.  Thanks for input.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 315492506
RAC: 322947

Not good to here that. Best

Not good to here that. Best of luck with another one.

Cheers, Mike. 

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Thanks.  I need the heat of a

Thanks.  I need the heat of a GTX 1070 for the winter anyway (not your problem, I know).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.