Ugh that sucks. Is this still the only Turing card to try E@H on these tasks. I don't see any in top 50 hosts. I don't believe anyone has mentioned having one on my own team to try. It'd be good to know if it was an isolated card or an issue with the entire lineup.
Do you see a chance to seriously lower the temps on the GPU? Open case near an open window on a cold morning? I know most cards are advertised as working at 90 degrees centigrade or more, but on a few occasions I have seen mine compute tasks at 50 °C when it would fail at 60 °C . Good luck with your endeavours anyway!
Aaargh!! When you are editing files to test something make sure YOU ARE EDITING FILES IN THE RIGHT DIRECTORY!!! Really could have used that two hours on something else
Ok. Edit init_data.xml and put these lines back in it:
200 doesn't really matter what number it is as long as it doesn't accidentally clash with any real tasks. For <client_pid> pick a process that is going to run as long as the test. 4 is System process which ought to keep running long enough. It is possible the process needs to be owned by your account in which case pick something else.
Next open PowerShell, either in terminal or ISE, both work the same. Run the following command:
The PowerShell command has "shm" in the name of the shared memory object but init_data.xml doesn't. Not a typo.
Now try running the test again. This time the app should think it's run by BOINC and it should use <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> from init_data.xml.
Once you are done testing clean up the shared memory object with:
$mmf.Dispose()
It's not the end of the world if you forget to run that command. The shared memory object is removed at the next reboot.
Otherwise, the new test directory I created on a second system (which runs a 1060 3GB + a 1050) appears to work (and to fail) properly on my primary system which runs a 2080 + a 1060. (I just modified the two lines mentioned in init_data.xml as required)
I intend a further portability test by burning the proposed test directory to a CD, and using it to create a test trial on my third system (which runs only a single 1050).
If all seems well, I intend to look into the suggestions on submitting a driver error report which are shown at the Nvidia Reddit forum. I am pessimistic that the Nvidia team will even look at an end-user report that is so remote from their game core market in this busy time after a major generation introduction, but I feel it my duty to make the attempt.
Meanwhile, in order to allow my primary machine to continue doing Einstein work during this period of high-pay work unit distribution, I intend to swap out the 2080 for my 1070 later today.
Now that you have set up a way to test the app without BOINC you could reach out to other 2080 owners and ask them to run the test. At least Seti and GPUGRID seem to have people talking about 2080's.
I today sent PM messages to Bruce and -= Vyper =- at SETI asking whether they would be willing and able to try running my test case. Unzipped it is under a dozen files and under 20 megabytes, so I think I can just copy it to a server where I buy space to post weather data, and point them to it, if either is willing and able.
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 750016800
11:46:57 (12960): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
11:47:09 (12960): [normal]: done. calling boinc_finish(28).
11:47:09 (12960): called boinc_finish
Thanks for your report. It appears to show your 2080 failing on the same type of WU in the same way as does my 2080. This makes it extremely unlikely that the problem is one of a happenstance defect on my personal sample of the 2080. (so I am glad I did not RMA it)
I have trimmed the stderr output you have posted to the portion which seems to contain the clear indication of failure above, and I am reproducing below the strikingly similar corresponding portion of a stderr from one of my "high-pay WU" 2080 failures:
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -82224864
12:04:33 (9900): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
12:04:45 (9900): [normal]: done. calling boinc_finish(28).
12:04:45 (9900): called boinc_finish
Aside from the similarity of stderr, it appears on a brief review of the tasks lists for your machine that it usually successfully completes what I have here called "low-pay" WUs for example from the LATeah1029L group, while it universally fails the dozens of "high-pay" WUs it has tried so far, from the LATeah0104R and LATeah0104S groups. Further the "high-pay" failures show elapsed times of 19 to 25 seconds, which is quite close to my observations.
As the "high-pay" type of WU is the current type being distributed here at Einstein, you may have little hope of successful completions other than luckily picking up a few resends of low-pay WUs needing an extra host completion to complete the quorum in cases of failure to reply or failure to meet sanity checks or failure to agree.
<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> Das System kann die angegebene Datei nicht finden. (0x2) - exit code 2 (0x2)</message> <stderr_txt> CUDA RC12!!!!!!!!!! CUDA Device number: 0 CUDA Device: GeForce RTX 2080 Compute capability: 7.5 Multiprocessors: 46 Unsupported CC detected (CC2.0 and better supported only). </stderr_txt> ]]>
CUDA Device number: 0 CUDA Device: GeForce RTX 2080 Compute capability: 7.5 Multiprocessors: 46 Unsupported CC detected (CC2.0 and better supported only).
That looks like a bug in the Asteroids software, which appears not to interpret the reported compute capability of 7.5 as being "better than CC2.0". That seems likely to be a different problem than the one in running high-pay Einstein WUs.
Ugh that sucks. Is this still
)
Ugh that sucks. Is this still the only Turing card to try E@H on these tasks. I don't see any in top 50 hosts. I don't believe anyone has mentioned having one on my own team to try. It'd be good to know if it was an isolated card or an issue with the entire lineup.
Do you see a chance to
)
Do you see a chance to seriously lower the temps on the GPU? Open case near an open window on a cold morning? I know most cards are advertised as working at 90 degrees centigrade or more, but on a few occasions I have seen mine compute tasks at 50 °C when it would fail at 60 °C . Good luck with your endeavours anyway!
Aaargh!! When you are editing
)
Aaargh!! When you are editing files to test something make sure YOU ARE EDITING FILES IN THE RIGHT DIRECTORY!!! Really could have used that two hours on something else
Ok. Edit init_data.xml and put these lines back in it:
200 doesn't really matter what number it is as long as it doesn't accidentally clash with any real tasks. For <client_pid> pick a process that is going to run as long as the test. 4 is System process which ought to keep running long enough. It is possible the process needs to be owned by your account in which case pick something else.
Next open PowerShell, either in terminal or ISE, both work the same. Run the following command:
The PowerShell command has "shm" in the name of the shared memory object but init_data.xml doesn't. Not a typo.
Now try running the test again. This time the app should think it's run by BOINC and it should use <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> from init_data.xml.
Once you are done testing clean up the shared memory object with:
It's not the end of the world if you forget to run that command. The shared memory object is removed at the next reboot.
Thank you Juha. I think I
)
Thank you Juha. I think I now have a portable test case.
You mentioned
but I failed to register the implication.
Using this method the command line directive of the form --device 1 does not take precedence over the pair of lines:
<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>
in init_data.xml.
Otherwise, the new test directory I created on a second system (which runs a 1060 3GB + a 1050) appears to work (and to fail) properly on my primary system which runs a 2080 + a 1060. (I just modified the two lines mentioned in init_data.xml as required)
I intend a further portability test by burning the proposed test directory to a CD, and using it to create a test trial on my third system (which runs only a single 1050).
If all seems well, I intend to look into the suggestions on submitting a driver error report which are shown at the Nvidia Reddit forum. I am pessimistic that the Nvidia team will even look at an end-user report that is so remote from their game core market in this busy time after a major generation introduction, but I feel it my duty to make the attempt.
Meanwhile, in order to allow my primary machine to continue doing Einstein work during this period of high-pay work unit distribution, I intend to swap out the 2080 for my 1070 later today.
archae86 wrote:Using this
)
Correct. Command line is deprecated, preserved for historic version compatibility only. I can dig out the references if you need them.
Juha wrote:Now that you have
)
I today sent PM messages to Bruce and -= Vyper =- at SETI asking whether they would be willing and able to try running my test case. Unzipped it is under a dozen files and under 20 megabytes, so I think I can just copy it to a server where I buy space to post weather data, and point them to it, if either is willing and able.
Hope I'm right here...
)
Hope I'm right here... getting computation error using RTX 2080.
I dont know much about OpenCL and programming, was reading about TDR Recovery to maybe solve the issue.
I expect to use BOINC 7.14.2 @ NVIDIA Game Ready Driver 416.34 without doing some additional stuff and use the GPU Versions of E@H.
Here is my:
Stderr output
11:46:47 (12960): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
11:46:47 (12960): [debug]: 1.1e+016 fp, 5.3e+009 fp/s, 1997662 s, 554h54m21s90
11:46:47 (12960): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0104R.dat --alpha 4.4228137297 --delta -0.0345036602638 --skyRadius 5.817760e-08 --ldiBins 15 --f0start 676.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.71528666e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0104R_0684_315580.dat --debug 1 --device 0 -o LATeah0104R_684.0_0_0.0_315580_0_0.out
output files: 'LATeah0104R_684.0_0_0.0_315580_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0104R_684.0_0_0.0_315580_0_0' 'LATeah0104R_684.0_0_0.0_315580_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0104R_684.0_0_0.0_315580_0_1'
11:46:47 (12960): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
11:46:47 (12960): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000000145340 , 0000000000144D50]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080" by: NVIDIA Corporation
Max allocation limit: 2147483648
Global mem size: 0
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0104R.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
% Read 1018 binary points
read_checkpoint(): Couldn't open file 'LATeah0104R_684.0_0_0.0_315580_0_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1018
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 38 df1dot: 2.71528666e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 750016800
11:46:57 (12960): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
11:47:09 (12960): [normal]: done. calling boinc_finish(28).
11:47:09 (12960): called boinc_finish
</stderr_txt>
]]>
Sybie wrote: % Filling array
)
Thanks for your report. It appears to show your 2080 failing on the same type of WU in the same way as does my 2080. This makes it extremely unlikely that the problem is one of a happenstance defect on my personal sample of the 2080. (so I am glad I did not RMA it)
I have trimmed the stderr output you have posted to the portion which seems to contain the clear indication of failure above, and I am reproducing below the strikingly similar corresponding portion of a stderr from one of my "high-pay WU" 2080 failures:
Aside from the similarity of stderr, it appears on a brief review of the tasks lists for your machine that it usually successfully completes what I have here called "low-pay" WUs for example from the LATeah1029L group, while it universally fails the dozens of "high-pay" WUs it has tried so far, from the LATeah0104R and LATeah0104S groups. Further the "high-pay" failures show elapsed times of 19 to 25 seconds, which is quite close to my observations.
As the "high-pay" type of WU is the current type being distributed here at Einstein, you may have little hope of successful completions other than luckily picking up a few resends of low-pay WUs needing an extra host completion to complete the quorum in cases of failure to reply or failure to meet sanity checks or failure to agree.
Asteroids@home for GPU
)
Asteroids@home for GPU says:
Seti@home and Collatz however runs fine with GPU.
Sybie wrote:CUDA Device
)
That looks like a bug in the Asteroids software, which appears not to interpret the reported compute capability of 7.5 as being "better than CC2.0". That seems likely to be a different problem than the one in running high-pay Einstein WUs.