Pascal again available, Turing may be coming soon

mmonnin

Joined: 29 May 16

Posts: 291

Credit: 3232287015

RAC: 137411

Ugh that sucks. Is this still

18 Oct 2018 13:32:17 UTC

Message 167345

(moderation:

)

Ugh that sucks. Is this still the only Turing card to try E@H on these tasks. I don't see any in top 50 hosts. I don't believe anyone has mentioned having one on my own team to try. It'd be good to know if it was an isolated card or an issue with the entire lineup.

Joined: 6 Nov 11

Posts: 14

Credit: 757863367

RAC: 3577

Do you see a chance to

18 Oct 2018 18:51:51 UTC

Message 167353

(moderation:

)

Do you see a chance to seriously lower the temps on the GPU? Open case near an open window on a cold morning? I know most cards are advertised as working at 90 degrees centigrade or more, but on a few occasions I have seen mine compute tasks at 50 °C when it would fail at 60 °C . Good luck with your endeavours anyway!

Juha

Joined: 27 Nov 14

Posts: 49

Credit: 4962746

RAC: 17

Aaargh!! When you are editing

18 Oct 2018 20:18:53 UTC

Message 167356 in response to message 167343

(moderation:

)

Aaargh!! When you are editing files to test something make sure YOU ARE EDITING FILES IN THE RIGHT DIRECTORY!!! Really could have used that two hours on something else

Ok. Edit init_data.xml and put these lines back in it:

<comm_obj_name>boinc_200</comm_obj_name>
<slot>200</slot>
<client_pid>4</client_pid>

200 doesn't really matter what number it is as long as it doesn't accidentally clash with any real tasks. For <client_pid> pick a process that is going to run as long as the test. 4 is System process which ought to keep running long enough. It is possible the process needs to be owned by your account in which case pick something else.

Next open PowerShell, either in terminal or ISE, both work the same. Run the following command:

$mmf = [System.IO.MemoryMappedFiles.MemoryMappedFile]::CreateOrOpen("shm_boinc_200", 0x2000)

The PowerShell command has "shm" in the name of the shared memory object but init_data.xml doesn't. Not a typo.

Now try running the test again. This time the app should think it's run by BOINC and it should use <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> from init_data.xml.

Once you are done testing clean up the shared memory object with:

$mmf.Dispose()

It's not the end of the world if you forget to run that command. The shared memory object is removed at the next reboot.

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7059384931

RAC: 1380056

Thank you Juha. I think I

19 Oct 2018 19:25:24 UTC

Message 167377 in response to message 167356

(moderation:

)

Thank you Juha. I think I now have a portable test case.

You mentioned

Juha wrote:

This time the app should think it's run by BOINC and it should use <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> from init_data.xml.

but I failed to register the implication.

Using this method the command line directive of the form --device 1 does not take precedence over the pair of lines:

<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>

in init_data.xml.

Otherwise, the new test directory I created on a second system (which runs a 1060 3GB + a 1050) appears to work (and to fail) properly on my primary system which runs a 2080 + a 1060. (I just modified the two lines mentioned in init_data.xml as required)

I intend a further portability test by burning the proposed test directory to a CD, and using it to create a test trial on my third system (which runs only a single 1050).

If all seems well, I intend to look into the suggestions on submitting a driver error report which are shown at the Nvidia Reddit forum. I am pessimistic that the Nvidia team will even look at an end-user report that is so remote from their game core market in this busy time after a major generation introduction, but I feel it my duty to make the attempt.

Meanwhile, in order to allow my primary machine to continue doing Einstein work during this period of high-pay work unit distribution, I intend to swap out the 2080 for my 1070 later today.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2140

Credit: 2773133100

RAC: 881231

archae86 wrote:Using this

19 Oct 2018 21:57:33 UTC

Message 167378 in response to message 167377

(moderation:

)

archae86 wrote:

Using this method the command line directive of the form --device 1 does not take precedence over the pair of lines:

<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>

in init_data.xml.

Correct. Command line is deprecated, preserved for historic version compatibility only. I can dig out the references if you need them.

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7059384931

RAC: 1380056

Juha wrote:Now that you have

20 Oct 2018 21:25:23 UTC

Message 167388 in response to message 167264

(moderation:

)

Juha wrote:

Now that you have set up a way to test the app without BOINC you could reach out to other 2080 owners and ask them to run the test. At least Seti and GPUGRID seem to have people talking about 2080's.

I today sent PM messages to Bruce and -= Vyper =- at SETI asking whether they would be willing and able to try running my test case. Unzipped it is under a dozen files and under 20 megabytes, so I think I can just copy it to a server where I buy space to post weather data, and point them to it, if either is willing and able.

Sybie

Joined: 28 Mar 06

Posts: 5

Credit: 6347019

RAC: 0

Hope I'm right here...

21 Oct 2018 21:54:35 UTC

Message 167400

(moderation:

)

Hope I'm right here... getting computation error using RTX 2080.

I dont know much about OpenCL and programming, was reading about TDR Recovery to maybe solve the issue.

I expect to use BOINC 7.14.2 @ NVIDIA Game Ready Driver 416.34 without doing some additional stuff and use the GPU Versions of E@H.

Here is my:

Stderr output

<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
Der Drucker hat kein Papier mehr.
 (0x1c) - exit code 28 (0x1c)</message>
<stderr_txt>
11:46:47 (12960): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49

11:46:47 (12960): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
11:46:47 (12960): [debug]: 1.1e+016 fp, 5.3e+009 fp/s, 1997662 s, 554h54m21s90
11:46:47 (12960): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0104R.dat --alpha 4.4228137297 --delta -0.0345036602638 --skyRadius 5.817760e-08 --ldiBins 15 --f0start 676.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.71528666e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0104R_0684_315580.dat --debug 1 --device 0 -o LATeah0104R_684.0_0_0.0_315580_0_0.out
output files: 'LATeah0104R_684.0_0_0.0_315580_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0104R_684.0_0_0.0_315580_0_0' 'LATeah0104R_684.0_0_0.0_315580_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0104R_684.0_0_0.0_315580_0_1'
11:46:47 (12960): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
11:46:47 (12960): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000000145340 , 0000000000144D50]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080" by: NVIDIA Corporation
Max allocation limit: 2147483648
Global mem size: 0
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0104R.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
% Read 1018 binary points
read_checkpoint(): Couldn't open file 'LATeah0104R_684.0_0_0.0_315580_0_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1018
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 38 df1dot: 2.71528666e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 750016800
11:46:57 (12960): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
11:47:09 (12960): [normal]: done. calling boinc_finish(28).
11:47:09 (12960): called boinc_finish

</stderr_txt>
]]>

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7059384931

RAC: 1380056

Sybie wrote: % Filling array

21 Oct 2018 22:58:48 UTC

Message 167402 in response to message 167400

(moderation:

)

Sybie wrote:

% Filling array of photon pairs ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36 ERROR: opencl_ts_2_phase_diff_sorted() returned with error 750016800 11:46:57 (12960): [CRITICAL]: ERROR: MAIN() returned with error '-36' FPU status flags: PRECISION 11:47:09 (12960): [normal]: done. calling boinc_finish(28). 11:47:09 (12960): called boinc_finish

Thanks for your report. It appears to show your 2080 failing on the same type of WU in the same way as does my 2080. This makes it extremely unlikely that the problem is one of a happenstance defect on my personal sample of the 2080. (so I am glad I did not RMA it)

I have trimmed the stderr output you have posted to the portion which seems to contain the clear indication of failure above, and I am reproducing below the strikingly similar corresponding portion of a stderr from one of my "high-pay WU" 2080 failures:

% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -82224864
12:04:33 (9900): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
12:04:45 (9900): [normal]: done. calling boinc_finish(28).
12:04:45 (9900): called boinc_finish

Aside from the similarity of stderr, it appears on a brief review of the tasks lists for your machine that it usually successfully completes what I have here called "low-pay" WUs for example from the LATeah1029L group, while it universally fails the dozens of "high-pay" WUs it has tried so far, from the LATeah0104R and LATeah0104S groups. Further the "high-pay" failures show elapsed times of 19 to 25 seconds, which is quite close to my observations.

As the "high-pay" type of WU is the current type being distributed here at Einstein, you may have little hope of successful completions other than luckily picking up a few resends of low-pay WUs needing an extra host completion to complete the quorum in cases of failure to reply or failure to meet sanity checks or failure to agree.

Sybie

Joined: 28 Mar 06

Posts: 5

Credit: 6347019

RAC: 0

Asteroids@home for GPU

23 Oct 2018 6:59:29 UTC

Message 167415

(moderation:

)

Asteroids@home for GPU says:

<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> Das System kann die angegebene Datei nicht finden. (0x2) - exit code 2 (0x2)</message> <stderr_txt> CUDA RC12!!!!!!!!!! CUDA Device number: 0 CUDA Device: GeForce RTX 2080 Compute capability: 7.5 Multiprocessors: 46 Unsupported CC detected (CC2.0 and better supported only). </stderr_txt> ]]>

Seti@home and Collatz however runs fine with GPU.

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7059384931

RAC: 1380056

Sybie wrote:CUDA Device

23 Oct 2018 13:54:36 UTC

Message 167419 in response to message 167415

(moderation:

)

Sybie wrote:

CUDA Device number: 0 CUDA Device: GeForce RTX 2080 Compute capability: 7.5 Multiprocessors: 46 Unsupported CC detected (CC2.0 and better supported only).

That looks like a bug in the Asteroids software, which appears not to interpret the reported compute capability of 7.5 as being "better than CC2.0". That seems likely to be a different problem than the one in running high-pay Einstein WUs.

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Stderr output

Comment viewing options

Forums › Cruncher's Corner