Pascal again available, Turing may be coming soon

mmonnin
mmonnin
Joined: 29 May 16
Posts: 263
Credit: 761,573,143
RAC: 1,420,705

Ugh that sucks. Is this still

Ugh that sucks. Is this still the only Turing card to try E@H on these tasks. I don't see any in top 50 hosts. I don't believe anyone has mentioned having one on my own team to try. It'd be good to know if it was an isolated card or an issue with the entire lineup.

22
22
Joined: 6 Nov 11
Posts: 14
Credit: 510,049,544
RAC: 476,087

Do you see a chance to

Do you see a chance to seriously lower the temps on the GPU? Open case near an open window on a cold morning? I know most cards are advertised as working at 90 degrees centigrade or more, but on a few occasions I have seen mine compute tasks at 50 °C when it would fail at 60 °C .    Good luck with your endeavours anyway!

Juha
Juha
Joined: 27 Nov 14
Posts: 49
Credit: 4,914,434
RAC: 38

Aaargh!! When you are editing

Aaargh!! When you are editing files to test something make sure YOU ARE EDITING FILES IN THE RIGHT DIRECTORY!!! Really could have used that two hours on something else Yell

Ok. Edit init_data.xml and put these lines back in it:

<comm_obj_name>boinc_200</comm_obj_name>
<slot>200</slot>
<client_pid>4</client_pid>

200 doesn't really matter what number it is as long as it doesn't accidentally clash with any real tasks. For <client_pid> pick a process that is going to run as long as the test. 4 is System process which ought to keep running long enough. It is possible the process needs to be owned by your account in which case pick something else.

Next open PowerShell, either in terminal or ISE, both work the same. Run the following command:

$mmf = [System.IO.MemoryMappedFiles.MemoryMappedFile]::CreateOrOpen("shm_boinc_200", 0x2000)

The PowerShell command has "shm" in the name of the shared memory object but init_data.xml doesn't. Not a typo.

Now try running the test again. This time the app should think it's run by BOINC and it should use <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> from init_data.xml.

Once you are done testing clean up the shared memory object with:

$mmf.Dispose()

It's not the end of the world if you forget to run that command. The shared memory object is removed at the next reboot.

archae86
archae86
Joined: 6 Dec 05
Posts: 2,673
Credit: 2,353,860,773
RAC: 3,013,004

Thank you Juha.  I think I

Thank you Juha.  I think I now have a portable test case.

You mentioned

Juha wrote:
This time the app should think it's run by BOINC and it should use <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> from init_data.xml.

but I failed to register the implication.

Using this method the command line directive of the form --device 1 does not take precedence over the pair of lines:

<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>

in init_data.xml.

Otherwise, the new test directory I created on a second system (which runs a 1060 3GB + a 1050) appears to work (and to fail) properly on my primary system which runs a 2080 + a 1060. (I just modified the two lines mentioned in init_data.xml as required)

I intend a further portability test by burning the proposed test directory to a CD, and using it to create a test trial on my third system (which runs only a single 1050).

If all seems well, I intend to look into the suggestions on submitting a driver error report which are shown at the Nvidia Reddit forum.  I am pessimistic that the Nvidia team will even look at an end-user report that is so remote from their game core market in this busy time after a major generation introduction, but I feel it my duty to make the attempt.

Meanwhile, in order to allow my primary machine to continue doing Einstein work during this period of high-pay work unit distribution, I intend to swap out the 2080 for my 1070 later today.

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1,922
Credit: 142,576,924
RAC: 108,255

archae86 wrote:Using this

archae86 wrote:

Using this method the command line directive of the form --device 1 does not take precedence over the pair of lines:

<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>

in init_data.xml.

Correct. Command line is deprecated, preserved for historic version compatibility only. I can dig out the references if you need them.

archae86
archae86
Joined: 6 Dec 05
Posts: 2,673
Credit: 2,353,860,773
RAC: 3,013,004

Juha wrote:Now that you have

Juha wrote:
Now that you have set up a way to test the app without BOINC you could reach out to other 2080 owners and ask them to run the test. At least Seti and GPUGRID seem to have people talking about 2080's.

I today sent PM messages to Bruce and -= Vyper =- at SETI asking whether they would be willing and able to try running my test case.  Unzipped it is under a dozen files and under 20 megabytes, so I think I can just copy it to a server where I buy space to post weather data, and point them to it, if either is willing and able.

Sybie
Sybie
Joined: 28 Mar 06
Posts: 4
Credit: 2,511,692
RAC: 0

Hope I'm right here...

Hope I'm right here... getting computation error using RTX 2080.

 

I dont know much about OpenCL and programming, was reading about TDR Recovery to maybe solve the issue.

I expect to use BOINC 7.14.2 @ NVIDIA Game Ready Driver 416.34 without doing some additional stuff and use the GPU Versions of E@H.

Here is my:

archae86
archae86
Joined: 6 Dec 05
Posts: 2,673
Credit: 2,353,860,773
RAC: 3,013,004

Sybie wrote: % Filling array

Sybie wrote:

% Filling array of photon pairs ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36 ERROR: opencl_ts_2_phase_diff_sorted() returned with error 750016800 11:46:57 (12960): [CRITICAL]: ERROR: MAIN() returned with error '-36' FPU status flags: PRECISION 11:47:09 (12960): [normal]: done. calling boinc_finish(28). 11:47:09 (12960): called boinc_finish

Thanks for your report.  It appears to show your 2080 failing on the same type of WU in the same way as does my 2080.  This makes it extremely unlikely that the problem is one of a happenstance defect on my personal sample of the 2080.  (so I am glad I did not RMA it)

I have trimmed the stderr output you have posted to the portion which seems to contain the clear indication of failure above, and I am reproducing below the strikingly similar corresponding portion of a stderr from one of my "high-pay WU" 2080 failures:

% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -82224864
12:04:33 (9900): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
12:04:45 (9900): [normal]: done. calling boinc_finish(28).
12:04:45 (9900): called boinc_finish

Aside from the similarity of stderr, it appears on a brief review of the tasks lists for your machine that it usually successfully completes what I have here called "low-pay" WUs for example from the LATeah1029L group, while it universally fails the dozens of "high-pay" WUs it has tried so far, from the LATeah0104R and LATeah0104S groups. Further the "high-pay" failures show elapsed times of 19 to 25 seconds, which is quite close to my observations.

As the "high-pay" type of WU is the current type being distributed here at Einstein, you may have little hope of successful completions other than luckily picking up a few resends of low-pay WUs needing an extra host completion to complete the quorum in cases of failure to reply or failure to meet sanity checks or failure to agree.

Sybie
Sybie
Joined: 28 Mar 06
Posts: 4
Credit: 2,511,692
RAC: 0

Asteroids@home for GPU

Asteroids@home for GPU says:

<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> Das System kann die angegebene Datei nicht finden. (0x2) - exit code 2 (0x2)</message> <stderr_txt> CUDA RC12!!!!!!!!!! CUDA Device number: 0 CUDA Device: GeForce RTX 2080 Compute capability: 7.5 Multiprocessors: 46 Unsupported CC detected (CC2.0 and better supported only). </stderr_txt> ]]>

Seti@home and Collatz however runs fine with GPU.

archae86
archae86
Joined: 6 Dec 05
Posts: 2,673
Credit: 2,353,860,773
RAC: 3,013,004

Sybie wrote:CUDA Device

Sybie wrote:
CUDA Device number: 0 CUDA Device: GeForce RTX 2080 Compute capability: 7.5 Multiprocessors: 46 Unsupported CC detected (CC2.0 and better supported only).

That looks like a bug in the Asteroids software, which appears not to interpret the reported compute capability of 7.5 as being "better than CC2.0".  That seems likely to be a different problem than the one in running high-pay Einstein WUs.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.