Pascal again available, Turing may be coming soon

Gandolph1

Joined: 20 Feb 05

Posts: 180

Credit: 389647264

RAC: 933

As near as I can tell if you

31 Dec 2018 23:41:37 UTC

Message 168611

(moderation:

)

As near as I can tell if you have a Turing GPU Einstein@home wont run. I've had two different 2080ti's and both crashed the video driver and failed all GPU related tasks. All SETI tasks continue to run perfectly which has me wondering where the fault lies..

I've heard nothing from Nvidia level 2 support by the way..

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

There rumours of an RTX 2060

1 Jan 2019 11:00:56 UTC

Message 168615

(moderation:

)

There are rumours of an RTX 2060 being announced at the end of this week and being available mid-January. Given the current issues with the Einstein app and Turing it would be wise to wait.

BOINC blog

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 610575093

RAC: 729459

I was having problems with

8 Jan 2019 22:46:18 UTC

Message 168752 in response to message 168608

(moderation:

)

I was having problems with PrimeGrid. I reported it to Nvidia Support and they responded immediately. They did not fail to reply to my information. I "think" that I finally isolated it. It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.

The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.

I have TURNED ON the Nvidia Control Panel HELP -> Debug Mode option which sets the board to run at default not OC speeds and the Einstein error continues.

There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:27:33 (9456): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
18:27:45 (9456): [normal]: done. calling boinc_finish(28).
18:27:45 (9456): called boinc_finish

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117608159709

RAC: 35266044

rjs5 wrote:... It appears

9 Jan 2019 5:41:38 UTC

Message 168760 in response to message 168752

(moderation:

)

rjs5 wrote:

... It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.

I'm wondering if there might be a slightly different interpretation to what you have noticed. I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly. Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat. If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?

In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings? Have you checked that you are actually using the latest available firmware?

I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.

rjs5 wrote:

The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.

If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series? My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding. I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.

rjs5 wrote:

There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?

The app doesn't change but the nature of the data does. Right at the moment, the current data will cause instant failure on Turing GPUs. From time to time, 'different' data is used which can be processed by Turing GPUs without any problem. Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts. I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.

Cheers,
Gary.

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 610575093

RAC: 729459

The PrimeGrid app that had

9 Jan 2019 21:22:13 UTC

Message 168767 in response to message 168760

(moderation:

)

The PrimeGrid app that had problems was genefer18. The genefer app takes input parameters that it uses to test for prime numbers. Genefer 15, 16, 17 and 19 use the same code as 18 and all worked fine. Genefer19 uses a different compute algorithm. All the rest use the same GPU code. They all take about half a CPU and genefer18 failed after 90 seconds into the app where power and GPU load was maximum. The extra power from the CPU may have been pushing some random failure in the box.

The Series 20 Nvidia Founders Edition boards vent the hot air into chassis. The 1080ti and earlier FE models vented the heat out the back. The increased heat inside the chassis could be affecting any transistor inside the box. Not just the CPU. The first transistor to fail and generate an error might not even be detected.

One of my favorite sayings: "Just because it did not fail, does not mean that it passed."

Compile time error ... that is just the way the Khronos OpenCL documentation describes the -36 error. I am not familiar with the OpenCL system. The 2080ti has 4352 Cuda cores. That is 768 more cores than the next largest 1080ti at 3584. Maybe the Einstein GPU app tries to use all the cores and 4352 causes some index or array size to overflow. I don't think the source has been made public so someone could review.

When I run the Einstein GPU WU, it fails at about 20 seconds into the app. I watch GPU load using GPUZ and the GPU is not even being used when the failure is reached according to the GPU load.

I am suspicious of the Einstein app too. It has been randomly generating the Error -36 for several years. That is why I was asking if anyone was currently having any success with the Series 20 GPU. I can't see my WU history deep enough to see if any of the 2080ti GPU jobs successfully passed. I made it harder to check since I MERGED computers and would have to look at the stderr.out file to see which GPU was running. Sigh.

GPU   Cores   Memory
2080ti   4352   11gb
2080    2944   8gb
2070   2304   8gb
2060   1920   6gb

1080ti   3584   11gb
1080   2560   8gb
1070ti   2432   8gb
1070   1920   8gb
1060   1280   6gb

980ti   2816   6gb
980   2048   4gb
970   1664   4gb
960   1024   2gb

Gary Roberts wrote:

rjs5 wrote:
... It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.

I'm wondering if there might be a slightly different interpretation to what you have noticed. I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly. Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat. If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?

In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings? Have you checked that you are actually using the latest available firmware?

I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.

rjs5 wrote:
The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.

If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series? My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding. I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.

rjs5 wrote:
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?

The app doesn't change but the nature of the data does. Right at the moment, the current data will cause instant failure on Turing GPUs. From time to time, 'different' data is used which can be processed by Turing GPUs without any problem. Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts. I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7222534931

RAC: 960185

User Richie had the useful

21 Jan 2019 15:24:03 UTC

Message 168998

(moderation:

)

User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble. He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27. As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.

A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27. So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.

On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 610575093

RAC: 729459

Good news from Oliver.

21 Jan 2019 19:27:47 UTC

Message 168999 in response to message 168998

(moderation:

)

Good news from Oliver. Thanks.

Combing through that data seems like a tedious method to surface the information. Too bad someone with access to the Einstein database does write a script that scans and catalogues failures. Seems like this could have been simple and identified a problem much earlier than by accumulating Forum complaints.

archae86 wrote:

User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble. He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27. As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.

A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27. So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.

On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7222534931

RAC: 960185

For some hours now, new tasks

27 Jan 2019 14:46:01 UTC

Message 169142

(moderation:

)

For some hours now, new tasks issued at Einstein for GPU gamma-ray pulsar jobs have task IDS showing that they use data file 1041L. Both by matching the general pattern of files names vs. computation behavior, and Gary's observation of the data file size in bytes, these tasks I expect to work correctly on Turing cards with the current application and drivers.

Looking through my list of known Einstein Turing hosts, I found that Zack's machine is indeed processing this work successfully.

In the short term tasks issued to a specific host are likely to include some re-issued 0104Y tasks, which fail on Turing cards.

Examination of the Task list for Zack's machine with emphasis on tasks returned in the last couple of days shows the expected pattern:

2008L tasks fail promptly (about 25 elapsed seconds)

0104Y tasks also fail promptly

1041L tasks run for about ten minutes elapsed time and terminate normally, with dozens of actual validations as of the moment I am typing.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7222534931

RAC: 960185

After posting about Zack's

27 Jan 2019 16:01:27 UTC

Message 169145 in response to message 169142

(moderation:

)

After posting about Zack's Turing-equipped machine showing current passing and failing tasks consistent with the "good for Turing" and "Bad for Turing" task file observation, I looked through the bottom end of my known bad on Einstein Turing host list, and found about half a dozen which have returned tasks within the past couple of days--all of them showing the predicted task file dependency. These included at least one sample each of 2070, 2080, and 2080 Ti cards.

If past experience is followed, we can expect 1041L task issue to continue for over a week. I have no idea what type of data file will be used for work issue after that.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

1042L is available now. It

31 Jan 2019 2:55:00 UTC

Message 169211 in response to message 169145

(moderation:

)

1042L is available now. It has the same data file size as mentioned below (information provided by you in another thread).

archae86 wrote:

Filename bytes Elapsed time Turing
10nnL      819,029  longest      works

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Stderr output

Comment viewing options

Forums › Cruncher's Corner