Pascal again available, Turing may be coming soon

Gandolph1
Gandolph1
Joined: 20 Feb 05
Posts: 180
Credit: 389399701
RAC: 14362

As near as I can tell if you

As near as I can tell if you have a Turing GPU Einstein@home wont run.  I've had two different 2080ti's and both crashed the video driver and failed all GPU related tasks. All SETI tasks continue to run perfectly which has me wondering where the fault lies..

I've heard nothing from Nvidia level 2 support by the way.. 

 

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 137621151
RAC: 16773

There rumours of an RTX 2060

There are rumours of an RTX 2060 being announced at the end of this week and being available mid-January. Given the current issues with the Einstein app and Turing it would be wise to wait.

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 440342449
RAC: 1549232

I was having problems with

I was having problems with PrimeGrid. I reported it to Nvidia Support and they responded immediately. They did not fail to reply to my information. I "think" that I finally isolated it. It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.

The only problem that I now have is with the Einstein@Home OpenCL error  -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error.  The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.

I have TURNED ON the  Nvidia Control Panel HELP -> Debug Mode option which sets the board to run at default not OC speeds and the Einstein error continues.

 

There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?

 

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>

 

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:27:33 (9456): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
18:27:45 (9456): [normal]: done. calling boinc_finish(28).
18:27:45 (9456): called boinc_finish

 

 

 
Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109953189978
RAC: 31409505

rjs5 wrote:... It appears

rjs5 wrote:
... It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.

I'm wondering if there might be a slightly different interpretation to what you have noticed.  I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly.  Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat.  If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?

In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings?  Have you checked that you are actually using the latest available firmware?

I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.

rjs5 wrote:
The only problem that I now have is with the Einstein@Home OpenCL error  -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error.  The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.

If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series?  My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding.  I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.

rjs5 wrote:
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?

The app doesn't change but the nature of the data does.  Right at the moment, the current data will cause instant failure on Turing GPUs.  From time to time, 'different' data is used which can be processed by Turing GPUs without any problem.  Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts.  I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.

 

Cheers,
Gary.

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 440342449
RAC: 1549232

The PrimeGrid app that had

The PrimeGrid app that had problems was genefer18. The genefer app takes input parameters that it uses to test for prime numbers. Genefer 15, 16, 17 and 19 use the same code as 18 and all worked fine. Genefer19 uses a different compute algorithm. All the rest use the same GPU code. They all take about half a CPU and genefer18 failed after 90 seconds into the app where power and GPU load was maximum. The extra power from the CPU may have been pushing some random failure in the box.

The Series 20 Nvidia Founders Edition boards vent the hot air into chassis. The 1080ti and earlier FE models vented the heat out the back. The increased heat inside the chassis could be affecting any  transistor inside the box. Not just the CPU. The first transistor to fail and generate an error might not even be detected.

One of my favorite sayings: "Just because it did not fail, does not mean that it passed."

 

Compile time error ... that is just the way the Khronos OpenCL documentation describes the -36 error. I am not familiar with the OpenCL system. The 2080ti has 4352 Cuda cores. That is 768 more cores than the next largest 1080ti at 3584. Maybe the Einstein GPU app tries to use all the cores and 4352 causes some index or array size to overflow. I don't think the source has been made public so someone could review.

When I run the Einstein GPU WU, it fails at about 20 seconds into the app. I watch GPU load using GPUZ and the GPU is not even being used when the failure is reached according to the GPU load.

I am suspicious of the Einstein app too. It has been randomly generating the Error -36 for several years.  That is why I was asking if anyone was currently having any success with the Series 20 GPU. I can't see my WU history deep enough to see if any of the 2080ti GPU jobs successfully passed. I made it harder to check since I MERGED computers and would  have to look at the stderr.out file to see which GPU was running. Sigh.

 

GPU    Cores    Memory
2080ti    4352    11gb
2080     2944     8gb
2070    2304     8gb
2060    1920     6gb

1080ti    3584    11gb
1080    2560     8gb
1070ti    2432     8gb
1070    1920     8gb
1060    1280     6gb

980ti    2816     6gb
980    2048     4gb
970    1664     4gb
960    1024     2gb

 

 

Gary Roberts wrote:
rjs5 wrote:
... It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.

I'm wondering if there might be a slightly different interpretation to what you have noticed.  I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly.  Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat.  If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?

In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings?  Have you checked that you are actually using the latest available firmware?

I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.

rjs5 wrote:
The only problem that I now have is with the Einstein@Home OpenCL error  -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error.  The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.

If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series?  My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding.  I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.

rjs5 wrote:
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?

The app doesn't change but the nature of the data does.  Right at the moment, the current data will cause instant failure on Turing GPUs.  From time to time, 'different' data is used which can be processed by Turing GPUs without any problem.  Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts.  I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7055904931
RAC: 1606934

User Richie had the useful

User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble.  He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27.   As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.

 A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27.  So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.

On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 440342449
RAC: 1549232

Good news from Oliver.

Good news from Oliver. Thanks.

Combing through that data seems like a tedious method to surface the information. Too bad someone with access to the Einstein database does write a script that scans and catalogues failures. Seems like this could have been simple and identified a problem much earlier than by accumulating Forum complaints.

 

archae86 wrote:

User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble.  He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27.   As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.

 A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27.  So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.

On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7055904931
RAC: 1606934

For some hours now, new tasks

For some hours now, new tasks issued at Einstein for GPU gamma-ray pulsar jobs have task IDS showing that they use data file 1041L.  Both by matching the general pattern of files names vs. computation behavior, and Gary's observation of the data file size in bytes, these tasks I expect to work correctly on Turing cards with the current application and drivers.

Looking through my list of known Einstein Turing hosts, I found that Zack's machine is indeed processing this work successfully.

In the short term tasks issued to a specific host are likely to include some re-issued 0104Y tasks, which fail on Turing cards.

Examination of the Task list for Zack's machine with emphasis on tasks returned in the last couple of days shows the expected pattern:

2008L tasks fail promptly (about 25 elapsed seconds)

0104Y tasks also fail promptly

1041L tasks run for about ten minutes elapsed time and terminate normally, with dozens of actual validations as of the moment I am typing.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7055904931
RAC: 1606934

After posting about Zack's

After posting about Zack's Turing-equipped machine showing current passing and failing tasks consistent with the "good for Turing" and "Bad for Turing" task file observation, I looked through the bottom end of my known bad on Einstein Turing host list, and found about half a dozen which have returned tasks within the past couple of days--all of them showing the predicted task file dependency.  These included at least one sample each of 2070, 2080, and 2080 Ti cards.

If past experience is followed, we can expect 1041L task issue to continue for over a week.  I have no idea what type of data file will be used for work issue after that.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

1042L is available now. It

1042L is available now. It has the same data file size as mentioned below (information provided by you in another thread).

archae86 wrote:

Filename bytes      Elapsed time Turing

10nnL      819,029  longest      works

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.