As near as I can tell if you have a Turing GPU Einstein@home wont run. I've had two different 2080ti's and both crashed the video driver and failed all GPU related tasks. All SETI tasks continue to run perfectly which has me wondering where the fault lies..
I've heard nothing from Nvidia level 2 support by the way..
There are rumours of an RTX 2060 being announced at the end of this week and being available mid-January. Given the current issues with the Einstein app and Turing it would be wise to wait.
I was having problems with PrimeGrid. I reported it to Nvidia Support and they responded immediately. They did not fail to reply to my information. I "think" that I finally isolated it. It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.
The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.
I have TURNED ON the Nvidia Control Panel HELP -> Debug Mode option which sets the board to run at default not OC speeds and the Einstein error continues.
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?
Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
(0x1c) - exit code 28 (0x1c)</message>
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:27:33 (9456): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
18:27:45 (9456): [normal]: done. calling boinc_finish(28).
18:27:45 (9456): called boinc_finish
... It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.
I'm wondering if there might be a slightly different interpretation to what you have noticed. I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly. Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat. If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?
In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings? Have you checked that you are actually using the latest available firmware?
I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.
rjs5 wrote:
The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.
If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series? My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding. I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.
rjs5 wrote:
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?
The app doesn't change but the nature of the data does. Right at the moment, the current data will cause instant failure on Turing GPUs. From time to time, 'different' data is used which can be processed by Turing GPUs without any problem. Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts. I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.
The PrimeGrid app that had problems was genefer18. The genefer app takes input parameters that it uses to test for prime numbers. Genefer 15, 16, 17 and 19 use the same code as 18 and all worked fine. Genefer19 uses a different compute algorithm. All the rest use the same GPU code. They all take about half a CPU and genefer18 failed after 90 seconds into the app where power and GPU load was maximum. The extra power from the CPU may have been pushing some random failure in the box.
The Series 20 Nvidia Founders Edition boards vent the hot air into chassis. The 1080ti and earlier FE models vented the heat out the back. The increased heat inside the chassis could be affecting any transistor inside the box. Not just the CPU. The first transistor to fail and generate an error might not even be detected.
One of my favorite sayings: "Just because it did not fail, does not mean that it passed."
Compile time error ... that is just the way the Khronos OpenCL documentation describes the -36 error. I am not familiar with the OpenCL system. The 2080ti has 4352 Cuda cores. That is 768 more cores than the next largest 1080ti at 3584. Maybe the Einstein GPU app tries to use all the cores and 4352 causes some index or array size to overflow. I don't think the source has been made public so someone could review.
When I run the Einstein GPU WU, it fails at about 20 seconds into the app. I watch GPU load using GPUZ and the GPU is not even being used when the failure is reached according to the GPU load.
I am suspicious of the Einstein app too. It has been randomly generating the Error -36 for several years. That is why I was asking if anyone was currently having any success with the Series 20 GPU. I can't see my WU history deep enough to see if any of the 2080ti GPU jobs successfully passed. I made it harder to check since I MERGED computers and would have to look at the stderr.out file to see which GPU was running. Sigh.
... It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.
I'm wondering if there might be a slightly different interpretation to what you have noticed. I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly. Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat. If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?
In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings? Have you checked that you are actually using the latest available firmware?
I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.
rjs5 wrote:
The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.
If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series? My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding. I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.
rjs5 wrote:
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?
The app doesn't change but the nature of the data does. Right at the moment, the current data will cause instant failure on Turing GPUs. From time to time, 'different' data is used which can be processed by Turing GPUs without any problem. Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts. I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.
User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble. He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27. As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.
A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27. So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.
On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."
Combing through that data seems like a tedious method to surface the information. Too bad someone with access to the Einstein database does write a script that scans and catalogues failures. Seems like this could have been simple and identified a problem much earlier than by accumulating Forum complaints.
archae86 wrote:
User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble. He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27. As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.
A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27. So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.
On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."
For some hours now, new tasks issued at Einstein for GPU gamma-ray pulsar jobs have task IDS showing that they use data file 1041L. Both by matching the general pattern of files names vs. computation behavior, and Gary's observation of the data file size in bytes, these tasks I expect to work correctly on Turing cards with the current application and drivers.
Looking through my list of known Einstein Turing hosts, I found that Zack's machine is indeed processing this work successfully.
In the short term tasks issued to a specific host are likely to include some re-issued 0104Y tasks, which fail on Turing cards.
Examination of the Task list for Zack's machine with emphasis on tasks returned in the last couple of days shows the expected pattern:
After posting about Zack's Turing-equipped machine showing current passing and failing tasks consistent with the "good for Turing" and "Bad for Turing" task file observation, I looked through the bottom end of my known bad on Einstein Turing host list, and found about half a dozen which have returned tasks within the past couple of days--all of them showing the predicted task file dependency. These included at least one sample each of 2070, 2080, and 2080 Ti cards.
If past experience is followed, we can expect 1041L task issue to continue for over a week. I have no idea what type of data file will be used for work issue after that.
As near as I can tell if you
)
As near as I can tell if you have a Turing GPU Einstein@home wont run. I've had two different 2080ti's and both crashed the video driver and failed all GPU related tasks. All SETI tasks continue to run perfectly which has me wondering where the fault lies..
I've heard nothing from Nvidia level 2 support by the way..
There rumours of an RTX 2060
)
There are rumours of an RTX 2060 being announced at the end of this week and being available mid-January. Given the current issues with the Einstein app and Turing it would be wise to wait.
BOINC blog
I was having problems with
)
I was having problems with PrimeGrid. I reported it to Nvidia Support and they responded immediately. They did not fail to reply to my information. I "think" that I finally isolated it. It appears that the 2080ti caused more heat to be generated by the CPU/GPU than the 1080ti that it replaced. It may have been that the faster 2080ti caused the CPU to run hotter because of the increased work the CPU had to do. That was enough to make my liquid cooled 7920x Skylake running default BIOS settings to run hot. I explicitly set the MAX CPU TEMP BIOS setting to 75 degrees and the PrimeGrid problems disappeared.
The only problem that I now have is with the Einstein@Home OpenCL error -36. From what I have seen from the Khronos documentation the -36 error is a Compile time (driver-independent) error. The 7920x Intel CPU doesn't have the Intel GPU on chip if that matters. It didn't matter with the 1080ti.
I have TURNED ON the Nvidia Control Panel HELP -> Debug Mode option which sets the board to run at default not OC speeds and the Einstein error continues.
There was some discussion about some Einstein WU completing successfully. Is anyone with an Nvidia Series 20 board successfully running Windows or Linux Einstein GPU WU now?
Stderr output
rjs5 wrote:... It appears
)
I'm wondering if there might be a slightly different interpretation to what you have noticed. I don't know anything about the PrimeGrid app, but if it's anything like the Einstein situation, the app itself may not use much CPU directly. Indirectly, because of the continuous polling that goes on, the Einstein app does need a full core to be 'used' at all times but my guess is that it doesn't require much extra wattage or generate much heat. If it did, wouldn't that rather seriously damage the power efficiency advantage that nVidia has over AMD?
In thinking about why your liquid cooled CPU might apparently be having a thermal problem - which you were able to work around by changing 'BIOS' settings - is it possible that the firmware version you use might have some sort of bug that, fortuitously, disappears when you change those settings? Have you checked that you are actually using the latest available firmware?
I just find it hard to imagine a modest load like this (which must have also been there with your 1080Ti as well) could be sufficient to affect a liquid cooled CPU.
If it were just simply a 'compile time error', why does the app work on every other GPU type, both AMD and nVidia, with the only exception being the brand new Turing series? My guess (and that's all it is) is that it will eventually be found to be related both to the way the app is coded and to how the new hardware/firmware/driver combination is handling that particular coding. I rather suspect that nVidia will be able to correct this at the firmware or driver level and that we will just have to wait until that happens.
The app doesn't change but the nature of the data does. Right at the moment, the current data will cause instant failure on Turing GPUs. From time to time, 'different' data is used which can be processed by Turing GPUs without any problem. Unfortunately, the current data type is available most of the time, and the 'different' type seems to happen in fairly short bursts. I started this thread which comments on changes in the type of data and if we again get a data file that can be processed by Turing GPUs, I'll add a comment there.
Cheers,
Gary.
The PrimeGrid app that had
)
The PrimeGrid app that had problems was genefer18. The genefer app takes input parameters that it uses to test for prime numbers. Genefer 15, 16, 17 and 19 use the same code as 18 and all worked fine. Genefer19 uses a different compute algorithm. All the rest use the same GPU code. They all take about half a CPU and genefer18 failed after 90 seconds into the app where power and GPU load was maximum. The extra power from the CPU may have been pushing some random failure in the box.
The Series 20 Nvidia Founders Edition boards vent the hot air into chassis. The 1080ti and earlier FE models vented the heat out the back. The increased heat inside the chassis could be affecting any transistor inside the box. Not just the CPU. The first transistor to fail and generate an error might not even be detected.
One of my favorite sayings: "Just because it did not fail, does not mean that it passed."
Compile time error ... that is just the way the Khronos OpenCL documentation describes the -36 error. I am not familiar with the OpenCL system. The 2080ti has 4352 Cuda cores. That is 768 more cores than the next largest 1080ti at 3584. Maybe the Einstein GPU app tries to use all the cores and 4352 causes some index or array size to overflow. I don't think the source has been made public so someone could review.
When I run the Einstein GPU WU, it fails at about 20 seconds into the app. I watch GPU load using GPUZ and the GPU is not even being used when the failure is reached according to the GPU load.
I am suspicious of the Einstein app too. It has been randomly generating the Error -36 for several years. That is why I was asking if anyone was currently having any success with the Series 20 GPU. I can't see my WU history deep enough to see if any of the 2080ti GPU jobs successfully passed. I made it harder to check since I MERGED computers and would have to look at the stderr.out file to see which GPU was running. Sigh.
GPU Cores Memory
2080ti 4352 11gb
2080 2944 8gb
2070 2304 8gb
2060 1920 6gb
1080ti 3584 11gb
1080 2560 8gb
1070ti 2432 8gb
1070 1920 8gb
1060 1280 6gb
980ti 2816 6gb
980 2048 4gb
970 1664 4gb
960 1024 2gb
User Richie had the useful
)
User Richie had the useful idea of combing through the abundant supply of re-issued (_3) tasks on his system looking for Turing cards giving trouble. He gave me a list of candidates, and after reviewing them, my count of known affected Einstein participant systems currently stands at 27. As until now all but one of these has come from self-disclosure by participants who made posts on these forums, and Richie's method is only an incomplete snapshot of systems giving trouble quite recently, the real total is surely substantially larger.
A half dozen or more each of models, 2070, 2080, and 2080 Ti are on the list of 27. So far I am unaware of a 2060 card running on Einstein, but that card has only been shipping to customers very recently.
On a more positive note, Einstein project administrator Oliver Behnke made a post in the new data file thread in which he commented in regard to the Turing on Einstein problem "FYI, we're going to look into this problem as soon as we possibly can."
Good news from Oliver.
)
Good news from Oliver. Thanks.
Combing through that data seems like a tedious method to surface the information. Too bad someone with access to the Einstein database does write a script that scans and catalogues failures. Seems like this could have been simple and identified a problem much earlier than by accumulating Forum complaints.
For some hours now, new tasks
)
For some hours now, new tasks issued at Einstein for GPU gamma-ray pulsar jobs have task IDS showing that they use data file 1041L. Both by matching the general pattern of files names vs. computation behavior, and Gary's observation of the data file size in bytes, these tasks I expect to work correctly on Turing cards with the current application and drivers.
Looking through my list of known Einstein Turing hosts, I found that Zack's machine is indeed processing this work successfully.
In the short term tasks issued to a specific host are likely to include some re-issued 0104Y tasks, which fail on Turing cards.
Examination of the Task list for Zack's machine with emphasis on tasks returned in the last couple of days shows the expected pattern:
2008L tasks fail promptly (about 25 elapsed seconds)
0104Y tasks also fail promptly
1041L tasks run for about ten minutes elapsed time and terminate normally, with dozens of actual validations as of the moment I am typing.
After posting about Zack's
)
After posting about Zack's Turing-equipped machine showing current passing and failing tasks consistent with the "good for Turing" and "Bad for Turing" task file observation, I looked through the bottom end of my known bad on Einstein Turing host list, and found about half a dozen which have returned tasks within the past couple of days--all of them showing the predicted task file dependency. These included at least one sample each of 2070, 2080, and 2080 Ti cards.
If past experience is followed, we can expect 1041L task issue to continue for over a week. I have no idea what type of data file will be used for work issue after that.
1042L is available now. It
)