I still think it's the app. At Seti we have a coder who bought a Turing Card and tweaked the code along with adapting it to cuda 9 and last version was for cuda 10 only under Linux. Then a second coder tweaked it more to make it backward compatible to older cards. So I believe that is why they are having success where others are failing.
That may also be true as NVidia implements OpenCL using CUDA .... in any case there may be several problems co-existing. The surprise would be if NVidia coded their CUDA wrong, LOL !
Cheers, Mike
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike, remember that this thread started with Task failures - tasks from one set of data files ran successfully on Turing cards, tasks from a different set of data files crashed (under Windows), or span wheels without making progress (under Linux).
That doesn't feel (to me) like a gross driver compatibility problem. It's more subtle than that. The nearest equivalent I can think of is the launch of NVidia Fermi cards at SETI in 2010. Ultimate bottom line: the developers of the previous application (NVidia themselves!) had used a simplified assumption as an optimisation, but the specification was tightened up for Fermi and the assumption was no longer valid. Again, some task types succeeded, but most failed. I'll see if I can dig up a reference.
I really think somebody has to look at the application and the OpenCL components together, and see why the app triggers the crash under defined circumstances - at least we have the error messages for Windows.
Thanks for the comments Richard. The Fermi compiler optimisation that caused data incoherency amongst threads within a warp is very interesting case indeed. Threads of differing indices could not read logically correct/intended intermediate values from global memory - because the compiler kept them away from it ! One has to add the 'volatile' keyword to force write-backs and re-reads. Subtle indeed as this 'optimisation' would not apply for thread numbers more than a warp's worth ie. 32. That, in turn, depends upon the problem setup - the kernel environment as it were.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
There is another new report of a Turing user seeing the fast-fail problem. I took the opportunity to compile a list of hosts/users reporting problems running Einstein GRP on Turing cards. I spotted eleven from forum posts and may have missed some (added two more reported by Yeti after initial post). One of these did not report to these forums and was accidentally discovered, and there quite likely are others.
Account ID Host RTX OS Latest Driver
139940 Archae86 12260865 2080 Windows 10 417.17
130556 bcavnaugh 12707326 2080 Ti Windows 10 417.01
24633 CElliott 12591228 2070 Windows 8.1 416.34
87612 Keith Myers 12291110 2080 Linux 410.78
143397 Penguin 12614077 2080 Windows 10 417.22
Unknown Anonymous 12711628 2080 Ti Windows 10 417.22
31398 th3tricky 12735904 2070 Windows 10 417.01
215546 Ouiche 12735193 2080 Windows 10 417.22
232598 Sybie 12578265 2080 Windows 10 417.22
30420 gandolph1 11869044 2080 Ti Windows 10 416.94
77248 Dougga 12747881 2080 Windows 10 417.01
9428 Yeti 12662252 2080 Windows 10 416.34
2690 csbyseti 712484 2070 Windows 10 416.34
th3tricky wrote:So would
)
Worth a try. It may answer a question about driver dependency of the error.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I still think it's the app.
)
I still think it's the app. At Seti we have a coder who bought a Turing Card and tweaked the code along with adapting it to cuda 9 and last version was for cuda 10 only under Linux. Then a second coder tweaked it more to make it backward compatible to older cards. So I believe that is why they are having success where others are failing.
That may also be true as
)
That may also be true as NVidia implements OpenCL using CUDA .... in any case there may be several problems co-existing. The surprise would be if NVidia coded their CUDA wrong, LOL !
Cheers, Mike
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike, remember that this
)
Mike, remember that this thread started with Task failures - tasks from one set of data files ran successfully on Turing cards, tasks from a different set of data files crashed (under Windows), or span wheels without making progress (under Linux).
That doesn't feel (to me) like a gross driver compatibility problem. It's more subtle than that. The nearest equivalent I can think of is the launch of NVidia Fermi cards at SETI in 2010. Ultimate bottom line: the developers of the previous application (NVidia themselves!) had used a simplified assumption as an optimisation, but the specification was tightened up for Fermi and the assumption was no longer valid. Again, some task types succeeded, but most failed. I'll see if I can dig up a reference.
I really think somebody has to look at the application and the OpenCL components together, and see why the app triggers the crash under defined circumstances - at least we have the error messages for Windows.
Edit - found it.
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/Fermi_Compatibility_Guide.pdf
Search for the keyword 'volatile'.
Thanks for the comments
)
Thanks for the comments Richard. The Fermi compiler optimisation that caused data incoherency amongst threads within a warp is very interesting case indeed. Threads of differing indices could not read logically correct/intended intermediate values from global memory - because the compiler kept them away from it ! One has to add the 'volatile' keyword to force write-backs and re-reads. Subtle indeed as this 'optimisation' would not apply for thread numbers more than a warp's worth ie. 32. That, in turn, depends upon the problem setup - the kernel environment as it were.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
There is another new report
)
There is another new report of a Turing user seeing the fast-fail problem. I took the opportunity to compile a list of hosts/users reporting problems running Einstein GRP on Turing cards. I spotted eleven from forum posts and may have missed some (added two more reported by Yeti after initial post). One of these did not report to these forums and was accidentally discovered, and there quite likely are others.
Has anyone tried a Windows 7
)
Has anyone tried a Windows 7 and RTX combo? Just a thought.
You overlooked my machine:
)
You overlooked my machine: RTX 2080 https://einsteinathome.org/de/host/12662252
csbyseti: RTX 2070: https://einsteinathome.org/de/host/712484
Supporting BOINC, a great concept !
Well, so much for any driver
)
Well, so much for any driver dependency .... :-(
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Yes, my computer with RTX2070
)
Yes, my computer with RTX2070 failed also.
Worked for some weeks without problems, then starts to fail at WU-Startup.
All other computers (GTX 1080 - GTX970 and Ati RX570 and Vega56) worked without problems.
So it's a problem with the new sort of WU's and turing card.