Pascal again available, Turing may be coming soon

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6595

Credit: 340054390

RAC: 327463

th3tricky wrote:So would

10 Dec 2018 2:16:45 UTC

Message 168160 in response to message 168158

(moderation:

)

th3tricky wrote:

So would rolling back one driver version have any effect?

Worth a try. It may answer a question about driver dependency of the error.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

I still think it's the app.

10 Dec 2018 5:21:59 UTC

Message 168162

(moderation:

)

I still think it's the app. At Seti we have a coder who bought a Turing Card and tweaked the code along with adapting it to cuda 9 and last version was for cuda 10 only under Linux. Then a second coder tweaked it more to make it backward compatible to older cards. So I believe that is why they are having success where others are failing.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6595

Credit: 340054390

RAC: 327463

That may also be true as

10 Dec 2018 6:35:00 UTC

Message 168163 in response to message 168162

(moderation:

)

That may also be true as NVidia implements OpenCL using CUDA .... in any case there may be several problems co-existing. The surprise would be if NVidia coded their CUDA wrong, LOL !

Cheers, Mike

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3038364911

RAC: 1904711

Mike, remember that this

10 Dec 2018 10:18:20 UTC

Message 168164 in response to message 168163

(moderation:

)

Mike, remember that this thread started with Task failures - tasks from one set of data files ran successfully on Turing cards, tasks from a different set of data files crashed (under Windows), or span wheels without making progress (under Linux).

That doesn't feel (to me) like a gross driver compatibility problem. It's more subtle than that. The nearest equivalent I can think of is the launch of NVidia Fermi cards at SETI in 2010. Ultimate bottom line: the developers of the previous application (NVidia themselves!) had used a simplified assumption as an optimisation, but the specification was tightened up for Fermi and the assumption was no longer valid. Again, some task types succeeded, but most failed. I'll see if I can dig up a reference.

I really think somebody has to look at the application and the OpenCL components together, and see why the app triggers the crash under defined circumstances - at least we have the error messages for Windows.

Edit - found it.

http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/Fermi_Compatibility_Guide.pdf

Search for the keyword 'volatile'.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6595

Credit: 340054390

RAC: 327463

Thanks for the comments

11 Dec 2018 5:55:59 UTC

Message 168182 in response to message 168164

(moderation:

)

Thanks for the comments Richard. The Fermi compiler optimisation that caused data incoherency amongst threads within a warp is very interesting case indeed. Threads of differing indices could not read logically correct/intended intermediate values from global memory - because the compiler kept them away from it ! One has to add the 'volatile' keyword to force write-backs and re-reads. Subtle indeed as this 'optimisation' would not apply for thread numbers more than a warp's worth ie. 32. That, in turn, depends upon the problem setup - the kernel environment as it were.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7389661687

RAC: 2013747

There is another new report

12 Dec 2018 19:22:01 UTC

Message 168217 in response to message 168152

(moderation:

)

There is another new report of a Turing user seeing the fast-fail problem. I took the opportunity to compile a list of hosts/users reporting problems running Einstein GRP on Turing cards. I spotted eleven from forum posts and may have missed some (added two more reported by Yeti after initial post). One of these did not report to these forums and was accidentally discovered, and there quite likely are others.

Account ID          Host     RTX     OS          Latest Driver
139940  Archae86    12260865 2080    Windows 10  417.17
130556  bcavnaugh   12707326 2080 Ti Windows 10  417.01
24633   CElliott    12591228 2070    Windows 8.1 416.34
87612   Keith Myers 12291110 2080    Linux       410.78
143397  Penguin     12614077 2080    Windows 10  417.22
Unknown Anonymous   12711628 2080 Ti Windows 10  417.22
31398   th3tricky   12735904 2070    Windows 10  417.01
215546  Ouiche      12735193 2080    Windows 10  417.22
232598  Sybie       12578265 2080    Windows 10  417.22
30420   gandolph1   11869044 2080 Ti Windows 10  416.94
77248   Dougga      12747881 2080    Windows 10  417.01
9428    Yeti        12662252 2080    Windows 10  416.34
2690    csbyseti    712484   2070    Windows 10  416.34

bluestang

Joined: 13 Apr 15

Posts: 34

Credit: 2492970228

RAC: 0

Has anyone tried a Windows 7

12 Dec 2018 18:29:01 UTC

Message 168218

(moderation:

)

Has anyone tried a Windows 7 and RTX combo? Just a thought.

Yeti

Joined: 17 Nov 04

Posts: 59

Credit: 1371204130

RAC: 560

You overlooked my machine:

12 Dec 2018 18:53:48 UTC

Message 168219

(moderation:

)

You overlooked my machine: RTX 2080 https://einsteinathome.org/de/host/12662252

csbyseti: RTX 2070: https://einsteinathome.org/de/host/712484

Supporting BOINC, a great concept !

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6595

Credit: 340054390

RAC: 327463

Well, so much for any driver

13 Dec 2018 6:01:18 UTC

Message 168229 in response to message 168217

(moderation:

)

Well, so much for any driver dependency .... :-(

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

csbyseti

Joined: 18 May 06

Posts: 8

Credit: 708108496

RAC: 566

Yes, my computer with RTX2070

13 Dec 2018 15:58:02 UTC

Message 168239

(moderation:

)

Yes, my computer with RTX2070 failed also.

Worked for some weeks without problems, then starts to fail at WU-Startup.

All other computers (GTX 1080 - GTX970 and Ati RX570 and Vega56) worked without problems.

So it's a problem with the new sort of WU's and turing card.

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner