Pascal again available, Turing may be coming soon

archae86
archae86
Joined: 6 Dec 05
Posts: 3,116
Credit: 6,393,151,617
RAC: 3,165,782

I have two bits of news on

I have two bits of news on the Turing highpay WU failure matter.  The first is that I today installed driver 316.81, released a couple of days ago, and the failure remained intact.  The second is some initial progress on work inspired by Vyper's observation of parameter dependence and Richard Haselgrove's suggestion that pursuing this point might be useful.

Richard Haselgrove wrote:
archae86 wrote:
Vyper tried hacking off crudely more than half the parameter string on my test case, and the application then got the GPU going.  But of course it seems unlikely the result would have met requirements, so it is a stretch to say that made it "work".

But if he could file a proper bug report stating which parameters were 'hacked off', that might point a programmer to the area of code which is either incompatible with, or needs re-compiling for, the new hardware. RTX cards aren't going away - and going by previous experience, people will just throw them into a working machine and break it. Einstein will have to get the debugger and the compiler out sooner or later, or suffer the error rate.

My initial measure was to pull into a comparison environment the parameter strings for a half dozen each of highpay and lowpay work units, selected in two batches a month apart.  As I hoped, the great majority of parameter values were identical.  Aside from the input files (data and template), I saw five candidate parameters, all of which appeared to have fixed values within WU type, but different between the two populations.

So far I have done half the intended work.  I built command lines for a Juha-type portable test case directory for lowpay (which untampered works on all drivers and all cards available to me) for which one single parameter was altered from the original value to that observed on highpay work.

Three of my five candidate parameters sailed through without observed effect--that is the core and memory clocks elevated promptly, and a bit later GPU usage went high and stayed high.  I terminated each after 90 seconds or so of uneventful running.  Somewhat to my surprise, switching either alpha or delta from the lowpay to the highpay value caused the (otherwise) unaltered lowpay test case to fail quickly in fashion observed on highpay work:  Clocks speed up at the usual time, GPU usage kicks up a bit later, but drops back within a very few seconds, and the process terminates in a bit over twenty seconds.

I neglected to check whether the failure signature in stderr resembled what I have seen on highpay failures.  I intend to reverse the transformation, trying command lines in which one of the five suspect parameters is changed from the highpay to the lowpay value, on an otherwise untampered highpay test directory.  I currently expect that all five of these will fail, but that a further test case in which I alter both alpha and delta to the lowpay values will cause the highpay test case, thus altered, to continue GPU processing without prompt termination (I don't think you can call this "working" and don't suppose the resulting output files would match a proper result).

I'll work more on this soon.  But, of course, a key question is how I might bring this result to the attention of anyone who might find it interesting or useful.

In case it may be interesting, the five non-file command line parameters I spotted as seeming to differ systematically between highpay and lowpay work were these:

Param       high-pay       low-pay
Alpha      4.4228137297    1.41058464281
Delta      0.0345036602638 0.444366280137
skyRadius  5.817760e-08    5.090540e-07
ldiBins    15              30
df1dot     2.71528666e-15  2.512676418e-15

 Obviously my conclusion to date is that the differences in skyRadius, IdiBins, and df1dot don't seem to trigger the problem of interest, while the differences in alpha and delta do.  By extension, it seems likely the problem of interest is not driven by the content of the data file or template file (as I formerly supposed) but rather by the parameters supplied to the application as being the necessary ones to process the data files.

 

[Edited the better part of a day later, to reflect additional results, which contradict my initial conclusions in the previous paragraph.

As planned I prepared a series of test cases in which I modified the 5 suspect numeric parameters singly on a highpay test case to assume their low-pay values.  I was not surprised that none made any difference (fast fail on the previous signature was uniformly observed).  Then I ran a test case in which I changed BOTH alpha and delta.  The surprising result was that also failed.  I then ran a test case in which I changed ALL FIVE of my suspect parameters simultaneously.  Still it failed.

Pending additional work, my revised opinion is that either alpha or delta can trigger the bad behavior on an otherwise safe lowpay test case, but that it seems likely that contrary to my conclusion yesterday, the content of the data file, the template file, or both may also be able to trigger the bad behavior.  An alternate possibility is that there is a triggering parameter value possible from a parameter not on my "suspected five" list.

I did check to see whether the stderr resulting when I converted a "good" lowpay test case to bad behavior just by changing the supplied value of alpha closely resembled my real-world failure cases.  It did.

As I've been visually monitoring MSIAfterburner graphs as part of assessing failure cases, let me be a bit more specific about the usual behavior in the failing cases.

Less than a second after I hit "enter" on the command line, both the core clock and memory clock go from low idle to high values.  GPU usage remains negligible until about 5 seconds elapsed time, when it quickly rises, and just as quickly falls, reaching negligible values before 10 seconds elapsed time.  The core and memory clocks remain far above idle levels, and finally at elapsed time 22 seconds the application terminates (which is visible in the command line window as advance to a fresh prompt).  The first observable difference between passing and failing cases occurs when the GPU usage begins to fall instead of staying high.  On my system this happens after about 8 seconds elapsed time, although the application does not actually terminate until 22 seconds elapsed time.

end of material added the next day]

archae86
archae86
Joined: 6 Dec 05
Posts: 3,116
Credit: 6,393,151,617
RAC: 3,165,782

archae86 wrote: it seems

archae86 wrote:
it seems likely that contrary to my conclusion yesterday, the content of the data file... may also be able to trigger the bad behavior.

Confirmed.  I took my highpay portable test case, and made the single change of specifying the data file (the switch name is actually "inputfile") to be LATeah1027L.dat--my test case lowpay data file, but changing no other parameters, and not changing the templates file specified.

That single change sufficed to stifle the fast failure behavior.  A corresponding change of only the templates file (the switch name is actually BinaryPointFile) left the fast failure behavior unaltered.  Changing both also gave continued normal running rather than fast failure.

So, to summarize, I've observed that I can convert a lowpay working test case to the fast failure behavior by changing either the specified alpha or delta value to that seen on all of my sample of highpay work.  And I've observed that I can convert highpay from fast failure to continued running by specifying a lowpay input file, but not by simply changing alpha and/or delta to the values used for lowpay work.

I currently consider this line of investigation on toggling highpay and lowpay work in and out of fast failure finished.  I don't have a bright idea of how or to whom I might usefully report it.  Quite likely it is just a sideshow curiousity.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 3,403
Credit: 11,131,607,967
RAC: 28,887,158

I would think your analysis

I would think your analysis and testing results should be forwarded to the application developer.  Bernd Machenschalk is listed as the applications developer.  Know he frequents the forums. Oliver Bock is listed for gpu programming.  He too posts often in the forums.

 

Richie
Richie
Joined: 7 Mar 14
Posts: 653
Credit: 1,702,978,594
RAC: 0

There's a new Vulkan

There's a new Vulkan developer driver 417.17 with support for Turing cards:

https://developer.nvidia.com/vulkan-driver

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 3,403
Credit: 11,131,607,967
RAC: 28,887,158

What's the latest news on

What's the latest news on Einstein Linux app compatibility with Turing cards?  Getting a RTX 2080 tomorrow and assume I will have cease crunching on that host for both Einstein and GPUGrid.  I believe that I will still be good on MW and know I wll be fine for Seti which is my main project.  Plenty of Turing cards running well there.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3,116
Credit: 6,393,151,617
RAC: 3,165,782

Keith Myers wrote:What's the

Keith Myers wrote:
What's the latest news on Einstein Linux app compatibility with Turing cards?  Getting a RTX 2080 tomorrow and assume I will have cease crunching on that host for both Einstein and GPUGrid. 

I am unaware of any Linux reports for Einstein compatibility with Turing cards.  This could mean no one has tried, or that they tried and it just works, or it fails and they did not report and we did not notice.  Also, bear in mind that the low-pay work type in current distribution runs just fine on the Turing cards on Windows, and the current density of high-pay resends has dropped to such low levels that one could afford just to ignore the occasional error condition.  This might change a few minutes from now, weeks or months from now, or never.

Please try your 2080 on Einstein.  

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 3,403
Credit: 11,131,607,967
RAC: 28,887,158

Yep, I am going to give it a

Yep, I am going to give it a try as all posts of incompatibility with both projects have been with Windows hosts.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3,116
Credit: 6,393,151,617
RAC: 3,165,782

Richie wrote:There's a new

Richie wrote:

There's a new Vulkan developer driver 417.17 with support for Turing cards:

https://developer.nvidia.com/vulkan-driver

I installed 417.17, and my highpay test case failed in the usual prompt way.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,645
Credit: 92,796,559,145
RAC: 53,393,495

Hi Peter, Just to bring you

Hi Peter,
Just to bring you some probably unwelcome news :-(.

My early morning work cache top-up script which controls the requesting of new work for all my hosts has detected and saved a new data file 'LATeah0104X.dat' and deployed it across the fleet.  Just based on the name alone, the tasks for this file will be ones that fail on your hardware.

Apart from the name, the size of the file is also indicative.  The new file is the same size as the previous data files that gave rise to tasks that failed on your RTX 2080.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,645
Credit: 92,796,559,145
RAC: 53,393,495

Keith Myers wrote:Yep, I am

Keith Myers wrote:
Yep, I am going to give it a try as all posts of incompatibility with both projects have been with Windows hosts.

Hi Keith,
I'm sure we'd all love to know if a task based on this latest data file will fail or not on your new card running under Linux. Are you still able to give it a try, thanks?

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.