Latest data file for FGRPB1G GPU tasks

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 559783219

RAC: 256519

That is what I am seeing too.

27 Jan 2019 14:53:48 UTC

Message 169143 in response to message 169141

(moderation:

)

That is what I am seeing too. My Nvidia 2080ti Founders Edition completed the first successful Einstein execution. Interestingly, it was slightly slower than the EVGA 1080ti at 510 seconds vs 495.

GPUZ shows temperature at TDP power at 80%, GPU 70 degrees and 65% GPU load.

Ooops. Just got a 2008L and it failed.

archae86 wrote:

Gary Roberts wrote:
A new data file LATeah0104Y.dat came into play more than 12 hours ago. It has the same size as, and would appear to be a continuation of a previous series that ended with LATeah0104X.dat (first mentioned in the opening post of this thread). The tasks based on the new file will most likely crunch faster than the previous 2103L tasks.

Based on the fact that 0104X tasks did fail on Turing GPUs, I imagine the new ones would also fail as well, unfortunately.

As was typical in that series of datafiles in the past, the 0104Y was only in new issue for a few days. Now we are getting new work issue from 1041L. If past behavior based on groups by the file name holds, these would be predicted to have long elapsed times, and to work correctly on Turing cards with the current applications and drivers. The data file size is 819,029, which matches exactly the datafile size Gary Roberts reported for previous groups of tasks in that series.

To re-state our observations on similar-behaving sets of Einstein Gamma-Ray Pulsar tasks:
Filename bytes      Elapsed time Turing
10nnL      819,029  longest      works
0104?    2,720,502  shortest     fails
20nnL    1,935,482  intermediate fails
These behaviors have held true since Turing cards appeared on Einstein late in September, 2018. The 2103L file run very recently is for this purpose classed in the 20nnL group, because of renaming mentioned earlier in this thread.

archae86

Joined: 6 Dec 05

Posts: 3153

Credit: 7152634931

RAC: 563047

rjs5 wrote:That is what I am

27 Jan 2019 16:08:14 UTC

Message 169146 in response to message 169143

(moderation:

)

rjs5 wrote:

That is what I am seeing too. My Nvidia 2080ti Founders Edition completed the first successful Einstein execution. Interestingly, it was slightly slower than the EVGA 1080ti at 510 seconds vs 495.

Interesting:

User Jan's 2080 Ti host has run a few 1041L tasks successfully recently with elapsed times around 367 seconds. There must be an important configuration difference from your machine.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 968

Credit: 25170063

RAC: 38

DanNeely wrote:I think having

28 Jan 2019 12:59:18 UTC

Message 169154 in response to message 169110

(moderation:

)

DanNeely wrote:

I think having a single moderator on point to send "have you seen this" messages to project staff is probably sufficient.

To my understanding this has been the modus operandi up to now, and it has worked quite well from my point of view. I'm not sure where or why this broke down in this particular case. If this process does need some adjustment we can and should of course talk about that. I'll open a thread on the moderators mailing list this week.

Cheers,
Oliver

Einstein@Home Project

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 968

Credit: 25170063

RAC: 38

Ok, I presume you guys are

28 Jan 2019 15:01:00 UTC

Message 169156

(moderation:

)

Ok, I presume you guys are longing for any kind of feedback from us so I rather post updates as I/we get them instead of having you to wait for a full solution. So please take all these coming updates as preliminary.

I'll regard this tread as the canonical one for the RTX 2080 problem, unless being advised otherwise (by you)
Bernd is currently unable to look into this, so I took over for the time being
I'm currently running a task on our GeForce RTX 2080 Ti on 64bit Linux (Driver 410.73 / OpenCL 1.2 CUDA 10.0.185)
Workunit: LATeah1041L_180.0_0_0.0_17609907
App: hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl-nvidia OK (runtime 6:12.18)
App: hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia OK (runtime 6:22.58)
Will try Peter's test set next
Shot in the dark: error -36 on Windows could indicate a driver timeout issue. You could try increasing the TdrDelay and/or TdrDdiDelay settings to rule that out. You could also disable those timeouts entirely by setting TdrDebugMode to 1, but that could lock up you system, so please proceed with caution and be warned.

Stay tuned...

Einstein@Home Project

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 968

Credit: 25170063

RAC: 38

Update: I ran Peter's test

28 Jan 2019 15:26:31 UTC

Message 169159

(moderation:

)

Update:

I ran Peter's test case on Linux, switching the app only, and it fails there as well - GOOD!
The test on Linux seems to fail at the same stage as the Windows apps - GOOD!
The Linux does hang instead of returning an error -> this corroborates my idea that there might be a (protective) timeout involved on Windows which Linux doesn't have/use
The underlying issue is thus definitely related to the dataset being analyzed and not a general incompatibility with the RTX 2080 cards
I already see a major difference between the two sets so that gives us something to look into

Cheers

Einstein@Home Project

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 559783219

RAC: 256519

Thanks MUCH for the update.

28 Jan 2019 15:32:27 UTC

Message 169160 in response to message 169159

(moderation:

)

Thanks MUCH for the update. It tells me that you were able to devote some time for this problem AND you are seeing the exact behavior that I am. Very good.

Oliver Behnke wrote:

Update:

I ran Peter's test case on Linux, switching the app only, and it fails there as well - GOOD!

The test on Linux seems to fail at the same stage as the Windows apps - GOOD!

The Linux does hang instead of returning an error -> this corroborates my idea that there might be a (protective) timeout involved on Windows which Linux doesn't have/use

The underlying issue is thus definitely related to the dataset being analyzed and not a general incompatibility with the RTX 2080 cards

I already see a major difference between the two sets so that gives us something to look into

Cheers

Keith Myers

Joined: 11 Feb 11

Posts: 4842

Credit: 18218859849

RAC: 6265558

Was there previous failures

28 Jan 2019 22:12:05 UTC

Message 169167 in response to message 169159

(moderation:

)

Was there previous failures with Titan V cards on these fast datasets? If so, I would suspect the same mechanism is in play. Does the science app expect the same architecture for Turing cards as Pascal? There is a difference in how those card families report how many cores per sm.

Pascal seems to use cores_per_proc = 128

Turing and Titan V seems to use cores_per_proc = 64

So if you are setting up your array with your parameter set expecting 128 cores per sm and what you actually have is 64 cores per sm, then I would think there would be issues.

The BOINC developers already had to change the code in BOINC to properly calculate peak_flops values for the new Turing cards.

archae86

Joined: 6 Dec 05

Posts: 3153

Credit: 7152634931

RAC: 563047

Oliver Behnke wrote: The

29 Jan 2019 2:56:18 UTC

Message 169172 in response to message 169160

(moderation:

)

Oliver Behnke wrote:

The underlying issue is thus definitely related to the dataset being analyzed and not a general incompatibility with the RTX 2080 cards
I already see a major difference between the two sets so that gives us something to look into

Some time ago it occurred to me that in addition to differing in data and template files, the "good" and "bad" for Turing task groups might differ systematically in one or more input parameters in the project provided command string which might influence the bad Turing outcome.

Comparing the input parameter strings for the two groups of tasks, I identified

Alpha
Delta
skyRadius
IdiBins
Df1dot

as potential candidates, purely on the basis of having fixed values within a group, that differed between the two groups.

One question I tested were: could I change the behavior of a "good" task to fast fail by changing a single parameter value from that used in the good group to that used in the bad group. The answer was YES, and, to my surprise, it was true of two of these five parameters, individually. Alpha or Delta.

So the Data file plus template file on the troublesome tasks is not necessary to creating the Turing fast failure. But another test was alter all five of these values on a failing tasks to their "good group" values. This did not convert the failing task into a passing task.

Assuming I actually did what I intended to do, it appears to me that the application code responds both to the data and template file input and to at least two of the command line parameters in ways which trip the condition that leads to the Turing-associated fast fail.

I did not include this result in the summary I prepared for project staff recently, and don't think I posted it on the forums here. Quite likely it is not useful.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 968

Credit: 25170063

RAC: 38

Update: Ran Peter's test

29 Jan 2019 13:00:21 UTC

Message 169177

(moderation:

)

Update:

Ran Peter's test case on a Quadro GV100 (Volta): same FAILURE
Ran Peter's test case on a GeForce GTX 1080 Ti (Pascal): different ERROR, similar "area"
Interestingly the GTX 1080 Ti and RTX 2080 Ti both sport 11 GB memory, yet throw errors at different stages

This all paints a pretty clear picture right now. I'm curious which NVIDIA GPUs were able to process this and similar datasets (i.e. all LATeah0xxxy) at all in the past. If you guys know any for sure, please let me know. I'm going to dig through our archives in the meantime. I'm also going to review any potential code changes that might play a role here.

Cheers,
Oliver

Einstein@Home Project

archae86

Joined: 6 Dec 05

Posts: 3153

Credit: 7152634931

RAC: 563047

Oliver, My test case should

29 Jan 2019 13:44:08 UTC

Message 169178 in response to message 169177

(moderation:

)

Oliver,

My test case should have worked on the GTX 1080 Ti. There are many of them successfully running here at Einstein, including the group of tasks represented by the test case.

I'm afraid you may have found a flaw in my test case--or at least an imperfection in how portable it is.

In my flotilla, I have GTX 1050, GTX 1060 3GB, GTX 1060 6GB, and GTX 1070 GB--all Pascal cards which correctly run work in the LATeah0104? group.

On the other hand, if on a Windows RTX 2080 Ti machine the test case generates a black screen (and driver restart) about seven seconds after initiation, and terminates with the reported error syndrome after about 25 seconds, probably for that test case you really are seeing the behavior of interest--so perhaps the test case is not completely useless.

Latest data file for FGRPB1G GPU tasks

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner