Trouble with Gamma-ray pulsar binary search on GPUs 1.22

Kavanagh

Joined: 29 Oct 06

Posts: 1842

Credit: 102884510

RAC: 12577

16 Jun 2019 9:50:47 UTC

Topic 219047

(moderation:

)

Processes fine up to a small percentage (< 5 %) and the hangs, and eventually times out. Log message.

16/06/2019 02:20:48 | Einstein@Home | Aborting task LATeah1049c_180.0_0_0.0_27813443_1: exceeded elapsed time limit 59966.48 (10500000.00G/175.10G)

Drivers up to date.

Richard

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2949913757

RAC: 689227

Please at least say which of

16 Jun 2019 10:20:54 UTC

Message 171719

(moderation:

)

Please at least say which of your GPUs is having the problem. The NVIDIA Quadro NVS 290 (256MB) probably has too little memory (and an old driver), so I'm guessing it's not that one - but we shouldn't have to guess.

When did you last 'up-to-date' the driver? Has it ever worked since that update?

Oooh, you're getting lovely big crash dumps. That'll put some meat on the bones.

https://einsteinathome.org/task/861772548

Kavanagh

Joined: 29 Oct 06

Posts: 1842

Credit: 102884510

RAC: 12577

NVidia GeForce GTX 1060

16 Jun 2019 10:33:13 UTC

Message 171720

(moderation:

)

NVidia GeForce GTX 1060 6GB

Driver version 430.86

Never worked.

Richard

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117246575340

RAC: 36191474

Kavanagh wrote:Processes fine

16 Jun 2019 21:48:00 UTC

Message 171734

(moderation:

)

Kavanagh wrote:

Processes fine up to a small percentage (< 5 %) and the hangs, and eventually times out.

That example you gave is quite atypical and not the real problem. Occasionally, a GPU can just 'spin its wheels' - ie. get into a state where no progress is being made at all. As a safeguard, there is a longish time limit and the system eventually just pulls the plug.

There is a further atypical example in your tasks list - a task that actually completed successfully and is currently waiting for validation. That sort of indicates that there might be a borderline hardware issue of some sort that usually rears its ugly head but in this particular case managed not to.

How good is your PSU? Does it have enough oomph to drive your current GPU? The fact that you described your GPU as "never worked" seems to suggest that you replaced a different GPU that was working, perhaps?? After all, you got 78M cobblestones from somewhere :-). It really does help people trying to diagnose issues if you give a few details. The other thing to do to help eliminate potential issues is to perform and report on a decent memory test of your current RAM.

I'm certainly grateful that Richard is looking at the crash dumps. It's Windows and I wouldn't have the faintest clue of how to do any of that.

EDIT: On re-reading and thinking further, I guess that "never worked" seems more correctly to mean "stopped working reliably after the last driver update". So I now assume the current GPU has been there all along and is not something new. That doesn't really change anything since the sole successful task does show that the current driver is capable of working. The possible 'borderline hardware' diagnosis doesn't really change.

You haven't mentioned tweaking anything. I assume that all frequencies and voltages are stock standard?

Cheers,
Gary.

Kavanagh

Joined: 29 Oct 06

Posts: 1842

Credit: 102884510

RAC: 12577

Gary Roberts wrote:- You

17 Jun 2019 6:27:42 UTC

Message 171737

(moderation:

)

Gary Roberts wrote:-

You haven't mentioned tweaking anything. I assume that all frequencies and voltages are stock standard?

Yes.

Boinc seems to have aborted all my GPU tasks now.

Richard

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117246575340

RAC: 36191474

Kavanagh wrote:Boinc seems to

17 Jun 2019 8:02:21 UTC

Message 171740 in response to message 171737

(moderation:

)

Kavanagh wrote:

Boinc seems to have aborted all my GPU tasks now.

You have just two which were aborted by the system due to the time limit being reached. All the other failures are compute errors, mostly at very close to the 90 sec run time value. BOINC has no real involvement in this other than to return the 'crash debris' to the project :-). I hope that you have temporarily disabled the GPU search until there is something further to try :-). Not much point in generating more debris, just yet :-).

To me, there seem to be two particular lines of enquiry. Hopefully, Richard may find some sort of driver issue when he analyses the crash dumps. That would be one of those lines and perhaps you may have to revert to whatever driver you were using when you weren't having these problems.

The other line would be to look into possible hardware issues and this is where things get a bit curly. Can you look on the specs label for your PSU and see what amps rating it has for 12V? If it isn't mentioned on the case somewhere, you might need to remove the side panel to examine the actual PSU itself. The make, model and age of the PSU would also be useful. Ideally, the best way to eliminate the PSU as the potential issue is to substitute it for a known good unit and see if the problem vanishes. Probably not very practical for the average user.

If you end up opening the case to check the PSU specs, give the interior a good brush out/blow out and check all fans and heat sinks for freedom from clogging dust/fluff and free spinning of the blades. If everything there seems OK, you might like to remove and re-seat the RAM modules as sometimes these may not have proper contact between the gold pins and the socket contacts. The action of removing and re-seating may improve that contact. In any case you should run a program to check for any memory errors. Richard may have advice on something suitable for Windows.

The compute errors mostly occurring at close to the 90 sec run time (~1 min for CPU time) seems a bit suspicious. Maybe Richard might have some thoughts about that. The very first checkpoint is supposed to be written around the 1 min mark so maybe there's something going on around that time that is somehow involved. Hopefully Richard might find something that gives a clue.

Cheers,
Gary.

Kavanagh

Joined: 29 Oct 06

Posts: 1842

Credit: 102884510

RAC: 12577

Gary wrote:- I hope that you

17 Jun 2019 9:19:52 UTC

Message 171742

(moderation:

)

Gary wrote:-

I hope that you have temporarily disabled the GPU search until there is something further to try :-)

Disabled.

CPU temperatures seem high. Time to open the box and give it a good clean.

Richard

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2949913757

RAC: 689227

I've had a look through one

17 Jun 2019 10:48:51 UTC

Message 171744

(moderation:

)

I've had a look through one of the crash dumps, but unfortunately those things are really of most use to the original programmers - especially if they can work with a debug version of the program which lists the names of the program functions involved in the crash.

The problem is a 0xc0000005 Access Violation, and from what I can tell, it's happening at the OS/video driver level, rather than in Einstein code - so it's unlikely to be helpful to drag the Einstein staff developers into this thread. But I hope they might take a look.

The last time I advised on one of these, I suggested using a utility like Display Driver Uninstaller to clean out the current video driver, and then doing a clean install of a new driver - perhaps in this case, going back a few versions just in case there's a bug in the current one. It took the user a couple of attempts (deleting both the NVidia and the Intel GPU drivers), but he got it working in the end.

Perhaps give it a quick try (reducing your cache so not too many tasks are downloaded/wasted to start with) after the cleaning is complete and the temperatures are under control. If the problem hasn't gone away after that, try the driver replacement method.

Trouble with Gamma-ray pulsar binary search on GPUs 1.22

Forums › Problems and Bug Reports

Please at least say which of

NVidia GeForce GTX 1060

Kavanagh wrote:Processes fine

Gary Roberts wrote:- You

Kavanagh wrote:Boinc seems to

Gary wrote:- I hope that you

I've had a look through one

Comment viewing options

Forums › Problems and Bug Reports