Trouble with Gamma-ray pulsar binary search on GPUs 1.22

Kavanagh
Kavanagh
Joined: 29 Oct 06
Posts: 1566
Credit: 100416787
RAC: 13076
Topic 219047

Processes fine up to a small percentage (< 5 %) and the hangs, and eventually times out. Log message.

16/06/2019 02:20:48 | Einstein@Home | Aborting task LATeah1049c_180.0_0_0.0_27813443_1: exceeded elapsed time limit 59966.48 (10500000.00G/175.10G)

 

Drivers up to date.

Richard

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752699717
RAC: 1466726

Please at least say which of

Please at least say which of your GPUs is having the problem. The NVIDIA Quadro NVS 290 (256MB) probably has too little memory (and an old driver), so I'm guessing it's not that one - but we shouldn't have to guess.

When did you last 'up-to-date' the driver? Has it ever worked since that update?

 

Oooh, you're getting lovely big crash dumps. That'll put some meat on the bones.

https://einsteinathome.org/task/861772548

Kavanagh
Kavanagh
Joined: 29 Oct 06
Posts: 1566
Credit: 100416787
RAC: 13076

NVidia GeForce GTX 1060

NVidia GeForce GTX 1060 6GB

 

Driver version 430.86

Never worked.

 

 

 

Richard

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109387373468
RAC: 35923960

Kavanagh wrote:Processes fine

Kavanagh wrote:
Processes fine up to a small percentage (< 5 %) and the hangs, and eventually times out.


That example you gave is quite atypical and not the real problem.  Occasionally, a GPU can just 'spin its wheels' - ie. get into a state where no progress is being made at all.  As a safeguard, there is a longish time limit and the system eventually just pulls the plug.

There is a further atypical example in your tasks list - a task that actually completed successfully and is currently waiting for validation.  That sort of indicates that there might be a borderline hardware issue of some sort that usually rears its ugly head but in this particular case managed not to.

How good is your PSU?  Does it have enough oomph to drive your current GPU?  The fact that you described your GPU as "never worked" seems to suggest that you replaced a different GPU that was working, perhaps??  After all, you got 78M cobblestones from somewhere :-).  It really does help people trying to diagnose issues if you give a few details.  The other thing to do to help eliminate potential issues is to perform and report on a decent memory test of your current RAM.

I'm certainly grateful that Richard is looking at the crash dumps.  It's Windows and I wouldn't have the faintest clue of how to do any of that.

EDIT:  On re-reading and thinking further, I guess that "never worked" seems more correctly to mean "stopped working reliably after the last driver update".  So I now assume the current GPU has been there all along and is not something new.  That doesn't really change anything since the sole successful task does show that the current driver is capable of working.  The possible 'borderline hardware' diagnosis doesn't really change.

You haven't mentioned tweaking anything.  I assume that all frequencies and voltages are stock standard?

Cheers,
Gary.

Kavanagh
Kavanagh
Joined: 29 Oct 06
Posts: 1566
Credit: 100416787
RAC: 13076

Gary Roberts wrote:-   You

Gary Roberts wrote:-

 

You haven't mentioned tweaking anything.  I assume that all frequencies and voltages are stock standard?

 

Yes.

 

Boinc seems to have aborted all my GPU tasks now.

 

 

Richard

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109387373468
RAC: 35923960

Kavanagh wrote:Boinc seems to

Kavanagh wrote:
Boinc seems to have aborted all my GPU tasks now.


You have just two which were aborted by the system due to the time limit being reached.  All the other failures are compute errors, mostly at very close to the 90 sec run time value.  BOINC has no real involvement in this other than to return the 'crash debris' to the project :-).  I hope that you have temporarily disabled the GPU search until there is something further to try :-).  Not much point in generating more debris, just yet :-).

To me, there seem to be two particular lines of enquiry.  Hopefully, Richard may find some sort of driver issue when he analyses the crash dumps.  That would be one of those lines and perhaps you may have to revert to whatever driver you were using when you weren't having these problems.

The other line would be to look into possible hardware issues and this is where things get a bit curly.  Can you look on the specs label for your PSU and see what amps rating it has for 12V?  If it isn't mentioned on the case somewhere, you might need to remove the side panel to examine the actual PSU itself.  The make, model and age of the PSU would also be useful.  Ideally, the best way to eliminate the PSU as the potential issue is to substitute it for a known good unit and see if the problem vanishes.  Probably not very practical for the average user.

If you end up opening the case to check the PSU specs, give the interior a good brush out/blow out and check all fans and heat sinks for freedom from clogging dust/fluff and free spinning of the blades.  If everything there seems OK, you might like to remove and re-seat the RAM modules as sometimes these may not have proper contact between the gold pins and the socket contacts.  The action of removing and re-seating may improve that contact.  In any case you should run a program to check for any memory errors.  Richard may have advice on something suitable for Windows.

The compute errors mostly occurring at close to the 90 sec run time (~1 min for CPU time) seems a bit suspicious. Maybe Richard might have some thoughts about that.  The very first checkpoint is supposed to be written around the 1 min mark so maybe there's something going on around that time that is somehow involved.  Hopefully Richard might find something that gives a clue.

Cheers,
Gary.

Kavanagh
Kavanagh
Joined: 29 Oct 06
Posts: 1566
Credit: 100416787
RAC: 13076

Gary wrote:- I hope that you

Gary wrote:-

I hope that you have temporarily disabled the GPU search until there is something further to try :-)

Disabled.

CPU temperatures seem high. Time to open the box and give it a good clean.

 

Richard

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752699717
RAC: 1466726

I've had a look through one

I've had a look through one of the crash dumps, but unfortunately those things are really of most use to the original programmers - especially if they can work with a debug version of the program which lists the names of the program functions involved in the crash.

The problem is a 0xc0000005 Access Violation, and from what I can tell, it's happening at the OS/video driver level, rather than in Einstein code - so it's unlikely to be helpful to drag the Einstein staff developers into this thread. But I hope they might take a look.

The last time I advised on one of these, I suggested using a utility like Display Driver Uninstaller to clean out the current video driver, and then doing a clean install of a new driver - perhaps in this case, going back a few versions just in case there's a bug in the current one. It took the user a couple of attempts (deleting both the NVidia and the Intel GPU drivers), but he got it working in the end.

Perhaps give it a quick try (reducing your cache so not too many tasks are downloaded/wasted to start with) after the cleaning is complete and the temperatures are under control. If the problem hasn't gone away after that, try the driver replacement method.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.