stalled computation

obsidian
obsidian
Joined: 8 May 05
Posts: 55
Credit: 2250121
RAC: 0
Topic 192926

sometimes a task will stall, and i can't get it run. the task's status will say "running", but it is making no progress.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410787838
RAC: 35012645

stalled computation

Quote:
sometimes a task will stall, and i can't get it run. the task's status will say "running", but it is making no progress.

I've observed exactly this several times over the last few weeks and have commented on it towards the end of this message. I have also noticed it just today on another machine. The only difference in my case is that I seem to be able to "un-stall" it rather easily. I'm presuming that your stalled result was the one that you aborted recently?

When it happens, I simply stop boinc completely. I have it running as a service or as a daemon for unix so I simply stop the service. On restarting the service, the crunching restarts from the last saved checkpoint and there seems to be no further problem.

Bernd has said that he will be releasing a new suite of apps soon and hopefully glitches like this will be sorted out at that point.

The one I found today had been stalled for long enough for the result in progres to have already passed the deadline before I noticed the problem. It has now actually completed crunching and has been successfully validated. The particular result has this result ID.

Cheers,
Gary.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686219439
RAC: 552581

Hi Gary, The last result

Hi Gary,

The last result you quoted started with the 4.17 application, right? To me it looks as if it had run into a null-pointer problem, but may have stalled while invoking the runtime-debugger?? It was then able to recover from the previous checkpoint with the new app version.

CU

BRM

Udo
Udo
Joined: 19 May 05
Posts: 203
Credit: 8945570
RAC: 0

I was running BOINC with E@H

I was running BOINC with E@H on some dual CPU servers (with 2 cores each, that is '4 CPUs' for Einstein).
Boinc is installed as a service of course, screensaver switched off.
On 2 of these servers it happened that the BOINC service itself occupied a whole CPU and the application didn't get any CPU time...

Perhaps this is the same situation you notice as 'stalled' on a single CPU?

I could 'resolve' this problem by switching back my BOINC client from 5.8.16 to 5.4.11

EDIT: Servers are Win2003 Ent.Edition SP1

Udo

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410787838
RAC: 35012645

RE: The last result you

Message 69412 in response to message 69410

Quote:

The last result you quoted started with the 4.17 application, right?

That's correct. 4.17 was the current version on June 18 when the result was first received.

Quote:
To me it looks as if it had run into a null-pointer problem, but may have stalled while invoking the runtime-debugger??

Not being literate in C (or whatever was used) I'm not familiar with null-pointers or runtime debuggers but yes, you can see in the stderr.txt output where an unhandled exception was detected and the Windows runtime debugger was loaded. The debugger announced its version number so I presume it was engaged successfully?? Notice the dump timestamp of June 26 at 00:26:46 local time. At that point the result had been processing around 90+ hours from memory and was about 95% completed.

You will notice that it was restarted today at 13:30 local time, more than a week after it had initially stalled. Maybe the stall was to do with the debugger not being able to proceed?? You will also notice that when it was restarted, 4.17 was initially being used but then I had the bright idea to speed up the final crunching by switching to 4.24. If you scan down the output you will see that this happened at 15:15 local time, after I had set up a hacked app_info.xml to override the 4.17 version that the result was "branded" with in the state file.

Quote:
It was then able to recover from the previous checkpoint with the new app version.

Amost correct :) just exchange "new" for "old" - see above :).

One final point - I've never opened a zip archive containing the actual result itself so I have no idea what it looks like. I did read with interest your comment about this (to Brian I think) elsewhere so no doubt one of these days I'll satisfy my curiosity :). I'm just hoping that there may be information there, rather than in stderr.txt that may be of use to Bernd.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410787838
RAC: 35012645

RE: Perhaps this is the

Message 69413 in response to message 69411

Quote:

Perhaps this is the same situation you notice as 'stalled' on a single CPU?

I don't think so because I recall on one occasion starting the windows task manager to see what was running. Both BOINC and the science app were listed but not really consuming any CPU. The idle process was at 98-99%

Quote:
I could 'resolve' this problem by switching back my BOINC client from 5.8.16 to 5.4.11

I've only ever needed to stop and restart BOINC, not change version.

In today's episode, the machine in question was headless, keyboardless and mouseless. I could see using BoincMgr on another machine that BOINC had stalled but the way I got direct access to the box was to plug in a USB mouse and a monitor. I have icons on the desktop that invoke small scripts to stop and start the BOINC service so it's quite easy to get things going and even install the app_info.xml with just the USB mouse :).

Cheers,
Gary.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686219439
RAC: 552581

RE: RE: while invoking

Message 69414 in response to message 69412

Quote:
Quote:
while invoking the runtime-debugger??

Not being literate in C (or whatever was used) I'm not familiar with null-pointers or runtime debuggers but yes, you can see in the stderr.txt output where an unhandled exception was detected and the Windows runtime debugger was loaded. The debugger announced its version number so I presume it was engaged successfully?? Notice the dump timestamp of June 26 at 00:26:46 local time.

With the new app, the runtinme debugger will produce tons of output, not just a few lines. Maybe the app really stalled while in the debugger.

Quote:

One final point - I've never opened a zip archive containing the actual result itself so I have no idea what it looks like. I did read with interest your comment about this (to Brian I think) elsewhere so no doubt one of these days I'll satisfy my curiosity :). I'm just hoping that there may be information there, rather than in stderr.txt that may be of use to Bernd.

The unzipped result-file is pretty "boring"...it's pure science. If I remember correctly, it's 10,000 rows in plain ASCII, each containing 5 floating point numbers separated by spaces, each row represents one "candidate" for a pulsar (ok, this is probably simplified).

1st column: has something to do with the spinning frequency of the pulsar

2nd & 3rd column : sky coordinates of the candidate, in a longitude/latitude style coordinate system (RA/DEC)

4th column : something about the change in the spinning frequency (most pulsars seem to change their rotation over time, depending on their age)

5th column: the so called "F-Statistic", a numerical value that measures how well the observed data from the detectors matches the hypothesis that there is a gravitational wave coming from a pulsar with the given spinning characteristics and given sky position.

The final paper on the S3 analysis linked on the E@H home page explains some of this stuff pretty well, even for a physics novice like me :-).

CU

BRM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410787838
RAC: 35012645

RE: With the new app, the

Message 69415 in response to message 69414

Quote:

With the new app, the runtinme debugger will produce tons of output, not just a few lines. Maybe the app really stalled while in the debugger.

At the time the debugger was called it wasn't the new app - it was still 4.17 so was using the symbol information from the .pdb file stored locally I guess. So I guess there was supposed to be a whole lot more output in stderr.txt. So I would tend to agree that things were stalled in the debugger.

I guess that this result wont be much use for troubleshooting in that case. Also, since the OP aborted his stalled result (I think) there is no stderr.txt output in his case to see what happened there.

BTW, thanks for the info about the result file structure.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244933581
RAC: 16281

If your result is actually

If your result is actually stalled, it would be a good idea to look into the last line(s) of the stderr.txt in the slot directory of that result, and e.g. post it here.

At the very beginning, "Reading SFTs and setting up stacks..." can take quite some time, depending on the speed of the machine. Another operation that can take prety lomg depending on the history of the run is resuming from a checkpoint. While these are in progress, no progress counters etc. are updated.

Another possibility is that the communication between App, BOINC Client and Manager is broken somewhere in between. In these cases it might be helpful to open the App's graphics, as the progress counter displayed there is independent of the communication with the Core Client.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410787838
RAC: 35012645

My result was stalled for

My result was stalled for over a week before I noticed it.

By stopping and restarting the boinc service, crunching resumed immediately and completed without further incident. The result has been uploaded, reported and validated so any info in the slot directory is long gone.

Isn't the stderr.txt file that was in the slot directory now visible on the website if you follow the result ID link I posted?

This stalled result syndrome has happened a few times to me on different machines. If I get another one, I will leave it stalled and post the .txt while it is still in the stalled condition.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244933581
RAC: 16281

RE: Isn't the stderr.txt

Message 69418 in response to message 69417

Quote:
Isn't the stderr.txt file that was in the slot directory now visible on the website if you follow the result ID link I posted?


Yes, it is. Unfortunately it only reveals that indeed it got stuck starting the Windows Runtime debugger. Someone should point Rom Walton (BOINC) to this.

(actually there is more it tells us: whatever the reason was for the general access violation, it didn't persist a restart of the App, as the result completed successfully afterwards)

Would be interesting to know is this is the only reason for stalled results, or if there are others.

Quote:
This stalled result syndrome has happened a few times to me on different machines. If I get another one, I will leave it stalled and post the .txt while it is still in the stalled condition.


Yep, that's what I meant.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.