Einstein jobs frequently restart without progress

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 6

I did ask for a more in-depth

Message 79313 in response to message 79312

I did ask for a more in-depth new log, last time around. Didn't you get my email?

mopel
mopel
Joined: 2 Sep 05
Posts: 6
Credit: 21991221
RAC: 0

RE: I did ask for a more

Message 79314 in response to message 79313

Quote:
I did ask for a more in-depth new log, last time around. Didn't you get my email?

Sure, I sent out one to your email address, but I can send another one now with more verbose debugging...

While the CPU usage for Boinc is limited to 60%, I suspected the 'AMD Cool&Quiet' to impact the Einstein job ( - it reduces CPU-freq to 1 GHz when idle and increases to 2.2 GHz when loaded on demand - ), but it doesn't make a difference when I disable it and let it continue at full speed.

There are messages like 'message timeout' before task restart, but I don't see a reason. Maybe the Einstein job doesn't respond fast enough to boinc's suspend request?

28-Feb-2008 00:54:26 [---] [app_msg_send] poll: 1 msgs queued for h1_0859.90_S5R3__437_S5R3b_3:
28-Feb-2008 00:54:26 [---] [app_msg_send] poll: deferred:
28-Feb-2008 00:54:26 [---] Restarting h1_0859.90_S5R3__315_S5R3b_1 - message timeout
28-Feb-2008 00:54:26 [Einstein@Home] [task_debug] task_state=UNINITIALIZED for h1_0859.90_S5R3__315_S5R3b_1 from kill_task
28-Feb-2008 00:54:26 [---] [cpu_sched_debug] Request enforce CPU schedule: Task restart

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 6

Well no, the weird thing I

Message 79315 in response to message 79314

Well no, the weird thing I found in your log was that BOINC loses messages it's getting from the application about the task, while the task is suspended. That's causing the message time out, but it doesn't explain why the messages get lost in the first place. They're alternating as well, first your first task restarts, then a minute later the other.

I forwarded the log to the EAH moderators helping in this case and to Rom Walton of BOINC.

Joe
Joe
Joined: 24 Jan 08
Posts: 32
Credit: 1865461
RAC: 1882

I've been having the same

Message 79316 in response to message 79315

I've been having the same problem. One unit restarted so much that I had to abort it because it was .48 days past due, and the next one's "chugging along" with restarts about once every 20 minutes or so. Very discouraging, I can tell you. I'm glad to see I'm not the only one, only because that means there's somebody working on clearing this up. If there's anything I can do to help, please let me know.

mopel
mopel
Joined: 2 Sep 05
Posts: 6
Credit: 21991221
RAC: 0

RE: I've been having the

Message 79317 in response to message 79316

Quote:
I've been having the same problem. One unit restarted so much that I had to abort it because it was .48 days past due, and the next one's "chugging along" with restarts about once every 20 minutes or so. Very discouraging, I can tell you. I'm glad to see I'm not the only one, only because that means there's somebody working on clearing this up. If there's anything I can do to help, please let me know.


Did you limit the CPU usage to less than 100%? And when you allow 100%, would Einstein then run fine either? ... - on both of my XP Home and XP-64 partitions, the CPU limit raises the problem it seems. I have no clue how to fix it yet

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 6

RE: I've been having the

Message 79318 in response to message 79316

Quote:
I've been having the same problem. One unit restarted so much that I had to abort it because it was .48 days past due, and the next one's "chugging along" with restarts about once every 20 minutes or so. Very discouraging, I can tell you. I'm glad to see I'm not the only one, only because that means there's somebody working on clearing this up. If there's anything I can do to help, please let me know.


I'm wondering if your problem has to do with you running Windows 98 SE.
Looking at the output of your aborted task, you never even got that far and ran into a breakpoint error. It's possible that the way the application was compiled didn't take into account the older OSes.

I've sent your case off to Bernd.

Joe
Joe
Joined: 24 Jan 08
Posts: 32
Credit: 1865461
RAC: 1882

RE: Did you limit the CPU

Message 79319 in response to message 79317

Quote:

Did you limit the CPU usage to less than 100%? And when you allow 100%, would Einstein then run fine either? ... - on both of my XP Home and XP-64 partitions, the CPU limit raises the problem it seems. I have no clue how to fix it yet

No, it's always been at 100%. I might add that the projects I'm doing for World Grid have no trouble, just Einstein.

Joe
Joe
Joined: 24 Jan 08
Posts: 32
Credit: 1865461
RAC: 1882

RE: I've sent your case

Message 79320 in response to message 79318

Quote:

I've sent your case off to Bernd.

Thanx!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2977800968
RAC: 787898

RE: RE: I've been having

Message 79321 in response to message 79318

Quote:
Quote:
I've been having the same problem. One unit restarted so much that I had to abort it because it was .48 days past due, and the next one's "chugging along" with restarts about once every 20 minutes or so. Very discouraging, I can tell you. I'm glad to see I'm not the only one, only because that means there's somebody working on clearing this up. If there's anything I can do to help, please let me know.

I'm wondering if your problem has to do with you running Windows 98 SE.
Looking at the output of your aborted task, you never even got that far and ran into a breakpoint error. It's possible that the way the application was compiled didn't take into account the older OSes.

I've sent your case off to Bernd.


My Celeron 400 MMX 476676 is also running Windows 98SE and Einstein 4.26. It shares with SETI, 50:50 split, 1 hour task switch, kept in memory while suspended.

Just occasionally, I notice a pale yellow band on my BOINCview, and it seems as if the machine has switched to Einstein, says (to BV) that it's running, but is at 0.0000% efficiency (i.e. no progress). I don't often look at the desktop on that machine, but I can if you want (I just have to press a button to choose a different monitor input) - I'll try and catch it in the act next time I see that it's stalled.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 757629547
RAC: 1156121

RE: RE: I've sent your

Message 79322 in response to message 79320

Quote:
Quote:

I've sent your case off to Bernd.

Thanx!

The one result I see on Richard's WIN98 host does not show this strange effect, I'll try to do some test as well, I've got a Windows ME host (should be similar).
CU

Bikeman

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.