Deferring communication, long delays and wasting CPU time

David@home

Joined: 11 Feb 05

Posts: 24

Credit: 11639

RAC: 0

11 May 2005 6:37:59 UTC

Topic 189156

(moderation:

)

I regularly get messages like this under the messages tab:

11/05/2005 06:44:26|Einstein@Home|Deferring communication with project for 20 hours, 47 minutes, and 54 seconds

This happens even when a WU is close to completion. For example, the current Einstein WU has only 14 minutes left to run but that deferring communication message means that my boincmgr (version 4.25) will not connect to einstein@home for nearly 23 hours. The results get uploaded when complete but these delays mean that the second stage of connecting to the schedulers only occurs a whole day later. This often means that my result is the fourth in the quorum and I have effectively wasted the CPU time as my result has contributed nothing to the science since the first three WUs returned have already been validated.

Any ideas why einstein@home is defering communication for so long?

Thanks

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

Deferring communication, long delays and wasting CPU time

11 May 2005 8:39:29 UTC

Message 11552

(moderation:

)

Hidden Computers means no help from the community is possible :-(

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Hi I regularly get

11 May 2005 14:40:32 UTC

Message 11553

(moderation:

)

I regularly get messages like this under the messages tab:

11/05/2005 06:44:26|Einstein@Home|Deferring communication with project for 20 hours, 47 minutes, and 54 seconds

Any ideas why einstein@home is defering communication for so long?

Thanks

If you're using a dialup connection try to disabel network access under file-menu when not connected to the internet.
When the result is uploaded and in the "ready to report" state try to manully update the project from the projects-tab.
If you're using boinc 4.19 then right-click on the project and select update.
If you're using a newer version then left-click the project and update is to the left in the blue section.

David@home

Joined: 11 Feb 05

Posts: 24

Credit: 11639

RAC: 0

Hi Holmis Some good

11 May 2005 20:10:48 UTC

Message 11554 in response to message 11553

(moderation:

)

Hi Holmis

Some good ideas

1) I am on a broadband always on connection so it can connect at any time. I could see how a dial-up connection could cuase Boinc Manager to delay but normally there should be no connection issues.

2) Manual update is how I currently do it, but this means I have to keep an eye on Boinc Manager and I was kinda hoping it would manage all this stuff on its own.

I guess this is down to how the Boinc works, if this is the case then I think it is a shame that 25% of all CPU time will be wasted in this way. I would prefer for units that do not meet the 3 results quorum to wait until another unit is sent out after the expiry date rather than waste so much CPU time processing four at a time. OK you get the credit but it means that some much CPU time has been lost to the science.

I have also noticed that there are large gaps in the time that Einstein units are sent out often 12 hours or more separate the 1st and 4th work units. This means if you get allocated the fourth unit there is a strong chance that you will be last to return the result.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Hi Holmis Some good

11 May 2005 20:41:18 UTC

Message 11555 in response to message 11554

(moderation:

)

Hi Holmis

Some good ideas

1) I am on a broadband always on connection so it can connect at any time. I could see how a dial-up connection could cuase Boinc Manager to delay but normally there should be no connection issues.

2) Manual update is how I currently do it, but this means I have to keep an eye on Boinc Manager and I was kinda hoping it would manage all this stuff on its own.

Then I'm out of ideas. But something is makeing boinc deferr. Take a look at the message from einstein right above the first deferr-message. Anything intresting?

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 16

Which version of BOINC are

11 May 2005 20:49:23 UTC

Message 11556

(moderation:

)

Which version of BOINC are you using, Appetiser? And on what operating system?

David@home

Joined: 11 Feb 05

Posts: 24

Credit: 11639

RAC: 0

Then I'm out of ideas. But

11 May 2005 20:49:49 UTC

Message 11557 in response to message 11555

(moderation:

)

Then I'm out of ideas. But something is makeing boinc deferr. Take a look at the message from einstein right above the first deferr-message. Anything intresting?

I will have to keep an eye out for the message to reappear. I recently rebooted and no defer messages are currently in the log, from what I remember the defer message was the only one and they appear at hourly intervals counting down.

I guess there is no history of messages kept in a file on disk that I can look for?

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

I guess there is no history

11 May 2005 21:49:11 UTC

Message 11558 in response to message 11557

(moderation:

)

I guess there is no history of messages kept in a file on disk that I can look for?

There is. In the boinc directory the files stdout.old and stderr.old should contain all messages from the last time boinc was run. There overwritten the next time boinc get restarted.

It's normal for boinc to count down every hour on the deferr-messages but the intresting part should be something like this:

2005-05-11 08:41:57|ProteinPredictorAtHome|Sending request to scheduler: http://predictor.scripps.edu/predictor_cgi/cgi
2005-05-11 08:41:59|ProteinPredictorAtHome|Scheduler RPC to http://predictor.scripps.edu/predictor_cgi/cgi failed
2005-05-11 08:41:59|ProteinPredictorAtHome|No schedulers responded
2005-05-11 08:41:59|ProteinPredictorAtHome|Deferring communication with project for 1 minutes and 0 seconds

I've written the most intresting part is bold. But your message could be something completly diffrent.

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 16

CC4.25 uses the newer version

11 May 2005 23:24:53 UTC

Message 11559

(moderation:

)

CC4.25 uses the newer version of both files already. Check for stderrdae.txt and stdoutdae.txt .. the latter will have all communication in it, the former only error messages.

David@home

Joined: 11 Feb 05

Posts: 24

Credit: 11639

RAC: 0

Many thanks I found the log

11 May 2005 23:30:24 UTC

Message 11560 in response to message 11558

(moderation:

)

Many thanks I found the log files, I think they changed the name in v 4.25 as I found the file stoutae.txt contained the info:

2005-05-09 18:09:35 [---] May run out of work in 6.00 days; requesting more
2005-05-09 18:09:35 [Einstein@Home] Requesting 1.69 seconds of work
2005-05-09 18:09:35 [Einstein@Home] Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005-05-09 18:09:36 [Einstein@Home] Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005-05-09 18:09:36 [Einstein@Home] Message from server: No work sent
2005-05-09 18:09:36 [Einstein@Home] Message from server: (won't finish in time) Computer on 97.0% of time, BOINC on 98.6% of that, this project gets 33.3% of that
2005-05-09 18:09:36 [Einstein@Home] No work from project
2005-05-09 18:09:36 [Einstein@Home] Deferring communication with project for 1 days, 4 hours, 47 minutes, and 59 seconds
2005-05-09 18:33:56 [Einstein@Home] Pausing result H1_0152.0__0152.1_0.1_T01_Fin1_3 (left in memory)
2005-05-09 18:33:56 [SETI@home] Resuming result 24ja05ab.6803.26320.1009650.195_2 using setiathome version 4.09
2005-05-09 19:09:37 [Einstein@Home] Deferring communication with project for 1 days, 3 hours, 47 minutes, and 59 seconds

So it looks like no work was sent as it would not complete in time. Hence boinc manager defers. This suggests a possible scheduling problem as the active WU would complete within a few hours and would be left in ready to report state for an extra 24 hours.

Many thanks for the help, at least I know where the defer messages are coming from but it looks to me like this causes a side effect in delaying the reporting of completed WUs.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 117976851609

RAC: 21859423

2005-05-09 18:09:35 [---] May

12 May 2005 5:17:47 UTC

Message 11561 in response to message 11560

(moderation:

)

2005-05-09 18:09:35 [---] May run out of work in 6.00 days; requesting more
....
2005-05-09 18:09:36 [Einstein@Home] Message from server: (won't finish in time) Computer on 97.0% of time, BOINC on 98.6% of that, this project gets 33.3% of that
2005-05-09 18:09:36 [Einstein@Home] No work from project

I've left in the lines that represent your real "problem". With E@H 7 day deadline, it's not wise to set a 6 day queue. I'm presuming you want a big Seti cache but the penalty you pay will be wasted E@H work because you always get more than you need from E@H. Recent scheduler changes on the server mean that the scheduler will refuse to send you work that it thinks will be wasted. It's a long story and you can read all about it if you go back through the message boards.

The two things you should do are:-

1. Reduce you "connect" interval to about 1 day and let E@H become your cache if S@H decides to have further extended outages.

2. Upgrade to a 4.3x (when they finally get the bugs sorted) because JM7 has spent a lot of time improving the scheduling code to handle what I think you are trying to do.

My perception is that you wish to crunch Seti mainly and would like to have E@H there as a backup in case of a really bad Seti outage. You don't really want a low cache because E@H might "take over" and get extra share during Seti outages. With JM7's work as I understand it, if this were to happen, Seti will accumulate a "debt" and will get extra share once it's back on line and your resource share wishes will eventually be honoured in the longer term. All of this with added protection against any work exceeding the deadline.

So what have you got to lose by setting a smaller cache? Well those annoying deferring messages for a start :).

Of course this is all IMHO and you are quite at liberty to set whatever cache best suits your needs.

Your final comment was
"This suggests a possible scheduling problem as the active WU would complete within a few hours and would be left in ready to report state for an extra 24 hours."

Not quite sure what you mean here as the work that the server is refusing to send is not the work that will be running for many days. With your 6 day cache don't you already have 6 days (at least) of E@H work sitting in your queue. With your computers hidden we can't see for ourselves but that would be my expectation. The server, by deferring, is just saying that with all the work on hand, it doesn't think that new work would get a run before it will be in deadline trouble. Whether or not the server is correct is a moot point. With a 6 day cache you are courting trouble anyway.

Cheers,
Gary.

Deferring communication, long delays and wasting CPU time

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports