Deferring communication, long delays and wasting CPU time

David@home
David@home
Joined: 11 Feb 05
Posts: 24
Credit: 11639
RAC: 0
Topic 189156

Hi

I regularly get messages like this under the messages tab:

11/05/2005 06:44:26|Einstein@Home|Deferring communication with project for 20 hours, 47 minutes, and 54 seconds

This happens even when a WU is close to completion. For example, the current Einstein WU has only 14 minutes left to run but that deferring communication message means that my boincmgr (version 4.25) will not connect to einstein@home for nearly 23 hours. The results get uploaded when complete but these delays mean that the second stage of connecting to the schedulers only occurs a whole day later. This often means that my result is the fourth in the quorum and I have effectively wasted the CPU time as my result has contributed nothing to the science since the first three WUs returned have already been validated.

Any ideas why einstein@home is defering communication for so long?

Thanks

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

Deferring communication, long delays and wasting CPU time

Hidden Computers means no help from the community is possible :-(

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Hi I regularly get

Hi

I regularly get messages like this under the messages tab:

11/05/2005 06:44:26|Einstein@Home|Deferring communication with project for 20 hours, 47 minutes, and 54 seconds

This happens even when a WU is close to completion. For example, the current Einstein WU has only 14 minutes left to run but that deferring communication message means that my boincmgr (version 4.25) will not connect to einstein@home for nearly 23 hours. The results get uploaded when complete but these delays mean that the second stage of connecting to the schedulers only occurs a whole day later. This often means that my result is the fourth in the quorum and I have effectively wasted the CPU time as my result has contributed nothing to the science since the first three WUs returned have already been validated.

Any ideas why einstein@home is defering communication for so long?

Thanks

If you're using a dialup connection try to disabel network access under file-menu when not connected to the internet.
When the result is uploaded and in the "ready to report" state try to manully update the project from the projects-tab.
If you're using boinc 4.19 then right-click on the project and select update.
If you're using a newer version then left-click the project and update is to the left in the blue section.

David@home
David@home
Joined: 11 Feb 05
Posts: 24
Credit: 11639
RAC: 0

Hi Holmis Some good

Message 11554 in response to message 11553

Hi Holmis

Some good ideas

1) I am on a broadband always on connection so it can connect at any time. I could see how a dial-up connection could cuase Boinc Manager to delay but normally there should be no connection issues.

2) Manual update is how I currently do it, but this means I have to keep an eye on Boinc Manager and I was kinda hoping it would manage all this stuff on its own.

I guess this is down to how the Boinc works, if this is the case then I think it is a shame that 25% of all CPU time will be wasted in this way. I would prefer for units that do not meet the 3 results quorum to wait until another unit is sent out after the expiry date rather than waste so much CPU time processing four at a time. OK you get the credit but it means that some much CPU time has been lost to the science.

I have also noticed that there are large gaps in the time that Einstein units are sent out often 12 hours or more separate the 1st and 4th work units. This means if you get allocated the fourth unit there is a strong chance that you will be last to return the result.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Hi Holmis Some good

Message 11555 in response to message 11554

Hi Holmis

Some good ideas

1) I am on a broadband always on connection so it can connect at any time. I could see how a dial-up connection could cuase Boinc Manager to delay but normally there should be no connection issues.

2) Manual update is how I currently do it, but this means I have to keep an eye on Boinc Manager and I was kinda hoping it would manage all this stuff on its own.

I guess this is down to how the Boinc works, if this is the case then I think it is a shame that 25% of all CPU time will be wasted in this way. I would prefer for units that do not meet the 3 results quorum to wait until another unit is sent out after the expiry date rather than waste so much CPU time processing four at a time. OK you get the credit but it means that some much CPU time has been lost to the science.

I have also noticed that there are large gaps in the time that Einstein units are sent out often 12 hours or more separate the 1st and 4th work units. This means if you get allocated the fourth unit there is a strong chance that you will be last to return the result.

Then I'm out of ideas. But something is makeing boinc deferr. Take a look at the message from einstein right above the first deferr-message. Anything intresting?

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 16

Which version of BOINC are

Which version of BOINC are you using, Appetiser? And on what operating system?

David@home
David@home
Joined: 11 Feb 05
Posts: 24
Credit: 11639
RAC: 0

Then I'm out of ideas. But

Message 11557 in response to message 11555


Then I'm out of ideas. But something is makeing boinc deferr. Take a look at the message from einstein right above the first deferr-message. Anything intresting?

I will have to keep an eye out for the message to reappear. I recently rebooted and no defer messages are currently in the log, from what I remember the defer message was the only one and they appear at hourly intervals counting down.

I guess there is no history of messages kept in a file on disk that I can look for?

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

I guess there is no history

Message 11558 in response to message 11557


I guess there is no history of messages kept in a file on disk that I can look for?

There is. In the boinc directory the files stdout.old and stderr.old should contain all messages from the last time boinc was run. There overwritten the next time boinc get restarted.

It's normal for boinc to count down every hour on the deferr-messages but the intresting part should be something like this:

2005-05-11 08:41:57|ProteinPredictorAtHome|Sending request to scheduler: http://predictor.scripps.edu/predictor_cgi/cgi
2005-05-11 08:41:59|ProteinPredictorAtHome|Scheduler RPC to http://predictor.scripps.edu/predictor_cgi/cgi failed
2005-05-11 08:41:59|ProteinPredictorAtHome|No schedulers responded

2005-05-11 08:41:59|ProteinPredictorAtHome|Deferring communication with project for 1 minutes and 0 seconds

I've written the most intresting part is bold. But your message could be something completly diffrent.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 16

CC4.25 uses the newer version

CC4.25 uses the newer version of both files already. Check for stderrdae.txt and stdoutdae.txt .. the latter will have all communication in it, the former only error messages.

David@home
David@home
Joined: 11 Feb 05
Posts: 24
Credit: 11639
RAC: 0

Many thanks I found the log

Message 11560 in response to message 11558

Many thanks I found the log files, I think they changed the name in v 4.25 as I found the file stoutae.txt contained the info:

2005-05-09 18:09:35 [---] May run out of work in 6.00 days; requesting more
2005-05-09 18:09:35 [Einstein@Home] Requesting 1.69 seconds of work
2005-05-09 18:09:35 [Einstein@Home] Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005-05-09 18:09:36 [Einstein@Home] Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005-05-09 18:09:36 [Einstein@Home] Message from server: No work sent
2005-05-09 18:09:36 [Einstein@Home] Message from server: (won't finish in time) Computer on 97.0% of time, BOINC on 98.6% of that, this project gets 33.3% of that
2005-05-09 18:09:36 [Einstein@Home] No work from project
2005-05-09 18:09:36 [Einstein@Home] Deferring communication with project for 1 days, 4 hours, 47 minutes, and 59 seconds
2005-05-09 18:33:56 [Einstein@Home] Pausing result H1_0152.0__0152.1_0.1_T01_Fin1_3 (left in memory)
2005-05-09 18:33:56 [SETI@home] Resuming result 24ja05ab.6803.26320.1009650.195_2 using setiathome version 4.09
2005-05-09 19:09:37 [Einstein@Home] Deferring communication with project for 1 days, 3 hours, 47 minutes, and 59 seconds

So it looks like no work was sent as it would not complete in time. Hence boinc manager defers. This suggests a possible scheduling problem as the active WU would complete within a few hours and would be left in ready to report state for an extra 24 hours.

Many thanks for the help, at least I know where the defer messages are coming from but it looks to me like this causes a side effect in delaying the reporting of completed WUs.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117976851609
RAC: 21859423

2005-05-09 18:09:35 [---] May

Message 11561 in response to message 11560


2005-05-09 18:09:35 [---] May run out of work in 6.00 days; requesting more
....
2005-05-09 18:09:36 [Einstein@Home] Message from server: (won't finish in time) Computer on 97.0% of time, BOINC on 98.6% of that, this project gets 33.3% of that
2005-05-09 18:09:36 [Einstein@Home] No work from project

I've left in the lines that represent your real "problem". With E@H 7 day deadline, it's not wise to set a 6 day queue. I'm presuming you want a big Seti cache but the penalty you pay will be wasted E@H work because you always get more than you need from E@H. Recent scheduler changes on the server mean that the scheduler will refuse to send you work that it thinks will be wasted. It's a long story and you can read all about it if you go back through the message boards.

The two things you should do are:-

1. Reduce you "connect" interval to about 1 day and let E@H become your cache if S@H decides to have further extended outages.

2. Upgrade to a 4.3x (when they finally get the bugs sorted) because JM7 has spent a lot of time improving the scheduling code to handle what I think you are trying to do.

My perception is that you wish to crunch Seti mainly and would like to have E@H there as a backup in case of a really bad Seti outage. You don't really want a low cache because E@H might "take over" and get extra share during Seti outages. With JM7's work as I understand it, if this were to happen, Seti will accumulate a "debt" and will get extra share once it's back on line and your resource share wishes will eventually be honoured in the longer term. All of this with added protection against any work exceeding the deadline.

So what have you got to lose by setting a smaller cache? Well those annoying deferring messages for a start :).

Of course this is all IMHO and you are quite at liberty to set whatever cache best suits your needs.

Your final comment was
"This suggests a possible scheduling problem as the active WU would complete within a few hours and would be left in ready to report state for an extra 24 hours."

Not quite sure what you mean here as the work that the server is refusing to send is not the work that will be running for many days. With your 6 day cache don't you already have 6 days (at least) of E@H work sitting in your queue. With your computers hidden we can't see for ourselves but that would be my expectation. The server, by deferring, is just saying that with all the work on hand, it doesn't think that new work would get a run before it will be in deadline trouble. Whether or not the server is correct is a moot point. With a 6 day cache you are courting trouble anyway.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.