Hi
I regularly get messages like this under the messages tab:
11/05/2005 06:44:26|Einstein@Home|Deferring communication with project for 20 hours, 47 minutes, and 54 seconds
This happens even when a WU is close to completion. For example, the current Einstein WU has only 14 minutes left to run but that deferring communication message means that my boincmgr (version 4.25) will not connect to einstein@home for nearly 23 hours. The results get uploaded when complete but these delays mean that the second stage of connecting to the schedulers only occurs a whole day later. This often means that my result is the fourth in the quorum and I have effectively wasted the CPU time as my result has contributed nothing to the science since the first three WUs returned have already been validated.
Any ideas why einstein@home is defering communication for so long?
Thanks
Copyright © 2024 Einstein@Home. All rights reserved.
Deferring communication, long delays and wasting CPU time
)
Hidden Computers means no help from the community is possible :-(
Hi I regularly get
)
Hi
I regularly get messages like this under the messages tab:
11/05/2005 06:44:26|Einstein@Home|Deferring communication with project for 20 hours, 47 minutes, and 54 seconds
This happens even when a WU is close to completion. For example, the current Einstein WU has only 14 minutes left to run but that deferring communication message means that my boincmgr (version 4.25) will not connect to einstein@home for nearly 23 hours. The results get uploaded when complete but these delays mean that the second stage of connecting to the schedulers only occurs a whole day later. This often means that my result is the fourth in the quorum and I have effectively wasted the CPU time as my result has contributed nothing to the science since the first three WUs returned have already been validated.
Any ideas why einstein@home is defering communication for so long?
Thanks
If you're using a dialup connection try to disabel network access under file-menu when not connected to the internet.
When the result is uploaded and in the "ready to report" state try to manully update the project from the projects-tab.
If you're using boinc 4.19 then right-click on the project and select update.
If you're using a newer version then left-click the project and update is to the left in the blue section.
Hi Holmis Some good
)
Hi Holmis
Some good ideas
1) I am on a broadband always on connection so it can connect at any time. I could see how a dial-up connection could cuase Boinc Manager to delay but normally there should be no connection issues.
2) Manual update is how I currently do it, but this means I have to keep an eye on Boinc Manager and I was kinda hoping it would manage all this stuff on its own.
I guess this is down to how the Boinc works, if this is the case then I think it is a shame that 25% of all CPU time will be wasted in this way. I would prefer for units that do not meet the 3 results quorum to wait until another unit is sent out after the expiry date rather than waste so much CPU time processing four at a time. OK you get the credit but it means that some much CPU time has been lost to the science.
I have also noticed that there are large gaps in the time that Einstein units are sent out often 12 hours or more separate the 1st and 4th work units. This means if you get allocated the fourth unit there is a strong chance that you will be last to return the result.
Hi Holmis Some good
)
Hi Holmis
Some good ideas
1) I am on a broadband always on connection so it can connect at any time. I could see how a dial-up connection could cuase Boinc Manager to delay but normally there should be no connection issues.
2) Manual update is how I currently do it, but this means I have to keep an eye on Boinc Manager and I was kinda hoping it would manage all this stuff on its own.
I guess this is down to how the Boinc works, if this is the case then I think it is a shame that 25% of all CPU time will be wasted in this way. I would prefer for units that do not meet the 3 results quorum to wait until another unit is sent out after the expiry date rather than waste so much CPU time processing four at a time. OK you get the credit but it means that some much CPU time has been lost to the science.
I have also noticed that there are large gaps in the time that Einstein units are sent out often 12 hours or more separate the 1st and 4th work units. This means if you get allocated the fourth unit there is a strong chance that you will be last to return the result.
Then I'm out of ideas. But something is makeing boinc deferr. Take a look at the message from einstein right above the first deferr-message. Anything intresting?
Which version of BOINC are
)
Which version of BOINC are you using, Appetiser? And on what operating system?
Then I'm out of ideas. But
)
Then I'm out of ideas. But something is makeing boinc deferr. Take a look at the message from einstein right above the first deferr-message. Anything intresting?
I will have to keep an eye out for the message to reappear. I recently rebooted and no defer messages are currently in the log, from what I remember the defer message was the only one and they appear at hourly intervals counting down.
I guess there is no history of messages kept in a file on disk that I can look for?
I guess there is no history
)
I guess there is no history of messages kept in a file on disk that I can look for?
There is. In the boinc directory the files stdout.old and stderr.old should contain all messages from the last time boinc was run. There overwritten the next time boinc get restarted.
It's normal for boinc to count down every hour on the deferr-messages but the intresting part should be something like this:
2005-05-11 08:41:57|ProteinPredictorAtHome|Sending request to scheduler: http://predictor.scripps.edu/predictor_cgi/cgi
2005-05-11 08:41:59|ProteinPredictorAtHome|Scheduler RPC to http://predictor.scripps.edu/predictor_cgi/cgi failed
2005-05-11 08:41:59|ProteinPredictorAtHome|No schedulers responded
2005-05-11 08:41:59|ProteinPredictorAtHome|Deferring communication with project for 1 minutes and 0 seconds
I've written the most intresting part is bold. But your message could be something completly diffrent.
CC4.25 uses the newer version
)
CC4.25 uses the newer version of both files already. Check for stderrdae.txt and stdoutdae.txt .. the latter will have all communication in it, the former only error messages.
Many thanks I found the log
)
Many thanks I found the log files, I think they changed the name in v 4.25 as I found the file stoutae.txt contained the info:
2005-05-09 18:09:35 [---] May run out of work in 6.00 days; requesting more
2005-05-09 18:09:35 [Einstein@Home] Requesting 1.69 seconds of work
2005-05-09 18:09:35 [Einstein@Home] Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005-05-09 18:09:36 [Einstein@Home] Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005-05-09 18:09:36 [Einstein@Home] Message from server: No work sent
2005-05-09 18:09:36 [Einstein@Home] Message from server: (won't finish in time) Computer on 97.0% of time, BOINC on 98.6% of that, this project gets 33.3% of that
2005-05-09 18:09:36 [Einstein@Home] No work from project
2005-05-09 18:09:36 [Einstein@Home] Deferring communication with project for 1 days, 4 hours, 47 minutes, and 59 seconds
2005-05-09 18:33:56 [Einstein@Home] Pausing result H1_0152.0__0152.1_0.1_T01_Fin1_3 (left in memory)
2005-05-09 18:33:56 [SETI@home] Resuming result 24ja05ab.6803.26320.1009650.195_2 using setiathome version 4.09
2005-05-09 19:09:37 [Einstein@Home] Deferring communication with project for 1 days, 3 hours, 47 minutes, and 59 seconds
So it looks like no work was sent as it would not complete in time. Hence boinc manager defers. This suggests a possible scheduling problem as the active WU would complete within a few hours and would be left in ready to report state for an extra 24 hours.
Many thanks for the help, at least I know where the defer messages are coming from but it looks to me like this causes a side effect in delaying the reporting of completed WUs.
2005-05-09 18:09:35 [---] May
)
2005-05-09 18:09:35 [---] May run out of work in 6.00 days; requesting more
....
2005-05-09 18:09:36 [Einstein@Home] Message from server: (won't finish in time) Computer on 97.0% of time, BOINC on 98.6% of that, this project gets 33.3% of that
2005-05-09 18:09:36 [Einstein@Home] No work from project
I've left in the lines that represent your real "problem". With E@H 7 day deadline, it's not wise to set a 6 day queue. I'm presuming you want a big Seti cache but the penalty you pay will be wasted E@H work because you always get more than you need from E@H. Recent scheduler changes on the server mean that the scheduler will refuse to send you work that it thinks will be wasted. It's a long story and you can read all about it if you go back through the message boards.
The two things you should do are:-
1. Reduce you "connect" interval to about 1 day and let E@H become your cache if S@H decides to have further extended outages.
2. Upgrade to a 4.3x (when they finally get the bugs sorted) because JM7 has spent a lot of time improving the scheduling code to handle what I think you are trying to do.
My perception is that you wish to crunch Seti mainly and would like to have E@H there as a backup in case of a really bad Seti outage. You don't really want a low cache because E@H might "take over" and get extra share during Seti outages. With JM7's work as I understand it, if this were to happen, Seti will accumulate a "debt" and will get extra share once it's back on line and your resource share wishes will eventually be honoured in the longer term. All of this with added protection against any work exceeding the deadline.
So what have you got to lose by setting a smaller cache? Well those annoying deferring messages for a start :).
Of course this is all IMHO and you are quite at liberty to set whatever cache best suits your needs.
Your final comment was
"This suggests a possible scheduling problem as the active WU would complete within a few hours and would be left in ready to report state for an extra 24 hours."
Not quite sure what you mean here as the work that the server is refusing to send is not the work that will be running for many days. With your 6 day cache don't you already have 6 days (at least) of E@H work sitting in your queue. With your computers hidden we can't see for ourselves but that would be my expectation. The server, by deferring, is just saying that with all the work on hand, it doesn't think that new work would get a run before it will be in deadline trouble. Whether or not the server is correct is a moot point. With a 6 day cache you are courting trouble anyway.
Cheers,
Gary.