Hi guys, this may affect all of you having a lot of machines. Thinking of Bruce Allen with 300 brand new Opterons, thinking of the master of Merlin.
The reason is this message:
2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds 2006-04-13 07:51:00 [Einstein@Home] 4 consecutive failures fetching scheduler list - deferring 604800 seconds
4 failures caused by the power loss means a penalty of 10 days? Fine! Great job.
And now all guys with a lot of fast boxes have to check every single machine if it is still willing to connect.
Thanks Boinc for this grand logic.
So admins in the wolrd, start your mouse and check every single box! Have fun!
Copyright © 2024 Einstein@Home. All rights reserved.
To all those guys having a real lot of machines -- poor you!
)
Well, not exactly. :-)
I know with Windoze there is BoincView - a great farm implement - so you can retry any/all Boinc functions across one's connected flock to invoke said behaviour, with I think at most two mouse clicks for the entire crew! It works for me anyhows.
I am unsure if any other applications on this page that would be suitable similiarly for other platforms. Can anybody comment here on such 'add-ons'.... ?
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: Well, not exactly.
)
Yes I know, there are applications. But is my special situation I can directly access 5 of my machines. For the others I have no direct access. So even if there is such a tool, It would not help.
However, I do not like an application which needs a nurse to check over and over again. Maybe you have read that famous UNIX haters handbook? I had access to a printed one somewhen in '94 or '95. One thing they did not like on Unix are those core files. After a while, you can find in almost every directory such a file (okay, this does not happen any more, but in these days it was true). And that behaviour with core files is a simlar example where a computer needs a nurse to clean up the crap over and over again.
RE: Yes I know, there are
)
I hear you! I take your point that it is odd that after a mere 4 failed tries you wind up at such a large delay of 604800 seconds. I wonder if the delay was already high before those occurred ..... does anybody know the algorithm on that? Pure geometric or what?
I have read the good old "Mythical Man Month" and I'll certainly have a peek at that UNIX book! I wonder if I can get a 'free Unix Barf Bag' though ...... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: I hear you! I take your
)
Two of my four machines finished the night displaying a six day wait for retry.
Reviewing the message log for of them, it had tried between 9:15 and 11:45 p.m. MDT to download new work and to upload completed results. The upload requests display backdowns gradually increasing to a bit over an hour for the longest display backdown before the (separate) new work/reporting 1 result request at 11:43 which earns the dreaded:
couldn't connect to server
...
no schedulers responded
...
fetching scheduler list
network error: couldn't connect to server
scheduler list fetch failed: http error
4 consecutive failure fetching scheduler list: deferring 604800 seconds
Yes issuing a update this morning got my two laggards work again. And I do recall from the SETI extended downtime last year that excess requests can be a problem on restart, but this seems extreme.
RE: Yes issuing a update
)
Thanks! At least another victim :-)
I will not manually update today, instead I will watch what happens when E@H runs out of work. I have a cache of ~0.7 days, so in 5 hours one of the CPUs and in 6 hours the second CPU will be idle.
Let me see, maybe boinc is clever enough to ignore the delay, maybe not. Whatever happens, I will do a manual update tomorrow morning.
RE: Well, not exactly.
)
With BoincView you cannot upload all your results where the upload failed, with two click. :(
You need two clicks for every result, that's true, because you can't mark more than one.
cu,
Michael
RE: Hi guys, this may
)
Or let boinc do its thing on its own.
RE: With BoincView you
)
Of course you can!
On the right side of the retry file transfer button, there is a down-arrow. try this one and you'll see a menu entry called: retry all file transfers
AndyK
Want to know your pending credit?
[img]http://tinyurl.com/438v3"[/img]
The biggest bug is sitting 10 inch in front of the screen.
Not sure but I think the
)
Not sure but I think the scheduler after 10 failed attempts to get work it trys to get the master file(scheduler list) to make sure that it is trying the right address. Then ten more attempts before it tries to get master file again. This is repeated until after the 4th failure to get the master file at which point it backs off for a week(604800sec). There are also incremental backoffs between attempts to get work.
So it looks like the op of this thread must have been hitting the update button in order to get to that point because I don't think the outage was long enough to drive it to that point.
However I do agree that a week long backoff is excessive. After 4 failures to get master file it should backoff no more than 24hrs IMO.
98SE XP2500+ @ 2.1 GHz Boinc v5.8.8
RE: RE: With BoincView
)
Thx! I never tried this. ;)
cu,
Michael