Is it possible that somehow (e.g. by initially cloning the BOINC directory) the new host assumed the identity of the other hosts and threw the client-server comms out of sync? It seems to be quite a coincidence that the problem started very shortly after the new host was attached.
CU
Bikeman
It's true that this host was added ... but it is located 250 kilometers from here(tm) and opeated by a friend. No, this hosts boinc directory was not cloned. He did the standard procedere using username and password to join the project.
I asked him also about the changed venue settings and he replied, that he only changed the settings for his hosts.
This story remains mysterious ... and does not even give a hint for the workunits my host never had seen but show up as overdue.
Stephan
Sorry, Memorial Day activities made it so I couldn't answer back yesterday.
There looks to be two separate anomalies for the original host in question. The first is the matter of why the 'lost' tasks didn't get resent to the host like they should have been before they expired.
The second is the spontaneous detach of the host, as reported by the project. Keep in mind that in this case it is the project which initiates the detach, and not the user. In general terms, the reason they happen is that the scheduler did not like something in the request it received from the host. As I said before, the only thing I know for a fact which will cause it is a 'bogus' RPC Sequence Number. I don't know if you can infer anything from the generation of the new CPID, since that is part of any attachment of a host, regardless of whether it is a new host or a re-attachment of an old one.
I'm not liking the new host theory, only because it would seem likely it should have had some problems too if the problem was some confusion on the backend side of which host was which. Although that can't be eliminated as a possibility based only on that.
In any event, the only way to get to the bottom of it is a thorough forensics analysis of the the logs back to at least around the 10th or so of this month.
For the lost tasks, you have to see if you can find entries for their filenames in the BOINC logs and work from there.
For the spontaneous detach you need to work the Windows system logs as well. Something had to have happened between 22:52:57 UTC on the 24th and 11:50:38 UTC on the 25th which lead to the CC generating a 'corrupt' scheduler request at 11:50:38 UTC. The only other alternative would be the project had a glitch at that time, but one would think other hosts would have had a malfunction as well. As of yet, there haven't been any other reports of weird failures at or around that time, but that doesn't mean there wasn't either.
It's also possible that the reason for the first anomaly is related to the second one. If that's true, the detective work is going to be even more difficult given that you are going to be looking over a time span of almost three weeks.
At this point, the host seems to be back on track so the best course of action might be to just let sleeping dogs lie and keep a closer eye on it for a while to see if any pattern emerges.
However, don't feel too bad about this. Almost all grizzled BOINC veterans have had a host anomaly which has had us stumped at one time or another. Personally, I hate mysteries when it comes to computers, but sometimes you just have to say, "OK, you got away with that one... But it ain't gonna happen again, if there's anything I can do about it!". ;-)
RE: This is really
)
It's true that this host was added ... but it is located 250 kilometers from here(tm) and opeated by a friend. No, this hosts boinc directory was not cloned. He did the standard procedere using username and password to join the project.
I asked him also about the changed venue settings and he replied, that he only changed the settings for his hosts.
This story remains mysterious ... and does not even give a hint for the workunits my host never had seen but show up as overdue.
Stephan
Sorry, Memorial Day
)
Sorry, Memorial Day activities made it so I couldn't answer back yesterday.
There looks to be two separate anomalies for the original host in question. The first is the matter of why the 'lost' tasks didn't get resent to the host like they should have been before they expired.
The second is the spontaneous detach of the host, as reported by the project. Keep in mind that in this case it is the project which initiates the detach, and not the user. In general terms, the reason they happen is that the scheduler did not like something in the request it received from the host. As I said before, the only thing I know for a fact which will cause it is a 'bogus' RPC Sequence Number. I don't know if you can infer anything from the generation of the new CPID, since that is part of any attachment of a host, regardless of whether it is a new host or a re-attachment of an old one.
I'm not liking the new host theory, only because it would seem likely it should have had some problems too if the problem was some confusion on the backend side of which host was which. Although that can't be eliminated as a possibility based only on that.
In any event, the only way to get to the bottom of it is a thorough forensics analysis of the the logs back to at least around the 10th or so of this month.
For the lost tasks, you have to see if you can find entries for their filenames in the BOINC logs and work from there.
For the spontaneous detach you need to work the Windows system logs as well. Something had to have happened between 22:52:57 UTC on the 24th and 11:50:38 UTC on the 25th which lead to the CC generating a 'corrupt' scheduler request at 11:50:38 UTC. The only other alternative would be the project had a glitch at that time, but one would think other hosts would have had a malfunction as well. As of yet, there haven't been any other reports of weird failures at or around that time, but that doesn't mean there wasn't either.
It's also possible that the reason for the first anomaly is related to the second one. If that's true, the detective work is going to be even more difficult given that you are going to be looking over a time span of almost three weeks.
At this point, the host seems to be back on track so the best course of action might be to just let sleeping dogs lie and keep a closer eye on it for a while to see if any pattern emerges.
However, don't feel too bad about this. Almost all grizzled BOINC veterans have had a host anomaly which has had us stumped at one time or another. Personally, I hate mysteries when it comes to computers, but sometimes you just have to say, "OK, you got away with that one... But it ain't gonna happen again, if there's anything I can do about it!". ;-)
Alinator