Admin-Einstein just lost one cruncher

history
history
Joined: 22 Jan 05
Posts: 127
Credit: 7573923
RAC: 0
Topic 189146

One of my rigs ran dry of work. I upload the completed units and expect more work. What was I thinking? "No work from project, daily quota of 8 results...yadda yadda" This is not an over tweaked machine. Cranked out WU's in 6.5 hours. OK, time for backup. Head for LHC and badda bing, a truck load of work no questions asked, no quota 8, 3 day que. Worse news is I have another rig with under 24 hours of work left that gets the same response after uploading. Looks like Einstein is about to lose another cruncher. My Einstein que is set for 4 days. Losing faith here. WTF?

Daedalus
Daedalus
Joined: 17 Mar 05
Posts: 1
Credit: 87059
RAC: 0

Admin-Einstein just lost one cruncher

Same has happened to one of my crunchers - the other is running fine with too much work to complete. The functioning machine accesses the internet directly via its modem - the failing machine uses the other machines modem via the network. these has been no configuration changes on the network nor the machines and everything had been working fine since it was initially installed.

What do I need to do to get more work?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117233638772
RAC: 36093167

@Tweakster, There is no

@Tweakster,

There is no such thing as a "quota 8 bug" that I'm aware of. There is a very necessary "feature" which prevents the trashing of more than 8 work units per day by a misbehaving box. If your boxes average about 6 hours, you only need 4 work units per day to keep them happy. If the server is refusing to send more work, there are at least three main possibilities.

1. The server is stuffed and has got its brains scrambled about how much work it has sent you. If this were the case, I would expect many others to be furiously complaining as well.

2. Your box is stuffed and has already received (and trashed) 8 work units during the current day. Have you had a good close look at what is happening in the queues of boxes that are being refused work?

3. The "phantom work units" bug is biting you really bad. I regularly notice that there will be work showing up in the results list on the web page that is not showing in the work tab of the GUI. It was fairly bad a few weeks ago but has been quite a bit better (for me anyway) lately. There have been quite a few posts around describing this bug and the developers are aware of it and trying to find out what is causing it. If you were being bitten by this bug it is possible that the server thinks it has sent you 8 but you have much less in your work list. If this were repeated for a few days in a row then your box would be out of work whilst the server thought you had plenty. Have you checked for a mismatch between the web page and your work tab? I should add that I've never seen the bug bad enough to totally exhaust a queue. Just the occasional few "phantoms" gradually working their way through and eventually expiring. Actually a benign bug from the user's standpoint.

Another point. Raising you cache size from 3 to 4 days is useless if the server thinks it has already sent you 8 for the day. All it will do is create a future problem when the server is really sending you the full 8 each and every day. You could easily run into future deadline problems as your cache accumulates.

The most important thing for you to do is work out if your box is trashing work or if there is a significant mismatch between web page list and your work tab. Because your computers are hidden, I can't see for myself.

Cheers,
Gary.

history
history
Joined: 22 Jan 05
Posts: 127
Credit: 7573923
RAC: 0

Gary; Thanks for the

Message 11524 in response to message 11523

Gary; Thanks for the response. Remember that these are my most stable rigs. They are not showing any calculation errors. I upped my cache to 4 days because of a business trip. Apparently the server had issues with this change. When one of my most stable rigs runs dry and gets zip from the scheduler, I have to think that the problem is on Einstein's end. Please review my posts in the last 72 hours and do a come back. My read is that the hosts file interface and the ability of the scheduler to make sense of same with my preferences, is way out of synch. I have another rig running dry, 9977. Have enjoyed the stability of Einstein for 4 months, it's a shame I could not expect more.

Regards-tweakster

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117233638772
RAC: 36093167

Please review my posts in the

Message 11525 in response to message 11524


Please review my posts in the last 72 hours and do a come back. My read is that the hosts file interface and the ability of the scheduler to make sense of same with my preferences, is way out of synch. I have another rig running dry, 9977. Have enjoyed the stability of Einstein for 4 months, it's a shame I could not expect more.

Regards-tweakster

Actually, what the hell do you think I did before replying at all??? I clicked on the link to your posts and read everything that you had written that was related to the "quota 8 bug". A title search failed initially because I was looking for "quota" and you had written "quata". Also, if you had taken a bit more care to describe things more fully rather than just ranting, it might have been a bit easier for us dumbos to get the proper picture. For instance, in your current message you use the phrase "hosts file interface and the ability of the scheduler to make sense of same" and I must admit to not really having a clue as to what you are talking about.

I know you're frustrated but it doesn't help by insulting innocent bystanders who are just trying to help.

There is no "quota 8 bug" and you should stop confusing others by using that term. There is a long term, documented, and well known bug which has previously been described as "phantom" work units. You must have a severe case of that.

I've just been able to look at CPU ID 9977, thanks for publishing the ID. The web page says you have a queue of 34, accumulated at the rate of 8 per day over a period of several days. If you have nothing on your box then you probably have a severe case of the "phantom" work units bug. I guess your only option is to tell the server somehow to abort all those 34 results so it doesn't keep thinking that you have them. Bruce or Bernd might be able to give help with that. As you appear to be on 4.19, for that box anyway, I don't think you can abort easily from BOINC itself but I think I've seen Bernd post about how to do it.

The developers have commented quite a few times in these lists about the difficulty of tracking this particular bug down. I know this bug is real because I get phantom work units occasionally on all boxes in a fairly random way. There doesn't seem to be any particular condition that triggers the bug. I have noticed quite a few "scheduler didn't respond - retry in 1 min" messages that create a phantom work unit on the web page but this only creates one or two and not the whole eight that would be required to prevent you getting some work.

Here is what the FAQ says about it:-

"Your account" shows WorkUnit XXXXX being sent to me, but I don't have it on my computer - where did it go?

This problem occurs when your machine contacts the Einstein@Home scheduler to request work, and the Einstein@Home scheduler sends work to your machine, but the work never arrives. This can happen if the networking connection fails during the data transfer. It might also happen if your machine never gets its original assigned hostid and later gets given a different one. It might also happen because of (known and perhaps unknown) bugs in the BOINC core client.

We hope that this problem is largely fixed. If your machine is behind a proxy server or part of a Windows network that uses proxy-like translation features, then bugs in the BOINC 4.19 client may cause this problem. If you see this happending repeatedly on your machine(s) please file a report in the message boards, and perhaps try one of the later BOINC versions.

Provided that your machine is successfully completing work, uploading the results, and downloading work, the occaisional lost Workunit is nothing to worry about. When it times out after the deadline, the work will simply be sent to another host machine.
[BA]

At the end of the day, as they continue to make changes to the server side software, I guess those of us still on older core clients might start to see instability due to the big differences that probably exist between 4.19, 4.25 and 4.35-4.37 range. It's also possible that some recent scheduler change is accentuating the missing work units problem. Working with the developers rather than ranting at the project as a whole might get the problem tracked down more effectively. After all even the "stable" versions are probably not really even beta quality.

Actually, I've just had a thought, as I was re-reading my whole message before posting. Because you have set your cache quite high (3, 4, 6, etc) the scheduler would be trying to send you a lot of work when a new day starts and it's time to give you a new lot of 8. Those whole 8 would come in one hit. If that particular transfer happened to result in a "phantom", you could easily get 8 "phantoms" and be screwed for the whole day immediately.

How reliable is your internet connection? In your messages, do you often get "scheduler failed to respond" type messages?

Just a thought.

Cheers,
Gary.

Digger
Digger
Joined: 24 Mar 05
Posts: 84
Credit: 27421
RAC: 0

Tweakster, As Gary stated,

Message 11526 in response to message 11524

Tweakster,

As Gary stated, your machine here is showing 34 work units ready to crunch. Obviously those work units aren't on your machine or you would not be having a problem right now. So the question is, where did they go? Did you by chance do a project reset or clean install on that particular machine that would have wiped out those work units? If that is the case, then the system still thinks you have them and they will be treated as 'no reply' when they miss the deadline. A project reset is always the very last thing that you should do, as it causes you to lose all of your current work.

The only other thing I can think of is the bug that Gary was speaking about.

There is no 'Quota 8' bug though. If you are getting that message, it is simply because the scheduler thinks you already have 34 work units on that machine.

Dig

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

I may know what the problem

I may know what the problem is. If you are using a PROXY SERVER or some other machine as a gateway to reach Einstein@Home over the network, please make sure that its timeout is set to at least 100*N seconds where N=# of CPUs on the host machine.

Please see this thread for further details:
http://einsteinathome.org/node/189125

If the proxy server timeout is smaller than this number, then what happens is that the connection is broken at the host end, BEFORE THE SCHEDULER REPLY EVER REACHES THE HOST. This causes the scheduler reply to be lost, and hence for the WU sent to a given host to be lost. Thus the host runs up against its daily quota limit.

Cheers,
Bruce

Director, Einstein@Home

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.