scheduler got bird flu?

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

6 Oct 2005 10:22:29 UTC

Topic 189970

(moderation:

)

Hi,

how else is it explainable that a single Athlon XP 2000+
can download 1608 WUs.

http://einsteinathome.org/host/421434

Does anyone have a better explanation?

cu,
Michael

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117345244677

RAC: 35860425

scheduler got bird flu?

6 Oct 2005 12:43:54 UTC

Message 17747

(moderation:

)

Quote:

Does anyone have a better explanation?

I would guess that the user concerned had some sort of software glitch that managed to create about 200 new CPUIDs (since 8 x 200 = 1600). I have a vague recollection of reading somewhere about a proposal to manage runaway CPUID creation by limiting the number that could be created at one go (maybe per day, I don't remember) to 200. Anyway it looks suspiciously like that.

I'm guessing again but the user in question in this case now shows only the one CPUID - the one you listed so it looks like the user might be on to it and already has merged all the phantom CPUIDs. I don't know for sure if there is any other way a batch of phantoms could be reduced to one unless there was user intervention.

If this is the case, the user is probably scratching his head and wondering about how to get rid of (ie abort) all the excess that wont make it before the deadline. I imagine he might have a bit of a potential disk space problem as well. 200 new large data files in rapid succession might have caused a bit of strain :).

Cheers,
Gary.

eberndl

Joined: 18 Jan 05

Posts: 43

Credit: 98691

RAC: 0

You can't merge computers

6 Oct 2005 13:12:58 UTC

Message 17748

(moderation:

)

You can't merge computers that have work out though, so I don't think it can be multiple merged computers.

But if it is, then the person would only see the WUs from the computer they were all merged into. If you detach and reattach, you often can't use the WUs you had perviously d/led, because they are assigned to a different numbered computer.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117345244677

RAC: 35860425

The WU limit of 8 per day

6 Oct 2005 13:44:38 UTC

Message 17749

(moderation:

)

The WU limit of 8 per day will stop a single runaway cpu in its tracks. I remember seeing a comment from Bruce several months ago about runaway CPUID creation and the potential to "empty the store" if something wasn't done. I'm sure I saw a figure of 200 as a limit on CPUID creation. This would still allow a fairly sizeable farm to be added without restriction :).

OK, you have a good point about merging. However, the user in question only has one CPUID, created very recently, so I don't know how else to explain it other than to say some sort of merging seems to have occurred because all 1600+ results are now shown on the one CPUID. As far as detaching/reattaching is concerned, I've no experience of what happens there. I've never had to do it.

Actually, I've just noticed that there is another problem for the user. The version of BOINC is 4.19 which doesn't have a work unit abort, I don't think. I think he will basically have to ditch the lot. With the 14 day deadline, hopefully he will not do anything rash but will come looking for help. Someone may know a solution to his problem. Unfortunately I don't know of one off the top of my head.

Cheers,
Gary.

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

This is a typical example of

6 Oct 2005 14:01:53 UTC

Message 17750

(moderation:

)

This is a typical example of a user whoi did set the 'Connect to network about every days' to a value much larger than 1.0. I guess it is set to the maximum of 10.

Why do I think so? Simple, check out the WU's. You always see a set of WU's which are downloaded within a small time, then a gap, then the next group.

I had a similar experiance with when I did set the contact to a value of 3. When I checked on the next day, I had already 20 WU's waiting on my machine. Okay, I could set the value back, so just one WU on my slowest machine was lost.

This is a very bad bug in the boinc client.

Ans I think this has nothing to do with a farm of machines, with merging or anything similar. I am sure everyone can reproduce this behavior just by setting the contact rate to whatever value which is reasonable larger than 1.0

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117345244677

RAC: 35860425

The person in question is

6 Oct 2005 14:26:32 UTC

Message 17751

(moderation:

)

The person in question is actually on the boards in the recent past and I've asked him to come join this thread so we can ask him some questions. Completely by chance I've just noticed him.

I've had a look again at his list of results. Notice the completed result is dated 5 Oct 2005 11:43:47 UTC. That means to me that all the ones dated prior to that date are ghosts and are actually not on his system. I think that version 4.19 always does the oldest results first so he shouldn't have any older than that. So maybe he doesn't have too much of a problem after all. Version 4.19 does not have the handshaking that would download the ghost WUs so he can just forget about them and allow them to expire eventually.

If you look through a few pages of his results you will see many examples of more than 8 per day so even if he set his connect to 10, how can he get more than 8 per day? The server will simply not give him more than 8 per day unless the server thinks he has multiple cpus or multiple CPUIDs. For example notice all the ones dated 01 October. I'll be shocked if the person concerned deliberately set the value as high as 10. I think it's best if we try to ask the user to tell us what happened and then maybe we can assist.

Cheers,
Gary.

Twosheds

Joined: 18 Jan 05

Posts: 1405

Credit: 3548147

RAC: 0

Thanks for raising this, it

6 Oct 2005 14:26:54 UTC

Message 17752 in response to message 17750

(moderation:

)

Thanks for raising this, it happened in the past but then appeared to cure itself without any intervention.

I've looked at my settings...I changed connecct to network from 3 to 1 day(s).

Which according to my account gives me 2 Wu's per day to crunch.

The only problem is, whether I have the days set to 1 or 3 no WU reaches my machine. But as you see from my record I should have xx Wu's stacked up waiting to be crunched.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117345244677

RAC: 35860425

Yes you have an extremely bad

6 Oct 2005 14:31:47 UTC

Message 17753

(moderation:

)

Yes you have an extremely bad problem with ghosts. How many results do you actually have at the moment?

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117345244677

RAC: 35860425

Actually, I think your

6 Oct 2005 14:34:31 UTC

Message 17754

(moderation:

)

Actually, I think your problems may have sorted themselves out. You've just uploaded another successful result and the server has upped your allocation to 4/day. It was 2/day last time I looked.

Cheers,
Gary.

Twosheds

Joined: 18 Jan 05

Posts: 1405

Credit: 3548147

RAC: 0

2 actual returned results

6 Oct 2005 14:36:52 UTC

Message 17755 in response to message 17753

(moderation:

)

2 actual returned results followed by xxx ghost results

I live in hope Gary!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117345244677

RAC: 35860425

Ghost results are a bit of a

6 Oct 2005 14:46:24 UTC

Message 17756

(moderation:

)

Ghost results are a bit of a mystery. They come about because the BOINC client asks for work and the server tries to send it but the client never receives it. I think basically the client keeps asking for work and the server sends more which in bad cases like yours results in more ghosts - until your daily limit of 8 is reached. The server then tells the client to get lost because the server thinks the client has 8 but the poor client in bad cases actually has none. It's hard to know how your list on the server got to 1600 though. I'll ask Bruce to have a look at it but basically I imagine his response might be something like "We've addressed these problems in later versions so please upgrade".

I think the problem is addressed in 4.45 but surely by 4.72. By all accounts we are getting near the release of 5.2.x so support for 4.x will probably drop away fairly quickly. Maybe if your machine keeps crunching normally now you might just wait for the expected new stable client.

Cheers,
Gary.

scheduler got bird flu?

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports