"Unsent" units

VQ-2 Ghost
VQ-2 Ghost
Joined: 6 May 06
Posts: 3
Credit: 49770
RAC: 0
Topic 191569

Greetings,

I have 2 computers in this account; they are both happily and slowly crunching away.

What I don’t understand is why some of the work that the faster computer has already crunched and returned is not being sent to any other computer(s) and remains with a status of “Unsent� for days on end.

As we are all in this mostly for the science, seeing the credit rise is also taken as sort of a reward for the work our computers do, it also serves as an incentive to push forward.

When our computers crunch away and the credit does not rise on par with the work done, it feels more like a letdown and the level of satisfaction is not as expected.

Having said all that, I fail to see a concise and satisfactory answer in this newsgroup, maybe this time someone can shed more light on this subject.

What am I missing here that will keep me in this project?

To see 5 examples of what I am concerned about, please see the links below.

http://einsteinathome.org/workunit/10879982

http://einsteinathome.org/workunit/10833910

http://einsteinathome.org/workunit/10812345

http://einsteinathome.org/workunit/10749932

http://einsteinathome.org/workunit/10686168

Thanks!

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

"Unsent" units

Essentially, EAH raw data is divided into a number of prepackaged "packs". That's the large data file you DL periodically and it's sent out to a number users. That data pack is further subdivided into a number of individual results, which are then crunched by the hosts which have that data pack onboard and returned.

What I think you are seeing here is the inital distribution of the main pack is limited to some number of hosts, so what happens is if you plow through them quickly you may end up having to wait for some of the other hosts which are working on that pack to finish up results they have and request more work out of the pack.

HTH,

Alinator

VQ-2 Ghost
VQ-2 Ghost
Joined: 6 May 06
Posts: 3
Credit: 49770
RAC: 0

If what you think may be

Message 42516 in response to message 42515

If what you think may be happening with those units is in fact the reason for the delay in sending them to be crunched by others, then the delay makes sense.

Thanks for your prompt restponse.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 197

You probably just got stuck

You probably just got stuck with several p3-500 class systems that take several days to do a WU, or a few noobs who only took single WU and left. Eventually the scheduler will realize these WU's are unattended and bring more PCs onto the data.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7057084931
RAC: 1614426

RE: You probably just got

Message 42518 in response to message 42517

Quote:
You probably just got stuck with several p3-500 class systems that take several days to do a WU, or a few noobs who only took single WU and left. Eventually the scheduler will realize these WU's are unattended and bring more PCs onto the data.

Ummm, no Dan, not for the "unsent" phenomenon reported by VQ-2 Ghost and viewable still on some of the links supplied.

Unsent on the only other result for the WU means he is waiting for the scheduler to choose to send out the _1 result for his WU for the very first time.

to VQ-2 Ghost:

If your host prefetches results much faster than the sum of the other hosts which are currently working on the same major datafile, you'll notice you are downloading sequential results, all of the _0 flavor. This is a big clue that you are "ahead of the pack" and likely to wait a while. The next major datafile will be a whole new ballgame.

The good news is that in S5 this just affects how long you wait for confirmation and credit. In S4, if you got in with a pack of low claimers, it could suppress your awarded credit for weeks worth of work. It happened to me, in perhaps February when one my hosts spent a whole month processing results from one major datafile and getting low awarded credit.

I even tried suspending the project and deleting the major datafile from the project directory, hoping to get assigned a new one with middle-of-the-road partners, but the client just noticed the file was missing and downloaded it again.

This too will pass, eventually.

VQ-2 Ghost
VQ-2 Ghost
Joined: 6 May 06
Posts: 3
Credit: 49770
RAC: 0

Thanks for some great

Message 42519 in response to message 42518

Thanks for some great answers, this kind of delay is now understood!

By sheer coinsidence, even a crunched unit my computer had returned on July the 9th and was just sitting there looking pretty, has now been sent to some other computer for crunchin since my 1st post on this subject.

Cheers!

Idefix
Idefix
Joined: 21 Mar 05
Posts: 11
Credit: 43293
RAC: 0

Hi archae86, RE: I even

Message 42520 in response to message 42518

Hi archae86,

Quote:
I even tried suspending the project and deleting the major datafile from the project directory, hoping to get assigned a new one with middle-of-the-road partners, but the client just noticed the file was missing and downloaded it again.

Did you delete the corresponding entry in the client_state.xml as well? If not you are reporting "I have data file x, please send me workunits for this data file."

Regards,
Carsten

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7057084931
RAC: 1614426

RE: Did you delete the

Message 42521 in response to message 42520

Quote:
Did you delete the corresponding entry in the client_state.xml as well?
Regards,
Carsten

No I did not. Back then I had neither the insight nor the courage to tamper with client_state.xml at all. Should I ever see the need to try this again, I'll take a look there.

These days I do go into client_state.xml to rebalance short vs. long_term debt differences to avoid anomalous prefetch behavior, and once or twice have done more ambitious things. So far I've not burned myself--though I think it is risky.

Thanks for the tip,
Peter

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 197

RE: RE: You probably just

Message 42522 in response to message 42518

Quote:
Quote:
You probably just got stuck with several p3-500 class systems that take several days to do a WU, or a few noobs who only took single WU and left. Eventually the scheduler will realize these WU's are unattended and bring more PCs onto the data.

Ummm, no Dan, not for the "unsent" phenomenon reported by VQ-2 Ghost and viewable still on some of the links supplied.

Unsent on the only other result for the WU means he is waiting for the scheduler to choose to send out the _1 result for his WU for the very first time.

To minimize the number of times a file is sent, only a few PCs are given it, with the expectation that they'll eventually request all the remaining work units in it. If they're all extremely slow or drop the project, some of the work units in the datafile will be stuck in the unsent state until the scheduler realizes there's a problem and sends the data to annother box.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

RE: Hi archae86,RE: I

Message 42523 in response to message 42520

Quote:
Hi archae86,
Quote:
I even tried suspending the project and deleting the major datafile from the project directory, hoping to get assigned a new one with middle-of-the-road partners, but the client just noticed the file was missing and downloaded it again.

Did you delete the corresponding entry in the client_state.xml as well? If not you are reporting "I have data file x, please send me workunits for this data file."

Regards,
Carsten

Of course that only exacerbates stalling at "Unsent" for the other hosts still working on the data pack by forcing the server scheduler to intervene and distribute it to other hosts.

I'm not sure how long it takes before the scheduler decides progress is too slow for a given data pack and sends it out again (or even how many hosts get it initially for that matter).

Alinator

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.