A Feast of Strange Resends??

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,873
Credit: 117,957,714,882
RAC: 30,200,639
Topic 193840

As the regular readers of these boards would know, I've set up the AP (anonymous platform) mechanism with a custom app_info.xml file to allow hosts of my choice to be capable of handling either the new S5R4 tasks or resends of the supposedly finished S5R3 run. Apart from an "own goal" which I'll put down to tiredness after converting a string of Linux hosts and then being stupid enough to do a Windows one without thinking properly, the conversion has been going quite smoothly for me. In fact there seemed to be an initial lack of resends a couple of days ago but now there seem to be plenty around.

Just a short time ago, I decided to bump up the cache on a machine that had several R4 tasks to see what would happen. I was pleased to see it grab a resend. So I bumped the cache a little more and each time I did I got more resends from the same frequency and with sequential sequence numbers. They were all _2 tasks - ie resends as expected.

Being curious, I decided to look at a quorum or three, to see what had caused the resends. This is an example of a typical quorum. You should notice some things that are very strange about this quorum (unless I'm more tired than I think).

  • * The _0 task was issued on 08 Aug. Huh?? I thought everything was sent out by 01 Aug approx??
    * The _1 task was issued on 31 Jul. Huh?? I thought the _0 task was issued first followed later by the _1 task??
    * The host that got the _0 task got a whole bunch of primary R3 tasks on the same day, 08 Aug, so it wasn't an isolated event??

At a quick glance there are about 8 pages of these R3 tasks, sent on 08 Aug and returned on 09 Aug with the same "client detached" message. The funny thing is that the host, having created all the resends at around 14:26 UTC, then managed to get new R4 work at 14:28 UTC. I assume that "client detached" is some form of bulk abort mechanism. My initial thinking was that the owner of the host decided to try for a bunch of work and somehow got about 160 all dated the same. His daily quota should be 64 so how did he achieve that? Then I figured that he must have decided he was sick of R3 stuff, so he detached from the project and then reattached in order to get different work. But how did he keep the same hostId? Some sort of merge I guess. Maybe some smart person can figure it all out.

Meanwhile I think I'll go feast on some more resends while the going is good. I really am curious about the _0 _1 strangeness though :-).

PS: While composing this message, I've had some thoughts about what might have happened. Here is my explanation.

The host in question was stocking up on R3 work on 31 Jul when it was becoming increasingly difficult to make sensible contact with the server. Perhaps he was making repeated requests that were received by the scheduler but resulted in ghost tasks that he didn't receive at the time. Out of frustration, he reduced his cache to a low value and processed and returned what he had without requesting more until Aug 08. When he did eventually run low on work and requested more, the scheduler sent him all the "lost" work in one big hit. The person took fright at the volume of work (previously allocated to him on 31 Jul but only sent on Aug 08) and detached his computer to dump the lot. He then reattached and was given new R4 work. He then merged the two hosts.

The story can't be quite right as his hostID dates back to May.

Cheers,
Gary.

Winterknight
Winterknight
Joined: 4 Jun 05
Posts: 1,465
Credit: 377,805,796
RAC: 121,814

A Feast of Strange Resends??

I think it might be a problem of detached host and re-sends.
It has been seen but not diagnosed AFAIK that some hosts detach/re-attach at bootup, and gets some work, and then is also re-sent the work that was on that host before.
Why this happens and why it gets the same ID, I have no idea. But I think it was one reason Seti switched off the re-send tasks option, it was causing server and traffic overload on their network.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,965,598,855
RAC: 705,358

One clue from the task in

One clue from the task in your sample WU: it has a deadline of 18 August, so it was probably first sent out not long after it was originally created on 30 July.

The interval between 'Sent' and 'deadline' (under 10 days) suggests an earlier problem on the hosts, and a bulk resend on 8 August. I did something similar when I made that pig's ear of re-optimising R3 when rigging for the timing run for R4.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Sigh..... Another

Sigh.....

Another interesting tidbit you can glean from looking over all the tasks for the host Gary highlighted.

It looks like Berkeley has made the information which gets relayed back to the user on the host summary in the case of things like redundant task and unconditional aborts even less informative and accurate than it was before.

Now it looks like both those cases show up as a missed deadline timeout!

Is it any wonder BOINC has a retention problem when the information the project is telling the user sends them down the wrong troubleshooting path. This is a don't worry about it, nothing is wrong with the host type of event, but gives the impression that your host stopped running the tasks for some reason.

I mean come on.... Sheesh!!

Alinator

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282,700
RAC: 0

RE: Is it any wonder BOINC

Message 84178 in response to message 84177

Quote:

Is it any wonder BOINC has a retention problem when the information the project is telling the user sends them down the wrong troubleshooting path. This is a don't worry about it, nothing is wrong with the host type of event, but gives the impression that your host stopped running the tasks for some reason.

Wot!? David doesn't know best?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,873
Credit: 117,957,714,882
RAC: 30,200,639

RE: One clue from the task

Message 84179 in response to message 84176

Quote:
One clue from the task in your sample WU: it has a deadline of 18 August, so it was probably first sent out not long after it was originally created on 30 July.

Yes, you are quite right and I didn't notice that clue.

However, in the PS bit of my message, I did figure it out despite missing the clue :-).

Here are a couple more musings/questions for the very smart people who frequent these boards.

1. For all the tasks in question from this host we are looking at, the server state is "over" and the outcome is shown as "client detached". If you click the "explain" link under the "outcome" heading, "client detached" is not actually shown as one of the defined outcomes. Am I correct in assuming that the owner actually detached from the project to cause this outcome? In other words there is no other way of doing a "bulk abort" (as this seems to be) in order to get the same outcome?

2. If the assumption of a proper "detach" is true, does anybody else have any info about the possibility of retaining the hostID on subsequent re-attach that Winterknight mentions?

3. If you get resent a whole flock of lost tasks (as this seems to be) it would be very tedious to go through the "highlight, abort, confirm" sequence for each task individually if you needed to trim your cache. I've casually tried to highlight more than one task at a time but can't seem to do so. Is it possible to select multiple tasks and abort them simultaneously?

4. How do you explain the number of lost tasks that were created? If the user was making multiple requests for work, surely the scheduler would be attempting to resend the first lot of lost tasks rather than creating a further bunch of new tasks that could also become lost? To answer my own question, I guess it means that the user wasn't making multiple requests at all. The user probably made a single request and the scheduler just screwed up bigtime (somehow) :-).

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,158
Credit: 7,239,023,973
RAC: 1,283,500

RE: 3. If you get resent a

Message 84180 in response to message 84179

Quote:

3. If you get resent a whole flock of lost tasks (as this seems to be) it would be very tedious to go through the "highlight, abort, confirm" sequence for each task individually if you needed to trim your cache. I've casually tried to highlight more than one task at a time but can't seem to do so. Is it possible to select multiple tasks and abort them simultaneously?


I don't know a way to do that, but I have had reasons to abort an appreciable number of tasks in queue. (most recently when my queue started to accumulate a substantial number of S5R4 tasks for which the completion estimate was about a seventh of the truth).

My current favorite method is in BOINCView, which I anyway run for central view and control of my household flotilla.

In BOINCView, once one has highlighted the first of a sequential set of tasks to abort, and moved the mouse pointer over the "abort selected task" button, the motions are:

1. Mouse left click
2. enter key (to confirm)
3. down arrow key (to move to next task)
(repeat at will)

As the mouse pointer stays on the abort icon, one can allocate three fingers from two hands and cycle along at better than a task per second, though it is wise to pause after about ten tasks to let the display catch up.

Not so good as would be the group select you suggest, but considerably better than moving the pointer back and forth between task selection and abort button selection for each task in BOINCMgr.

Since I've just played up BOINCView, let me confess that it has a bad habit of just stalling in a state which uses all of one core to no useful purpose once every few days on my host, and that updates seem to have slowed or stopped some time ago.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

@ Gary: To address Items 1

@ Gary:

To address Items 1 and 2:

After looking at the time and sequence of events here, I'd say yes the owner initiated the detach in this case. I don't know if it was intended to be a bulk abort though, it may just been a 'shotgun' approach to deal with the AP bug and/or screwy notices in the message log. ;-)

'Spontaneous' detaches do occur occaisionally. There are a couple of known reasons why they can happen, related to a 'bad' RPC sequence number or bad authenticator being sent to the project from the host for example. However there have been other documented cases where there was no satisfactory explanation found, other than they were loosely related to periods of backend trouble.

Regarding HID's, yes they changed the behaviour awhile back to reuse the original HID as long as you don't do any further 'cleanup' in the BOINC folders and files before re-attaching to the project. AFAIK, this even happens with most spontaneous cases these days as well.

The method Archae86 outlined is the method I use for 'bulk' aborting when needed. I imagine the reason multiple selection is disabled is to discourage CWCP (credit whore cherry picking). Although, there really isn't much point to that here on EAH. ;-)

For Item 4... LOL...

Well, I guess all you can say is that when the backend goes insane, just about anything can happen and all bets are off! :-D

@ Archae86:

Hmmm...

Yes, there are a couple of oddities similar to the one you've outlined I've noticed about BV when you run it long term.

It seems to me they are related to when the reply from the CC's is very large, gets slowed down, delayed, and/or interrupted for some reason.

Going through the Host setup dialog for the one effected usually takes care of the problem, if not a restart of BV does the trick.

Alinator

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,873
Credit: 117,957,714,882
RAC: 30,200,639

Thanks for the replies,

Thanks for the replies, guys.

I used BV about three years ago before the farm started multiplying. It was fine with a few hosts but became impossible to use once the number of hosts exceeded some relatively small (by my standards anyway) threshold. Everything just bogged down very quickly and I abandoned using it. For a while I did keep a look out for new versions that might perform better but I got the impression that there wasn't much new development going on. I haven't looked for quite a while now.

These days most hosts largely run unattended and I use BM if I need to have a quick look at what is going on. At major transitions like the present, I'll physically visit each host and do a bit of a cleanup if required before loading any new stuff from a network share. I've still got more to do and I'm quite surprised how well the unprepped machines are doing with maintaining a supply of resends only. There was a relative drought right at the start but there seem to be a few around at present.

Cheers,
Gary.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139,002,861
RAC: 0

RE: 3. If you get resent a

Message 84183 in response to message 84179

Quote:
3. If you get resent a whole flock of lost tasks (as this seems to be) it would be very tedious to go through the "highlight, abort, confirm" sequence for each task individually if you needed to trim your cache. I've casually tried to highlight more than one task at a time but can't seem to do so. Is it possible to select multiple tasks and abort them simultaneously?

I thought BOINC 6.2 introduced/re-introduced the ability to select multiple tasks and be able to abort them in one go.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,873
Credit: 117,957,714,882
RAC: 30,200,639

RE: I thought BOINC 6.2

Message 84184 in response to message 84183

Quote:
I thought BOINC 6.2 introduced/re-introduced the ability to select multiple tasks and be able to abort them in one go.

I wouldn't have a clue :-).

I've no intention of living on the BOINC bleeding edge. I'll perhaps go there once it's been stable for a month or two :-).

I don't think it's particularly encouraging to see different version numbers for different platforms. You also wonder about stability a bit when they leave the previous recommended version there as well - sort of like suggesting you might need a backup :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.