trouble getting O1AS20-100T work

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225514931

RAC: 1049225

15 Feb 2016 21:57:26 UTC

Topic 198436

(moderation:

)

For the last several hours, my hosts seem to have surprising difficulty getting O1AS20-100T work, in a period when the server status page shows quite a bit available.

For example, right now the most recent server status snapshot timestamped 15 Feb 2016, 21:30:01 UTC shows 969 tasks to send, yet since my host Stoll8 finished a unit of this type and returned it at 15 Feb 2016, 16:11:31 UTC, repeated work requests, both automatic and by manual update request, have repeatedly been denied, with the request log generally showing a considerable CPU request (as the CPU queue on this host is empty)
for example[send] CPU: req 444960.00 sec, 1.00 instances; est delay 0.00 but no CPU work provided as in [HOST#10659288] MSG(high) No work is available for Gravitational Wave search O1 all-sky tuning
I'm not begging for CPU work in general, and in fact have the option for Gamma-ray pulsar binary search #1 disabled (which is honored by the server now that the beta status is ended). If there is an intentional spreading of the work to additional hosts which is avoiding mine since it has already processed three 1.02 tasks successfully, I'm happy to wait my turn.

I'm posting in case this is the symptom of something unintended. Over on the Technical News thread rbpeake posted about the same difficulty, presumably thinking it unexpected.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117680215940

RAC: 35183913

trouble getting O1AS20-100T work

16 Feb 2016 0:46:59 UTC

Message 137190

(moderation:

)

Yesterday, I set up 5 machines to get O1AST work. 4 had other work and got only a single task. The fifth had been in the process of running down its cache of FGRPB1 and ran out just as the new release became available. I had a low cache setting and so got just 4 tasks. All the tasks, 8 in total, were V1.02.

After reading your report, I have increased the cache setting on the machine with no other work type. It's asking for, but not receiving any new tasks. It looks to me like task distribution is turned off at the server in some way.

I was hoping to get a copy of the V1.03 app. Now that the info from Christian (and HB) talks about a checkpoint code problem resolved in V1.03 I'm quite reluctant to continue running V1.02. It seems to me like a complete waste of time if there is an issue resolved in a later version. I would happily abort all V1.02 tasks if I could get V1.03 to replace them. I might even try downloading V1.03 directly and tweaking the state file to turn the unstarted V1.02s into 1.03s. Maybe I can work out (remember) how to do that :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117680215940

RAC: 35183913

At 16 Feb 2016, 5:00:01 UTC,

16 Feb 2016 5:22:10 UTC

Message 137191

(moderation:

)

At 16 Feb 2016, 5:00:01 UTC, the server status page shows 853. That's perhaps a few (~100) tasks sent out in the last 7.5 hours. Maybe those were resends only and not V1.03 'new' tasks. I'd be fairly confident it's nothing to do with excluding anyone who has already received some 'quota'. More likely to be something to do with the roll out of V1.03.

I tried to look for and manually download V1.03 but I find it's no longer possible to browse the download directory - permission denied. Perhaps V1.03 is not there yet, even if I could have looked :-). I really don't like running apps with a known bug :-). Give me the 'not spotted yet' bugs any day! :-).

Cheers,
Gary.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188497319

RAC: 219958

I saw that trouble to get

16 Feb 2016 7:29:31 UTC

Message 137192

(moderation:

)

I saw that trouble to get work too. We will look into that today.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117680215940

RAC: 35183913

Just got a V1.03 task and a

16 Feb 2016 9:08:22 UTC

Message 137193

(moderation:

)

Just got a V1.03 task and a download of the 1.03 new app. The task was an _7 resend. When finished it will be checked against a V1.02 task that has already finished.

Since resends seem to trigger a 1.03 replacement task, I just aborted 2 1.02 tasks that had not yet started. Hopefully somebody just got those as 1.03 resends :-). Anybody with 1.02 tasks not yet started should do the same.

Unfortunately, I got the 32bit 1.03 app even though I'm running 64bit Linux.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225514931

RAC: 1049225

Well, not all of the 1.03

16 Feb 2016 13:34:54 UTC

Message 137194

(moderation:

)

Well, not all of the 1.03 work sent out is resends of previous work.

My host Stoll6 after having many, many requests denied over the past half day on the grounds of no work available got a "burp" of first issue 1.03 tasks.

15 tasks all issued at the same moment, 16 Feb 2016, 4:38:17 UTC. They are sequentially numbered WUs from 239746936 through ...50. All were created at 15 Feb 2016, 17:15:42 UTC.

The thing I find curious is that the second task in the quorum for members of this burp was issued at a drip-bleed rate. For these 15 WUs, the first one issued to a quorum partner was sent at 16 Feb 2016, 4:37:20 UTC (a minute before mine), but the others dribbled out at irregular intervals over the next five hours, with the last finally being sent at 9:36:46. All were first sends--no user aborts, failures, or such to complicate the picture on this batch.

Sadly, this burst of work is probably more than Stoll6 can handle, as it is set to process a single CPU task, and this work takes a bit over 14 hours, and the deadlines are late on January 20.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188497319

RAC: 219958

I create a bunch of new work

16 Feb 2016 15:03:18 UTC

Message 137195

(moderation:

)

I create a bunch of new work yesterday evening after the 1.03 was deployed. I also created the rest of the workunits for the tuning run just now. That should speed up work distribution. If you still have trouble getting work please check that you allow test applications in your Einstein@Home preferences and have enough disk space.

I also increased the estimated time and credits per task to better match what we saw in the beginning.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225514931

RAC: 1049225

RE: That should speed up

16 Feb 2016 15:24:09 UTC

Message 137196 in response to message 137195

(moderation:

)

Quote:

That should speed up work distribution.

Yes indeed. By the time you made your post here, both of my long-starved, long-requesting hosts had received a full load of work in response to automatic requests in the preceding half hour.

Just now I turned on my laptop, and it got a task on first try after booting.

So at least as seen from here, the trouble getting work is not a problem now. The change is quite dramatic.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117680215940

RAC: 35183913

RE: ... I also created the

16 Feb 2016 20:48:31 UTC

Message 137197 in response to message 137195

(moderation:

)

Quote:

... I also created the rest of the workunits for the tuning run just now.

I hope that doesn't have negative effects arising from all the extra records suddenly in the database.

Work 	      O1AS20-100T     FGRP4      BRP6      BRP4    FGRPB1    BRP4G         in DB
Tasks total 	  746,736   942,022   406,948   239,632   109,378   86,803     2,531,519
Tasks to send 	  741,997     2,021     4,761     8,758     3,963        1       761,501

I thought the idea was to have a few thousand 'ready to send' (like all the others) rather than 740 thousand, and then progressively insert more as needed?

I'm certainly not criticising - just a little surprised. If this isn't going to cause stress then great!! The more the merrier :-).

I'm also happy at the other apparent implication - no more bugs, so let's go!! :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117680215940

RAC: 35183913

RE: Sadly, this burst of

16 Feb 2016 21:18:33 UTC

Message 137198 in response to message 137194

(moderation:

)

Quote:

Sadly, this burst of work is probably more than Stoll6 can handle, as it is set to process a single CPU task, and this work takes a bit over 14 hours, and the deadlines are late on January 20.

Feb not Jan :-). That's a five day deadline. There must have been a quick change of heart about that since tasks on one of your other hosts show a 7 day deadline.

The important thing is that all people wanting to join the new GW action should think carefully about cache settings, particularly if concurrent GPU work causes cores to not be available. BOINC still seems to get work for the full number of cores.

If the availability of lots of work is going to be 'officially announced' there should be a prominent mention of the deadline and a warning about cache settings. I don't want to deal with the 'greedy Einstein doesn't play fair' complaints from people who get a bunch of work they can't handle :-).

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225514931

RAC: 1049225

Maybe I stumbled on a

17 Feb 2016 2:27:57 UTC

Message 137199

(moderation:

)

Maybe I stumbled on a restriction, which, if real, may relate to the (now past) extreme difficulty in getting work, and a (new prediction from me) future appreciable delay in quorum fulfillment.

As it happens, I have four hosts which are AVX capable, and one which only gets the SSE2 application.

As after the initial flood it appeared from both my computation and other indications that my hosts had more CPU work than they could fulfill, I did a small amount of aborting to get them under. But I did not reduce the requested work prefetch parameters (I will after writing this note), and BOINC requested a dribble of additional work in the hours since then.

Here is the thing I noticed:
1. NONE of the 8 new work units issued to one of my AVX hosts has been issued to a quorum partner (the second task shows as "unsent").
2. ALL of the new work issued to my SSE2 host has been issued to a quorum partner, and in every case that quorum partner is AVX. (for work issued at 16 Feb 2016, 15:42:53 UTC and later)
3. ALL of the new work issued to my "luckier" AVX host recently has been issued either to an SSE2 host or a Linux host.

I'm not sure there are any hard and fast rules requiring quorums currently to be filled with a dissimilar application, which could slow things appreciably if there is a dominant application, or if these observations are accidental side effects of some unintentional oddity in the issue process.

I just offer it as a clue--perhaps such a restriction was highly desired in the early part of the beta to speed detection of a problem specific to one application variant, but perhaps might usefully be relaxed in the future.

trouble getting O1AS20-100T work

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports