FGRP4 Observations and Problems

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

RE: I don't think there

Quote:
I don't think there will be any substantial difference between standard (full running) tasks labelled 0016.0 or 0048.0 or 0080.0, etc, once the production run starts up.


That may well be, and I certainly respect your grasp of the previous history. However during the beta testing of the last week, whether by accident or not, there does seem to have been a very strong association of run time, credit claim, and credit award with that number, aside from a couple of big bumps in awarded credit that may have been intentional adjustments applied between some issue batches.

Regarding the batches created on August 30, I observed a 15% run-time increase for the single 112.0 unit I handled compared to two 80.0 units running on the same host with throttling disabled. By contrast the same host processed 80.0 units issued two days earlier with closely comparable run-time.

All four August 30 issue 80.0 units for which I have received credit so far were awarded 1215.68 (vs. 607.84 for apparently similar units created on August 28), while the single 112.0 unit was awarded 1386.0.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

RE: RE: ... while the

Quote:
Quote:
... while the Workunits Total has been stable at 2050 for several hours. That number is better to watch as a sign of new production than is the Tasks total, as Tasks total grows with resends.

Yes, indeed. I noticed that the initial work allocation several days ago was 1900 WUs - ie, 3800 primary tasks. The latest release was a further 150 WUs (300 tasks), giving the total of 2050 WUs all up (4100 primary tasks). At the time of writing, tasks total was 6541 which means that so far there have been 6541-4100=2441 resends.


Something has changed, and today the numbers which seem to indicate currently active "out in the field" work are behaving normally, for example "tasks in progress" and "workunits without canonical result" are both slowly but steadily reducing.

However most of the numbers one supposes represent cumulative behavior since the beginning of FGRP4 have been dropping irregularly but substantially over the last few hours, including "tasks total", "workunits total", "tasks invalid", "tasks failed" ...

Possibly this is the outcome of some sort of cleanup process being done in preparation for production release--or maybe not.

As of the 1 Sep 2014 16:10:02 UTC snapshot, FGRP4 shows these:
Tasks total: 3162
Tasks valid: 2010
Tasks invalid: 334
Tasks failed: 329
Workunits total: 1235

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

As Bernd announced in the

As Bernd announced in the FGRP4 thread on Technical News, there is new FGRP4 work available.

However what I can see does not obviously match his announcement. The work that three of my hosts have downloaded and are running in the last few hours bears creation dates between 9:26 and 9:51 UTC on September 2, and has names starting with "fgrp4_test_112.0", and shows both on the tasks lists on my user account web pages and in Boincmgr on my machines as running version 1.03, with an FGRP4-beta designation.

Bernd's post, on the other hand, speaks of an initial release of "1000 workunits (2000 tasks) that will be run by the 1.04 app version, i.e. "non-Beta", and is time-stamped 2 Sep 2014 9:59:27 UTC.

Possibly there were two batches, closely space in time, or possibly...

Obviously I have no indication of credit yet, but initial indications on execution time requirement are toward the high end of previous FGRP4 batches, possibly the same as the small batch of "112.0" labelled work released three days ago.

On all three of my hosts running this work, jobs in this batch arrive with execution time predictions far over the truth. I realize that (through DCF and perhaps one or more other parameters) this is partly host history dependent, so other users may see different behavior. The deadlines are 6-day deadlines--so like the previous late-beta work, but shorter than common at Einstein for production work. On my hosts the combination of excessive completion estimate and moderately short deadline has kicked these tasks into High Priority execution, pre-empting earlier downloaded CPU tasks.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

RE: Bernd's post, on the

Quote:

Bernd's post, on the other hand, speaks of an initial release of "1000 workunits (2000 tasks) that will be run by the 1.04 app version, i.e. "non-Beta", and is time-stamped 2 Sep 2014 9:59:27 UTC.

Possibly there were two batches, closely space in time, or possibly...


The nine initial jobs I received from this batch bore "sent" times from 2 Sep 2014 10:17:05 UTC to 16:54:06 UTC and all have run as version 1.03. However when one host finished the first batch of three it received three more between 3 Sep 2014 2:11:57 UTC and 2:14:03. A version 1.04 program was sent as part of the download process for these, and they are running as 1.04 right now. Where the v1.03 stuff carried a designation of "FGRP4-Beta" these latest units carry "FGRP4-SSE2".

This seems to mean the send process got adjusted, as the quorum partners for two of these units received them earlier as v1.03 work.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117830221732
RAC: 34733907

RE: This seems to mean the

Quote:
This seems to mean the send process got adjusted, as the quorum partners for two of these units received them earlier as v1.03 work.


I just had a quick look at one of your hosts and saw some v1.04 tasks - ie the new 1.04 non-beta app. I was interested to see the task names still included the 'fgrp4_test' string so I presume your host didn't need to download a new data file and that the 'fgrp4_test.dat' file is still being used rather than some new file with a name something like 'LATeah011x?.dat'. I presume the test data is 'real' (and not previously processed) data and that if the new app is returning 'good' answers, perhaps the test data will continue to be processed until exhausted. If it was 'old' or 'test' data, I presume the data file would have changed with the advent of the production app.

I have been trying to get v1.04 work for the last several hours on a particular host (which is also enabled for FGRP3). Most of the time the response is "no work available" for both but occasionally I get some further FGRP3 resends. This is despite there being some FGRP4 'ready to send' as per the server status page.

I've looked (quite a few times) at the scheduler logs. When I get FGRP3 work, it shows "checking plan class FGRPSSE" or something like that and FGRP3 work is sent. When no work is sent, there isn't any particular reason given apart from "no work available" even though there is always FGRP4. I've not seen any evidence of a different FGRP4 specific plan class. I'm guessing that even though v1.04 is the 'production' app, you won't get any work for it (at the moment) if you don't have the 'accept beta test work' preference setting turned on. Mine is off and I don't intend to turn it back on until I'm sure there are no further DCF related nasty surprises :-).

I would be very interested to know if anyone has received FGRP4 1.04 tasks without 'beta test' being turned on.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

Gary, I've looked at a lot of

Gary, I've looked at a lot of my scheduler logs also in the last day. As, in hopes of getting more FGRP4 work, I had disabled other CPU applications and raised my requested queue, rather commonly the requested CPU work request was quite large. Always the server showed work available, yet most of the time no work was provided.

So far as I can tell, all my work so far has been expressly provided under the beta classification. When I have not gotten work, the scheduler log has contained a (to me) undecipherable calculation purporting to show that the host was too slow to have any chance of finishing the work on time.

Here are selected lines from my most recent scheduler log for a request which got work:

2014-09-03 03:12:25.0954 [PID=17477]    [send] [HOST#6566435] will accept beta work.  Scanning for beta work.
2014-09-03 03:12:25.0956 [PID=17477]    [send] [HOST#6566435] beta work found: [RESULT#453811661]
2014-09-03 03:12:25.0958 [PID=17477]    [version] Checking plan class 'FGRP4-Beta'
2014-09-03 03:12:25.0970 [PID=17477]    [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2014-09-03 03:12:25.0970 [PID=17477]    [version] numerical Windows version: 601760100 (Microsoft Windows 7 Home Premium x64 Edition, Service Pack 1, (06.01.7601.00))
2014-09-03 03:12:25.0971 [PID=17477]    [version] plan class ok
2014-09-03 03:12:25.0971 [PID=17477]    [version] Checking plan class 'FGRP4-SSE2'
2014-09-03 03:12:25.0971 [PID=17477]    [version] plan class ok
2014-09-03 03:12:25.0971 [PID=17477]    [version] Best version of app hsgamma_FGRP4 is ID 621 (1.44 GFLOPS)
2014-09-03 03:12:25.0971 [PID=17477]    [send] [HOST#6566435] [WU#198116184 fgrp4_test_112.0_0_-6.52e-10] using delay bound 518400 (opt: 518400 pess: 518400)

2014-09-03 03:12:25.0987 [PID=17477] [send] [HOST#6566435] Sending app_version 621 hsgamma_FGRP4 2 104 FGRP4-SSE2; 1.44 GFLOPS
2014-09-03 03:12:25.1004 [PID=17477] [send] est. duration for WU 198116184: unscaled 365282.25 scaled 945583.05
2014-09-03 03:12:25.1004 [PID=17477] [HOST#6566435] Sending [RESULT#453811661 fgrp4_test_112.0_0_-6.52e-10_0] (est. dur. 945583.05 seconds, delay 518400, deadline 1410232345)

Unfortunately I don't have one of the rejected for too slow host ones to share.

The available work from this batch is getting passed out very, very slowly. It may be that it is beta only, that not many hosts are set to accept beta, and that most requests are getting denied for that curious deadline compliance calculation I've seen but don't understand.

On another matter, there appears to have been another credit rate adjustment (the third change I've noticed). This time it is down, as my sole new 112.0 job to get credit so far got 1031.16, down from the 1386.0 awarded in the immediate previous small batch of 112.0 work. As both subsets of the three day ago work got credit higher than any "standard candles" I have available, this seems a step in the right direction, though preliminary comparison suggests that on my particular machines it is perhaps 5% below parity with FGRP3--this is by far the closest it has been in the sequence. Averaged across the full fleet of user machines it may well be just fine.

Jasper
Jasper
Joined: 14 Feb 12
Posts: 63
Credit: 4032891
RAC: 0

RE: I would be very

Quote:

I would be very interested to know if anyone has received FGRP4 1.04 tasks without 'beta test' being turned on.

The exact same here. When Bernd said he had made some FGRP4 ´non-beta´ available, I saw them and switched to FGRP3 and FGRP4, but no beta. Granted, I got a few resends for FGRP3 since, but no FGRP4 - right now, the queue looks empty (for both), so that fits. Not really an issue, I will allow S6CasA again (and BRP4 after that if so needed) as soon as I see FGRP not catching up; which is quite something, on my slow cruncher. 😄
I´d just like to receive at least one FGRP4 though, but that´s only out of curiosity.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

I got three new FGRP4 units

I got three new FGRP4 units within the last couple of hours which appear to be from a new batch. The sent times range from 12:33 to 14:30 UTC.

The Task names now look like: LATeah0001D_16.0_0_-1.95e-10

instead of like: fgrp4_test_112.0_93_-2.6e-11

All three of these got quorum partners within 37 minutes, in contrast to the previous batch, where in more than half the cases I had no quorum partner ten hours after my "sent" time.

As "to send" is back down to zero at the moment, it appears this batch went out far more quickly than the glacial pace yesterday.

Scheduler logs for two of my hosts successfully requesting this work show that I had beta enabled, and that plan class FGRP4-Beta was checked first, but when no work was found there plan class FGRP4-SSE2 was checked, with, in my case, success.

I surmise that not only is beta acceptance perhaps no longer required, but that perhaps something else was adjusted making it less likely for a host request to be rejected on grounds the host is incapable of finishing the work on time.

Deadlines on these newest units are the Einstein production traditional 14 days--another step toward normal production status.

Estimated time to completion for these three latest issued "more production" units are much shorter than immediately previously. It could be that this is because the units really represent far less computation, or it could be that a further adjustment has been made in something on the sending side--as the new estimates are somewhat close to actual recent results (this is a first--the earliest work had severely short estimates--which in turned triggered DCF chaos, then things shifted to substantially too long estimates, which may have played a part in request denials).

While these units all have 16.0 designations in the task name, and while I continue to believe that during the true beta testing there was a non-random and strong run time variation correlated with that field, these new units do not appear to be short time to run work as were the initial beta test 16.0 units. At claimed progress near the 2% level, these so far look similar in execution time to work late in beta test sequence. Of course I may be looking at synthetic claimed progress, as I don't know how to distinguish boingmgr guessing from progress based on actual reports from the application.

[edit: update execution time observation. OOPS. It appears that these are probably in fact much shorter execution time work than was the late beta work. All three of my tasks are now showing as 19.000% complete, with run times between 23 and 27 minutes. This suggests that the actual application finally issued a progress estimate, suggests these are pretty short units compared to the late beta, and suggests that the units are still being issued in a way which causes boincmgr on my hosts greatly to overestimate the actual completion time. However, with the longer deadlines (and the short real time), these are greatly less likely to trigger high priority processing status on a typical host. Mine would not be running yet had I not suspended other tasks to push these to the head of the line.]

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

All three of my first batch

All three of my first batch LATeah0001D_16.0 units finished with CPU times on two different hosts between 3600 and 4000 seconds. Comparing like to like, this is just about one tenth the time the same hosts used on "112.0" jobs distributed late in the beta test.

I don't know the division of labor between server side tagging, host history, and host computation by the BOINC software, but all jobs before running and while running on my hosts had hugely excessive run time estimates--something like an order of magnitude.

My use (previously) of throttling, and possibly other differences may make my experience in the run time estimate matter not representative of the general user population.

I do have good hope that once things settle down and I restrict jobs to one GPU type and one CPU type, and fall comes and I stop throttling much, that the run time estimates will eventually become pretty accurate.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7231601375
RAC: 1160659

An hour ago, over on

An hour ago, over on Technical News, Bernd announced the official launch of the real FGRP4 run.

A few observations:

1. for the first time the server status page has an entry for an FGRP4 work generation process.

2. work I've personally received includes some "16.0" units created between 7:13 UTC and 9:37 UTC and some "48.0" units between 9:50 and 10:05.

3. estimated run times on these still appear to be excessive on my particular hosts

4. there are still a few fgrp4_test re-issues coming through.

I'm promoted some of each of these two flavors to immediate processing, It appears that on one host the single 16.0 task hit the first real progress indication at 19 percent, while the two 48.0 tasks hit a real progress indication at 2.333 percent with a similar amount of real runtime. So there is at least a hint that in this initial work the beta observation that the 16.0 units had very short run times (and possibly non-characteristic credit rates) may still apply, though this may be a temporary artifact rather than any durable relationship.

Bernd's post mentions that the former FGRP3 credit for full-sized units of 660 has been left intact, so it may be that highish credit we saw in the late beta is gone for newly generated FGRP4 work.

[edit: with some more run time I can see that one of the "48.0" units I have running is clearly a shorty, while the other 48's are much longer--so farewell to the simple relationship I thought I saw in the beta]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.