FGRP4 Observations and Problems

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7358461687

RAC: 2280617

21 Aug 2014 2:24:59 UTC

Topic 197686

(moderation:

)

There is an existing thread in Technical News regarding FGRP4. Both Holmis and I have contributed observations and criticisms there.

But to keep things tidy, arguably the News forums should be for news, and observations and comments in either the Cruncher's Corner or here in Problems and Bug Reports.

So let me try to start a thread here for both observations and criticisms regarding FGRP4.

For a start, I'll quote my own post in Technical News made a few hours ago:

archa86 wrote:

I got a batch of FGRP4 tasks on this laptop.

My first one ran to completion successfully--need the quorum partner to run to see whether things went right.

Observations:

1. the initial completion time estimate was very far low--something like 6x.
2. the credit awarded (as seen here and elsewhere) of 2.58 seems very low in relation to the CPU work required.
3. I'm very happy to see CPU work available in small enough doses of computation required to be suitable either for lower output machines which run 24/7, or for somewhat higher output machines which run intermittently (as does my laptop).

To that I'll add an additional observation regarding progress reporting:

4. It appears that the current FGRP4 application reports progress (as observed in the Progress column of Boinc Manager) in very coarse increments. The only three progress reports I have seen are at 0.000%, 32.333%, and 65.666%.

As completion times are short compared to some recent Einstein CPU aps, this spacing may not represent an unusually small amount of actual computation, but from experience we may predict that some users will interpret an extended period with no update of progress as a "stall", fault of either the application or their machine, and take unconstructive responses ranging from aborting the task, to disabling work request for the specific application, up to abandoning Einstein altogether.

It would be good to report progress more frequently.

I think many of us assume that checkpointing is tied to progress reporting. Is this actually true? More specifically, for the current FGRP4 roughly how frequently is there a checkpoint--so that the intermittent user may hope not to be wasting large amounts of already invested CPU time each time they shut down their PC?

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7358461687

RAC: 2280617

FGRP4 Observations and Problems

21 Aug 2014 2:33:55 UTC

Message 122855

(moderation:

)

For those desiring FGRP4 work:

1. currently this is beta-test status, so your Einstein@home project preferences for the location (aka venue) of the host in question must have a "Yes" for "Run beta/test application versions?"

2. also in your Einstein@home preferences, in the "Run only the selected applications" section, you must have a Yes for "Gamma-ray pulsar search #4".

I am a bit unclear on the default setting of the Run Only... item for a newly listed application, but for my locations all were initially set to "No". So it required an active intervention on my part after the preferences page introduced listing of this application to get this type of work.

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7358461687

RAC: 2280617

I got and processed a unit of

21 Aug 2014 4:17:48 UTC

Message 122856

(moderation:

)

I got and processed a unit of FGRP4 work on a second host which was also a Windows 7 host but a very modern desktop CPU.

Once again the execution time was a large multiple of the prediction. Though I did not observe the prediction directly, I could easily observe that after completion of the FGRP4 result all executing work on the host went to High Priority mode because the estimated completion times were greatly elongated. A few hours after the FGRP4 results completed the estimated completion times for FGRP3 and Perseus work are between 5 and 10 times longer than recent experience.

As I have the requested work buffer size set to a little over two days, this will resolve itself pretty soon. People running larger work buffers may find this effect more disruptive.

So for two different Intel/Windows 7 hosts, the execution times estimates were low by the better part of an order of magnitude. Results on other hosts may differ.

mountkidd

Joined: 14 Jun 12

Posts: 180

Credit: 12972370626

RAC: 6434135

RE: For those desiring

21 Aug 2014 4:21:43 UTC

Message 122857 in response to message 122855

(moderation:

)

Quote:

For those desiring FGRP4 work:

1. currently this is beta-test status, so your Einstein@home project preferences for the location (aka venue) of the host in question must have a "Yes" for "Run beta/test application versions?"

2. also in your Einstein@home preferences, in the "Run only the selected applications" section, you must have a Yes for "Gamma-ray pulsar search #4".

This is not quite true in my experience. One of my venues (w/ 2 hosts) was set to Yes for "run beta" but no for "GR gpu" and no for "run cpu for apps w/ gpu" and I still got a large number (~120) of FGRP4 tasks downloaded. Disabling "beta" and abort tasks cured the problem. I'm back to BRP5 only on both hosts...

Gord

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119207990391

RAC: 25011950

RE: This is not quite true

21 Aug 2014 8:24:48 UTC

Message 122858 in response to message 122857

(moderation:

)

Quote:

This is not quite true in my experience.

I think it may well be true because you may have received the unexpected tasks for a different reason. I should add that I haven't received any FGRP4 tasks because I haven't (yet) enabled the FGRP4 preference. I'm still trying to figure out how I'm going to juggle venues (yet again) to allow me to do so in a controlled way. Also, I'm in no hurry until I see the initial problems (as reported by Holmis) corrected.

Quote:

One of my venues (w/ 2 hosts) was set to Yes for "run beta" but no for "GR gpu" and no for "run cpu for apps w/ gpu" and I still got a large number (~120) of FGRP4 tasks downloaded.

I assume you must have selected the preference for the FGRP4 run in that venue? I also assume that "run cpu for apps w/ gpu" refers to the pref setting labelled "Run CPU versions of applications for which GPU versions are available"? If so, setting this pref to 'No' may not (of itself) prevent you from getting FGRP4 tasks because there is no GPU app 'available' for FGRP4 and so the pref setting may not even be looked at.

Quote:

Disabling "beta" and abort tasks cured the problem. I'm back to BRP5 only on both hosts...

Could you have also cured the problem by disabling FGRP4? If you did get FGRP4 tasks with that run preference already disabled, that's also a problem that will need to be rectified as well.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7358461687

RAC: 2280617

In the technical news thread

21 Aug 2014 12:33:00 UTC

Message 122859

(moderation:

)

In the technical news thread on this topic Gary Roberts has pointed out that some execution time and credit reporting for these tasks may be atypical because of the "short ends" problem.

So I don't know how representative my result may be but I will point out that all three hosts in my flotilla which have received this work have greatly increased their duration correction factor, and have thus been driven into executing work in high-priority mode immediately after completing their first task of this type.

While I did not log the reported duration correction factor before beginning this process, the values as of this morning a few hours after first processing this type of work are:
17.735275
9.560836
10.259696

So, for the work distributed initially, it seems all three of my hosts have had significantly low run time estimates.

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7358461687

RAC: 2280617

One other attribute of the

21 Aug 2014 14:13:28 UTC

Message 122860

(moderation:

)

One other attribute of the FGRP4 work I have received is tight deadlines, just 48 hours after the "sent" time.

I imagine this has to do with the beta test status of the currently distributing work. But together with the DCF transient associated with the severe underestimate of time required, this will be disruptive for some users, particularly those who choose to run long queue lengths, and also those whose machines are only actively processing BOINC work intermittently.

mountkidd

Joined: 14 Jun 12

Posts: 180

Credit: 12972370626

RAC: 6434135

RE: I assume you must have

21 Aug 2014 16:44:18 UTC

Message 122861 in response to message 122858

(moderation:

)

Quote:

I assume you must have selected the preference for the FGRP4 run in that venue?

This was set to 'No' for FGRP4. BRP5 is the only app enabled in all my venues. I suspect it had something to do with 'Beta' being enabled, but then 'run cpu for gpu apps' was set to no. It appears no doesn't quite mean no.

Gord

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7358461687

RAC: 2280617

With under two dozen units

21 Aug 2014 20:21:50 UTC

Message 122862

(moderation:

)

With under two dozen units processed, I've had two errors on two different machines--far higher an error rate than I've been seeing.

The error on my laptop came almost immediately upon start of execution, and has quite a short stderr, of which only two lines look potentially interesting

(unknown error) - exit code -1073741680 (0xc0000090)
...
-- signal handler called: signal 8

The error on my fastest PC has a much longer stderr, of which one entry reads "Maximum elapsed time exceeded", although another entry buried deep in the might be interesting and reads

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x76343226

[edit: the first of these two errors was my only v 1.02 job. That version has been deprecated.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

RE: The error on my fastest

22 Aug 2014 7:30:28 UTC

Message 122863 in response to message 122862

(moderation:

)

Quote:

The error on my fastest PC has a much longer stderr, of which one entry reads "Maximum elapsed time exceeded", although another entry buried deep in the might be interesting and reads
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x76343226

That's the normal error when Boinc aborts the job because of "Maximum elapsed time exceeded".

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3006311450

RAC: 707153

RE: RE: The error on my

22 Aug 2014 7:48:42 UTC

Message 122864 in response to message 122863

(moderation:

)

Quote:

Quote:
The error on my fastest PC has a much longer stderr, of which one entry reads "Maximum elapsed time exceeded", although another entry buried deep in the might be interesting and reads
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x76343226
That's the normal error when Boinc aborts the job because of "Maximum elapsed time exceeded".

Rom Walton once told me it was a deliberate choice by the developers. One possible reason for a task running far longer than expected is that the execution path for that particular dataset has branched into a previously undetected infinite loop. The full program debug logs are to help the developer find that loop.

FGRP4 Observations and Problems

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports