FGRP4 Observations and Problems

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228761571
RAC: 1126895
Topic 197686

There is an existing thread in Technical News regarding FGRP4. Both Holmis and I have contributed observations and criticisms there.

But to keep things tidy, arguably the News forums should be for news, and observations and comments in either the Cruncher's Corner or here in Problems and Bug Reports.

So let me try to start a thread here for both observations and criticisms regarding FGRP4.

For a start, I'll quote my own post in Technical News made a few hours ago:

archa86 wrote:

I got a batch of FGRP4 tasks on this laptop.

My first one ran to completion successfully--need the quorum partner to run to see whether things went right.

Observations:

1. the initial completion time estimate was very far low--something like 6x.
2. the credit awarded (as seen here and elsewhere) of 2.58 seems very low in relation to the CPU work required.
3. I'm very happy to see CPU work available in small enough doses of computation required to be suitable either for lower output machines which run 24/7, or for somewhat higher output machines which run intermittently (as does my laptop).

To that I'll add an additional observation regarding progress reporting:

4. It appears that the current FGRP4 application reports progress (as observed in the Progress column of Boinc Manager) in very coarse increments. The only three progress reports I have seen are at 0.000%, 32.333%, and 65.666%.

As completion times are short compared to some recent Einstein CPU aps, this spacing may not represent an unusually small amount of actual computation, but from experience we may predict that some users will interpret an extended period with no update of progress as a "stall", fault of either the application or their machine, and take unconstructive responses ranging from aborting the task, to disabling work request for the specific application, up to abandoning Einstein altogether.

It would be good to report progress more frequently.

I think many of us assume that checkpointing is tied to progress reporting. Is this actually true? More specifically, for the current FGRP4 roughly how frequently is there a checkpoint--so that the intermittent user may hope not to be wasting large amounts of already invested CPU time each time they shut down their PC?

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228761571
RAC: 1126895

FGRP4 Observations and Problems

For those desiring FGRP4 work:

1. currently this is beta-test status, so your Einstein@home project preferences for the location (aka venue) of the host in question must have a "Yes" for "Run beta/test application versions?"

2. also in your Einstein@home preferences, in the "Run only the selected applications" section, you must have a Yes for "Gamma-ray pulsar search #4".

I am a bit unclear on the default setting of the Run Only... item for a newly listed application, but for my locations all were initially set to "No". So it required an active intervention on my part after the preferences page introduced listing of this application to get this type of work.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228761571
RAC: 1126895

I got and processed a unit of

I got and processed a unit of FGRP4 work on a second host which was also a Windows 7 host but a very modern desktop CPU.

Once again the execution time was a large multiple of the prediction. Though I did not observe the prediction directly, I could easily observe that after completion of the FGRP4 result all executing work on the host went to High Priority mode because the estimated completion times were greatly elongated. A few hours after the FGRP4 results completed the estimated completion times for FGRP3 and Perseus work are between 5 and 10 times longer than recent experience.

As I have the requested work buffer size set to a little over two days, this will resolve itself pretty soon. People running larger work buffers may find this effect more disruptive.

So for two different Intel/Windows 7 hosts, the execution times estimates were low by the better part of an order of magnitude. Results on other hosts may differ.

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 176
Credit: 12615082555
RAC: 8003769

RE: For those desiring

Quote:

For those desiring FGRP4 work:

1. currently this is beta-test status, so your Einstein@home project preferences for the location (aka venue) of the host in question must have a "Yes" for "Run beta/test application versions?"

2. also in your Einstein@home preferences, in the "Run only the selected applications" section, you must have a Yes for "Gamma-ray pulsar search #4".

This is not quite true in my experience. One of my venues (w/ 2 hosts) was set to Yes for "run beta" but no for "GR gpu" and no for "run cpu for apps w/ gpu" and I still got a large number (~120) of FGRP4 tasks downloaded. Disabling "beta" and abort tasks cured the problem. I'm back to BRP5 only on both hosts...

Gord

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117755922092
RAC: 34822969

RE: This is not quite true

Quote:
This is not quite true in my experience.


I think it may well be true because you may have received the unexpected tasks for a different reason. I should add that I haven't received any FGRP4 tasks because I haven't (yet) enabled the FGRP4 preference. I'm still trying to figure out how I'm going to juggle venues (yet again) to allow me to do so in a controlled way. Also, I'm in no hurry until I see the initial problems (as reported by Holmis) corrected.

Quote:
One of my venues (w/ 2 hosts) was set to Yes for "run beta" but no for "GR gpu" and no for "run cpu for apps w/ gpu" and I still got a large number (~120) of FGRP4 tasks downloaded.


I assume you must have selected the preference for the FGRP4 run in that venue? I also assume that "run cpu for apps w/ gpu" refers to the pref setting labelled "Run CPU versions of applications for which GPU versions are available"? If so, setting this pref to 'No' may not (of itself) prevent you from getting FGRP4 tasks because there is no GPU app 'available' for FGRP4 and so the pref setting may not even be looked at.

Quote:
Disabling "beta" and abort tasks cured the problem. I'm back to BRP5 only on both hosts...


Could you have also cured the problem by disabling FGRP4? If you did get FGRP4 tasks with that run preference already disabled, that's also a problem that will need to be rectified as well.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228761571
RAC: 1126895

In the technical news thread

In the technical news thread on this topic Gary Roberts has pointed out that some execution time and credit reporting for these tasks may be atypical because of the "short ends" problem.

So I don't know how representative my result may be but I will point out that all three hosts in my flotilla which have received this work have greatly increased their duration correction factor, and have thus been driven into executing work in high-priority mode immediately after completing their first task of this type.

While I did not log the reported duration correction factor before beginning this process, the values as of this morning a few hours after first processing this type of work are:
17.735275
9.560836
10.259696

So, for the work distributed initially, it seems all three of my hosts have had significantly low run time estimates.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228761571
RAC: 1126895

One other attribute of the

One other attribute of the FGRP4 work I have received is tight deadlines, just 48 hours after the "sent" time.

I imagine this has to do with the beta test status of the currently distributing work. But together with the DCF transient associated with the severe underestimate of time required, this will be disruptive for some users, particularly those who choose to run long queue lengths, and also those whose machines are only actively processing BOINC work intermittently.

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 176
Credit: 12615082555
RAC: 8003769

RE: I assume you must have

Quote:
I assume you must have selected the preference for the FGRP4 run in that venue?


This was set to 'No' for FGRP4. BRP5 is the only app enabled in all my venues. I suspect it had something to do with 'Beta' being enabled, but then 'run cpu for gpu apps' was set to no. It appears no doesn't quite mean no.

Gord

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228761571
RAC: 1126895

With under two dozen units

With under two dozen units processed, I've had two errors on two different machines--far higher an error rate than I've been seeing.

The error on my laptop came almost immediately upon start of execution, and has quite a short stderr, of which only two lines look potentially interesting

(unknown error) - exit code -1073741680 (0xc0000090)
...
-- signal handler called: signal 8

The error on my fastest PC has a much longer stderr, of which one entry reads "Maximum elapsed time exceeded", although another entry buried deep in the might be interesting and reads

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x76343226

[edit: the first of these two errors was my only v 1.02 job. That version has been deprecated.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

RE: The error on my fastest

Quote:
The error on my fastest PC has a much longer stderr, of which one entry reads "Maximum elapsed time exceeded", although another entry buried deep in the might be interesting and reads
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x76343226


That's the normal error when Boinc aborts the job because of "Maximum elapsed time exceeded".

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2960522696
RAC: 707764

RE: RE: The error on my

Quote:
Quote:
The error on my fastest PC has a much longer stderr, of which one entry reads "Maximum elapsed time exceeded", although another entry buried deep in the might be interesting and reads
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x76343226

That's the normal error when Boinc aborts the job because of "Maximum elapsed time exceeded".


Rom Walton once told me it was a deliberate choice by the developers. One possible reason for a task running far longer than expected is that the execution path for that particular dataset has branched into a previously undetected infinite loop. The full program debug logs are to help the developer find that loop.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.