Not Getting Credit

dmargulis
dmargulis
Joined: 17 Apr 07
Posts: 8
Credit: 941960531
RAC: 850322
Topic 219553

So I've been running E@H for many years, several on this machine.  As of 8/27 I seem to have stopped receiving credit for my WUs.  I've done a bit of research and seen my logs report completed WUs, although there also seem to be an inordinate number that are being aborted due to "not started and deadline has passed"  so perhaps also an issue on the input side?

 But the no credit for 2 weeks seems unlikely to be a pending validation issue and my account doesn't show any pending credits.  Any thoughts?

dmargulis
dmargulis
Joined: 17 Apr 07
Posts: 8
Credit: 941960531
RAC: 850322

So looking further into my

So looking further into my account statistics, I looks like there are a large number of Invalid results all in the time period under question. They are all Gravitational Wave WUs and all GPU-based.

Can anyone confirm that this is likely the problem with the GPU app developmental effort?

 

Thanks all.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117854091619
RAC: 34791760

dmargulis wrote:... As of

dmargulis wrote:
... As of 8/27 I seem to have stopped receiving credit for my WUs.  I've done a bit of research and seen my logs report completed WUs, although there also seem to be an inordinate number that are being aborted due to "not started and deadline has passed"  so perhaps also an issue on the input side?

Your host started downloading the O2AS GW test tasks around 18th Aug with more of them on 25th Aug and later.  The 08/18 tasks were V1.07 and the 08/25 and later tasks were a newer test version V1.08.

The very first tasks returned from the 08/18 batch started exceeding the deadline on 09/02 but you did have a couple that seem to have made it back and got validated around 09/03.  So the first part of your problem was simply tasks not being crunched in a timely manner.

On top of that, both the mentioned versions are test versions of a new app so I imagine you must have changed your settings to allow your host to run test apps.  I'm sure the project is grateful for this support but you really need to be aware that test tasks are likely to fail or give invalid results so shouldn't be accepted if you want well behaved and stable performance with good prospect of validation.  You also need to follow the messages that can tell you if there are problems or not.

For example, on 08/26 Bernd posted this message confirming problems with V1.08 and that the app version was "retracted/deprecated".   By following the messages, you could have saved a lot of pointless crunching where everything labeled V1.08 was destined to be invalid.

I'm very sorry that you've been caught like this.  Hopefully, those who contribute to testing will eventually see some reward for their suffering in the form of an app that does give valid results.  We are not there yet as the return to the V1.07 app still shows significant numbers of results failing validation.  There is obviously more work to do and there is no clear answer as to why or how and when there will be progress towards the final goal of a stable app.

Cheers,
Gary.

dmargulis
dmargulis
Joined: 17 Apr 07
Posts: 8
Credit: 941960531
RAC: 850322

Thanks, Gary. I don't

Thanks, Gary.

I don't monitor BOINC daily and so didn't notice the lack of credit until 9/8.  My message wasn't meant so much to complain about not getting credit as to look for an explanation for why the app behavior changed.  I did eventually find that explanation as my second message was meant to note (obviously not effectively).  As a result I purged a bunch of 1.08 tasks that had yet to be started but left the 1.07 tasks .  There is certainly no need to apologize for the project testing new algorithms.

What I don't understand is why I suddenly started to exceed deadlines on the work.  I seem to have downloaded a very large number of tasks with longer running times than usual so visual inspection was pretty clear that they couldn't be crunched in time.  I don't believe that I changed my task buffer size so how would I check on what caused the excessive task downloads?

Thanks

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117854091619
RAC: 34791760

dmargulis wrote:... I don't

dmargulis wrote:
... I don't believe that I changed my task buffer size so how would I check on what caused the excessive task downloads?

I believe I can tell you what caused you to get excessive downloads.

When I originally looked at your tasks, I selected just the GW GPU tasks because you mentioned them specifically in your second message.  Now that you mention excessive downloads without changing your work cache size, that suggests a duration correction factor (DCF) problem that might have been caused by running of some FGRPB1G tasks at about the same time as the O2AS GW tasks.  Sure enough, on inspection, there are still a couple of FGRPB1G tasks completed around Aug 26 and still showing in your tasks list.

It just so happens that FGRPB1G tasks crunch faster than their estimate whilst O2AS tasks take way longer than the estimate.  DCF is project wide so fast finishing FGRPB1G tasks will reduce DCF to such an extent that when your host went to download O2AS tasks, your BOINC client would have asked for a certain amount of work but then been given GW tasks to fill that request based on an entirely unrealistic estimate of how long the tasks would take.

To counter this problem there are several strategies you could use:-

  • Choose one search only as the DCF will then settle over time to a value that will give the correct number of tasks for the work cache size you require.

  • If you'd like both searches concurrently, just keep a very low cache size (eg. 0.2 days) so that even a wildly incorrect estimate of crunch time couldn't allow you to get anywhere near the 14 day deadline.  It will look very messy because sometimes you will have much more actual work than you should and at other times you will have much less.  However, with the low cache size you could essentially 'set and forget'.  Your main risk would be running out of work if there was a project glitch of some sort.
  • If you're prepared to monitor things occasionally, you could alternate the two searches through preferences on a time scale of your choosing - perhaps several days or more.  Let's say you chose a week.  Start with a low cache size with just one search selected.  Once you have done a few tasks and the estimate is looking 'within the ball park' you could increase the cache size and then forget about it for the remainder of the week.  At the appropriate time, you would set the work cache to a low value again and then change the search to the other one.  Any new tasks will be for the other search but you won't get too many.  The remaining tasks for the previous search selection will be crunched.  When the new search tasks start being crunched there will be a DCF adjustment and once again, when that settles, you can increase the work cache to what you would like to have.  The disadvantage is that you have to keep remembering to do the pref changes (work cache and search type) at the appropriate times.  In other words, you have to do a bit of micro-managing on an ongoing basis :-).

 I hope this is all understandable.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7232354634
RAC: 1158290

dmargulis wrote:I don't

dmargulis wrote:
I don't believe that I changed my task buffer size so how would I check on what caused the excessive task downloads?

There is more than one important element here, some in your control, and some matters of how BOINC runs, and some how Einstein is configured.

The project provides estimates of the work content of each task.  BOINC on your PC guesses how long each task will take to run based on the work content estimate combined with a current estimate of how fast your machine is.

The first guess of how fast your machine is comes from a benchmark run by BOINC on your machine, but gets adjusted in view of actual results. 

So long as the project estimates are consistently proportional to each other, the system works well.  But here at Einstein, the task time consumption relative to that estimate can vary between task types by very large amounts--well over a factor of ten.

As you have your project settings adjusted to allow more than one type of task, your system is vulnerable to a shift in task type.  In this case, probably work was fetched to your machine assuming history on the Gamma-Ray Pulsar task could be applied to the Gravity Wave task.  In fact, the GW tasks were going to take well over ten times the elapsed time estimated that way.

There are simple things you can do to avoid this type of trouble:

1. allow only one type of task at any given time  (this means turning off the "beta tasks allowed" flag, as well as unchecking all the other types save the selected one).

2. Or, if you don't like that, specify a very short task queue length (the sum of "store at least" and "Store up to an additional").  Consider 0.1 day total (no, I'm not joking), maybe less if you run multiple projects.

Additional elements which can be a factor for people running multiple projects involve BOINC choosing to run another project for days when one falls into schedule trouble, etc.  Again this consideration may give you a good reason to employ extremely short queue length request numbers.

Also if you are inconsistent in how much of the time your PC is actually running BOINC, that will cause inappropriate work fetch.  A remedy for that is, once again, very short queue length request.

If you want to watch BOINC on your machine adjust the guess of how fast your machine is in response to recently run work type, watch the value of Task Duration Correction Factor change.  I currently have two hosts with the exact same graphics card (RX 570) one of which is running GRP only and one GW only.  At the moment, the GRP only one shows an FDCF of 0.178, while the GW one shows 1.76.  The adjustment procedure adjusts the number slowly down each time a task runs in shorter elapsed time than predicted.  If the elapsed time is only a little longer than predicted, a similar small adjustment up is made.  But if the elapsed time is longer by too large a proportion, there is an immediate full adjustment to the single most recent observation.  This will give an immediate big change in your total estimated queue size and can trigger "panic mode", with attendant suspending of running tasks in favor of others seemingly in more deadline trouble, not to mention cross-project allocation trouble.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.