Completed, too late to validate

KittenKaboodle
KittenKaboodle
Joined: 9 Feb 11
Posts: 13
Credit: 10765731
RAC: 0
Topic 195749

Sometimes, the following scenario happens:
( see http://einsteinathome.org/workunit/94548956 for details)

I get a certain WU because some other computer didn't complete the WU in time. Long before my computer starts to compute that WU, the computer which timed out for the WU finally completes the WU and returns the result. When my computer returns the completed WU, I get no credits.
Why doesn't Boinc realize that my computer doesn't need to compute that WU any more? The server should communicate the situation to my computer and ask my computer to delete the WU. It's a waste of resources and I don't get credits anyway.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2775580752
RAC: 812016

Completed, too late to validate

Quote:

Sometimes, the following scenario happens:
( see http://einsteinathome.org/workunit/94548956 for details)

I get a certain WU because some other computer didn't complete the WU in time. Long before my computer starts to compute that WU, the computer which timed out for the WU finally completes the WU and returns the result. When my computer returns the completed WU, I get no credits.
Why doesn't Boinc realize that my computer doesn't need to compute that WU any more? The server should communicate the situation to my computer and ask my computer to delete the WU. It's a waste of resources and I don't get credits anyway.


BOINC does actually have the ability to cancel a task under those circumstances, but it relies on your computer contacting the server in the interval between the original late task being returned, and your computer starting to compute the new copy it has received. Once your computer has already started work on the task, it doesn't get cancelled automatically, in an attempt to preserve your credit for the work you've already done.

But that's nothing to do with the reason you got zero credits. If you look at your copy (task 225314958), you'll see that you had your own deadline to meet, 14 days from the date of issue, the same as the other tasks. It was because you missed your own personal deadline that you got zero credits: if you had reported it just three hours earlier, you would have got credit, at least, though I agree the scientific work would have been a bit of a waste since the results were already in.

Edit - an i7 with a RAC of 21,935 should have no difficulty returning work on time. Might I suggest that you reduce your cache settings a bit?

KittenKaboodle
KittenKaboodle
Joined: 9 Feb 11
Posts: 13
Credit: 10765731
RAC: 0

RE: BOINC does actually

Quote:

BOINC does actually have the ability to cancel a task under those circumstances, but it relies on your computer contacting the server in the interval between the original late task being returned, and your computer starting to compute the new copy it has received.


There was plenty of time (almost 12 days) and my computer is communicating with the server every 4 hours, so I don't get why that WU wasn't canceled.

Quote:

But that's nothing to do with the reason you got zero credits. If you look at your copy (task 225314958), you'll see that you had your own deadline to meet, 14 days from the date of issue, the same as the other tasks. It was because you missed your own personal deadline that you got zero credits: if you had reported it just three hours earlier, you would have got credit, at least, though I agree the scientific work would have been a bit of a waste since the results were already in.

The point is, if there was a reliable mechanism for cancelling a WU, then this problem would never have been occurred. I missed the deadline more than 1 time, it's usually no problem when my computer is just a few hours late.

Quote:

Edit - an i7 with a RAC of 21,935 should have no difficulty returning work on time. Might I suggest that you reduce your cache settings a bit?

Well I could, but then my GPU wouldn't get enough work during night times.
See http://einsteinathome.org/node/195669&nowrap=true#110755 for details.

Edit: And why does Boinc start to calculate WUs that are due on 20 April and stops calculating those WUs after having calculated 20% of those WUs.
Right now, I got 7 WUs left which are due on 13 April but Boinc started the WUs due on 14 and 15 April. I just don't understand that software.

Sabroe_SMC
Sabroe_SMC
Joined: 9 Oct 06
Posts: 27
Credit: 359911731
RAC: 117001

cls :START echo

cls

:START

echo off

time /t

PING -n 3600 127.0.0.1>nul

boinccmd.exe --project http://einstein.phys.uwm.edu/ update"

GOTO START

Make a batch file with this content and start it. It will contact the Einstein server all 3600 sec and make a update to report or to fetch work. Than you can do a smaller cache.

mikey
mikey
Joined: 22 Jan 05
Posts: 11977
Credit: 1834156899
RAC: 207375

RE: Edit: And why does

Quote:

Edit: And why does Boinc start to calculate WUs that are due on 20 April and stops calculating those WUs after having calculated 20% of those WUs.
Right now, I got 7 WUs left which are due on 13 April but Boinc started the WUs due on 14 and 15 April. I just don't understand that software.

This one could be due to memory issues and how much you are letting Boinc use, go into the Boinc Manager down by the clock, and click on Advanced, Preferences, the disk and memory usage tab and change the percentage under memory usage for "use at most [] % when computer is in use". Change that to say 85.00 and see if your tasks don't finish all the way thru. I was having the same problem, even at 75%, but they now go all the way without doing what you are describing. I also have the one below it set at 90%, the "% when computer is idle".

Quote:
The point is, if there was a reliable mechanism for cancelling a WU, then this problem would never have been occurred. I missed the deadline more than 1 time, it's usually no problem when my computer is just a few hours late.

You have been lucky or unlucky then, Boinc is designed to NOT give credits to a pc that returns units late. However the process is when a unit is late it is sent out to another pc for crunching. What you have been doing is returning the unit before they did and you were getting credit and they were not! In this case they returned it before you did and you did not get the credit. Years ago Seti looked into a system where any unit that was resent like this would be sent to a pc that consistently returned units in less than 24 hours. The scheduling was too complicated so they dropped it, you just got paired with a pc that was uber fast in its crunching and returning of units. Some projects send a cancel signal to the original pc, some don't.

KittenKaboodle
KittenKaboodle
Joined: 9 Feb 11
Posts: 13
Credit: 10765731
RAC: 0

Thanks, mikey, I will adjust

Thanks, mikey, I will adjust my config and see if it helps.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110085752897
RAC: 23824240

RE: RE: BOINC does

Quote:
Quote:

BOINC does actually have the ability to cancel a task under those circumstances, but it relies on your computer contacting the server in the interval between the original late task being returned, and your computer starting to compute the new copy it has received.

There was plenty of time (almost 12 days) and my computer is communicating with the server every 4 hours, so I don't get why that WU wasn't canceled.


I'm guessing it's because that BOINC feature may not be implemented at E@H. The Devs at each project decide which BOINC features to enable and which to disable in the server config. For example, E@H has the "Resend Lost Tasks" feature enabled whereas other projects have this disabled. I find this feature particularly useful for a number of different purposes. On the other hand I've never seen a task get canceled by server instruction but then again I try very hard to make sure that resend tasks get completed by the deadline because of the very thing that happened to you. You got a resend task because someone else missed a deadline. Instead of that person aborting the missed deadline task, they let it run and be returned late. Because your cache of work is such that some of your tasks can miss their own deadlines, you are always at risk of the original deadline miss task being completed way late but still making it back before yours does. If that happens, you get no credit if your task misses its own deadline by the tiniest whisker. There's nothing personal - it's just bad luck.

Quote:
Quote:

But that's nothing to do with the reason you got zero credits. If you look at your copy (task 225314958), you'll see that you had your own deadline to meet, 14 days from the date of issue, the same as the other tasks. It was because you missed your own personal deadline that you got zero credits: if you had reported it just three hours earlier, you would have got credit, at least, though I agree the scientific work would have been a bit of a waste since the results were already in.

The point is, if there was a reliable mechanism for cancelling a WU, then this problem would never have been occurred. I missed the deadline more than 1 time, it's usually no problem when my computer is just a few hours late.


The point is, there is a reliable mechanism but the Devs would seem not to have implemented it, probably because of the extra load it would place on the servers.

Quote:
Edit: And why does Boinc start to calculate WUs that are due on 20 April and stops calculating those WUs after having calculated 20% of those WUs.
Right now, I got 7 WUs left which are due on 13 April but Boinc started the WUs due on 14 and 15 April. I just don't understand that software.


This is nothing to do with your memory use preference settings. It will occur even if you allow BOINC to use 100% at all times. I know from personal experience.

It very much depends on the version of BOINC you are using. I can make it happen at will on the current recommended version and I'm guessing it is probably in most if not all 6.10.x versions. It's one of the major reasons I've left most of my machines on earlier versions. It may be fixed in 6.12.x and it's probably worth trying one of those to see if it fixes things. Richard Haselgrove may be able to comment further about that.

For any host running the current version (well 6.10.58 anyway) here's how to make it happen.

  • * Have a big cache of work to make it happen more easily
    * Have the machine off (or crashed, or BOINC not running for some reason) so that the time stats will drop to the extent that the current cache of work exceeds (or is at risk of exceeding) the allowed deadline (when it wasn't exceeding the deadline previously).
    * When BOINC is restarted, it will go into "High Priority" mode and start making incredibly stupid decisions like pausing the most "at risk" task that's 99% completed in order to start another task that is in no danger at all, and so forth, just as you describe.

So why is it happening to you since you are not causing your time stats to be lowered. I believe it will be due to DCF oscillations. The Duration Correction Factor will go up and down markedly due to the speed with which your GPU can complete tasks. Let's assume the DCF starts at 1.0 (the desired value) as this means the project estimate for how long a job will take is precisely correct. If the project estimate is too short, the DCF will be adjusted upwards so that the estimate X DCF gives an answer closer to reality. So if the project estimate is 5 hours and your computer always takes 6, over a period the DCF would converge to 1.2. The reverse would happen if the estimate was too big. For example, an 8 hour estimate would result eventually in a DCF of 0.75.

The problem is that there is only a single (per project) DCF for all E@H tasks crunched on your machine. If you were only doing CPU tasks and the GW and BRP estimates weren't too bad (ie both high or both low by about the same amount) then there wouldn't be a problem. It's when you do a BRP task on the GPU that the problem arises. These tasks are completed so quickly that the DCF will be adjusted down by a relatively big fraction, particularly if you have several in a row. While the DCF is being adjusted downwards (and with a large cache size) the total estimated hours in your cache will be being lowered (in stepwise fashion as each DCF adjustment is made) and so your client will be asking for a lot more work each time the DCF steps down.

Then comes the day of reckoning. A much longer running GW task finishes and in one single hit, the DCF jumps back to a much higher value. If you have an overly large cache, all of a sudden the total hours in that cache will have grown to the extent that BOINC goes into panic mode. Then the buggy behaviour of particular BOINC versions kicks in and you experience just what you describe.

As I see it, you have two choices to mitigate the effects of this issue.

1. Experiment with BOINC versions until you find one that can handle your GPU properly (so no old versions) without showing the problem. I imagine your only real hope is probably the latest 6.12.x version. Richard may very well know more about this.

2. Reduce your cache size to the point that a sudden upward movement in DCF is much less likely to take you into HP mode. Guarantee your supply of BRP work by using a very simple boinccmd script to update the project, say once per hour. You could run it as a scheduled task so you wouldn't need the script itself to work out when to run. This is just the same as Sabroe_SMC already suggested. Surely you won't be running out of GPU work if BOINC is told to update the project every hour?

Cheers,
Gary.

KittenKaboodle
KittenKaboodle
Joined: 9 Feb 11
Posts: 13
Credit: 10765731
RAC: 0

Thanks for your thorough

Thanks for your thorough answer, Gary.
I was able to reduce the cache while maintaining enough GPU WUs by inserting the tag into the app_info.xml. Now my CPU WUs won't get near the deadline any more.

websterhamster
websterhamster
Joined: 14 Feb 11
Posts: 1
Credit: 500
RAC: 0

I was just wondering, what do

I was just wondering, what do you do if you never get a workunit that you can complete on time? Does that mean your computer is just too slow for E@H?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2775580752
RAC: 812016

RE: I was just wondering,

Quote:
I was just wondering, what do you do if you never get a workunit that you can complete on time? Does that mean your computer is just too slow for E@H?


The deadline at Einstein is always 14 days after the task is issued to you. So, if your computer can't complete a single Einstein task in less than 14 days, then yes, I'm afraid it is too slow.

Having said that, my current oldest and slowest PC, a 1.6 GHz Pentium 4 that's over 10 years old, can complete a S5GCE HF task in under 18 hours, so it beats that minimum speed requirement almost 20-fold. You would have to be using a very old, or very slow, computer not to be able to participate in Einstein at all.

Most of this dicussion thread has been about people who can complete tasks much faster than that, but for one reason or another don't even start computing the work they've been assigned until too close to an impending deadline.

mikey
mikey
Joined: 22 Jan 05
Posts: 11977
Credit: 1834156899
RAC: 207375

RE: I was just wondering,

Quote:
I was just wondering, what do you do if you never get a workunit that you can complete on time? Does that mean your computer is just too slow for E@H?

Yours is NOT too slow but it did take an awful long time to complete a unit!!
Issued to you on 24 Apr 2011 1:15:38 UTC returned to Einstein on 7 May 2011 17:59:02 UTC Completed and validated 271,836.70 194,844.50 369.94 500.00 Binary Radio Pulsar Search v1.05 (BRP3SSE)

It took 75 HOURS to finish one work unit!! Your pc is either not on 24/7 or was VERY busy doing other things instead of crunching. If you will start a new thread some of us can help you figure out why it is taking so long.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.