This unit was reported after BOINCMgr had increased the forecasted completion times for work already in queue by a factor of about six, but the host web page still showed the task duration correction factor at .21.
Subsequent to reporting this single S5R3 task, and a single S5R4 task, the completion forecasts have come down back down some (from almost 31 hours to 28.25 hours), and the task duration correction factor now shows on the web page as 1.30. (up from .21 minutes earlier)
So apparently through some means the tdcf was greatly increased, but the value displayed on the web page lagged that used by BOINCMgr to display predictions.
I was about to renounce this observation, since I'd not seen it on my other hosts, and I'd not seen any confirming observations from others.
Now Richard Haselgrove in another thread has reported that the first S5R4 result reported by one of his hosts raised the tdcf for that host from 0.216881 to 1.586961. Both values are not so far from my own.
As to my other three hosts, none of them has yet completed or reported an S5R4 task. So if that is the trigger there is no incongruity. I should have completions from two other hosts in about a day.
My new guess of the behavior is that as soon as an S5R4 task completes, the estimates shown by BOINCMgr for completion time of both S5R3 and S5R4 tasks will enormously increase--to preposterously high values for S5R3, and somewhere in the neighborhood of truth for S5R4. This won't affect the web page tdcf until the task is successfully reported.
I suspect the fact that my host reported a single S5R4 task causing a tdcf bump, then went back to work down a queue of scores of S5R3 tasks which will steadily whittle the tdcf back toward .21 is unusual. I botched my transition badly, with multiple app_info.xml edits, program downloads, and restarts, some of which involved wholesale rejection and subsequent reconstitution of my S5R3 queue. I suspect there was a moment in which I had one S5R4 task ready to go, and was unable to run S5R3, so it got to run. I'll just need to lower my requested queue size to avoid an excessive flood of S5R4 when that resumes for this host.
I suspect the fact that my host reported a single S5R4 task causing a tdcf bump, then went back to work down a queue of scores of S5R3 tasks which will steadily whittle the tdcf back toward .21 is unusual. I botched my transition badly, with multiple app_info.xml edits, program downloads, and restarts, some of which involved wholesale rejection and subsequent reconstitution of my S5R3 queue. I suspect there was a moment in which I had one S5R4 task ready to go, and was unable to run S5R3, so it got to run. I'll just need to lower my requested queue size to avoid an excessive flood of S5R4 when that resumes for this host.
You are correct. Most hosts got the bump up and then they are settling in... Even still, Bernd has indicated that the initial units are at runtime maxima, so the dcf will gradually decrease for a non-app_info host...
However, for anyone using the multi-purpose app_info file, wide fluctuations are certainly possible. Any underestimation will be increased to the full time of the last task completed. Overestimations only gradually bring the estimation back down.
As for avoiding the flood, I believe that you could also edit client_state.xml to have a value closer to reality for S5R4...
I received 10 S5R4 files, even though I have an app_info.xml for the optimized SSE2 Linux app.
While reading in the SETI forums this night I noticed that SETI has the same problem. A lot of people use an optimized client via the app_info.xml, but some people received work units for Astropulse which they definitely didn't have in their config.
This problem has finally been fixed - if anyone sees Bruce Allen before I do, please tell him to apply changeset [trac]changeset:15765[/trac] to the server.
This problem has finally been fixed - if anyone sees Bruce Allen before I do, please tell him to apply changeset [trac]changeset:15765[/trac] to the server.
Thank you very much for the update , I forwarded it.
I received 10 S5R4 files, even though I have an app_info.xml for the optimized SSE2 Linux app.
While reading in the SETI forums this night I noticed that SETI has the same problem. A lot of people use an optimized client via the app_info.xml, but some people received work units for Astropulse which they definitely didn't have in their config.
This problem has finally been fixed - if anyone sees Bruce Allen before I do, please tell him to apply changeset [trac]changeset:15765[/trac] to the server.
Yep, read it on boinc_dev. I updated the scheduler, should be fixed now.
Somehow, despite having a large queue of S5R3, my host Stoll4 has completed, uploaded and reported its first S5R4 result
This took 8.53 CPU hours, while an S5R3 unit on this host ranged from about 4 to 5.
This unit was reported after BOINCMgr had increased the forecasted completion times for work already in queue by a factor of about six, but the host web page still showed the task duration correction factor at .21.
Subsequent to reporting this single S5R3 task, and a single S5R4 task, the completion forecasts have come down back down some (from almost 31 hours to 28.25 hours), and the task duration correction factor now shows on the web page as 1.30. (up from .21 minutes earlier)
So apparently through some means the tdcf was greatly increased, but the value displayed on the web page lagged that used by BOINCMgr to display predictions.
It seems likely that for hosts with large amounts of remaining S5R3 work in queue, that the hugely better than forecast performance will push the tdcf back down enough to escape the more severe levels of EDF and so on within a dozen or two task S5R3 completions.
However this may undo the (presumably intended) result of abating severe overfetch of S5R4 work once the S5R3 estimate excess bleeds off enough to allow fetch to resume, as it will mean that the tdcf has been driven back much nearer an appropriate value for S5R3, thus far too low for S5R3 as originally deployed. the tdcf, as I recall, will adapt upward immediately if the first S5R4 completion is substantially above expectation, but for fast hosts with large queues, by then the S5R4 work in queue may be several times that desired.
I invite correction from those who understand how this stuff really works.
I would normally trim your message and try to respond to specific points. However, since you have gone to the trouble of accurately documenting a series of observations which are important to really understanding this, I'll leave everything intact. In no way is any of what follows, a "correction" of your observations. It's hopefully an understandable explanation for anybody troubled by what appear to be bad estimates of task duration. Please note that the figures I quote below are very much rough estimates.
I've posted elsewhere some background information that people need to understand as an aid to understanding the following comments. If you haven't seen it already, please have a read first.
From archae86's description, he has both S5R3 and S5R4 tasks in his cache. From the numbers presented, his S5R3 tasks would have had a crunch estimate inserted by the WUG of about 22.5 hours. His BOINC client would have learnt (over a period of time) that a DCF of about 0.2 needs to be applied so that when this information is passed to BOINC Manager, it will display a crunch estimate of 22.5x0.2 = 4.5 hours. The true crunch time will be about 4.5 hours and all is well.
When the first S5R4 task is downloaded, the new WUG will have inserted quite a different estimate into the task, one that is a bit smaller than the true value but still a lot closer than the "way out" value of 22.5 hours compared to the 4.5 hour reality. The reason for this "way out" estimate is explained in my previous message that I linked to.
From the description given by archae86, the S5R4 WUG estimate would have been something like 6 hours. (Please note that you can't see this value anywhere in BOINC Manager). When the BOINC client received the first task, it would have simply applied the only DCF it knew about - the 0.2 value it had learned from long time experience. Therefore the estimated crunch time passed to BOINC Manager for display purposes would have been 0.2x6.0 = 1.2 hours. This is the value that you will see and this is precisely why you will get a whole bunch of tasks if you have left your cache settings at too large a value.
Until an S5R4 task is actually fully completed, BOINC is not smart enough to do anything about the now "way too low" estimate of 1.2 hours. If the task was 20% completed and had already taken 1.5 hours of crunch time, a person could spot the looming problem, but BOINC can't. When the task is finally done and an actual time of 8.5 hours is recorded, BOINC will get its wake up call and will adjust the DCF in one big hit to the new value of 8.5/6 = 1.4 approximately. Note that the divisor of 6 is from the WUG estimate that we don't see but BOINC knows about.
The S5R4 task that has just completed will be uploaded, the new DCF will be recorded in the state file but will NOT be recorded on the website until some time later when the result(s) are actually reported. This doesn't matter since it's the value in the state file that determines future work fetch arrangements.
A by-product of all this will be that all other S5R4 tasks in the cache will suddenly jump from 1.2 hour estimates to 8.5 hour estimates and (more importantly) all S5R3 tasks present now or downloaded in the future will have estimates of around 22.5x1.4 = 31.5 hours. In fact they will still only take the same 4.5 hours as per usual but BOINC simply can't know any better. The only way for BOINC to know better would be to add functionality to separately track the different science runs with different DCF values. If there were several of these S5R3 tasks in the cache the excessive estimates of 31.5 hours each could cause the onset of "high priority" crunching mode.
So when an S5R3 task is completed after the S5R4 one has upset the BOINC "equilibrium", BOINC will notice that it only takes 4.5 hours as against the estimate of 31.5 hours - say a 27 hour discrepancy. 10% of this is 2.7 hours so BOINC will now be estimating around 29 hours (31.5 -2.7) for all the remaining S5R3 tasks. BOINC will now reduce the DCF to about 29/22.5 = 1.29. These 10% reductions will continue for every extra S5R3 result that is completed.
If several S5R3 tasks were done in succession, the DCF will continue to drop as shown in the calculations above. Soon the DCF could be below 1.0 again and still heading down. Not only would the S5R3 task estimates be reducing, but also would the estimates for S5R4 tasks in the cache. If an S5R4 task were then to be completed, the DCF would probably jump up again to about 1.4.
Unfortunately this "yo-yo" effect is likely to continue for some time for those people doing their "civic duty" of agreeing to accept both S5R3 resends as well as S5R4 new tasks. The best way to avoid any adverse impacts is to make sure your cache doesn't exceed about 1 - 2 days (preferably less particularly for multi projects) and also not to get too hung up on the incorrect and oscillating estimates. If you feel your ability to support multiple projects is being compromised by these gyrations, you could use the AP mechanism to prevent further S5R3 resends and then you could return to a new equilibrium (quickly if you wanted to by manually editing your DCF to say 1.2). As I've indicated elsewhere, the use of the AP (anonymous platform) mechanism is not to be entered into lightly and without the full understanding of what app_info.xml files are all about. You can screw-up things big time if you are not careful.
In about 2 months time, resends should be largely finished and everything will settle down. Since Bernd has said that we are doing "slow" S5R4 at the moment and that the "average" should be somewhat faster, it's possible that the DCF may eventually settle at something like 1.1 to 1.2 which is certainly much closer to the ideal 1.0 than the current S5R3 low value of 0.2 or thereabouts.
If anyone has questions about any of the above, please fire away.
...
From the description given by archae86, the S5R4 WUG estimate would have been something like 6 hours.
...
When the task is finally done and an actual time of 8.5 hours is recorded,
...
If anyone has questions about any of the above, please fire away.
Well done Gary, excellent example of a clear and accurate explanation, as always.
The only thing that's left to explain - and it can't be done yet - is why the WUG estimate of 6 hours (at archae86's host speed - used as an example only, YMMV) has turned out to be so much lower than the real time of 8.5 hours (ditto).
As I've written elsewhere, I don't (yet) buy Bernd's explanation that this is just the regular cyclical runtime variation we're used to from S5R3. It feels much bigger than that.
Very nicely explained here and in the posting you referred to, Gary. Thanks.
And if I may take the liberty of quoting you from the referred posting:
Quote:
We can assist by not leaving machines with overly large caches.
Indeed, many problems of 'panic mode' and crunching past deadlines can be attributed to large caches. Large caches are really only needed by those who only connect every few days. With an 'always on' connection I keep my caches to not more than a day.
Seems like there might be 2 yoyo@home projects... ;)
The only thing that's left to explain - and it can't be done yet - is why the WUG estimate of 6 hours (at archae86's host speed - used as an example only, YMMV) has turned out to be so much lower than the real time of 8.5 hours (ditto).
This doesn't really surprise me at all. A couple of us (myself included) "reminded" Bernd that the speedup of the S5R3 apps over time had meant that the DCF at the start of S5R3 which was somewhere in the range of 0.7 - 0.8 or so if I remember correctly had reduced considerably to values in the 0.15 - 0.25 range and that this wasn't really ideal because a newly added host would "suffer" a five times larger estimate than needed and that this would possibly lead to unnecessary questions in the forums and to the impression that the project didn't know how to estimate properly. We suggested to Bernd that the ideal would be to end up with a DCF of around 1.0. Obviously we didn't think through the unintended consequences (on the old run) of restoring the DCF to a more ideal value for the new run.
It has always been my impression that it's not that easy to come up with an overarching estimate that will suit all of the likely platforms on which the tasks will run and that will be close to 1.0. I'm therefore not surprised to see the value somewhat bigger than 1.0 at the moment.
Quote:
As I've written elsewhere, I don't (yet) buy Bernd's explanation that this is just the regular cyclical runtime variation we're used to from S5R3. It feels much bigger than that.
I think Bernd's explanation will turn out to be correct to some extent anyway, simply from an observation of what happens when a host jumps into a frequency band for the very first time when not much crunching has yet been done in that band. I have the luxury of being able to observe lots of hosts and I know for a fact that when one of my hosts jumped into a new band where few other hosts appeared to be working, the very first tasks (almost invariably) had sequence numbers close to a cycle peak rather than being more randomly distributed. I saw this many times and since we have all just jumped into new and "unpopulated" bands, it wouldn't be surprising to find the same behaviour (ie jumping in at the slowest crunch times) yet again. These are the "bad" tasks as Bernd described them.
.... the WUG estimate that we don't see but BOINC knows about ...
The implied risks of trying too hard to micro-manage are noted. I guess BOINC is very much written to be hands-free for the use of those with little/no technical knowledge ( no offense meant ). I think that has to be most of the contributors, unless E@H has managed to attract the focus of ~10^5 nerds! :-)
Certainly the parameter juggling is impressive to attempt an accounting of the enormous variance of hosts.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: This unit was reported
)
I was about to renounce this observation, since I'd not seen it on my other hosts, and I'd not seen any confirming observations from others.
Now Richard Haselgrove in another thread has reported that the first S5R4 result reported by one of his hosts raised the tdcf for that host from 0.216881 to 1.586961. Both values are not so far from my own.
As to my other three hosts, none of them has yet completed or reported an S5R4 task. So if that is the trigger there is no incongruity. I should have completions from two other hosts in about a day.
My new guess of the behavior is that as soon as an S5R4 task completes, the estimates shown by BOINCMgr for completion time of both S5R3 and S5R4 tasks will enormously increase--to preposterously high values for S5R3, and somewhere in the neighborhood of truth for S5R4. This won't affect the web page tdcf until the task is successfully reported.
I suspect the fact that my host reported a single S5R4 task causing a tdcf bump, then went back to work down a queue of scores of S5R3 tasks which will steadily whittle the tdcf back toward .21 is unusual. I botched my transition badly, with multiple app_info.xml edits, program downloads, and restarts, some of which involved wholesale rejection and subsequent reconstitution of my S5R3 queue. I suspect there was a moment in which I had one S5R4 task ready to go, and was unable to run S5R3, so it got to run. I'll just need to lower my requested queue size to avoid an excessive flood of S5R4 when that resumes for this host.
RE: I suspect the fact
)
You are correct. Most hosts got the bump up and then they are settling in... Even still, Bernd has indicated that the initial units are at runtime maxima, so the dcf will gradually decrease for a non-app_info host...
However, for anyone using the multi-purpose app_info file, wide fluctuations are certainly possible. Any underestimation will be increased to the full time of the last task completed. Overestimations only gradually bring the estimation back down.
As for avoiding the flood, I believe that you could also edit client_state.xml to have a value closer to reality for S5R4...
RE: Same here, I received
)
This problem has finally been fixed - if anyone sees Bruce Allen before I do, please tell him to apply changeset [trac]changeset:15765[/trac] to the server.
RE: This problem has
)
Thank you very much for the update , I forwarded it.
Bikeman
RE: RE: Same here, I
)
Yep, read it on boinc_dev. I updated the scheduler, should be fixed now.
BM
BM
RE: Somehow, despite having
)
I would normally trim your message and try to respond to specific points. However, since you have gone to the trouble of accurately documenting a series of observations which are important to really understanding this, I'll leave everything intact. In no way is any of what follows, a "correction" of your observations. It's hopefully an understandable explanation for anybody troubled by what appear to be bad estimates of task duration. Please note that the figures I quote below are very much rough estimates.
I've posted elsewhere some background information that people need to understand as an aid to understanding the following comments. If you haven't seen it already, please have a read first.
From archae86's description, he has both S5R3 and S5R4 tasks in his cache. From the numbers presented, his S5R3 tasks would have had a crunch estimate inserted by the WUG of about 22.5 hours. His BOINC client would have learnt (over a period of time) that a DCF of about 0.2 needs to be applied so that when this information is passed to BOINC Manager, it will display a crunch estimate of 22.5x0.2 = 4.5 hours. The true crunch time will be about 4.5 hours and all is well.
When the first S5R4 task is downloaded, the new WUG will have inserted quite a different estimate into the task, one that is a bit smaller than the true value but still a lot closer than the "way out" value of 22.5 hours compared to the 4.5 hour reality. The reason for this "way out" estimate is explained in my previous message that I linked to.
From the description given by archae86, the S5R4 WUG estimate would have been something like 6 hours. (Please note that you can't see this value anywhere in BOINC Manager). When the BOINC client received the first task, it would have simply applied the only DCF it knew about - the 0.2 value it had learned from long time experience. Therefore the estimated crunch time passed to BOINC Manager for display purposes would have been 0.2x6.0 = 1.2 hours. This is the value that you will see and this is precisely why you will get a whole bunch of tasks if you have left your cache settings at too large a value.
Until an S5R4 task is actually fully completed, BOINC is not smart enough to do anything about the now "way too low" estimate of 1.2 hours. If the task was 20% completed and had already taken 1.5 hours of crunch time, a person could spot the looming problem, but BOINC can't. When the task is finally done and an actual time of 8.5 hours is recorded, BOINC will get its wake up call and will adjust the DCF in one big hit to the new value of 8.5/6 = 1.4 approximately. Note that the divisor of 6 is from the WUG estimate that we don't see but BOINC knows about.
The S5R4 task that has just completed will be uploaded, the new DCF will be recorded in the state file but will NOT be recorded on the website until some time later when the result(s) are actually reported. This doesn't matter since it's the value in the state file that determines future work fetch arrangements.
A by-product of all this will be that all other S5R4 tasks in the cache will suddenly jump from 1.2 hour estimates to 8.5 hour estimates and (more importantly) all S5R3 tasks present now or downloaded in the future will have estimates of around 22.5x1.4 = 31.5 hours. In fact they will still only take the same 4.5 hours as per usual but BOINC simply can't know any better. The only way for BOINC to know better would be to add functionality to separately track the different science runs with different DCF values. If there were several of these S5R3 tasks in the cache the excessive estimates of 31.5 hours each could cause the onset of "high priority" crunching mode.
So when an S5R3 task is completed after the S5R4 one has upset the BOINC "equilibrium", BOINC will notice that it only takes 4.5 hours as against the estimate of 31.5 hours - say a 27 hour discrepancy. 10% of this is 2.7 hours so BOINC will now be estimating around 29 hours (31.5 -2.7) for all the remaining S5R3 tasks. BOINC will now reduce the DCF to about 29/22.5 = 1.29. These 10% reductions will continue for every extra S5R3 result that is completed.
If several S5R3 tasks were done in succession, the DCF will continue to drop as shown in the calculations above. Soon the DCF could be below 1.0 again and still heading down. Not only would the S5R3 task estimates be reducing, but also would the estimates for S5R4 tasks in the cache. If an S5R4 task were then to be completed, the DCF would probably jump up again to about 1.4.
Unfortunately this "yo-yo" effect is likely to continue for some time for those people doing their "civic duty" of agreeing to accept both S5R3 resends as well as S5R4 new tasks. The best way to avoid any adverse impacts is to make sure your cache doesn't exceed about 1 - 2 days (preferably less particularly for multi projects) and also not to get too hung up on the incorrect and oscillating estimates. If you feel your ability to support multiple projects is being compromised by these gyrations, you could use the AP mechanism to prevent further S5R3 resends and then you could return to a new equilibrium (quickly if you wanted to by manually editing your DCF to say 1.2). As I've indicated elsewhere, the use of the AP (anonymous platform) mechanism is not to be entered into lightly and without the full understanding of what app_info.xml files are all about. You can screw-up things big time if you are not careful.
In about 2 months time, resends should be largely finished and everything will settle down. Since Bernd has said that we are doing "slow" S5R4 at the moment and that the "average" should be somewhat faster, it's possible that the DCF may eventually settle at something like 1.1 to 1.2 which is certainly much closer to the ideal 1.0 than the current S5R3 low value of 0.2 or thereabouts.
If anyone has questions about any of the above, please fire away.
Cheers,
Gary.
RE: ... From the
)
Well done Gary, excellent example of a clear and accurate explanation, as always.
The only thing that's left to explain - and it can't be done yet - is why the WUG estimate of 6 hours (at archae86's host speed - used as an example only, YMMV) has turned out to be so much lower than the real time of 8.5 hours (ditto).
As I've written elsewhere, I don't (yet) buy Bernd's explanation that this is just the regular cyclical runtime variation we're used to from S5R3. It feels much bigger than that.
Very nicely explained here
)
Very nicely explained here and in the posting you referred to, Gary. Thanks.
And if I may take the liberty of quoting you from the referred posting:
Indeed, many problems of 'panic mode' and crunching past deadlines can be attributed to large caches. Large caches are really only needed by those who only connect every few days. With an 'always on' connection I keep my caches to not more than a day.
Seems like there might be 2 yoyo@home projects... ;)
Regards
Rod
RE: The only thing that's
)
This doesn't really surprise me at all. A couple of us (myself included) "reminded" Bernd that the speedup of the S5R3 apps over time had meant that the DCF at the start of S5R3 which was somewhere in the range of 0.7 - 0.8 or so if I remember correctly had reduced considerably to values in the 0.15 - 0.25 range and that this wasn't really ideal because a newly added host would "suffer" a five times larger estimate than needed and that this would possibly lead to unnecessary questions in the forums and to the impression that the project didn't know how to estimate properly. We suggested to Bernd that the ideal would be to end up with a DCF of around 1.0. Obviously we didn't think through the unintended consequences (on the old run) of restoring the DCF to a more ideal value for the new run.
It has always been my impression that it's not that easy to come up with an overarching estimate that will suit all of the likely platforms on which the tasks will run and that will be close to 1.0. I'm therefore not surprised to see the value somewhat bigger than 1.0 at the moment.
I think Bernd's explanation will turn out to be correct to some extent anyway, simply from an observation of what happens when a host jumps into a frequency band for the very first time when not much crunching has yet been done in that band. I have the luxury of being able to observe lots of hosts and I know for a fact that when one of my hosts jumped into a new band where few other hosts appeared to be working, the very first tasks (almost invariably) had sequence numbers close to a cycle peak rather than being more randomly distributed. I saw this many times and since we have all just jumped into new and "unpopulated" bands, it wouldn't be surprising to find the same behaviour (ie jumping in at the slowest crunch times) yet again. These are the "bad" tasks as Bernd described them.
Cheers,
Gary.
Thank you Gary! A terrific
)
Thank you Gary! A terrific explanation. :-)
For me it crystallised with this bit :
The implied risks of trying too hard to micro-manage are noted. I guess BOINC is very much written to be hands-free for the use of those with little/no technical knowledge ( no offense meant ). I think that has to be most of the contributors, unless E@H has managed to attract the focus of ~10^5 nerds! :-)
Certainly the parameter juggling is impressive to attempt an accounting of the enormous variance of hosts.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal