On Friday 25th, I received a batch of 76 Gamma ray pulsar #4 work units all with a deadline of 10/10. The estimated run time for all of these WUs was 4:02:54.
The first few have started and already I can see the same happening on this machine as has happened on my other machines - the initial estimate is low but the actual runtime for most of the units can be up to a factor of 10 longer. On one machine at the moment I have 4 work units running Gammy ray pulsar #4 and have been running for 43:40:19 with an estimated time left of 6:51:29. This particular batch of work units started with an estimated time of 4.5 or 6.5 hours and all of them are likely to exceed the deadline.
The batch I mentined above had an initial estimated runtime of 4:02:54 but the first two have already exceeded the estimate. One job has been running ro 5:12:10 with an estimated time left of 10:38:22 and is 33% complete. The other WU has been running for 5:52:43 with an estimated time left of 1):15:48 and is 36.5% complete.
Although I do not contribute my processor time for credits alone, these particular WUs do no earn much credit regardless of the length of time it is running (it seems like it is 693 credits/WU). However I do expect estimates to be reasonably accurate; some overrun, some underrun the estimate but to be this much out on the low side for most of the WUs seems to me to be some wrong with the initial estimation.
I subscribe to a number of projects, and if this keeps happening with estimates with this project, I am afraid I will assign my compute resource elsewhere.
I have no problem with long running jobs but I do object to getting such badly estimated WUs that, because of the estimate, it us unlikely I will complete them before the deadline.
Is there an issue with estimated times?
George
Copyright © 2024 Einstein@Home. All rights reserved.
Crazy estimated times vs actual run times
)
Obviously from your point of view there is but, unfortunately, it's not something that the project can fix.
You don't say which of your 4 machines you are talking about but I guess it must be the Q9650 which doesn't yet have any completed tasks rather than the i7 950, which does.
When tasks are issued they contain an estimate of how much work is contained in the task. FGRP4 tasks are mostly pretty uniform in work content so most will contain the same estimate. It is BOINC's job to convert this work estimate into an estimated crunch time. It does this based on how it estimates the performance capabilities of your machine. BOINC can be fooled when making this estimate. For this reason, it often takes many completed tasks for BOINC to get the estimate right. Therefore it's always good policy when starting a new and unknown science run, to start with a small work cache size (by adjusting your preferences temporarily) until a few tasks are completed. That way you wont end up with so many if BOINC has overestimated the capabilities and underestimated the time.
The bigger question is why the estimate was out by a factor of 10. It's not usually this bad. If you look at your completed tasks for your i7 through the link above, you will notice a big difference between CPU time and elapsed (Run) time. It's not usually anywhere near this large and it's a sign that your machine is doing other CPU intensive work. It would be interesting to know what else you use the machine for. BOINC will always defer to other CPU intensive work and this is why the run time can blow out so much.
Another thing that can really slow things down is thermal throttling. Do you monitor temperatures at all? With all cores crunching, things can become very hot and it could well be that the machine is protecting itself by automatically reducing the clock speed. I suspect something like this might be happening because an 'unthrottled' i7 can crunch FGRP4 tasks much faster than the times shown for those completed tasks. As an example picked at random from the top hosts list on this project, here are some completed results for an i7 930 which would be a bit slower than your 950. Notice how close the run and CPU times are. That host is also supporting 2 GPUs but is still managing to do FGRP4 tasks a lot faster than yours.
It's impossible for outsiders to really know why your machine seems to be struggling. Hopefully you can investigate and find the cause and then be able to enjoy a much better experience.
Cheers,
Gary.
Many thanks for the response
)
Many thanks for the response Gary. And yes you are right about the CPU that the figures came from. However, I see the same happening with GPR4 WUs on all of my machines, none of them are running hot - the Q9650 is running at 68C and it is only running BOINC. The same is true for two of the remaing 3 systems apart from web browsing and email use and some document processing; the 4th is used for much more compute intensive work so I discount the work that is done there. That said, most WUs on this system usually finish around the estimated time.
Also, if the problem was due to temperature and/or other workload I should see similar effects on work units in other projects but only E@H shows this large discrepancy between estimated and actual runtimes.
I saw a similar effect when I got some of the follow-up gravitational wave WUs. These also ran for much longer than the estimated time but the initial gravitational was WUs I had finished round about the estimated time.
There are a couple of GPR4 tasks that are 22% and 15% complete, have been running for almost 6 hours and have 21 and 31 hours respectively left to run. I think I will just abort these.
I also noticed the discrepancy between CPU and run time - around 25% and had a look at the results you gave the link to. Not sure what the speed difference between the two systems are but the GPR4 tasks on there ran in around 8 hours which was twice the estimated time for my system.
I will try to get through as many of the GPR4 tasks I have before I go on holiday at the end of this coming week, the rest will just have to be aborted. I was just interested to see if anybody had seen similar effects. Strangely enough the GPU WUs for E@H all tend to run quicker than the estimated time and usually significantly quicker maybe a factor of 0.5.
They are very strange WUs - tasks started at the same time go up in step in elapsed time (nothing strange there!) but also do the same with time remaining and stop at the same progress %age points for long (hours) periods before moving to the next %age point. Then there is the situation where for each elapsed second you can see the remaining time increase by 5 to 10 seconds at a time. I only see this increase with some WUs - the other WUs reamining time must change at each %age point as the remaining time doesn't change any time I have looked - I just see the change when the progress has moved from one %age point to the next (that doesn't change either as elapsed time marches on). As I said, very strange behaviour.
Thanks once again,
George
Combining CPU and GPU work on
)
Combining CPU and GPU work on the same machine can lead to estimates being a bit off but not as much as you are reporting, something else must be affecting your systems. My system currently estimates FGRP4 tasks at about 6 hours but they take almost 13 hours but on the other hand it overestimates my iGPU work at about 9.5 hours but those task run for only 3.5 hours.
If it was my machine I would run CPU-Z (or equivalent) to check that the CPU is running at full speed. If not I would shut the system down and thoroughly clean it, eg blow out the dust bunnies. Then check with GPU-Z again.
After that I would check in task manager to see if anything besides the Einstein tasks is using a lot of CPU time, if so examine what it is (google the process name) and probably try to remove it.
RE: ... As I said, very
)
In your computing preferences, do you have the setting
Suspend work while computer is in use?
set to 'yes' or 'no'?Also, what about the setting
Leave tasks in memory while suspended?
What value do you have for the setting
Suspend work if CPU usage is above ... (0 means no restriction)?
I don't know if it's possible to create extra run time if keyboard use causes tasks in progress to be dropped from RAM. When restarted, such tasks would restart from a previous checkpoint. I don't know if both CPU time and elapsed time are stored in the checkpoint. If only CPU time was being stored and elapsed time was tied to the system clock, that would explain the big difference between the two.
You could get a similar effect if the third preference above wasn't set to zero. Any other process needing a short burst of CPU might be throwing all the FGRP4 jobs out of memory. The time between checkpoints is rather long with FGRP4 so you could be losing a lot.
Cheers,
Gary.
RE: I don't know if both
)
Yes, they are - well, to be exact, and are both stored in boinc_task_state.xml
If a task has to be restarted from checkpoint for whatever reason, both those values should be recovered and used as the new starting point.
RE: ... and are both
)
Thanks, Richard.
I've never really delved into slot directories. From your comment, I presume the .xml file is updated by BOINC when checkpoints are written so that the app doesn't record times in the checkpoint. If so, this explains something I see very occasionally but have not previously understood. On restarting a crashed computer, on rare occasions I have seen tasks restart with a significant % completed but zero elapsed time. I guess this means the .xml file was damaged by the crash so that times had to start from zero again.
Do you have any thoughts about why some supposedly lightly loaded hosts show large differences between CPU and run times for CPU tasks? Could it be due to a virus/malware or could it be a sign of overactive virus scanning that's bogging things down? I've seen a few cases like this over the years but can't remember ever seeing an entirely satisfactory explanation.
Cheers,
Gary.
RE: RE: ... and are
)
The host in question here is running AVG (dll reported in the aborted tasks) and AVG has had reports in several places of logging boinc activity if i recall for example PrimeGrid
Perhaps George could check if anything is being reported in the AVG logs, at least to eliminate that possibility.
I'd like to thank everybody
)
I'd like to thank everybody that has replied and for the various suggestions made. I really appreciate the advice but, unless I'm no thinking this through correctly or just being plain dumb, everything that has been suggested would affect WUs from all the projects and not just this one. I don't see any great differences between run times and cpu times for WUs from other projects, nor do I see any great variance in the actual run time against the initial estimate and E@H is the only project where estimated times get changed when any work unit completes that has a longer run time than the current estimate.
It does come down again if WUs take less time than the new estimate, but very slowly. Like say the estimate was set at 20 hours and the next WU finishes in time, the new estimate might be 19 hours. There doesn't appear to be any weighting applied to a rogue WU that does take a long time where most WUs are completing in around the originally estimated time.
This modification of the estimated time must have some impact on the work manager and requesting new WUs. I have noticed with the issue I am seeing that all of my projects are getting fewer WUs as the work manager is not requesting them usually with the "work queue full" message. I can only ascribe this to the massive increase in estimated time that I reported.
If I am wrong in my analysis, please correct me but I would expect to see the same symptoms with all my projects.
George
Yes I do have AVG but all I have seen is false reporting of (surprise, surprise) E@H GPU WUs as viruses. Had this happen about a month ago for about every 5th GPU WU but guess AVG fixed this and haven't seen it since.
RE: ... everything that has
)
This assumes all projects behave the same on all hosts ... they don't.
I looked a bit deeper at your oldest host, these Q9650 CPUs are getting long in the tooth.
It has recently generates invalids and errors at E@H and also at Asteroids@Home. Fix these issues and "rogue WUs" may disappear.
For example this task http://einsteinathome.org/task/521943638 indicates some registry issue.
I think this host is struggling to do E@H work and you are seeing strange behaviour - when a system is not running well, expect craziness!
This suggests, to me, that AVG is monitoring boinc and E@H work and if you haven't excluded the boinc data directory from AVG's scope, perhaps you might consider that. Whether that is the cause of your invalids and errors, we wait with bated breath.
Good luck.
Many thanks for the info and
)
Many thanks for the info and I'll tak eit on board. I agree with the age of this system - like me it is a bit long in the tooth but I am planning to upgrade another system that is not quite as old and transfer the motherboard from that system to this one. I can try setting AVG to avoid checking the WUs when they are downloading although I'm doubtful if this would make much difference. I can also look at my AVG settings for its system scan that happens daily but does not take too long (last complete scan was about 50 minutes and my scans are set to start at 2:00AM so it wasn't running when I noticed the problems I reported).
I'll live with what I have just now and hopefully get rid of some of the issues when I replace the motherboard and CPU. But it still does not explain why I see the same symptoms on all of the 4 machines I have running various BOINC projects. One of these is less than 9 months old but I'll wait and see what happens when I upgrade my system with new motherboard and CPU.
I agree that this system is struggling - it does not really have enough memory for a start but I wasn't using it for anything else, so thought I'd put it to a good use. Hopefully the replacement motherboard and CPU will give it a new lease of life.
Thanks again,
George