What's the deal with this?
It's been tough enough to keep crunching for EAH through this Beta test on older hosts with the generally 'tougher' calculational load of the new WU's and the tightness factor increase due to the slow performance of the app along with keeping the deadline at two weeks.
The only saving grace was that you weren't sending out a trailer by default. This meant that if you sent a result which the host couldn't quite make within the deadline by 24 to 48 hours or so (even though it was telling you it can't make the deadline), it still had an even money chance to get credit for the for the 1.2 MSecs. plus of time it spent on the result.
As it stands now, this is no longer true and in fact means I might as well detach anything that isn't a P4 or higher from the project, since the odds are they will just be wasting their time and my money running EAH at this point.
I propose that increasing the deadline would be a more appropriate way to keep folks crunching EAH, since the most probable event which would cause them to drop out is when they start to perceive that EAH is 'hogging' the machine.
The simple truth is a tight deadline project always seems to hog the machine, especially when it can take 3 or more days to run a result.
Alinator
Copyright © 2024 Einstein@Home. All rights reserved.
Initial Replication Is Now 3??
)
This is an instance of the "cross platform validation problem". Initial replication is still 2, so two WU were sent. Hosts did not agree sufficiently in their results, so a third was sent out. Third one failed to meet the deadline, so WU was sent again to another host. Sometimes this takes days, in this case it took only a couple of minutes.
Once the cross-platform problem and other stability problems are sorted out one way or another, developers will move on to speed up the code considerably (test versions of the optimized app have already been spotted :-) ...). Then the deadline will be easier to meet.
CU
BRM
RE: This is an instance
)
Alinator is right, initial replication is 3...
Michael
Team Linux Users Everywhere
RE: This is an instance
)
Yes, I'm aware of how re-issuing works in cases of a host failure or the 'infamous' cross platform "C,BNC" effect. ;-)
However, the IR is a parameter set at 'split time' by the project side configuration and is not usually affected by re-issuing. IOW, if the IR is 2, then it will be 2 for that WU regardless of the number of re-issues needed to complete. So this WU is somewhat unusual in that regard, perhaps because there was no longer any other eligible hosts with that datapack set and so the project had to generate 'new' work in essence to be able to re-issue.
Regarding the deadline. Even if the team can double the speed of the app, EAH will still be a tight deadline project for a large spectrum of hosts depending on their particular configuration. As I said, a tight deadline project will always have an effect which most people intrepret as the project is 'hogging' the machine and BOINC is ignoring their preferences.
I have no iron clad hard data on this, but based on the tone many people take when this issue comes up, it leads me to believe this is a major factor when people drop a project for their list.
Alinator
RE: However, the IR is a
)
The wording "Initial replication" would suggest so, but I can see several times a week that one of my workunits starts at IR2 and then later increases to IR 3 (or even IR 4).
CU
BRM
RE: RE: However, the IR
)
I know for a fact that Bikeman is correct on this and that the IR does increase during the life of a WU just as he has described. Here is my assessment of what has actually happened in this particular (and very interesting!!) case:-
*2nd result failed with client error 12 hours later & was replaced on June 12 - IR still =2
*1st & 3rd results completed on June 13 but "no consensus". 4th result issued June 14 - IR now =3
*4th result passed deadline on June 28 and 5th result issued just 4 mins later
*IR is still =3 as 4th result is considered "dead" and so there are still just 3 "live" results in circulation
*It's still possible for 4th result to resurrect itself - if so IR would change to 4
I've seen many examples of this type of behaviour. The interesting thing is that it's all mixed together in the one WU here which makes it more difficult to see exactly when the IR value changes. IR doesn't change when a replacement is issued for a blown deadline. IR does increment when a "decider" is sent out for a "no consensus" deadlock.
Cheers,
Gary.
RE: Regarding the
)
Whilst there really isn't anything sinister happening with IR, I do agree with Alinator that there is a bit of a problem with unrealistic deadlines particularly for people with older machines or people wanting to support multiple projects. Just over two months ago, at the end of the previous run, there were 75K active hosts. There are now only 58K active hosts.
People are voting with their feet for a combination of reasons. High on the list are (i) long crunch times, (ii) client errors, and (iii) validation failures. All these problems contribute to a general feeling of wasted effort. It's very disheartening to see the waste with missed deadlines and validation problems.
Cheers,
Gary.
RE: People are voting with
)
Now that I've seen the numbers about people that have left, I don't feel as bad for sticking it out for a month after the change over to this run before going somewhere else to crunch. I hope that things get fixed and put on a more equitable footing for the next run so I can bring my systems back in the mix.
Arion
RE: People are voting with
)
I'm hoping they get things straightened out soon. Einstein is my "project of choice", but I've pulled 8 of 9 cores I had running it because of the problems. It isn't worth paying the electric bill for 24 hours worth of wasted time that won't validate.
RE: I'm hoping they get
)
I think quite a few people share this sentiment but I'm also hoping that not too many take the final step of going elsewhere when there are signs that the situation may be resolved soon. I see that Bernd has mentioned some gains and that he plans to make the current Windows beta app into the official app shortly.
In my case, I've actually made changes that have increased my involvement in EAH and also increased my exposure to the validation problem. I've converted virtually all my AMD boxes to Linux which has given me a speedup of around 40-45% compared to Windows. I've converted a significant proportion of PIII boxes to Linux as well which has resulted in a 25-30% speedup for those. I've left all P4 boxes on Windows as they showed no improvement under Linux. I've added about 15-20 boxes that were not crunching 2 months ago. Most of those were Linux installs.
The thing that is most frustrating is that the 30%+ gain in crunching efficiency is being eroded significantly by validation failures which kill off about half the gain (in my experience anyway). I know it is important to fix bugs and problems that cause crashes in the app, but I also wish that something could be done about the validation issue.
When talking about bugs in the science app, I've noticed a couple of times on different machines over the last few weeks, a behaviour where the app appears to be running but progress is completely stalled. As BOINC is supposed to be "set and forget" this behaviour is only noticed when looking at machines at the bottom of the list and wondering why they haven't reported in a week or three :). Quite often deadlines have passed even before the problem is noticed. In each case, stopping and restarting the BOINC service kicks the science app back into life and the result is eventually completed and even validated. I'm pretty sure I've seen this with both Windows and Linux boxes.
Cheers,
Gary.
4.24 is now the official
)
4.24 is now the official windows app, and has been for somewhere between 5 and 8 hours now.