Initial Replication Is Now 3??

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0
Topic 192915

What's the deal with this?

Note the IR

It's been tough enough to keep crunching for EAH through this Beta test on older hosts with the generally 'tougher' calculational load of the new WU's and the tightness factor increase due to the slow performance of the app along with keeping the deadline at two weeks.

The only saving grace was that you weren't sending out a trailer by default. This meant that if you sent a result which the host couldn't quite make within the deadline by 24 to 48 hours or so (even though it was telling you it can't make the deadline), it still had an even money chance to get credit for the for the 1.2 MSecs. plus of time it spent on the result.

As it stands now, this is no longer true and in fact means I might as well detach anything that isn't a P4 or higher from the project, since the odds are they will just be wasting their time and my money running EAH at this point.

I propose that increasing the deadline would be a more appropriate way to keep folks crunching EAH, since the most probable event which would cause them to drop out is when they start to perceive that EAH is 'hogging' the machine.

The simple truth is a tight deadline project always seems to hog the machine, especially when it can take 3 or more days to run a result.

Alinator

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 485,684,516
RAC: 1,443

Initial Replication Is Now 3??

Quote:

What's the deal with this?

Note the IR

It's been tough enough to keep crunching for EAH through this Beta test on older hosts with the generally 'tougher' calculational load of the new WU's and the tightness factor increase due to the slow performance of the app along with keeping the deadline at two weeks.

The only saving grace was that you weren't sending out a trailer by default. This meant that if you sent a result which the host couldn't quite make within the deadline by 24 to 48 hours or so (even though it was telling you it can't make the deadline), it still had an even money chance to get credit for the for the 1.2 MSecs. plus of time it spent on the result.

As it stands now, this is no longer true and in fact means I might as well detach anything that isn't a P4 or higher from the project, since the odds are they will just be wasting their time and my money running EAH at this point.

I propose that increasing the deadline would be a more appropriate way to keep folks crunching EAH, since the most probable event which would cause them to drop out is when they start to perceive that EAH is 'hogging' the machine.

The simple truth is a tight deadline project always seems to hog the machine, especially when it can take 3 or more days to run a result.

Alinator

This is an instance of the "cross platform validation problem". Initial replication is still 2, so two WU were sent. Hosts did not agree sufficiently in their results, so a third was sent out. Third one failed to meet the deadline, so WU was sent again to another host. Sometimes this takes days, in this case it took only a couple of minutes.

Once the cross-platform problem and other stability problems are sorted out one way or another, developers will move on to speed up the code considerably (test versions of the optimized app have already been spotted :-) ...). Then the deadline will be easier to meet.

CU

BRM

Michael Karlinsky
Michael Karlinsky
Joined: 22 Jan 05
Posts: 888
Credit: 23,502,182
RAC: 139

RE: This is an instance

Message 69373 in response to message 69372

Quote:

This is an instance of the "cross platform validation problem". Initial replication is still 2, so two WU were sent.

Alinator is right, initial replication is 3...

Michael

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

RE: This is an instance

Message 69374 in response to message 69372

Quote:

This is an instance of the "cross platform validation problem". Initial replication is still 2, so two WU were sent. Hosts did not agree sufficiently in their results, so a third was sent out. Third one failed to meet the deadline, so WU was sent again to another host. Sometimes this takes days, in this case it took only a couple of minutes.

Once the cross-platform problem and other stability problems are sorted out one way or another, developers will move on to speed up the code considerably (test versions of the optimized app have already been spotted :-) ...). Then the deadline will be easier to meet.

CU

BRM

Yes, I'm aware of how re-issuing works in cases of a host failure or the 'infamous' cross platform "C,BNC" effect. ;-)

However, the IR is a parameter set at 'split time' by the project side configuration and is not usually affected by re-issuing. IOW, if the IR is 2, then it will be 2 for that WU regardless of the number of re-issues needed to complete. So this WU is somewhat unusual in that regard, perhaps because there was no longer any other eligible hosts with that datapack set and so the project had to generate 'new' work in essence to be able to re-issue.

Regarding the deadline. Even if the team can double the speed of the app, EAH will still be a tight deadline project for a large spectrum of hosts depending on their particular configuration. As I said, a tight deadline project will always have an effect which most people intrepret as the project is 'hogging' the machine and BOINC is ignoring their preferences.

I have no iron clad hard data on this, but based on the tone many people take when this issue comes up, it leads me to believe this is a major factor when people drop a project for their list.

Alinator

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 485,684,516
RAC: 1,443

RE: However, the IR is a

Message 69375 in response to message 69374

Quote:


However, the IR is a parameter set at 'split time' by the project side configuration and is not usually affected by re-issuing. IOW, if the IR is 2, then it will be 2 for that WU regardless of the number of re-issues needed to complete. So this WU is somewhat unusual in that regard, perhaps because there was no longer any other eligible hosts with that datapack set and so the project had to generate 'new' work in essence to be able to re-issue.

The wording "Initial replication" would suggest so, but I can see several times a week that one of my workunits starts at IR2 and then later increases to IR 3 (or even IR 4).

CU

BRM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,551
Credit: 79,000,031,574
RAC: 64,955,517

RE: RE: However, the IR

Message 69376 in response to message 69375

Quote:
Quote:

However, the IR is a parameter set at 'split time' by the project side configuration and is not usually affected by re-issuing. IOW, if the IR is 2, then it will be 2 for that WU regardless of the number of re-issues needed to complete....

The wording "Initial replication" would suggest so, but I can see several times a week that one of my workunits starts at IR2 and then later increases to IR 3 (or even IR 4).

I know for a fact that Bikeman is correct on this and that the IR does increase during the life of a WU just as he has described. Here is my assessment of what has actually happened in this particular (and very interesting!!) case:-

  • *Initial two results issued on June 11 - IR=2
    *2nd result failed with client error 12 hours later & was replaced on June 12 - IR still =2
    *1st & 3rd results completed on June 13 but "no consensus". 4th result issued June 14 - IR now =3
    *4th result passed deadline on June 28 and 5th result issued just 4 mins later
    *IR is still =3 as 4th result is considered "dead" and so there are still just 3 "live" results in circulation
    *It's still possible for 4th result to resurrect itself - if so IR would change to 4

I've seen many examples of this type of behaviour. The interesting thing is that it's all mixed together in the one WU here which makes it more difficult to see exactly when the IR value changes. IR doesn't change when a replacement is issued for a blown deadline. IR does increment when a "decider" is sent out for a "no consensus" deadlock.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,551
Credit: 79,000,031,574
RAC: 64,955,517

RE: Regarding the

Message 69377 in response to message 69374

Quote:

Regarding the deadline. Even if the team can double the speed of the app, EAH will still be a tight deadline project for a large spectrum of hosts depending on their particular configuration. As I said, a tight deadline project will always have an effect which most people intrepret as the project is 'hogging' the machine and BOINC is ignoring their preferences.

Whilst there really isn't anything sinister happening with IR, I do agree with Alinator that there is a bit of a problem with unrealistic deadlines particularly for people with older machines or people wanting to support multiple projects. Just over two months ago, at the end of the previous run, there were 75K active hosts. There are now only 58K active hosts.

People are voting with their feet for a combination of reasons. High on the list are (i) long crunch times, (ii) client errors, and (iii) validation failures. All these problems contribute to a general feeling of wasted effort. It's very disheartening to see the waste with missed deadlines and validation problems.

Cheers,
Gary.

Arion
Arion
Joined: 20 Mar 05
Posts: 147
Credit: 1,626,747
RAC: 0

RE: People are voting with

Message 69378 in response to message 69377

Quote:
People are voting with their feet for a combination of reasons. High on the list are (i) long crunch times, (ii) client errors, and (iii) validation failures. All these problems contribute to a general feeling of wasted effort. It's very disheartening to see the waste with missed deadlines and validation problems.

Now that I've seen the numbers about people that have left, I don't feel as bad for sticking it out for a month after the change over to this run before going somewhere else to crunch. I hope that things get fixed and put on a more equitable footing for the next run so I can bring my systems back in the mix.

Arion

ohiomike
ohiomike
Joined: 4 Nov 06
Posts: 80
Credit: 6,453,639
RAC: 0

RE: People are voting with

Message 69379 in response to message 69378

Quote:
People are voting with their feet for a combination of reasons. High on the list are (i) long crunch times, (ii) client errors, and (iii) validation failures. All these problems contribute to a general feeling of wasted effort. It's very disheartening to see the waste with missed deadlines and validation problems.

I'm hoping they get things straightened out soon. Einstein is my "project of choice", but I've pulled 8 of 9 cores I had running it because of the problems. It isn't worth paying the electric bill for 24 hours worth of wasted time that won't validate.


Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,551
Credit: 79,000,031,574
RAC: 64,955,517

RE: I'm hoping they get

Message 69380 in response to message 69379

Quote:

I'm hoping they get things straightened out soon. Einstein is my "project of choice", but I've pulled 8 of 9 cores I had running it because of the problems. It isn't worth paying the electric bill for 24 hours worth of wasted time that won't validate.

I think quite a few people share this sentiment but I'm also hoping that not too many take the final step of going elsewhere when there are signs that the situation may be resolved soon. I see that Bernd has mentioned some gains and that he plans to make the current Windows beta app into the official app shortly.

In my case, I've actually made changes that have increased my involvement in EAH and also increased my exposure to the validation problem. I've converted virtually all my AMD boxes to Linux which has given me a speedup of around 40-45% compared to Windows. I've converted a significant proportion of PIII boxes to Linux as well which has resulted in a 25-30% speedup for those. I've left all P4 boxes on Windows as they showed no improvement under Linux. I've added about 15-20 boxes that were not crunching 2 months ago. Most of those were Linux installs.

The thing that is most frustrating is that the 30%+ gain in crunching efficiency is being eroded significantly by validation failures which kill off about half the gain (in my experience anyway). I know it is important to fix bugs and problems that cause crashes in the app, but I also wish that something could be done about the validation issue.

When talking about bugs in the science app, I've noticed a couple of times on different machines over the last few weeks, a behaviour where the app appears to be running but progress is completely stalled. As BOINC is supposed to be "set and forget" this behaviour is only noticed when looking at machines at the bottom of the list and wondering why they haven't reported in a week or three :). Quite often deadlines have passed even before the problem is noticed. In each case, stopping and restarting the BOINC service kicks the science app back into life and the result is eventually completed and even validated. I'm pretty sure I've seen this with both Windows and Linux boxes.

Cheers,
Gary.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,359
Credit: 2,926,549,031
RAC: 2,949,048

4.24 is now the official

4.24 is now the official windows app, and has been for somewhere between 5 and 8 hours now.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.