Information about the new S5 workunits

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 745336373
RAC: 1053206

RE: You're right, it

Message 37847 in response to message 37846

Quote:

You're right, it doesn't affect these cpus, and I don't know at the moment, if Athlon XPs are also affected by the performance gap.

Athlon XPs fall into the same category wrt. this problem as Intel Pentium IIIs : Yes , the Win version will run significantly slower (probably roughly 30 %) as compared to the Linux app. The reason is that the non-SSE2 variant of the "modf" function used by the math lib in the win app is very slow indeed. And no, the experimental fix mentioned above won't help because Athlon XPs (AFAIK) don't support SSE2.

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Yeah I was talking about my

Yeah I was talking about my Venice box. I haven't done anything about the app on my Core so far, mainly because I don't have a clue what's wrong there. But if ut's an individual issue it doesn't really matter- I don't mind running Linux (was more or less planning it anyway, just hadn't gotten around to make the effort).
I can't say anything about my AMD's performance after our "quick-tuning" yet, since the WU wasn't even sent to anyone else yet and I have no idea what it's worth (it's a 400 MHz, does that tell you sth?). The WU is at 36.6% in about 8.5 hours, which hints at a completion time of around 24 hours or sth.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 745336373
RAC: 1053206

RE: Yeah I was talking

Message 37849 in response to message 37848

Quote:
Yeah I was talking about my Venice box. I haven't done anything about the app on my Core so far, mainly because I don't have a clue what's wrong there. But if ut's an individual issue it doesn't really matter- I don't mind running Linux (was more or less planning it anyway, just hadn't gotten around to make the effort).
I can't say anything about my AMD's performance after our "quick-tuning" yet, since the WU wasn't even sent to anyone else yet and I have no idea what it's worth (it's a 400 MHz, does that tell you sth?). The WU is at 36.6% in about 8.5 hours, which hints at a completion time of around 24 hours or sth.

Good morning Annika!

The WU should be in the 300-350 credits range, I guess. The "fix" doesn't seem to level the playing field between the Windows and the Linux app completely, but it should narrow the gap.

It's 400 Hz btw, not MHz (it's somehow related to the spinning speed of the pulsars we are looking for, and a pulsar spinning a few hundred million times per sec would probably mean a Nobel Prize to it's discoverer.

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Okay, thanks for explaining.

Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 745336373
RAC: 1053206

RE: Okay, thanks for

Message 37851 in response to message 37850

Quote:
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.

No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!

CU

BRM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: RE: Okay, thanks for

Message 37852 in response to message 37851

Quote:
Quote:
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.

No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!

CU

BRM

Hi,

my first result is finished and uploaded also some other members of our team have patched there app and successfully finished WUs.

My c/h rose from ~14 to ~19!
This is eliminating the AMD/Win penalty. :-)

Some stats from my data:

[pre]
A64: 8.2 - 8.8 [c/(h·GHz)] Linux
A64 X2: 8.2 - 8.8 [c/(h·GHz)] Linux
A64: 4.6 - 5.2 [c/(h·GHz)] Windows
A64 X2: 4.7 - 5.2 [c/(h·GHz)] Windows
[/pre]
Because the Einstein app is scaling with cpu clock, there is no need to look at the different clocks, also cache size is pretty uninteresting. The former S5R1 and S5R2 app was running in L1 cache and even the smaller cache of Intel cpus was big enough. In my data there are for sure some hosts which are oc'd and therefore have influence to the results above, but I suppose one can find them in both os-groups.

My first result equals 7,25 [c/(h·GHz)].
I should say, that running Boinc native with only one Einstein app without cpu affinity and one VMWare Linux cruncher dedicated to one core ended up with about 50% resource share for each task. But taskmanager showed more than 105,000,000 page faults for the Einstein Win app. VMWare in contrast only produced 430,000 page fauts after running for a couple of days. So maybe running Boinc without another full load process aside which is dedicated to one core will even improve the speed. Also the example imho shows, that page misses don't really bother the app and do not dramaticaly reduce speed.
When we get other results, we can draw conclusions about this.

The Intel Core cpus show really big differences in my data and therefor it's impossible to get good stats without knowing the exact clock rate.

When does one of the developers give a statement about this ugly lib issue? ;-)

cu,
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 745336373
RAC: 1053206

RE: RE: RE: Okay,

Message 37853 in response to message 37852

Quote:
Quote:
Quote:
Okay, thanks for explaining. When I get home from Uni (around 7 pm) the WU should be more than half crunched, so I'll be able to get some fairly good estimates. Any idea how big the Win penalty for this kind of box usually is, so I have sth to compare to? I've seen everything from people on the board writing about a 20% difference all the way to a friend's Opteron which is a good 70% (!!!) faster under Linux.

No idea about the penalty for your Venice, I guess Michael and his database of statistics will be helpful there, but he's probably still taking the well deserved nap. Can hardly wait to see your results!!!

CU

BRM

Hi,

my first result is finished and uploaded also some other members of our team have patched there app and successfully finished WUs.

My c/h rose from ~14 to ~19!
This is eliminating the AMD/Win penalty. :-)

Some stats from my data:

[pre]
A64: 8.2 - 8.8 [c/(h·GHz)] Linux
A64 X2: 8.2 - 8.8 [c/(h·GHz)] Linux
A64: 4.6 - 5.2 [c/(h·GHz)] Windows
A64 X2: 4.7 - 5.2 [c/(h·GHz)] Windows
[/pre]
Because the Einstein app is scaling with cpu clock, there is no need to look at the different clocks, also cache size is pretty uninteresting. The former S5R1 and S5R2 app was running in L1 cache and even the smaller cache of Intel cpus was big enough. In my data there are for sure some hosts which are oc'd and therefore have influence to the results above, but I suppose one can find them in both os-groups.

My first result equals 7,25 [c/(h·GHz)].
I should say, that running Boinc native with only one Einstein app without cpu affinity and one VMWare Linux cruncher dedicated to one core ended up with about 50% resource share for each task. But taskmanager showed more than 105,000,000 page faults for the Einstein Win app. VMWare in contrast only produced 430,000 page fauts after running for a couple of days. So maybe running Boinc without another full load process aside which is dedicated to one core will even improve the speed. Also the example imho shows, that page misses don't really bother the app and do not dramaticaly reduce speed.
When we get other results, we can draw conclusions about this.

The Intel Core cpus show really big differences in my data and therefor it's impossible to get good stats without knowing the exact clock rate.

When does one of the developers give a statement about this ugly lib issue? ;-)

cu,
Michael

Thanks for the stats, this looks really promising, doesn't it!!! I expected a 30 % rise in performance.

Unless you've already done so, I'll drop Bernd an email just in case he has missed the whole discussion.

CU

BRM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Thanks for the stats,

Message 37854 in response to message 37853

Quote:

Thanks for the stats, this looks really promising, doesn't it!!! I expected a 30 % rise in performance.

Unless you've already done so, I'll drop Bernd an email just in case he has missed the whole discussion.

CU

BRM

Yes, looks very good. :-)
I haven't mailed to Bernd, so go ahaed.

Btw. I don't think this patch harms any cpus that are not SSE2 capable. There must be another switch in the code to filter out Intel SSE1 and non SSE cpus. This will probably work on AMD too. Should be something like described on that Web page about Intel compilers. But if there is some place where SSE1 Instructions are used, this might accelerate AMD Athlon XPs too.
But this is just a guess.

cu
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 745336373
RAC: 1053206

RE: Yes, looks very good.

Message 37855 in response to message 37854

Quote:

Yes, looks very good. :-)
I haven't mailed to Bernd, so go ahaed.

Btw. I don't think this patch harms any cpus that are not SSE2 capable. There must be another switch in the code to filter out Intel SSE1 and non SSE cpus. This will probably work on AMD too. Should be something like described on that Web page about Intel compilers. But if there is some place where SSE1 Instructions are used, this might accelerate AMD Athlon XPs too.
But this is just a guess.

cu
Michael

Yes the detection mechanism to me seems to be as described in the article: First, detect feature bits to check for SSE2, then check for vendor and if it is "AuthenticAMD", reset the results just obtained from CPUID to a bare minimum. Not a nice thing to do, IMHO.

I didn't see any SSE instructions and I doubt very much that Athlon XPs or P IIIs will see any performance increase whatsoever by changing the COU detection code. For those platforms to reach the the same levels of performance as under Linux/gcc, a better implementation of the modf function is needed.

In the meantime, I think it's a matter of courtesy to keep the number of modified clients to a minimum until Bernd OK's the change. It was essential to verify our hypothesis to try out the change, but let's wait until the official OK before everybody is patching the app. If 1000 people are patching and one of them makes a mistake, it can mess up quite a few results. As a software engineer I'd prefer that the new version is formally tested, approved, and only then released with a new version number before it's widely used so any negative effects are traceable.

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Hey guys, just a quick

Hey guys, just a quick update. My WU has about 2 hours left; total crunching time should amount to between 20.5 and 21 hours. I still don't know the exact credit value, though. Btw, I'm getting a friend from Uni to check this with one or two of his AMD boxes, so we'll get some more results. Mailing Bernd is a great idea imo.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.