poor measured performance on Linux vs Windows (bad Compiler choices/options)

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

> > According to my sources,

Message 8762 in response to message 8761

> > According to my sources, the Intel compiler does not create faster
> binaries on
> > Linux, and gcc 4.0 does not too.
>
> In many cases, the Intel compiler indeed does create faster running code. I've
> used it on a few occasions and most things tend to run noticably faster. The
> only reason I don't use it exclusively is that I prefer to support the GNU
> guys.

Would be nice if you compare times with a a small loop containing almost only floating point operations.

A loop similar to this one ...
all the rest including tempFreq1 is double
OOTWOPI ist a #define,
k & klim is int
for(k=0; k im * xinv;
reXinv += Xalpha_k->re * xinv;
Xalpha_k ++;
}
Does not make sense to send you all the program, because I cannot provide you a test environment, it is just an example of the kind of loop.

> As much as some (myself included) tend to knock gcc, they've really made some
> improvements over the years - Particularly where C++ is concerned. I just
> thought that needed said.

Yep, they really did!

> At any rate, I hope a solution can be found that doesn't cause too much
> inconvienience. Tis a great project and I really hated yanking my boxen off of
> it....But not as much as I hated the thoughts of fiddle-farting around with
> wine on 26 (oops..31) boxes. :)

Agree! Wine is quadruple no-no on my machines!

Let me see, I now know where to change and what to change, hopefully it will work soon and break nothing else.

josep
josep
Joined: 9 Mar 05
Posts: 63
Credit: 1156542
RAC: 0

Wurgl, take a look

Message 8763 in response to message 8762

Wurgl, take a look at:

http://pandora.aei.mpg.de/merlin/2:History/2.1:Pre-delivery/2.1.2:Benchmarks.html

There are some benchmark tests, compiled for several platforms, with different compilers and optimizations. These benchmarks were used during the design of Merlin cluster.

And the floating point benchmark is no other than LIGO's LALDemod program. And, as far as I know, it's very similar to the code used by E@H (perhaps the same code you are analyzing)

The web form for returning results seems to be no longer working, but you can run the tests and measure the completion time.

I have done this in my Athlon XP 2600+ and the test compiled with ICC (for Linux) for Pentium 3 is noticeable faster than GCC's best optimized version for Athlon.
It is 1,47 times faster (near a 50% speed increase)

So, I suppose the E@H app compiled with ICC for Linux really will be faster.

My completion times are (mesured several times):

client: demod-linux-gcc-3.0-athlon-highoptim-1.0.tar.bz2
completion time: 53,2 sec.

client: demod-linux-icc-p3-highoptim-1.0.tar.bz2
completion time: 36,2 sec.

This is for an Athlon XP 2600+ (model 8 "Thoroughbred", 2078 Mhz)

The drawback is that this ICC built for Pentium 3 does not work in older Athlons. I have a 1333 MHz Athlon Thunderbird, that is successfully running E@H with the Linux native client, and the benchmark compiled with ICC for Pentium 3 fails.

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

The drawback is that this ICC

Message 8764 in response to message 8763


The drawback is that this ICC built for Pentium 3 does not work in older Athlons. I have a 1333 MHz Athlon Thunderbird, that is successfully running E@H with the Linux native client, and the benchmark compiled with ICC for Pentium 3 fails.

And this is the problem the guys have. When you compile an application so that it is using a processor specific instruction set then it is expected to run faster on that processor, but will not run any more on some/many others.

Now you sure can link a binary having a set of the same function, with the only difference being the optimization. This can theoretically be made.

But in reality this opens a can of new bugs. The detection of the CPU and its version, subversion, ... may be wrong. And even the compiler may introduce bugs with different options. So this is a problem of having a huge test environment with all those different CPUs and even different distributions. Very time consuming :-(

However, look at that code snipplet I posted. Regardless if it is still used or not (yes, they changed some code). Create an assembler file an check for fxch statements. Almost every fp instruction is followed by such an fxch :-( Get rid of that and the decoder of the CPU will be able to decode more 'real' statements and Linux will reach the speed of that 'other' compiler. And most important: It will still run on all CPUs.

Once x64 has reached a reasonable percentage of the machines, it might make sense to 1.) distribute a seperate client for x86 and 2.) it might make sense to compile that client using SSE[1|2|3] statements.

But it is not my project :-) so whatever we are discussion here, it may just ring some bell, sure not more. Decisions are made elsewhere.

Happy crunching!

FalconFly
FalconFly
Joined: 16 Feb 05
Posts: 191
Credit: 15650710
RAC: 0

> And this is the problem the

Message 8765 in response to message 8764

> And this is the problem the guys have. When you compile an application so that
> it is using a processor specific instruction set then it is expected to run
> faster on that processor, but will not run any more on some/many others.

That's why modern compilers can offer to include several code paths, which the Binary will choose run-time, based on that Architecture it finds itself running on.

Going one step further, one could write & compile a Client that is fully optimized for an unlimited number of specific Architectures and decide which to use at Application start.
The only drawback remaining would be the increased size of the Binary due to the larger number of Code paths contained.

3D Applications have gone this way since a very long time, holding upto 6 specific Renderpaths for different Architectures to achieve optimal performance.

The compiler guys over at SETI successfully compiled CPU-specific Clients for almost everything x86 (Linux and Win32) with truly massive performance gains.
Just putting it all into one Client is beyond their capability (and not really needed).
BOINC, knowing what CPU it is running on, could easily signal what optimized Client it needs, and still download a generic one if detection failed for whatever reason.

All it would take, is to do the compiling and implement the optimized Clients for Download on the Server and make BOINC download the appropriate one.

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

All it would take, is to do

Message 8766 in response to message 8765


All it would take, is to do the compiling and implement the optimized Clients for Download on the Server and make BOINC download the appropriate one.I, for one, would like that ... every time I have tried to use one of the optimized binaries I get weird reactions - I don't know why ...

ANd I have tried them on two different classes of machines. It is not that I cannot figure it out I suppose, but it is just easier to "go with the flow" and live with what you know is probably less performance.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.