This project is about number crunching. I wonder if the programmers optimized the calculating part of Einstein client. A good optimization may make any calculation 10 times faster. Any information about it?
I hope they wrote it in C (or better Assembly) and used the fastest compiler available (they say Intel's C compiler makes 25% faster code for example then gcc or microsoft C++).
Kind Regards:
azazil
Copyright © 2024 Einstein@Home. All rights reserved.
Is EINSTEIN client OPTIMIZED for speed?
)
> This project is about number crunching. I wonder if the programmers optimized
> the calculating part of Einstein client. A good optimization may make any
> calculation 10 times faster. Any information about it?
>
> I hope they wrote it in C (or better Assembly) and used the fastest compiler
> available (they say Intel's C compiler makes 25% faster code for example then
> gcc or microsoft C++).
>
>
> Kind Regards:
>
> azazil
>
Some of the optimizations will destroy the science being done as they will propogate round off errors faster. So it is much more likely that the code has been optimized more for correctness than speed.
BOINC WIKI
... but we want wrong answers
)
... but we want wrong answers twice as fast! :-)
Cheers,
PeterV.
It's optimized for speed as
)
It's optimized for speed as far we could get as long as the results stay correct within tolerances we found acceptable. Some calculations still need to be done in double precision, which means we can not make much use of e.g. SSE. It looks like the MSC compiler makes the most of our (C) code, so the Windows version is somewhat faster than the Linux & Mac versions (built with gcc). We didn't find a significant improvement with icc. When things have settled down a bit and became more stable, we may address this issue again. You may want to take a look at this old thread.
BM
BM
> It's optimized for speed as
)
> It's optimized for speed as far we could get as long as the results stay
> correct within tolerances we found acceptable. Some calculations still need to
> be done in double precision, which means we can not make much use of e.g. SSE.
> It looks like the MSC compiler makes the most of our (C) code, so the Windows
> version is somewhat faster than the Linux & Mac versions (built with gcc).
> We didn't find a significant improvement with icc. When things have settled
> down a bit and became more stable, we may address this issue again. You may
> want to take a look at href="https://einsteinathome.org/%3Ca%20href%3D"http://einsteinathome.org/node/187125">http://einsteinathome.org/node/187125">this[/url] old
> thread.
>
> BM
You also failed to point out that you have to "optimize" for the specific target hardware. In the old days when we optimized by hand, we would select the instructions that had the fewest clock cycles but would give us the same effects as the "correct" instructions. Since these instructions were part of the specific Instruction Set Architecture (ISA) you crafted the programs effectively by hand.
Right now, there is a debate "raging" on the boards (if you have the time to read all of them for the various projects) that discusses which CPU is "fastest" for processing work. The difficulty here is even "identical" processors of different "steppings" can alter the performance delivered due to changes in the internal arrangements.
The fundamental problem is that you have several degrees of "freedom" that rule the actual performance, as Bernd mentioned, you have the program itself, the compiler, the compiler "switches" used, the CPU ISA, and the CPU physical architecture.
The reason that I mention ALL of this is to point out that the AMD and Intel chips use an "identical" exterrnal ISA based on the 8086 ISA (with extensions that complicate things again as Bernd stated) but they have vastly different internal ISA and therefore delivered performance.
One of the favorite things I came across while reseaching ISA for a hardware class was a program that HP developed to "model" one of their CPUs, they compiled the program which emulated a CPU and ran it on the same actual physical CPU. Now, the expectation is that when you emulate in software a hardware system the delivered performance will be, at best, significantly slower. In fact, the emulator allowed them to do "on-the-fly" optimizations and deliivered performances as high as 20% over the native hardware. WHen you consider that you had added software overhead this is very interesting ...
Anyway, just more food for confusion ...
Follow the top link on my site to the lectures part and read the hardware lecture notes (well, that is all that is there are THIS time ... hard to miss)
Since the WIN version is said
)
Since the WIN version is said here to be better optimized than the Linux version, what is the experience of those here who run the WIN version under WINE under Linux as opposed to running the native Linus version under Linux?
Is WIN-under-WINE-under-Linux stable & error free?
How much faster is WIN-under-WINE-under-LINUX compared to native Linux?
(I am not asking about the screensaver but only the number-cunching.)
Thanks,
ADDMP
> How much faster is
)
> How much faster is WIN-under-WINE-under-LINUX compared to native Linux?
Since this delta is going to vary from host to host, why don't you try it for yourself and see?
> Since the WIN version is
)
> Since the WIN version is said here to be better optimized than the Linux
> version, what is the experience of those here who run the WIN version under
> WINE under Linux as opposed to running the native Linus version under Linux?
>
> Is WIN-under-WINE-under-Linux stable & error free?
>
> How much faster is WIN-under-WINE-under-LINUX compared to native Linux?
>
> (I am not asking about the screensaver but only the number-cunching.)
>
> Thanks,
> ADDMP
>
I tried it with mixed results. Basically the first WU returned fine with a very fast completion time (21,926.15s vs 37,047.61s for native Linux) however the 2 WUs processed after that completed in ~10k seconds and had errors. I don't know why they had errors but re-starting the client didn't help :(. So now I'm just back to Linux native only.
These are a couple of other threads that have been started on the subject, the devs definetly know about the problem.
http://einsteinathome.org/node/187846
http://einsteinathome.org/node/187471
> > How much faster is
)
> > How much faster is WIN-under-WINE-under-LINUX compared to native Linux?
>
> Since this delta is going to vary from host to host, why don't you try it for
> yourself and see?
I can't tell you about E@H, but I did play a bit with seti classic. Running binary over the exactly the same WU on the same hardware. Linux native binary was slowest (by about 50%) and that's what I expected. What I didn't expect was that running Windows binary in WINE was actually faster than Windows binay in Windows (Win2k SP3 vs. RH7.3 with 2.4 series kernel). Not much, but persistently in order of 5%.
Metod ...
> I can't tell you about E@H,
)
> I can't tell you about E@H, but I did play a bit with seti classic. Running
> binary over the exactly the same WU on the same hardware. Linux native binary
> was slowest (by about 50%) and that's what I expected. What I didn't expect
> was that running Windows binary in WINE was actually faster than Windows binay
> in Windows (Win2k SP3 vs. RH7.3 with 2.4 series kernel). Not much, but
> persistently in order of 5%.
>
I am going to go ahead and try running E@H with wine on Linux, I'll give the results back once I get through testing it out.
such things just should not be writ so please destroy this if you wish to live 'tis better in ignorance to dwell than to go screaming into the abyss worse than hell
If they want to borrow any of
)
If they want to borrow any of the code from seti optimization they are more than welcome.
https://sourceforge.net/projects/setiboinc/
Haven't seen the source for einstein but if it is all doubles, then only SSE2 and SSE3 would work. I'm not positive but I believe Mac's Altivec only uses single precision SIMD.
The real trick with most optimization these days isn't so much converting to SIMD, but arranging the code so that multiple execution units inside of the CPU can work on adjacent instructions at the same time.
Example:
for(i = 1; i < 100; i++) {
a += buffer[i];
}
vs
for(i = 1; i < 100; i+=4 ) {
a += buffer[i+0];
b += buffer[i+1];
c += buffer[i+2];
d += buffer[i+3];
}
a += b + c + d;
Then comes dependancy chains and latency scheduling. These can be done even under C without resorting to assembly language.