Information about the new S5 workunits

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 739076781
RAC: 1263353

RE: To prove the point that

Message 37787 in response to message 37786

Quote:
To prove the point that it's cache-related on Intels, I'll try to negotiate with my brother... he's got one of those really nice late Pent Ms with an enormous 2 MB of L2 cache (it's called a Dothan iirc), a likely machine to get away totally unaffected if we are correct... doubt he'll let me run it off a live CD, though... anyone else got a machine like this?

My Pentium M is a Banias, with "only" 1 MB L2 cache, sorry.

Annika, if you have some spare time and want to get to the bottom of this mystery, there's another thing you might want to try:

Intel has a brilliant tool called VTune. It's (among other things) a profiler that works without modifying or recompiling the code. You just let it sample some data for some time and it will tell you the hot-spots of any app running during the sampling process.
In addition to this, VTune uses hardware support for profiling and performance analysis in recent intel CPUs. So the tool will actually report things like statistics on cache misses.

Best of all: it's available for Win XP and Linux, and an evaluation version is available for free. The GUI for this tool is as intuitive as it gets, really nice.

Unfortunately, Pentium III and (of course) AMD chips are not supported as far as I can see. So your "L2 challenged" Core would be an excellent test platform to compare Win & Linux, as it seems to be the only recent Intel CPU that shows a WIN penalty so far, right?

EDIT:

What VTune will show you is that the compiler used for Win actually does use SSE instructions. Some of the math library functions will execute different code paths depending on what processor is detected. However, from what I see in the profiling statistics, I don't think even a mis-detection of SSE capabilities would have a 30% slowdown effect as those functions aren't executed that often, but it might contribute a bit to the PIII problem (those lib functions have names like "modf_pentium4" which makes you wonder whether there's a SSE code path for Pentium IIIs). But that's a preliminary analysis.


CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Well I'm a bit of a newb with

Well I'm a bit of a newb with stuff like this, but it sounds really interesting. I'll try to find time for it over the weekend and see what I can do.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 739076781
RAC: 1263353

RE: Well I'm a bit of a

Message 37789 in response to message 37788

Quote:
Well I'm a bit of a newb with stuff like this, but it sounds really interesting. I'll try to find time for it over the weekend and see what I can do.

That would be cool. But beware, this VTune tool is quite addictive if you like dissecting applications :-).

I dug a bit deeper and to me it seems that the Win Version will indeed use SSE2 instructions in some places which are not really that important hotspots, and standard FPU instructions if SSE2 is not supported (e.g. on P III). I still can't believe this has a significant impact.

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

In addition, I got said

In addition, I got said Dothan added to my fleet temporarily. It's crunching one WU each under Win XP and BackTrack (a SLAX-based Linux live CD which we knew to support that notebook's hardware well). If our theory is correct, the Linux app should be no faster with that monster cache... maybe a bit slower, because a live CD and USB flash stick are not exactly the ideal crunching environment, performance-wise.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 739076781
RAC: 1263353

RE: In addition, I got said

Message 37791 in response to message 37790

Quote:
In addition, I got said Dothan added to my fleet temporarily. It's crunching one WU each under Win XP and BackTrack (a SLAX-based Linux live CD which we knew to support that notebook's hardware well). If our theory is correct, the Linux app should be no faster with that monster cache... maybe a bit slower, because a live CD and USB flash stick are not exactly the ideal crunching environment, performance-wise.

If the USB stick is mounted with async IO it might be quite OK.

As I said, this VTune stuff from intel is quite addictive so I spent some time taking measurements (under Win XP on the Pentium Banias).

The results so far are:

- L2 cache misses are almost not measurable. Not an issue with the Banias as expected.

- no significant penalties because of misaligned data access

- the hot spot function executes about 1 instruction every 0.8 clock cycles. Not too bad.

- the few SSE2 instructions that are used by the math lib don't really matter that much. They seem to be used as drop-in replacements for scalar FPU instructions, not in any kind of vector-mode. The Linux version doesn't seem to use any, but this may depend on the version of glibc installed on your system (???).

Nothing to explain the observed effect, so far :-(

@Gary

Contrary to what I wrote earlier, VTune should be useable for Pentium III systems!
So you might consider profiling under XP and Linux to find out where the 30 % penalty is incured.

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Got VTune running, but

Got VTune running, but somehow I can't seem to recognize the relevant data... makes me feel like a newbie ;-) Could anyone please give me a hint what to look for?

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 739076781
RAC: 1263353

RE: Got VTune running, but

Message 37793 in response to message 37792

Quote:
Got VTune running, but somehow I can't seem to recognize the relevant data... makes me feel like a newbie ;-) Could anyone please give me a hint what to look for?

Hi Annika,

I've just returned form a cocktail bar tour with some friends, nevertheless I hope the following still makes some sense ;-).

What you want to do is to configure an "Activity" with a "Sampling" run (data collection as the application in question is already running).

The important thing is to set up "Event driven counters"

Try, for example, to create a new Activity:

Menu->New Activity

Category: Analyser Project, Advanced Activity Configuration
--> OK

Select Data Collectors: New... counters and New...--> Sampling

select "Sampling" -->Configure ... -->

Tab general: Event-Based Sampling
Tab Event Ratios -->

Here you can select different statistics grouped into "Ratio Groups".
For the start one might try out memory related statistics:

Ratio Group: Memory Statistics:

Select Ratio (for example): L2 Misses per Data Memory Reference

(add with ">>" to the right hand side list)

Add more ratios as you prefer, then press OK

Once you have closed the "Advanced Activity configuration" dialog, you can start the data collection ( the green triangle icon near the menu bar).

After a default of 20 seconds the data collection will end and the results will be displayed.

By clicking in the bar chart representing the einstein app, you should see the "hotspot" einstein functions. A separate window will give estimations for the gathered statistics (here: L2 cache miss ratio) for the currently selected function inside the einstein app.

I'll make a few screenshots tomorrow (well.... today, sunday) .

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Thanks a lot :-D that was

Thanks a lot :-D that was really helpful. I wished some of my professors could explain sth that well when they are dead sober and not lacking sleep :-D in German, too.
The strange thing is: I've done three runs now and in each the number of cache misses was exactly zero. And yes I'm quite sure I've done it right, it wasn't so hard following your instructions. I believe less and less that cache plays a major role here...

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 739076781
RAC: 1263353

RE: Thanks a lot :-D that

Message 37795 in response to message 37794

Quote:
Thanks a lot :-D that was really helpful. I wished some of my professors could explain sth that well when they are dead sober and not lacking sleep :-D in German, too.
The strange thing is: I've done three runs now and in each the number of cache misses was exactly zero. And yes I'm quite sure I've done it right, it wasn't so hard following your instructions. I believe less and less that cache plays a major role here...

Yup, even the P III's cache should be more than enough to run this app efficiently.

I will play around with this tool a bit more today and have a look at the efficiency of branch prediction. And I'll try to install this under linux as well to have a direct comparison. At least on your Yonah T2060 we should be able to see differences to explain the 30 % gap if we compare measurements from XP and Linux. Not necessarily for L2 cache misses but for some other metric.

The crucial code of einstein is compiled into just a few 100 lines of assembly instructions, something causing a 30% loss of performance should really "stick out".

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Okay, if you know what to

Okay, if you know what to look for, I'll be more than happy to try it out on my little box. This is getting more and more mysterious and I want to figure it out.
Btw, it looks like the Dothan (really one of the most extreme machines concerning L2 cache) is getting even more of a Win penalty then my CPU...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.