AMD Processors

phud
phud
Joined: 22 Mar 05
Posts: 5
Credit: 547609
RAC: 0

the cache on the Duron is

the cache on the Duron is "front loaded" meaning the L1 cache is 128kb
and the L2 is 64kb...

most other processors have L1 cache between 20 and 64kb
with the L2 getting between 128kb and upto 2MB

for what it is...the Duron is a very efficent CPU!

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

I have noticed the same thing

Message 11085 in response to message 11083

I have noticed the same thing over time. My little Duron 1.2 is not really that far behind my XP 3000+. Granted the 3000+ gets more use it does have significant idle time to crunch. Anybody have any idea why the Duron chip would seem so much faster given the same clock speed? Would it be the smaller cache? That defies all sensible logic. Any ideas?
-Jason

hi Jason,

A larger cache can slow things down in principle, due to the time penalty for abandoning a speculative fetch (ie for data the pre-fetch algorith hoped might be useful) when an immediate need arises for a totally unexpected value. If the pre-fetch had not been in progress (the cache already being full, perhaps) you'd save that penalty.

But that is not what I think is happening here.

If I remember correctly, the Duron not only has a smaller cache, it also has a shorter pipeline (which is why it has the lower nominal clock speed). The pipeline is where a processor gains an apparently higher clock speed than it really deserves by starting on the next calculation before it has finished the first.

Pipelining is a liability where the choice of the subsequent calc depends on the result of the earlier, for example. If there are a lot of such result dependent calulations in the code then a chip with a short pipeline will maybe lose one nominal machine cycle where the wrong next result has been started, whereas a machine with a longer pipeline might lose several nominal machine cycles. If I am right about the pipleine lengths, then that would be my uninformed guess as to what is going on.

In either case, the fact is that both cache and pipelining are only fully effective where the future data flow is predicted accurately by the chip: Predictions are made a few instructions ahead for pipelining and many hundred instructions ahead for cache.

In both cases where these prediction fail more than a critical proportion of the time, the speed enhancing hardware (pipe of cache) crosses over from benefit to liability, and the cheaper chip wins out.

In the most extreme case, you ever want to crunch data where you have an array that fills your real memory, but where you can't predict one instruction ahead which array element is needed next, then the best chip would have no cache, no pipeline, and you'd disable the on-board cache. In that limiting case all those 'enhancements' would be a liability.

R~~

~~gravywavy

Jason Charles
Jason Charles
Joined: 6 May 05
Posts: 3
Credit: 930279
RAC: 0

Thanks for the insight to the

Thanks for the insight to the processor question, I never thought about the pipeline being a significant factor for this project in the Duron vs. the rest of the high performing CPU's. It does make sense however, that the shorter pipeline would win out over time when prediction does fail repeatedly. I guess some programs such as einstein@home does not lend itself to being "predicted" when the code is cached? Some very good ideas you presented. I just recently put an P4 2.8 Ghz machine on with an almost identical software configuration. The performance is not what I would have expected from an P$ 2.8. For this particular project I would definately suggest the lower end Duron. With that, I will leave you with another question... How do think a Semperon would perform running this project?

Andrew M
Andrew M
Joined: 10 Nov 04
Posts: 7
Credit: 863109
RAC: 0

A sempron would make a decent

A sempron would make a decent cruncher. A word of advice is to purchase the Athlon 64 based version instead of the Athlon XP version. The decreased cache has an even lower performance impact because of the intergrated memory controller.

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

I guess some programs such as

Message 11088 in response to message 11086

I guess some programs such as einstein@home does not lend itself to being "predicted" when the code is cached?

Yes and no. I think E@H does lend itself moderately well to the cache, but not at all to the pipe.

I have heard the E@H code is full of FFTs, and I'll share with you why the FFT is less pipe friendly than most code. Warning: I am just about to bump the 'tecky' level a bit, so some readers will want to quit now...

Still reading? Well if you have a 16 value FFT, you want the data presented in this order:

0
8
4
12
2
10
6
14
1
etc (!!?)

Can you see what is happening? If you get it without looking further down the page you are much better than I am!

Imagine how it jumps around for a wider FFT. The first two data items you want are always the first and the middle array elements. The first two iterations have to wait for separate pages to load, but hopefully they do not fill the cache, and a good cache controller will keep the page handy having loaaded it.
We will keep coming back to each page (we use each element exactly once), so providing we can fit the entire 'width' of the FFT into cache at once there should be no problem with caching.

Or rather, a small problem at the start, but soon the algorithm finds it is loading ready-cached data at every iteration.

Providing the data thrown at a single FFT fits into cache all at once, then cache size should not matter so much.

Turning to the pipeline: this hopping around the data is exactly the thing that will totally stall a pipe at every iteration. A prediction method based on ordinary arithmetic will never guess what is wanted next.

However it is stunningly easy to build hardware that will do it. Look at the numbers again, but this time in binary:

loop address
0 = 0000 ~ 0000 = 0
1 = 0001 ~ 1000 = 8
2 = 0010 ~ 0100 = 4
3 = 0011 ~ 1100 = 12
4 = 0100 ~ 0010 = 2
5 = 0101 ~ 1010 = 10
6 = 0110 ~ 0110 = 6
7 = 0111 ~ 1110 = 14
8 = 1000 ~ 0001 = 1

Take the loop counter as a bunary number, mirror image it, and you get the desired address. Cool eh?

Wouldn't it be easy to get hardware to do this?

Chips always offer the assembly programmer the chance to

load float register from a pre-defined base address indexed by the integer found in a given general register.

What we want is one more instruction:

load float register from pre-defined base address indexed by the mirror image of the integer in the given general register

If we had just that one extra instruction in the set, the code would run faster and on top of that, the predictive algorithm would know what was going on and could equally easily pipeline the correct values.

Reason chips don't is not because of the computer science, but because of economics: too few purchasers of PC chips care about FFTs to create a mass market, even tho it all it needs is one more instruction in the instruction set.

Dedicated chips for FFTs do exist, and even tho they are built to a much lower technological level than a Pentium or Duron they will far outrun either. They cost around £1k but can't be used for this project as they need special programming, and the project team are unlikely to write code specially for the few dozen users who'd be willing to spend £1k to take part...

~~gravywavy

AnRM
AnRM
Joined: 9 Feb 05
Posts: 213
Credit: 4346941
RAC: 0

..... chips for FFTs do

Message 11089 in response to message 11088

..... chips for FFTs do exist, and even tho they are built to a much lower technological level than a Pentium or Duron they will far outrun either. They cost around £1k but can't be used for this project as they need special programming, and the project team are unlikely to write code specially for the few dozen users who'd be willing to spend £1k to take part...
>Thanks for the detailed explanation. I don't have the math (FFTs) to understand your posts completely but I learn something about processing and CPUs etc from them. I appreciate your time and effort and respect your knowledge of the subject. Great posts! Cheers, Rog.

Jason Charles
Jason Charles
Joined: 6 May 05
Posts: 3
Credit: 930279
RAC: 0

gravywavy, Thanks for

gravywavy,

Thanks for shedding some light on the FFT's and pipeline discussion. You brought forth very easy to understand examples and offered some thought provoking ideas on chip design. You made me wonder if the market would support a mainstream chip that offered an additional instruction just for FFT processing, say for all of us crunchers out there that are looking for the next CPU that will do it faster and more efficiently. Imagine if AMD started a marketing campaign along side with BOINC, SETI and the like to promote a mainstream chip with special DC or FFT instructions. Just a thought... Thanks again!
-Jason

SunRedRX7
SunRedRX7
Joined: 11 Feb 05
Posts: 7
Credit: 92426097
RAC: 65230

> However, 8 years ago when I

Message 11091 in response to message 11061

> However, 8 years ago when I got my Pentium II, I paid about $3,000 for it and it was the best computer in the store. To this day, I have yet to find a program that won't run on it, even thought it is a little slow. I know someone who got a cheap computer years after I bought my Pentium II, then they had to upgrade it to run Windows XP!

Personally I'd rather build/upgrade a $500 PC every year, then one $3,000 PC every 6 years.

BTW, check what ghz your Pentium-M runs at, that'll show you how having an efficient architecture can overcome ghz differences.

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

BTW, check what ghz your

Message 11092 in response to message 11091


BTW, check what ghz your Pentium-M runs at, that'll show you how having an efficient architecture can overcome ghz differences.

That's right. And the posting that started this thread off made the point that the Centrino chip outperforms a Pentium-4 at the same GHz, which is pretty much the same point.

A pentium-M outperforms a Pentium 4 quoted as 1.5x in GHz. A recent lab test in Personal Computer World (July 2005, p154, UK edn) had a 1.6Ghz P-M returning the same benchmark scores as a 2.4GHz P-4 only when the P-4 was overclocked to 2.52GHz, almost 1.6x the nominal speed of the P-M.

The effect could be even more dramatic on E@H, as the P-4 has a longer pipe (see my postings earlier in the thread). My guessitmate would be that on this project a P-M could equal a P-4 running at 1.8x the speed. Someone who can drive BOINCSTATS can check this out if you'd be so kind please?

You can fit a Pentium-M on a desktop motherboard now (Aopen did the first one, now several other boardmakers have followed on). This means you can have full sized disk drives, PCI express, etc etc, like any other desktop machine, but finally run an Intel chip on a sensible amount of power.

Buying a chip and a motherboard to similar performance would cost about double for the Pentium-M, but you'd make the difference back and more on the power bills if you run it 24/7. That is power for the chip plus power for the fan to cool it plus power supply losses in the bigger power supply you'll need with the P-4.

~~gravywavy

SunRedRX7
SunRedRX7
Joined: 11 Feb 05
Posts: 7
Credit: 92426097
RAC: 65230

Tom's Hardware also did a

Message 11093 in response to message 11092

Tom's Hardware also did a decent review of the Pentium M lately and stacked it against the Pentium 4
http://www.tomshardware.com/cpu/20050525/index.html
There findings:
"Conclusion: The Pentium 4 Must Go (alternatively: Kill The Pentium 4!) "

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.