Hyper Threading Processor
27 Apr 2005 1:15:06 UTC
Topic 189052
(moderation:
What exactly does a hyper threading Pentium 4 processor do that is different from one that is not hyper threading? There appears to be two CPUs, so does that mean I can process two Einstein@Home work units at the same speed that I can process one? When processing one work unit, it appears to use only half of the CPU power. Does that mean that I can use the other half for another project without slowing down Einstein@Home? Please let me know. Thank you.
Hyper Threading Processor
)
I believe the consensus is that with HT, a wu is slower, but the fact that you do two at a time makes the output higher. You have to set your preferences to use more than 1 processor in order for it to do two wu's at the same time.
I love HT and can't wait until I get a dual core with HT, so I can do 4 at a time. :)
> I believe the consensus is
)
> I believe the consensus is that with HT, a wu is slower, but the fact that you
> do two at a time makes the output higher. You have to set your preferences to
> use more than 1 processor in order for it to do two wu's at the same time.
The speed up is not quite a factor of two. And there is variation depending on exactly what projects you might be running. But, most of the tests that I have done or watched seem to indicate a throughput increase of about 60% over running the processor in non-HT mode.
So, you get 1.6 WU done with each WU processed taking a longer time, but less so than if you ran them serially.
> I love HT and can't wait until I get a dual core with HT, so I can do 4 at a
> time. :)
Even better will be dual processors with dual core with HT which give you 8 in-flight at once. Now that will be something to see ... :)
Was that a technical or a
)
Was that a technical or a practical question?
The previous replies give the practical answers. Some readers might be interested in *why* it works that way. If that does not include you, skip to the next post now...
In a traditional single core processor with a threaded operating system, but woithout threading built in to the chip, whenever the processor changes threads the 'context' has to be stored. This means the progam counter and all the registers in the cpu have to be saved to memory. Then the context of the new thread is loaded. All this takes time as each word of the context has to written to cache (or even worse to RAM). This means if you could persuade the operating system to run two WU together as separate threads you would lose out: the time taken for context switching would just make the overall throughput less; and at the end of the day you still have only one cpu to do the processing.
With a HyperThreaded chip, the context for the two threads is held on the chip so that it can swap from one thread to another in a small number of cpu cycles: maybe even in a single cycle. Runnig two threads together then cost little in throughput. But where does the saving come, if there is still only ever one floating point op going on at once? Wouldn't you expect 2 WU to take 2x as long so that throughput remained constant?
If the cpu could actually work flat out this would be true. In fact as we know the chip runs a lot faster than the memory, so whenever the next data is missing from the on-chip cache a non-Ht chip has to wait for the cache to load.
Thread switching for a non-HT chip would be counter productive, as exporting the context would slow things down further still.
In a HT chip, however, all the context is held nearby. Whenever there is a cache fault on thread 0 the context switches to thread 1 while the cache looks for the data needed by thread 0. Hopefully by the time thread 1 hits a cache fault the data needed by thread 0 will have arrived.
If the processing was cache-bound, so that almost every op needed data from outside the chip, there would be almost a 2x improvement in throughput by hyperthreading. If the loops in the code were so predictable that the look-ahead cache loader always got things right, then there would be a slight loss in performance from hyperthreading (as the context switch does still cost *something*) The factor of 1.6x throughput gain with hyperthreading shows that this code comes comfortably between those two extremes: running without HT the chip is running at full speed some of the time, and waiting for cache some of the time.
In future, if cpu speeds continue to increase the proportion of time spent waiting for cache will increase and it might make sense to have more than two contexts held on-chip. Without HT there would be no point pushing the cpu any faster, the chip would just get cache-bound, which is why we see a pause in the increasing GHz just now.
With HT in the long term there may be another push to increase cpu speeds further. It is not simple to predict: a 3GHz chip with two cores each with context space for 2 threads would be slower than a single core 6GHz chip with context space for 4 threads, but suppose the single core had 8 threads? or ran at 8GHz, then which would be better?
Expect to see clock speeds static at around 3GHz while the engineers find the sweet point on multi-core & multi-threads within each core, and then expect clock speeds to start to rise again but with even larger amounts of multi-core and multi-threading for each step in clock speed.
Expect to see some chips with more cores or more threading than is useful for ordinary apps simply because until the things are built it is hard to predict which are best for the mainstream users! Expect to see chip designers getting well-designed chips that sadly fail to hit the sweet points: the next few steps in chip evolution are simply not predictable in detail, they must involve some mix of cache, multi-cores, and hyperthreads but how muchy of each will be a matter for experiment, not engineering. But in a few years, having 8 or 16 WU on the fly at one time may well seem puny...
By the way: We can't increase the memory speed significantly as the RAM is too far away from the cpu for the info to get there any faster even if it moves at the speed of light! All this work on cache, multi-threads, and multi-cores is a direct result of the effort to make the chip run faster than the memory.
~~gravywavy
> Was that a technical or a
)
> Was that a technical or a practical question?
>
> The previous replies give the practical answers. Some readers might be
> interested in *why* it works that way. If that does not include you, skip to
> the next post now...
Permission to steal this? you will be cited ... but this is good and should be saved for posteritiy ...
involve some mix of cache,
)
involve some mix of cache, multi-cores, and hyperthreads but how muchy of each
can't increase the memory speed significantly as the RAM is too far away from the cpu for the info to get there any faster even if it moves at
This may seem dumb but: Why won't they put more memory on the CPU itself? I've read somewhere that AMD is moving MB bridge functionality onto their newest CPUs.
So if the game is to put more stuff on the chips, why not simply scale up the L1 cache to 256MB or more? And perhaps even dispense with the cache concept altogether and just use off chip RAM as a kind of hi-speed swap space?
Another "why not" solution: Could perhaps future separate RAM modules be on a much shorter bus, maybe even physically touching the CPU itself? Bye bye slot A/754/939 &c :-)
The 1st solution at least would probably reduce the cost of the overall system and make for a cleaner design. In my presently uninformed opinion, anyway.
Greetings, Mr Ragnar Schroder.
This may seem dumb but: Why
)
This may seem dumb but: Why won't they put more memory on the CPU itself? I've read somewhere that AMD is moving MB bridge functionality onto their newest CPUs.
So if the game is to put more stuff on the chips, why not simply scale up the L1 cache to 256MB or more? And perhaps even dispense with the cache concept altogether and just use off chip RAM as a kind of hi-speed swap space?
Another "why not" solution: Could perhaps future separate RAM modules be on a much shorter bus, maybe even physically touching the CPU itself? Bye bye slot A/754/939 &c :-)
The 1st solution at least would probably reduce the cost of the overall system and make for a cleaner design. In my presently uninformed opinion, anyway.
The simple answer is cost. Memory cost per unit speed increase is a non-linear function. So, to have the system contain memory cells that operate at the speed of the processor would likely make the system cost several thousand times higher. So, for this system you would be spending about $100,000.00 for the system you have now (or more).
Cache works by the "slight of hand" where the fast memory is small and yet our performance is close to the speed of the procesor, yet is much, much cheaper. I have a discussion about this in cache in the glossary and a little bit on how it all works out.
As far as the rest, you have physical factors regarding component placement.
If you look at the lecture notes (also on my site) for computer history you will see that most of the features of Mainframe computers have made it into PCs you have on your desk. In fact, for some time now, you have had more computing power of early mainframes on your desk (though the older mainframes probably still had greater I/O capacity than we have).
involve some mix of cache,
)
involve some mix of cache, multi-cores, and hyperthreads but how muchy of each
can't increase the memory speed significantly as the RAM is too far away from the cpu for the info to get there any faster even if it moves at
This may seem dumb but: Why won't they put more memory on the CPU itself?
hi Ragnar,
no, not dumb at all. Paul is partly right about cost, but it is more about getting the best value for the next increase in cost.
In my previous post in this thread I mentioned that if a CPU was totally memory bound, it would always be running at the speed of the memory bus: putting in an extra core would not speed things up at all and putting in more on-chip memory would make a big difference
If the cache was 100% efficient even after adding a second core, then the second core would speed things up 100% and extra memory would be irrelevant.
In most cases it is now fairly evenly balanced between the two, and that is why we are starting to see chips with dual cores for the first time.
This would have been technically possible 10yrs ago, but at the expense of smalle or no on-chip cache: at that stage your solution of simply adding more memory (as much memory as possible for the size/price of chip) was the solution that added best gain in performance.
Now we have reached the balance point between adding cores and adding memory, expect to see chips with more memory and dual cores, more memory and quad cores, even more memory and quad cores, in turn as the balance tips one way or the other.
In this sense Paul is right: the choice is never a technical one (in terms of what is possible) but an economic in terms of what delivers best performance for the target price of the next release of chips.
~~gravywavy
no, not dumb at all. Paul
)
no, not dumb at all. Paul is partly right about cost, but it is more about getting the best value for the next increase in cost.
In this sense Paul is right: the choice is never a technical one (in terms of what is possible) but an economic in terms of what delivers best performance for the target price of the next release of chips.
Thank you! :)
You did say it better ...
We are seeing some new and interesting technical innovation now because we have effectively hit the wall with increases in clock speed. This problem has been predicted for nearly a decade, but until recently has been held off with changes in design rules, the addition of distributed clock mechanisms, etc.
Heck, the Hyper-Threading was added to take advantage of the possibilities of pipeline stalls and to make effective use of the "opportunity". Becasue we do have the potentials in programs to use multiple CPUs we can now see the benefits of multiple cores/logical CPUs/etc. giving us increses in speed.
Isn't it also true that cache
)
Isn't it also true that cache memory inside the cpu also takes up a lot of die area which is where the main cost would come from if you where to simply increase the cache size? I also read somewhere (Intel site I think) that 2006/7 they will start using 45 micron and then 32 micron technology which should also provide a great boost as far as fitting more cores into a single chip.
Isn't it also true that cache
)
Isn't it also true that cache memory inside the cpu also takes up a lot of die area which is where the main cost would come from if you where to simply increase the cache size? I also read somewhere (Intel site I think) that 2006/7 they will start using 45 micron and then 32 micron technology which should also provide a great boost as far as fitting more cores into a single chip.
Actually, the major cost component is in the yeild. If a given wafer has, say, 20 flaws, evenly distributed across the surface, and the wafer can only make 20 chips, it is likely that almost all of the chips will contain a flaw.
As the chip size increases the probability of flaws increases and the yeild decreases. And, there is literally no way to make a perfect wafer.