I have to correct myself once more, there is one area where even floating point calculations would benefit from the 64 bit instrcution set.
Some compilers tend to use a pair of 32bit integer move instructions to copy a 64 bit double precision floating point. This is very bad: It leads to mixing 32bit write- and 64 bit read-accesses on the same data, which is causing a so called "store forward stall", something which is quite expensive. I see this a lot in the code the MS compiler generated for the latest Windows beta app. With 64 bit instructions, even the dumbest compiler would copy the 64 bit floats in one instruction, preventing a store forward stall.
Even in 32 bit mode there would be ways around this effect, but in 64 bit mode it's kind of foolproof :-)
The biggest advantages for using 64-bit is that there are double the number of registers in the register files, that SSE2 support is guaranteed, and that function arguments can usually be passed in the registers.
Doubling the number of registers reduces the amount of wasted cycles caused by these reasons:
* The low number of registers in x86 32-bit mode forces the compiler or assembly programmer to use more move instructions, wasting cycles that would otherwise be used for computation. This also forces the CPU to use up more cache throughput, which could be used more efficiently if there are not so many instructions that must compete with one another to use the cache. Also, with more instructions competing to use the cache, the chance that a cache miss rises. Cache misses usually stall the processor until the data is fetched from a larger and slower cache or from main memory, unless the processor has some form of hardware multithreading.
* CPU pipelines are getting really deep, forcing the use of many techniques to keep the pipelines full and the execution units busy. Unfortunately, these techniques only get you so far. The low amount of registers often creates bubbles, which are empty slots or groups of empty slots, in the pipeline which accomplish nothing but resolve dependencies. With more registers, the dependencies can often be spread out farther away from each other with other work whose data are in the extra registers, which can either reduce the number and size of the bubbles in the pipeline or completely eliminate them, resulting in greater efficiency.
The guarantee of SSE2 speeds things up because SSE2 can do anything that the x87 FPU can do but much more efficiently. With the old x87 FPU used in 16 and 32-bit modes, you must make sure that the data you want worked on are on the top two locations of the FPU stack, forcing the compiler or assembly programmer to create register exchange instructions that waste time. With SSE2, you can work on any two registers you desire, and can perform the same operation on small arrays of data if desired, raising efficiency. You also do not have to write code that checks to see if SSE2 is present or not, which invariably wastes time and program space.
Passing values to functions in the registers is much faster than passing them on the stack (a structure in memory), as is done in 32-bit mode and 16-bit mode. When a function is called in 32-bit mode or 16-bit mode, the arguments being passed are usually written to memory, and then are read from memory by the function that needs the data. In 64-bit mode, the arguments are kept in the registers unless there are too many arguments to keep in the registers under the calling convention being used, forcing the compiler or assembly programmer to push the remaining arguments onto the stack like in 32-bit or 16-bit mode.
Remember, registers are much faster than memory and caches. However, registers require huge amounts of chip real estate per bit compared to caches and RAM, so CPU designers cannot put many of them in a CPU and hope that the CPU remain inexpensive. Therefore, CPU designers must choose a good compromise between speed and cost when designing an architecture. The compiler's and the assembly programmer's responsibility is to use all of them effectively.
And one more stone in this way: integer computations are significantly faster than floating ones. Whith large registers why not to change some fp-operations (where the precision is limited) with integer computation whith fixed point. Or may be it is better to use SSE2 instead (I don't yet know SSE2 enough)?
Let it be so. I'm familiar whith new processors architecture, espesially with new command subsets like SSE,SSE2,SSE2,SSSE2 and so on. But the key of this thread is that we should ask Bernd to compile a 64-bit binary. Or may be, you can try it yourself, Akos?
I think that unless you know the code inside and out there wouldn't be much of a way to be sure if it would benefit from a 64bit recompile except trying it out. From what I understand porting to x64 is pretty easy. The main challenge is that, I do not believe there is a way to compile Win64 binaries with GCC (in that, there is no 64bit cygwin). Do you guys compile the Win32 binary with visual studio? Something else? How hard would it be to at least test it to see if it is worth bothering with?
RE: But are you sure the
)
Yes. I'm sure.
The key: ABC@Home uses lots of operations on 64 bit wide integers.
These operations are 2-3 times faster in 64 bit mode.
RE: RE: But are you sure
)
Ah, I see! Not something E@H would benefit from (mostly floating point ops), tho. Number theory projects are a different story, I admit!
CU
Bikeman
I have to correct myself once
)
I have to correct myself once more, there is one area where even floating point calculations would benefit from the 64 bit instrcution set.
Some compilers tend to use a pair of 32bit integer move instructions to copy a 64 bit double precision floating point. This is very bad: It leads to mixing 32bit write- and 64 bit read-accesses on the same data, which is causing a so called "store forward stall", something which is quite expensive. I see this a lot in the code the MS compiler generated for the latest Windows beta app. With 64 bit instructions, even the dumbest compiler would copy the 64 bit floats in one instruction, preventing a store forward stall.
Even in 32 bit mode there would be ways around this effect, but in 64 bit mode it's kind of foolproof :-)
CU
Bikeman
The biggest advantages for
)
The biggest advantages for using 64-bit is that there are double the number of registers in the register files, that SSE2 support is guaranteed, and that function arguments can usually be passed in the registers.
Doubling the number of registers reduces the amount of wasted cycles caused by these reasons:
* CPU pipelines are getting really deep, forcing the use of many techniques to keep the pipelines full and the execution units busy. Unfortunately, these techniques only get you so far. The low amount of registers often creates bubbles, which are empty slots or groups of empty slots, in the pipeline which accomplish nothing but resolve dependencies. With more registers, the dependencies can often be spread out farther away from each other with other work whose data are in the extra registers, which can either reduce the number and size of the bubbles in the pipeline or completely eliminate them, resulting in greater efficiency.
The guarantee of SSE2 speeds things up because SSE2 can do anything that the x87 FPU can do but much more efficiently. With the old x87 FPU used in 16 and 32-bit modes, you must make sure that the data you want worked on are on the top two locations of the FPU stack, forcing the compiler or assembly programmer to create register exchange instructions that waste time. With SSE2, you can work on any two registers you desire, and can perform the same operation on small arrays of data if desired, raising efficiency. You also do not have to write code that checks to see if SSE2 is present or not, which invariably wastes time and program space.
Passing values to functions in the registers is much faster than passing them on the stack (a structure in memory), as is done in 32-bit mode and 16-bit mode. When a function is called in 32-bit mode or 16-bit mode, the arguments being passed are usually written to memory, and then are read from memory by the function that needs the data. In 64-bit mode, the arguments are kept in the registers unless there are too many arguments to keep in the registers under the calling convention being used, forcing the compiler or assembly programmer to push the remaining arguments onto the stack like in 32-bit or 16-bit mode.
Remember, registers are much faster than memory and caches. However, registers require huge amounts of chip real estate per bit compared to caches and RAM, so CPU designers cannot put many of them in a CPU and hope that the CPU remain inexpensive. Therefore, CPU designers must choose a good compromise between speed and cost when designing an architecture. The compiler's and the assembly programmer's responsibility is to use all of them effectively.
And one more stone in this
)
And one more stone in this way: integer computations are significantly faster than floating ones. Whith large registers why not to change some fp-operations (where the precision is limited) with integer computation whith fixed point. Or may be it is better to use SSE2 instead (I don't yet know SSE2 enough)?
RE: And one more stone in
)
It was true about 5-10 years ago.
Let it be so. I'm familiar
)
Let it be so. I'm familiar whith new processors architecture, espesially with new command subsets like SSE,SSE2,SSE2,SSSE2 and so on. But the key of this thread is that we should ask Bernd to compile a 64-bit binary. Or may be, you can try it yourself, Akos?
RE: But the key of this
)
You should try to ask Bernd to compile a 64-bit binary.
I can't do it. I don't have access to the sources.
I would be glad to a x86-64 code...
I think that unless you know
)
I think that unless you know the code inside and out there wouldn't be much of a way to be sure if it would benefit from a 64bit recompile except trying it out. From what I understand porting to x64 is pretty easy. The main challenge is that, I do not believe there is a way to compile Win64 binaries with GCC (in that, there is no 64bit cygwin). Do you guys compile the Win32 binary with visual studio? Something else? How hard would it be to at least test it to see if it is worth bothering with?
The windows boinc code is all
)
The windows boinc code is all written for the MS compiler. Bernd has tried (and failed) to get it to compile in gcc previously.