but didn't the Athlon (since XP) always feature 2 independent FP execution untis? The 1st is for FADD and SSE and the 2nd one for FMUL and SSE. Together with the FStore, of course and the usual 3-wide decode and dispatch.
You are right. The FADD + FMUL + FSTORE is the conventional
solution. But the Bulldozer consists of 2 FMAC instead of it.
That offers some advantages and lots of disadvantages...
Probably, one of the advantages can be important in our case.
The double-FMAC can execute 2 FADD or 2 FMUL simultaneously.
( I didn't do any measurements on its performance effect. )
but didn't the Athlon (since XP) always feature 2 independent FP execution untis? The 1st is for FADD and SSE and the 2nd one for FMUL and SSE. Together with the FStore, of course and the usual 3-wide decode and dispatch.
You are right. The FADD + FMUL + FSTORE is the conventional
solution. But the Bulldozer consists of 2 FMAC instead of it.
That offers some advantages and lots of disadvantages...
Probably, one of the advantages can be important in our case.
The double-FMAC can execute 2 FADD or 2 FMUL simultaneously.
( I didn't do any measurements on its performance effect. )
Is a Bulldozer code optimisation worthwhile?
Or should development effort be better concentrated to better utilise GPUs?
Slight aside: How do the AMD "APU"s compare? Are the APU extras easily utilised?
Or should development effort be better concentrated to better utilise GPUs?
Slight aside: How do the AMD "APU"s compare? Are the APU extras easily utilised?
The APU is a standard AMD(ATI) GPU welded onto the side of the CPU.
The S6GC application would need a major redesign to work on a GPU. GPUs only work well for applications that only need a tiny amount of memory per thread; or which have extremely sequential memory access patterns. S6GC uses a lot of memory and accesses it in a way that was sufficiently random that back when CUDA was new, shiney, and nVidia was using its engineers to help create flagship apps for promotional purposes they couldn't get it running any faster than on a CPU. Adding a bigger GPU just resulted in more of the chip stuck waiting for memory reads/writes to go through.
I don't think so. This core versus module issue is not something the actual single threaded app "sees". That's a matter entirely for the OS task scheduler, and maybe hand tuned multi-threaded apps.
What the project may be able to do is to recompile with some kind of Bully optimization (if there is anything like that), or maybe a separate code path for some hot loop. But I doubt there would be much to gain. AVX support might help (and also the Intels), but I've asked about this some time ago and apparently the potentail gain in E@H doesn't look too tempting.
Hi, RE: but didn't
)
Hi,
You are right. The FADD + FMUL + FSTORE is the conventional
solution. But the Bulldozer consists of 2 FMAC instead of it.
That offers some advantages and lots of disadvantages...
Probably, one of the advantages can be important in our case.
The double-FMAC can execute 2 FADD or 2 FMUL simultaneously.
( I didn't do any measurements on its performance effect. )
RE: RE: but didn't the
)
Is a Bulldozer code optimisation worthwhile?
Or should development effort be better concentrated to better utilise GPUs?
Slight aside: How do the AMD "APU"s compare? Are the APU extras easily utilised?
Happy fast crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
RE: Or should development
)
The APU is a standard AMD(ATI) GPU welded onto the side of the CPU.
The S6GC application would need a major redesign to work on a GPU. GPUs only work well for applications that only need a tiny amount of memory per thread; or which have extremely sequential memory access patterns. S6GC uses a lot of memory and accesses it in a way that was sufficiently random that back when CUDA was new, shiney, and nVidia was using its engineers to help create flagship apps for promotional purposes they couldn't get it running any faster than on a CPU. Adding a bigger GPU just resulted in more of the chip stuck waiting for memory reads/writes to go through.
RE: Is a Bulldozer code
)
I don't think so. This core versus module issue is not something the actual single threaded app "sees". That's a matter entirely for the OS task scheduler, and maybe hand tuned multi-threaded apps.
What the project may be able to do is to recompile with some kind of Bully optimization (if there is anything like that), or maybe a separate code path for some hot loop. But I doubt there would be much to gain. AVX support might help (and also the Intels), but I've asked about this some time ago and apparently the potentail gain in E@H doesn't look too tempting.
MrS
Scanning for our furry friends since Jan 2002
RE: If so, could one post
)
Hi, this is my profile FX-6100 and GTX 550 Ti. Most of the time I am running other tasks on my Desktop too.
http://einsteinathome.org/account/143295/computers
RE: Hi, this is my profile
)
Thank you!
Here's the details of your Bulldozer/GTX560TI [Win7_64] compared to my Phenom II x6/GT240 [Ubuntu 11.04_64] (http://einsteinathome.org/host/3805542 and an I7 2600K/GTX560 HT 8 tasks at a time [Ubuntu 11.10_64] (http://einsteinathome.org/host/4237123)
Measured floating point speed 2,480.14 million ops/sec
Measured integer speed 14,953.99 million ops/sec
Bulldozer Bench Marks:
Measured floating point speed 2,418.34 million ops/sec
Measured integer speed 9,494.01 million ops/sec
Intel I7 Bench Marks:
Measured floating point speed 3,261.51 million ops/sec
Measured integer speed 12,522.66 million ops/sec
Task times (most recent 5 tasks):
Bulldozer:
CPU Sec: 1,060.73, 1,080.85, 1,071.34, 1,065.27, 1,064.07
Run time: 2,854.81, 2,896.77, 3,192.96, 2,849.72, 2,866.32
I7:
CPU Sec: 577.31, 577.10, 576.26, 573.68, 571.94
Run time: 3,179.28, 3,184.89, 3,177.90, 3,162.97, 3,161.87
Gravitational Wave S6 GC search v1.01 (SSE2):
Phenom:
CPU Sec: 20,100.59, 20,281.95, 20,169.62, 20,160.31, 20,118.88
Run time: 21,658.95, 21,554.15, 21,706.22, 21,504.95, 21,571.40
Bulldozer:
CPU Sec: 22,693.31, 22,834.50, 22,800.11, 22,874.71, 22,944.72
Run time: 24,903.33, 25,173.36, 25,157.99, 25,231.34, 25,608.83
I7:
CPU Sec: 18,002.43, 18,000.19, 18,125.64, 18,080.39, 18,058.62
Run time: 20,944.07, 20,967.78, 20,928.44, 20,673.16, 20,614.86
I'll leave the interpretation of these numbers as an exercise for the reader.
Joe
ps. I hate trying to get columns to line up in bb-code.
RE: [pre]Phenom Bench
)
On some project sites, the [pre][/pre] tags actually work. ;-)
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: On some project sites,
)
Thank you.
Too late to edit this one, I'll TRY to remember that incantation for the next one.
BTW do we have a list of BB tags that work on this site?
Joe
RE: BTW do we have a list
)
Most of them do work on most sites. It's just the [pre][/pre] tag that doesn't on some sites.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: BTW do we have a list
)
upper left of the post window under "message", if i recall right some one had to point it out to me as well :)
Use BBCode tags to format your text
seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.