How many of them have you met? How many compilers have you written on your own?
BM
Good questions.
None this generation. No compiler but I once wrote an OS. In the course of reviewing the official OS I saw lots of instructions used to perform time critical operations were not the most efficient. The authors claimed exhaustion and deadlines. Probably true. Management could have spent the resources to tighten things up later but preferred otherwise.
In a compiler you can review the code generated by a particular statement and judge if there is a better choice. This makes checking the quality of the compiler easy.
An assembler coder is aware of the state prior and following the statement to select better code but the change in 402 vs 424 is so dramatic it makes you wonder.
I've looked into the functions provided by ipp or mkl, but they're not worth it. They are mostly way too complex for what we are doing in our code.
More than 90& (99% before optimization) of the run time is spent within a single loop that once had about a dozen lines of C-Code, containing only simple multiplications and additions (and the very vew divisions we coudn't avoid - I think there's actually only one of them left - hm, gives me another idea...). We have parallized/vectorized as much as we could.
You can map the operations we perform to matrix operations, but due to the necessary overhead executing these even with a higly-optimized library this will not be faster than doing just the necessay low-level operations.
The latest speedup came mostly from avoiding type conversions (requiring to set the rounding mode, which is slow), a bit of optimizing the interface to the assembler-coded parts, and some playing with the C-code. It's always a tradeoff you have to make - global static variables are the fastest in many cases, but with too many of them your code becomes unreadable and unmaintainable.
I also found that apparently due to caching effects some compiler switches that worked well for previous versions weren't optimal for this code (e.g. unrolling loops).
Finally I used the same compiler and settings we use for the Linux version (a gcc-4.1) now for Windows, too (at least for this critical module), which saved some interfacing, quite some maintenance, and brought the Windows App to the speed our Linux App had before.
So - the libraries don't help us, as we don't perform standard operations (like e.g. FFT), and the latest speedup of the Windows App was due to a combination of things, where roughly half of them didn't come from the assembler coding.
I'll have another try with the Intel compiler, but I think all critical parts by now have been taken out ouf the hands of the compiler by our assembler coding anyway, so I don't expect much of it.
Since v4.24 is now the "official" Windows application, shouldn't it be removed from the Beta Test page?
Thanks for the hint. For the moment it may help people with problems automatically downloading the official App to have this at hand for manual installation. It will be removed from the Beta App page at the next update.
RE: How many of them have
)
Good questions.
None this generation. No compiler but I once wrote an OS. In the course of reviewing the official OS I saw lots of instructions used to perform time critical operations were not the most efficient. The authors claimed exhaustion and deadlines. Probably true. Management could have spent the resources to tighten things up later but preferred otherwise.
In a compiler you can review the code generated by a particular statement and judge if there is a better choice. This makes checking the quality of the compiler easy.
An assembler coder is aware of the state prior and following the statement to select better code but the change in 402 vs 424 is so dramatic it makes you wonder.
I've looked into the
)
I've looked into the functions provided by ipp or mkl, but they're not worth it. They are mostly way too complex for what we are doing in our code.
More than 90& (99% before optimization) of the run time is spent within a single loop that once had about a dozen lines of C-Code, containing only simple multiplications and additions (and the very vew divisions we coudn't avoid - I think there's actually only one of them left - hm, gives me another idea...). We have parallized/vectorized as much as we could.
You can map the operations we perform to matrix operations, but due to the necessary overhead executing these even with a higly-optimized library this will not be faster than doing just the necessay low-level operations.
The latest speedup came mostly from avoiding type conversions (requiring to set the rounding mode, which is slow), a bit of optimizing the interface to the assembler-coded parts, and some playing with the C-code. It's always a tradeoff you have to make - global static variables are the fastest in many cases, but with too many of them your code becomes unreadable and unmaintainable.
I also found that apparently due to caching effects some compiler switches that worked well for previous versions weren't optimal for this code (e.g. unrolling loops).
Finally I used the same compiler and settings we use for the Linux version (a gcc-4.1) now for Windows, too (at least for this critical module), which saved some interfacing, quite some maintenance, and brought the Windows App to the speed our Linux App had before.
So - the libraries don't help us, as we don't perform standard operations (like e.g. FFT), and the latest speedup of the Windows App was due to a combination of things, where roughly half of them didn't come from the assembler coding.
I'll have another try with the Intel compiler, but I think all critical parts by now have been taken out ouf the hands of the compiler by our assembler coding anyway, so I don't expect much of it.
BM
BM
RE: More than 90% (99%
)
Sounds like everything fits into L1 cache. At least on AMD.
Too bad they only make 64k/64k. A 32k/32k would show if there was any room left.
Can you split a wu in the middle and work the halves with two threads and see if you escape thrashing?
I just saw that Conroe has L1
)
I just saw that Conroe has L1 I 32kb/D 32kb. That would be interesting to play around with.
Since v4.24 is now the
)
Since v4.24 is now the "official" Windows application, shouldn't it be removed from the Beta Test page?
RE: Since v4.24 is now the
)
Thanks for the hint. For the moment it may help people with problems automatically downloading the official App to have this at hand for manual installation. It will be removed from the Beta App page at the next update.
BM
BM
If I am running the Beta Test
)
If I am running the Beta Test App 4.24 what should I do now to switch to the official.
just delete the app_info.xml
)
just delete the app_info.xml file. the executables should be indentical, otherwise hte version numbers would've been different.
Thanks will do that but will
)
Thanks will do that but will let my work cash go dry first by setting no new work.
RE: Thanks will do that but
)
That isn't necessary. Just delete the app_info.xml file from your einstein project folder. That is all you need to do.
me-[at]-rescam.org