Yup, see http://einstein.phys.uwm.edu/contributors.php . Ben Knispel is the lead physicist, and Oliver Bock, Bernd Machenschalk and myself are the main contributors in terms of software engineering for BRP4.
I think there is no single, easy explanation for the relative performance of the 660Ti and the 560Ti, it is a complex matter and most of it has already been mentioned here. Let's try to sum up the most essential points:
1) "According to WIKI the 660Ti should have 2x the GFLOPS" . Yes, but this is the theoretical peak performance. This is an unrealistic metric, if you look at real world GPGPU benchmarks (e.g. OpenCL benchmarks here http://www.clbenchmark.com/ ) you will see that a performance increase of a factor of (say) 1.3 is more realistic as an upper limit of what you can expect.
2) Remaining CPU processing time. During the evolution of the ABP/BRP apps, we have successively reduced the share of remaining processing on the CPU. On a modern PC, the CPU version of BRP4 can still take over 40k seconds, while the GPU version consumes only less than 200 sec CPU time on fast computers. What does that mean? It means that 99.5 % of the work that was done on the CPU has now been moved to the GPU, with only 0.5 % remaining on the CPU. That is not too bad and I currently think that we will not improve on that in the near future (at least I have run out of ideas how to speed up the remaining 0.5% significantly). But this does have an effect on relative runtime when comparing different GPUs.
A hypothetical (simplified) example:
configuration 1:
GPU time : 1000 sec, CPU time 200 sec : sum: 1200
configuration 2, card is 1.33 times faster:
GPU time: 750 sec, CPU time 200 sec: sum : 950 sec
overall saving is 250 sec, or just a factor of ~ 1.26 . So especially for fast GPUs the overall runtime doesn't scale according to GPU performance gain because of the remaining CPU time even tho the remaining CPU time is just a tiny fraction of the original CPU time.
3) Optimization for Fermi/Kepler GPUs (or lack of it): The BRP4 app is a one-app-fits-all build. It is meant to be compatible with as many GPU models as possible. We already have > 20 different app versions to care about (different searches (GW, gamma ray & binary radio pulsars), different OSs, 32/64 bit, CPU/NVIDIA/AMD(OpenCl) ..) and we are reluctant to add more versions unless we can achieve a significant performance increase or a decrease in compatibility problems. It is on our TODO list to test whether CUDA apps using newer CUDA API versions and possibly compilation options specifically for Fermi and/or Kepler would increase the performance enough to justify adding at least 5 (!) new app-versions (e.g. OSX CUDA Kepler, Windows 32 bit CUDA Kepler, Windows 64 bit CUDA Kepler, Linux 32bit CUDA Kepler, Linux 64 bit CUDA Kepler). Potentially we can have unified app versions that switch Kernel codes at run time, but that needs to be investigated. Anyway just testing all those versions would require a big effort and currently we have a lot of other stuff on our desks (the new GW run is coming closer).
But Einstein@Home, like some other BOINC projects, has always been "open" in the sense that we appreciate contributions by volunteer developers (I myself have worked as a volunteer developer for E@H for several years, before I joined the AEI). The source code is available (see link on the homepage) and while we cannot promise to do day-to-day support for external developers, we would try as much as we can to offer a hand now and then in setting up tests etc. Our test project "Albert@Home" would be a perfect platform for volunteer developers to test their optimizations without fear of breaking the science. What we would encourage is that volunteer developers (in accordance with the Open Source software rules) report back their changes to the project so that those changes can be integrated into the official apps and be distributed automatically to all users.
The BRP4 app is a one-app-fits-all build. It is meant to be compatible with as many GPU models as possible. We already have > 20 different app versions to care about (different searches (GW, gamma ray & binary radio pulsars), different OSs, 32/64 bit, CPU/NVIDIA/AMD(OpenCl) ..) and we are reluctant to add more versions unless we can achieve a significant performance increase or a decrease in compatibility problems. It is on our TODO list to test whether CUDA apps using newer CUDA API versions and possibly compilation options specifically for Fermi and/or Kepler would increase the performance enough to justify adding at least 5 (!) new app-versions (e.g. OSX CUDA Kepler, Windows 32 bit CUDA Kepler, Windows 64 bit CUDA Kepler, Linux 32bit CUDA Kepler, Linux 64 bit CUDA Kepler). Potentially we can have unified app versions that switch Kernel codes at run time, but that needs to be investigated. Anyway just testing all those versions would require a big effort and currently we have a lot of other stuff on our desks (the new GW run is coming closer).
Wow, I didn't know that the version plurality was getting so high!
Cheers, Mike
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Hallo!
In different fora I read - unfortunately I can´t remember where - that NVIDIA limited the crunching speed for mathematics of the new generation of graphic cards to a level of the former generation. The reason is, that they want a higher difference in performance to their TESLA crunching cards, which are realy extra costly.
I´m somewhat wondering, in all threads concerning the new generation of graphics cards, there are no tests reported about the savings in electrical power by these new cards. As I see from the grafikchip-rangliste, the GTX660Ti is listed for max power of 150W, whereas the GTX560Ti consumes 210W. A difference of 60W in advantage of the new card generation. If I take energy costs of 0.25€/KWh, continuierly crunching 24h/d and a charge of 260€ for a GTX660Ti, I get a payback time of nearly exact 2 years just by saving electrical power. But this is a theoretical value. Some practical measurements would be of value.
Hallo!
In different fora I read - unfortunately I can´t remember where - that NVIDIA limited the crunching speed for mathematics of the new generation of graphic cards to a level of the former generation. The reason is, that they want a higher difference in performance to their TESLA crunching cards, which are realy extra costly.
I'm quite sure this was referring to double precision floating point performance only. If you compare the double point floating point performance of modern AMD GPUs with the consumer NVIDIA counterparts, the NVIDIA GPUs have absolutely no chance to compete, for the reason you mentioned. This is not important for us here, because the BRP4 app only uses single precision floating point math. NVIDIA has no intend to limit single precision floating point performance in their consumer products following the same logic, because their TESLA product that focuses single precision math applications (the Tesla K10) is actually using the same GPU as the top notch Kepler consumer cards GTX680 and GTX 690. They still can charge more for the Tesla K10 because it has extra memory and error correcting/detecting memory.
Quote:
I´m somewhat wondering, in all threads concerning the new generation of graphics cards, there are no tests reported about the savings in electrical power by these new cards. As I see from the grafikchip-rangliste, the GTX660Ti is listed for max power of 150W, whereas the GTX560Ti consumes 210W. A difference of 60W in advantage of the new card generation. If I take energy costs of 0.25€/KWh, continuierly crunching 24h/d and a charge of 260€ for a GTX660Ti, I get a payback time of nearly exact 2 years just by saving electrical power. But this is a theoretical value. Some practical measurements would be of value.
A very good point indeed!! I recently swapped a GTS 450 for a GTX 650Ti for exactly this reason: higher performance (evene tho it's not too spectacular) at lower power consumption.
Hallo!
In different fora I read - unfortunately I can´t remember where - that NVIDIA limited the crunching speed for mathematics of the new generation of graphic cards to a level of the former generation. The reason is, that they want a higher difference in performance to their TESLA crunching cards, which are realy extra costly.
That's very true for double-precision processing, but doesn't apply here - Einstein only uses single-precision maths, which AFAIK wasn't limited by NVidia.
On another project, I'm seeing something like a 50% performance boost for a GTX 670, by using a cuda42 build targeted on the Kepler architecture instead of a generic/Fermi cuda32 build.
Regarding all the different versions, why not just provide 32 bit apps, it reduces your version count while still being compatible for the 64 bit hosts. About the only place this might be an issue is the 64 bit Linux that don't have 32 bit libs.
Another option might be to move up the minimum driver version (with appropriate warning times given to users) in order to bring the Cuda version up from 3.2. This is the situation GPUgrid is in at the moment. They are talking about keeping a cuda 3.1 app only on one of their 3 work queues as they need the newer features to do some of the science. They will use cuda 4.2 on the others.
Regarding all the different versions, why not just provide 32 bit apps, it reduces your version count while still being compatible for the 64 bit hosts. About the only place this might be an issue is the 64 bit Linux that don't have 32 bit libs.
There are so many Linux systems that lack 32 bit compatibility libs that this is actually a significant problem for us. E.g. there are several big clusters in academic institutions that run E@H jobs on idle nodes and they will not change their setups just for E@H :-).
Quote:
Another option might be to move up the minimum driver version (with appropriate warning times given to users) in order to bring the Cuda version up from 3.2. This is the situation GPUgrid is in at the moment. They are talking about keeping a cuda 3.1 app only on one of their 3 work queues as they need the newer features to do some of the science. They will use cuda 4.2 on the others.
At one point in time we will want to do something like that as well, I'm just not sure we are at this point already. It's a compromise. In the past we always tried to provide compatible apps even for systems that could be described as "legacy stuff" (e.g. Power PC Macs, Pentium III boxes without SSE2 ...) but as soon as the number (and combined performance) of the hardware in question falls below a certain threshold, the extra effort is no longer justified, and we drop support for them.
Even when we will move to (say) CUDA 4.2, we will most likely still want to support some pre-Fermi cards like the Geforce 9800GT or the GT 2XX which might not be very competive compared to Fermi/Kepler cards, but still much faster than crunching on CPUs.
RE: somebody know who is
)
Yup, see http://einstein.phys.uwm.edu/contributors.php . Ben Knispel is the lead physicist, and Oliver Bock, Bernd Machenschalk and myself are the main contributors in terms of software engineering for BRP4.
I think there is no single, easy explanation for the relative performance of the 660Ti and the 560Ti, it is a complex matter and most of it has already been mentioned here. Let's try to sum up the most essential points:
1) "According to WIKI the 660Ti should have 2x the GFLOPS" . Yes, but this is the theoretical peak performance. This is an unrealistic metric, if you look at real world GPGPU benchmarks (e.g. OpenCL benchmarks here http://www.clbenchmark.com/ ) you will see that a performance increase of a factor of (say) 1.3 is more realistic as an upper limit of what you can expect.
2) Remaining CPU processing time. During the evolution of the ABP/BRP apps, we have successively reduced the share of remaining processing on the CPU. On a modern PC, the CPU version of BRP4 can still take over 40k seconds, while the GPU version consumes only less than 200 sec CPU time on fast computers. What does that mean? It means that 99.5 % of the work that was done on the CPU has now been moved to the GPU, with only 0.5 % remaining on the CPU. That is not too bad and I currently think that we will not improve on that in the near future (at least I have run out of ideas how to speed up the remaining 0.5% significantly). But this does have an effect on relative runtime when comparing different GPUs.
A hypothetical (simplified) example:
configuration 1:
GPU time : 1000 sec, CPU time 200 sec : sum: 1200
configuration 2, card is 1.33 times faster:
GPU time: 750 sec, CPU time 200 sec: sum : 950 sec
overall saving is 250 sec, or just a factor of ~ 1.26 . So especially for fast GPUs the overall runtime doesn't scale according to GPU performance gain because of the remaining CPU time even tho the remaining CPU time is just a tiny fraction of the original CPU time.
3) Optimization for Fermi/Kepler GPUs (or lack of it): The BRP4 app is a one-app-fits-all build. It is meant to be compatible with as many GPU models as possible. We already have > 20 different app versions to care about (different searches (GW, gamma ray & binary radio pulsars), different OSs, 32/64 bit, CPU/NVIDIA/AMD(OpenCl) ..) and we are reluctant to add more versions unless we can achieve a significant performance increase or a decrease in compatibility problems. It is on our TODO list to test whether CUDA apps using newer CUDA API versions and possibly compilation options specifically for Fermi and/or Kepler would increase the performance enough to justify adding at least 5 (!) new app-versions (e.g. OSX CUDA Kepler, Windows 32 bit CUDA Kepler, Windows 64 bit CUDA Kepler, Linux 32bit CUDA Kepler, Linux 64 bit CUDA Kepler). Potentially we can have unified app versions that switch Kernel codes at run time, but that needs to be investigated. Anyway just testing all those versions would require a big effort and currently we have a lot of other stuff on our desks (the new GW run is coming closer).
But Einstein@Home, like some other BOINC projects, has always been "open" in the sense that we appreciate contributions by volunteer developers (I myself have worked as a volunteer developer for E@H for several years, before I joined the AEI). The source code is available (see link on the homepage) and while we cannot promise to do day-to-day support for external developers, we would try as much as we can to offer a hand now and then in setting up tests etc. Our test project "Albert@Home" would be a perfect platform for volunteer developers to test their optimizations without fear of breaking the science. What we would encourage is that volunteer developers (in accordance with the Open Source software rules) report back their changes to the project so that those changes can be integrated into the official apps and be distributed automatically to all users.
Cheers
HB
RE: The BRP4 app is a
)
Wow, I didn't know that the version plurality was getting so high!
Cheers, Mike
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Hallo! In different fora I
)
Hallo!
In different fora I read - unfortunately I can´t remember where - that NVIDIA limited the crunching speed for mathematics of the new generation of graphic cards to a level of the former generation. The reason is, that they want a higher difference in performance to their TESLA crunching cards, which are realy extra costly.
I´m somewhat wondering, in all threads concerning the new generation of graphics cards, there are no tests reported about the savings in electrical power by these new cards. As I see from the grafikchip-rangliste, the GTX660Ti is listed for max power of 150W, whereas the GTX560Ti consumes 210W. A difference of 60W in advantage of the new card generation. If I take energy costs of 0.25€/KWh, continuierly crunching 24h/d and a charge of 260€ for a GTX660Ti, I get a payback time of nearly exact 2 years just by saving electrical power. But this is a theoretical value. Some practical measurements would be of value.
Kind regards and happy crunching
Martin
RE: Hallo! In different
)
I'm quite sure this was referring to double precision floating point performance only. If you compare the double point floating point performance of modern AMD GPUs with the consumer NVIDIA counterparts, the NVIDIA GPUs have absolutely no chance to compete, for the reason you mentioned. This is not important for us here, because the BRP4 app only uses single precision floating point math. NVIDIA has no intend to limit single precision floating point performance in their consumer products following the same logic, because their TESLA product that focuses single precision math applications (the Tesla K10) is actually using the same GPU as the top notch Kepler consumer cards GTX680 and GTX 690. They still can charge more for the Tesla K10 because it has extra memory and error correcting/detecting memory.
A very good point indeed!! I recently swapped a GTS 450 for a GTX 650Ti for exactly this reason: higher performance (evene tho it's not too spectacular) at lower power consumption.
Cheers
HB
RE: Hallo! In different
)
That's very true for double-precision processing, but doesn't apply here - Einstein only uses single-precision maths, which AFAIK wasn't limited by NVidia.
On another project, I'm seeing something like a 50% performance boost for a GTX 670, by using a cuda42 build targeted on the Kepler architecture instead of a generic/Fermi cuda32 build.
Hallo Bikeman! RE: I'm
)
Hallo Bikeman!
Thanks for clarifying this.
Kind regards and happy crunching
Martin
Regarding all the different
)
Regarding all the different versions, why not just provide 32 bit apps, it reduces your version count while still being compatible for the 64 bit hosts. About the only place this might be an issue is the 64 bit Linux that don't have 32 bit libs.
Another option might be to move up the minimum driver version (with appropriate warning times given to users) in order to bring the Cuda version up from 3.2. This is the situation GPUgrid is in at the moment. They are talking about keeping a cuda 3.1 app only on one of their 3 work queues as they need the newer features to do some of the science. They will use cuda 4.2 on the others.
BOINC blog
RE: Regarding all the
)
There are so many Linux systems that lack 32 bit compatibility libs that this is actually a significant problem for us. E.g. there are several big clusters in academic institutions that run E@H jobs on idle nodes and they will not change their setups just for E@H :-).
At one point in time we will want to do something like that as well, I'm just not sure we are at this point already. It's a compromise. In the past we always tried to provide compatible apps even for systems that could be described as "legacy stuff" (e.g. Power PC Macs, Pentium III boxes without SSE2 ...) but as soon as the number (and combined performance) of the hardware in question falls below a certain threshold, the extra effort is no longer justified, and we drop support for them.
Even when we will move to (say) CUDA 4.2, we will most likely still want to support some pre-Fermi cards like the Geforce 9800GT or the GT 2XX which might not be very competive compared to Fermi/Kepler cards, but still much faster than crunching on CPUs.
Bottom line: we keep an eye on it.
Cheers
HB