Thanks, probably missed that. I'll take another look next week.
dont forget about the change to "twiddles" also. not just twiddle_dee. there are 3 main conditions being changed in the code I sent you. the change to __global for both twiddles and twiddle_dee, and the change from lds[64][64] to lds[64][65] that I mentioned in the last post. twiddle_was already addressed in v1.25/1.26, but twiddles should be changed too according to petri. it's in the bottom section of code I sent over.
v1.28 works well. I see similar behavior and runtimes with v1.28 as with our manual code injection.
DF1DX wrote:
Does this change also work on AMD cards?
it should work. but the speed improvement isn't as dramatic as with nvidia cards from our tests. maybe ~20% or less. but we really only tested Polaris and Navi (not "big" navi) cards. so other architectures the improvement is unknown. there might need to be some other changes in the amd app to make it work. we had to tweak the code injection to get it working with AMD. just remember that if implemented you will need OpenCL 2.0 drivers. many people have been running their cards with the legacy (opencl 1.2) install because it was easy and it works, but these new techniques only work with OpenCL 2.0.
right now I think the project admins have only changed the Nvidia apps.
Thank you all for the tremednous contributions! The speed up is greatly appreciated :) It is weird to think that such small clever architectural changes in the code can help NVIDIA cards perform so much more efficiently. And most of that (to the best of my understanding) thanks to a different population of the arrays. Love to see NVIDIA cards getting more competitive on E@H!
But reading through this thread, I got a bit confused. Who is now to thank for this code review and awesome ideas? Where did you exchange ideas?
Thank you all for the tremednous contributions! The speed up is greatly appreciated :) It is weird to think that such small clever architectural changes in the code can help NVIDIA cards perform so much more efficiently. And most of that (to the best of my understanding) thanks to a different population of the arrays. Love to see NVIDIA cards getting more competitive on E@H!
But reading through this thread, I got a bit confused. Who is now to thank for this code review and awesome ideas? Where did you exchange ideas?
Cheers
user petri33 took it upon himself to examine the code used in the Einstein apps. Him, myself, and several others have had a hunch that there were some inefficiencies in the code that was really holding Nvidia back. The performance difference between comparable nvidia and AMD GPUs was too great to be chalked up to “AMD is just better at this”.
Since the openCL code in the application is in plain text, it’s easier to see what the app is doing. Additionally, you can dump the nvidia compute cache to see what OpenCL code was compiled at runtime.
petri has vast knowledge and experience writing applications and optimizing applications for nvidia GPUs when doing signals analysis. He wrote the custom linux application that dominated over on SETI (4-5x faster than the project provided apps).
petri devised a way to inject code into the application real-time. This allowed fast and easy testing of code changes without the need to modify or recompile the application. On a simplistic view, it’s looking for certain sections of code and swapping them out for better sections of code on the fly. Using this method he looked for “low hanging fruit” changes that would have a big impact. That’s what was done here. No changes to the Einstein code, just optimizations for better memory access on nvidia by using some different types of arrays. I myself and a few others tested that these changes do in fact work and provide faster run times.
I then contacted Bernd via PM and sent him the code with some short explanations of what was done and what changes needed to be made. Bernd made the changes and incorporated them into the application.
Petri is still working on it looking for more optimizations that can be done. But this recent change will probably be the biggest jump in performance for nvidia. And maybe small iterative improvements going forward. But the biggest limiting factor for him is time. He’s a busy guy and doing this in his spare time just for fun.
so petri is responsible for the idea and figuring out how exactly to implement it. I helped him verify findings by testing the code on my systems on several different types of GPUs. A few other members of our team also tested on various platforms and GPUs to sort out any bugs or issues. And then Bernd rolled it all into a more official application that can be used on Linux/windows/Mac for appropriate GPUs.
Most of the credit goes to petri I think, since he found the issues and the project devs weren’t looking for this, and I don’t blame them since they probably don’t have enough time to dedicate to things like this. But we all certainly thank Bernd for his effort in digesting the info we gave him and figuring out how to apply it in the official applications (since the implementation is vastly different, even if the outcome is similar).
Thanks Bernd!
)
Thanks Bernd!
_________________________________________________________________________
Bernd Machenschalk
)
dont forget about the change to "twiddles" also. not just twiddle_dee. there are 3 main conditions being changed in the code I sent you. the change to __global for both twiddles and twiddle_dee, and the change from lds[64][64] to lds[64][65] that I mentioned in the last post. twiddle_was already addressed in v1.25/1.26, but twiddles should be changed too according to petri. it's in the bottom section of code I sent over.
change
__constant float2 twiddles[
to
__global float2 twiddles[
_________________________________________________________________________
Have another go with 1.27.
)
Thanks!
Have another go with 1.27.
BM
Hm - clFFT has its own
)
Hm - clFFT has its own clBuildProgram() calls with own options - I think I'll have to patch these, too.
Try 1.28.
BM
Good improvement.The runtime
)
Good improvement.The runtime dropped from around 28 to about 17 minutes on my old 1050 Ti! (1 WU, Linux Mint, Driver 490.57).
Does this change also work on AMD cards?
v1.28 works well. I see
)
v1.28 works well. I see similar behavior and runtimes with v1.28 as with our manual code injection.
it should work. but the speed improvement isn't as dramatic as with nvidia cards from our tests. maybe ~20% or less. but we really only tested Polaris and Navi (not "big" navi) cards. so other architectures the improvement is unknown. there might need to be some other changes in the amd app to make it work. we had to tweak the code injection to get it working with AMD. just remember that if implemented you will need OpenCL 2.0 drivers. many people have been running their cards with the legacy (opencl 1.2) install because it was easy and it works, but these new techniques only work with OpenCL 2.0.
right now I think the project admins have only changed the Nvidia apps.
_________________________________________________________________________
I added app (Beta Test)
)
I added app (Beta Test) versions for AMD/ATI w. OpenCL 2.0.
BM
Thank you all for the
)
Thank you all for the tremednous contributions! The speed up is greatly appreciated :) It is weird to think that such small clever architectural changes in the code can help NVIDIA cards perform so much more efficiently. And most of that (to the best of my understanding) thanks to a different population of the arrays. Love to see NVIDIA cards getting more competitive on E@H!
But reading through this thread, I got a bit confused. Who is now to thank for this code review and awesome ideas? Where did you exchange ideas?
Cheers
Bernd Machenschalk wrote: I
)
thanks Bernd.
at what point do these apps come out of beta testing and released for general use? What is the criteria you’re looking for?
_________________________________________________________________________
bozz4science wrote: Thank
)
user petri33 took it upon himself to examine the code used in the Einstein apps. Him, myself, and several others have had a hunch that there were some inefficiencies in the code that was really holding Nvidia back. The performance difference between comparable nvidia and AMD GPUs was too great to be chalked up to “AMD is just better at this”.
Since the openCL code in the application is in plain text, it’s easier to see what the app is doing. Additionally, you can dump the nvidia compute cache to see what OpenCL code was compiled at runtime.
petri has vast knowledge and experience writing applications and optimizing applications for nvidia GPUs when doing signals analysis. He wrote the custom linux application that dominated over on SETI (4-5x faster than the project provided apps).
petri devised a way to inject code into the application real-time. This allowed fast and easy testing of code changes without the need to modify or recompile the application. On a simplistic view, it’s looking for certain sections of code and swapping them out for better sections of code on the fly. Using this method he looked for “low hanging fruit” changes that would have a big impact. That’s what was done here. No changes to the Einstein code, just optimizations for better memory access on nvidia by using some different types of arrays. I myself and a few others tested that these changes do in fact work and provide faster run times.
I then contacted Bernd via PM and sent him the code with some short explanations of what was done and what changes needed to be made. Bernd made the changes and incorporated them into the application.
Petri is still working on it looking for more optimizations that can be done. But this recent change will probably be the biggest jump in performance for nvidia. And maybe small iterative improvements going forward. But the biggest limiting factor for him is time. He’s a busy guy and doing this in his spare time just for fun.
so petri is responsible for the idea and figuring out how exactly to implement it. I helped him verify findings by testing the code on my systems on several different types of GPUs. A few other members of our team also tested on various platforms and GPUs to sort out any bugs or issues. And then Bernd rolled it all into a more official application that can be used on Linux/windows/Mac for appropriate GPUs.
Most of the credit goes to petri I think, since he found the issues and the project devs weren’t looking for this, and I don’t blame them since they probably don’t have enough time to dedicate to things like this. But we all certainly thank Bernd for his effort in digesting the info we gave him and figuring out how to apply it in the official applications (since the implementation is vastly different, even if the outcome is similar).
I hope that clarifies everyone role in this.
_________________________________________________________________________