I'm not convinced that memory bandwidth has a significant effect on runtimes. I have two machines with identical GTX 650 models -- one with Athlon II X4 645 at 3.1GHz and DDR3-1600 (9-9-9-24-41-1T) RAM and the other with Phenom II X4 965 at 3.4GHz and DDR2-800 (5-5-5-18-22-2T) RAM. Because of its primary use, I generally do not crunch on the box with DDR3-1600, but decided to run a few workunits on it (9 to be exact) to see how it compared with the other. The runtimes on the box with faster RAM was only about 200 seconds faster at best and 100 seconds faster at worst.
Well, another day, another MULTI K RAC DROP!!! Down 2.1K RAC at this moment. The falling hasn't stopped YET... Still crunching 15 Hours day, at 6 Units at a time; 2 Per GPU.
So, anyone who wants to take this as complaining, feel free to do so; THAT'S NOT what I'm doing, though... I AM chronicling the drop for the OFFICIAL record of things that are happening on my two systems.
MAC Invalids have DROPPED to 0 for now!!! Still have PLENTY of 1.17 and a few original Beta 1.17 Units Pending Validation, though... So, things could change; we'll see. RAC on the MAC has dropped to 47.8K from a recent high of 63K; so, an ALMOST 20K RAC DROP on the MAC since the new Units came into existence. Pendings are three Beta 1.17 Units, and up to 11 Pending 1.17 standard Units... For the amount that the MAC crunches, these Pendings are quite low... I would have expected a slight climb in the past couple days; according to what's been said in this Thread about the rapid descent from BRP6 and BRP4G into the FGRPB1G Units...
Win XP Pro x64 is still crunching through original regular/standard 1.17 Units; have picked up more of the latest Beta 1.17 Units with lowered estimated times. (I'm picking up these Beta Units BECAUSE I have turned ON "Accept Beta Work" in my Preferences. I did this, because in the past three years, there's been MANY a time that I've run QUITE low in Work in Queue on my two systems. When I mentioned this, I was RECOMMENDED to turn this switch ON to ENSURE that my systems ALWAYS had some sort of work! I haven't run out of work SINCE!!!) NO Invalids on the XP Pro system with GTX-760 card. ONLY 2 Pending original/regular/standard 1.17 Units...
So, in close for today, it seems that Pendings have DROPPED significantly on the Windows system; BUT, RAC has dropped from 40K to 18.8K RAC in a VERY, VERY SHORT TIME!!! Since I've had most of my Units on Windows Validate; SHOULDN'T RAC be picking up on the Win system??? Just asking...
I'm EAGERLY awaiting the new "climb" of RAC that has been foretold here... Anyone else seeing what I'm seeing??? Has ANYONE'S RAC started to climb with the 1.17 Units YET??? Just curious...
As to crunching MORE Units at a time; all three of my GPUs have ONLY 2GB GDDR5 RAM on them. System RAM on the Win system is 8GB DDR3, the MAC has 16GB DDR2. CPU Speed on the MAC is 3GHz, and on the Win system is 3.89GHz. The Win system hardware is MUCH newer than the MAC/Hackintosh. As mentioned in prior posts, the MAC is circa 2008/2009 EXCEPT for the TWO GTX-750TI SC cards; one of which is ONLY 6 Mos. Old. The ENTIRE Win system with GTX-760 is 3.5 Years Old...
So, don't know what anyone else needs from me Spec-wise... I think I've given EVERYTHING as detailed as I can be. I'm simply perplexed that my RAC could go from 103K down to 66.7K RAC in SUCH short order; AND NOT show ANY signs of recovery, YET... How long should the recovery take??? When do I throw in the towel and say that my equipment isn't good enough anymore??? (I REALLY don't think that's the case!!!) So, I ask again, anyone else seeing this MASSIVE drop and SUCH SLOOOOOW recovery??? Is this REALLY going to be the new norm???
My biggest issue with crunching multiple work units at once is the loss of cpu threads for cpu projects. With Parkes and Arecibo my gpu (980 Ti) was kept nice and busy with four work units, and didn't even fully load a single thread. Now I have to give up 2+ threads just to keep it close to fully loaded.
The runtimes on the box with faster RAM was only about 200 seconds faster at best and 100 seconds faster at worst.
Interesting, thanks for testing. How was the gpu usage ? For my info, how do you explain the need to have a full CPU core per GPU wu ?
I do not know enough about OpenCL to explain the need for a full CPU core for each work unit. I assume it takes the attention of the CPU core to keep the GPU fully fed. However, if that were the case, it looks like we would be getting more work done at the time than we did with Arecibo. For example, Arecibo didn't tax my GPU enough to bother me during most normal PC use, unless, of course, I tried to play a graphically-intensive game.
I have monitored GPU use running these new work units, and see that the GPU is pegging 100% nearly all of the time, even on the machine with slower system RAM. Interestingly, using HWMonitor, I see that Frame Buffer maxes at 82%, Bus Interface at 21%, and Memory at 46% (on a 2GB card). All four cores of the CPU remain at 100% during GPU processing, but, interestingly, the core temp varies slightly, whereas CPU core temps remain consistent during CPU-only processing. So I do wonder if most of the CPU "work" is just monitoring and occasionally feeding the GPU, as previously suggested by others.
So I do wonder if most of the CPU "work" is just monitoring and occasionally feeding the GPU, as previously suggested by others.
Don't forget that opencl doesn't behave the same way on amd (my card) and nvidia (your card): while it is using only a core from time to time on amd, it is using a full core waiting on nvidia.
At the end it is the same : if you use your free core on amd it will not be available when you need it and slow down gpu filling. If it is not memory bandpass, maybe is it L2 cache ?
The Nvidia OpenCL application does not need to use full CPU core. Raistmer has proved that at Seti with his SoG application and -use_sleep commandline parameter (I have CPU usage between 20-40% when running one task at a time on GPU, depends whether task is Arecibo or GreenBank). Of course SoG application has a lot of other commandline parameters that you have to get right to get a decent load on the GPU if you have the -use_sleep activated.
But whether his method is applicable for Einstein OpenCL application, I don't know. Perhaps you can ask him?
This reminds me of a long-running discussion on Folding several years ago, when they had both an OpenCl and a CUDA app for the Nvidia cards, as well as OpenCL for the AMD cards. (Now they are all OpenCL). The observation was the same as here, that AMD by default does not require an entire CPU core, whereas Nvidia does require an entire core when using OpenCl. However, the Nvidia CUDA app did not require an entire core.
There was some debate about why that was, but the general consensus was that Nvidia did it to ensure that their cards were properly fed, by reserving an entire core using spin states whether they needed it or not when running OpenCl. However, it was also stated that the application developer could defeat that behavior by turning off the unnecessary spin states, though I have no idea how that is accomplished.
What Cuda developers have told me is that NVidia have, over the years, exposed several different mechanisms for developers to use to solve the feeding and synchronisation problems involved in performing parallel operations on a serial host computer - from spin loops, to interrupts, to callbacks (I don't guarantee those exact terms, or their sequence, but it was asserted that multiple techniques exist, and that the available choices have changed over time in the development toolkits). Cuda developers also assert that the same techniques should also be relevant, and available, when programming in OpenCL.
OpenCL developers, on the other hand, assert that the full range of techniques are accessible when using ATi hardware and toolkits for their OpenCL programs, but that the NVidia OpenCL toolkits are less comprehensive, and the spin-loop is the most efficient solution available.
I'm still waiting for an OpenCL developer for intel_gpus to (re-)appear on the scene and complete the triangle.
What Cuda developers have told me is that NVidia have, over the years, exposed several different mechanisms for developers to use to solve the feeding and synchronisation problems involved in performing parallel operations on a serial host computer - from spin loops, .....
Cycling in a loop to check the status of a request - this is a piece of art in a programming science. From a late 1970' I guess.
I'm not convinced that memory
)
I'm not convinced that memory bandwidth has a significant effect on runtimes. I have two machines with identical GTX 650 models -- one with Athlon II X4 645 at 3.1GHz and DDR3-1600 (9-9-9-24-41-1T) RAM and the other with Phenom II X4 965 at 3.4GHz and DDR2-800 (5-5-5-18-22-2T) RAM. Because of its primary use, I generally do not crunch on the box with DDR3-1600, but decided to run a few workunits on it (9 to be exact) to see how it compared with the other. The runtimes on the box with faster RAM was only about 200 seconds faster at best and 100 seconds faster at worst.
[b][1-3-2017 Update - 7:45 PM
)
[1-3-2017 Update - 8:09 PM PST:]
Well, another day, another MULTI K RAC DROP!!! Down 2.1K RAC at this moment. The falling hasn't stopped YET... Still crunching 15 Hours day, at 6 Units at a time; 2 Per GPU.
So, anyone who wants to take this as complaining, feel free to do so; THAT'S NOT what I'm doing, though... I AM chronicling the drop for the OFFICIAL record of things that are happening on my two systems.![Tongue Out Tongue Out](https://einsteinathome.org/sites/all/libraries/tinymce/jscripts/tiny_mce/plugins/emotions/img/smiley-tongue-out.gif)
MAC Invalids have DROPPED to 0 for now!!!
Still have PLENTY of 1.17 and a few original Beta 1.17 Units Pending Validation, though... So, things could change; we'll see. RAC on the MAC has dropped to 47.8K from a recent high of 63K; so, an ALMOST 20K RAC DROP on the MAC since the new Units came into existence. Pendings are three Beta 1.17 Units, and up to 11 Pending 1.17 standard Units... For the amount that the MAC crunches, these Pendings are quite low... I would have expected a slight climb in the past couple days; according to what's been said in this Thread about the rapid descent from BRP6 and BRP4G into the FGRPB1G Units...
Win XP Pro x64 is still crunching through original regular/standard 1.17 Units; have picked up more of the latest Beta 1.17 Units with lowered estimated times. (I'm picking up these Beta Units BECAUSE I have turned ON "Accept Beta Work" in my Preferences. I did this, because in the past three years, there's been MANY a time that I've run QUITE low in Work in Queue on my two systems. When I mentioned this, I was RECOMMENDED to turn this switch ON to ENSURE that my systems ALWAYS had some sort of work! I haven't run out of work SINCE!!!) NO Invalids on the XP Pro system with GTX-760 card. ONLY 2 Pending original/regular/standard 1.17 Units...
So, in close for today, it seems that Pendings have DROPPED significantly on the Windows system; BUT, RAC has dropped from 40K to 18.8K RAC in a VERY, VERY SHORT TIME!!!
Since I've had most of my Units on Windows Validate; SHOULDN'T RAC be picking up on the Win system??? Just asking...
I'm EAGERLY awaiting the new "climb" of RAC that has been foretold here... Anyone else seeing what I'm seeing??? Has ANYONE'S RAC started to climb with the 1.17 Units YET??? Just curious...
As to crunching MORE Units at a time; all three of my GPUs have ONLY 2GB GDDR5 RAM on them. System RAM on the Win system is 8GB DDR3, the MAC has 16GB DDR2. CPU Speed on the MAC is 3GHz, and on the Win system is 3.89GHz. The Win system hardware is MUCH newer than the MAC/Hackintosh. As mentioned in prior posts, the MAC is circa 2008/2009 EXCEPT for the TWO GTX-750TI SC cards; one of which is ONLY 6 Mos. Old. The ENTIRE Win system with GTX-760 is 3.5 Years Old...
So, don't know what anyone else needs from me Spec-wise... I think I've given EVERYTHING as detailed as I can be. I'm simply perplexed that my RAC could go from 103K down to 66.7K RAC in SUCH short order; AND NOT show ANY signs of recovery, YET... How long should the recovery take??? When do I throw in the towel and say that my equipment isn't good enough anymore??? (I REALLY don't think that's the case!!!) So, I ask again, anyone else seeing this MASSIVE drop and SUCH SLOOOOOW recovery??? Is this REALLY going to be the new norm???
TL
TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees
My biggest issue with
)
My biggest issue with crunching multiple work units at once is the loss of cpu threads for cpu projects. With Parkes and Arecibo my gpu (980 Ti) was kept nice and busy with four work units, and didn't even fully load a single thread. Now I have to give up 2+ threads just to keep it close to fully loaded.
MarkHNC wrote:The runtimes on
)
Interesting, thanks for testing. How was the gpu usage ? For my info, how do you explain the need to have a full CPU core per GPU wu ?
_AF_EDLS_GuL wrote:MarkHNC
)
I do not know enough about OpenCL to explain the need for a full CPU core for each work unit. I assume it takes the attention of the CPU core to keep the GPU fully fed. However, if that were the case, it looks like we would be getting more work done at the time than we did with Arecibo. For example, Arecibo didn't tax my GPU enough to bother me during most normal PC use, unless, of course, I tried to play a graphically-intensive game.
I have monitored GPU use running these new work units, and see that the GPU is pegging 100% nearly all of the time, even on the machine with slower system RAM. Interestingly, using HWMonitor, I see that Frame Buffer maxes at 82%, Bus Interface at 21%, and Memory at 46% (on a 2GB card). All four cores of the CPU remain at 100% during GPU processing, but, interestingly, the core temp varies slightly, whereas CPU core temps remain consistent during CPU-only processing. So I do wonder if most of the CPU "work" is just monitoring and occasionally feeding the GPU, as previously suggested by others.
MarkHNC wrote:So I do wonder
)
Don't forget that opencl doesn't behave the same way on amd (my card) and nvidia (your card): while it is using only a core from time to time on amd, it is using a full core waiting on nvidia.![Wink Wink](https://einsteinathome.org/sites/all/libraries/tinymce/jscripts/tiny_mce/plugins/emotions/img/smiley-wink.gif)
At the end it is the same : if you use your free core on amd it will not be available when you need it and slow down gpu filling. If it is not memory bandpass, maybe is it L2 cache ?
The Nvidia OpenCL application
)
The Nvidia OpenCL application does not need to use full CPU core. Raistmer has proved that at Seti with his SoG application and -use_sleep commandline parameter (I have CPU usage between 20-40% when running one task at a time on GPU, depends whether task is Arecibo or GreenBank). Of course SoG application has a lot of other commandline parameters that you have to get right to get a decent load on the GPU if you have the -use_sleep activated.
But whether his method is applicable for Einstein OpenCL application, I don't know. Perhaps you can ask him?
This reminds me of a
)
This reminds me of a long-running discussion on Folding several years ago, when they had both an OpenCl and a CUDA app for the Nvidia cards, as well as OpenCL for the AMD cards. (Now they are all OpenCL). The observation was the same as here, that AMD by default does not require an entire CPU core, whereas Nvidia does require an entire core when using OpenCl. However, the Nvidia CUDA app did not require an entire core.
There was some debate about why that was, but the general consensus was that Nvidia did it to ensure that their cards were properly fed, by reserving an entire core using spin states whether they needed it or not when running OpenCl. However, it was also stated that the application developer could defeat that behavior by turning off the unnecessary spin states, though I have no idea how that is accomplished.
What Cuda developers have
)
What Cuda developers have told me is that NVidia have, over the years, exposed several different mechanisms for developers to use to solve the feeding and synchronisation problems involved in performing parallel operations on a serial host computer - from spin loops, to interrupts, to callbacks (I don't guarantee those exact terms, or their sequence, but it was asserted that multiple techniques exist, and that the available choices have changed over time in the development toolkits). Cuda developers also assert that the same techniques should also be relevant, and available, when programming in OpenCL.
OpenCL developers, on the other hand, assert that the full range of techniques are accessible when using ATi hardware and toolkits for their OpenCL programs, but that the NVidia OpenCL toolkits are less comprehensive, and the spin-loop is the most efficient solution available.
I'm still waiting for an OpenCL developer for intel_gpus to (re-)appear on the scene and complete the triangle.
Richard Haselgrove wrote:What
)
Cycling in a loop to check the status of a request - this is a piece of art in a programming science. From a late 1970' I guess.