I certainly don't have a good understanding of how multiple jobs share a GPU. I am putting my two cents in, hoping others can help me understand better.
Fermi supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. For example, a PhysX program may invoke a fluids solver and a rigid body solver which, if executed sequentially, would use only half of the available thread processors. On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources. Kernels from different application contexts can still run sequentially with great efficiency thanks to the improved context switching performance.
I'm not sure of the precise definition of an application context but I believe in our case it's a work unit. I believe what we're seeing is analogous to mutli=threading on a single core. The advantage comes from overlapping I/O wait times. 2 compute bound threads still take a bit more than twice as long.
I'm hoping people have some good links to expand our understanding of how this can be optimized.
I don't run Seti tasks myself but as far as errors running cuda X2 here on my superclocked 660Ti I rarely have any error tasks and get average of 57 complete tasks per day. (same on my OC'd 550Ti 54 tasks per day)
GPU-Z saying the temp is 64-C on a sunny day here today.
I certainly don't have a good understanding of how multiple jobs share a GPU. I am putting my two cents in, hoping others can help me understand better.
Fermi supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. For example, a PhysX program may invoke a fluids solver and a rigid body solver which, if executed sequentially, would use only half of the available thread processors. On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources. Kernels from different application contexts can still run sequentially with great efficiency thanks to the improved context switching performance.
I'm not sure of the precise definition of an application context but I believe in our case it's a work unit. I believe what we're seeing is analogous to mutli=threading on a single core. The advantage comes from overlapping I/O wait times. 2 compute bound threads still take a bit more than twice as long.
I'm hoping people have some good links to expand our understanding of how this can be optimized.
Joe
Since I also run SETI MB and AstroPulse work using the LUNATICs Optimized
Application, all too often 1 SETI MB and 1 Einstein BRP4CUDA or OpenCL
are using the same HD5870 GPU. Haven't seen this on my 2 NVidia GTX470 & 480 yet.
But it works without problems, a.t.m. ;-)
And you'll need an so called app_info.xml file and edit those
parameters accordingly. (F.i 2 instances_per_device and 0.5 (CUDA/OpenCL)GPU).
Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.
With 2 cards an average production of about 85000 credits is possible.
Not bad for a 200 euro card.
My CPU runs at 4 GHz, and with slower CPU’s the result will be lower.
Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.
Are you doing anything to improve the latency with regard to servicing of the GPU tasks by your CPU?
In particular.
1. are you restricting the CPU to running fewer than the maximum number of BOINC tasks? (some people call this dedicating a core to the GPU-but that terminology is somewhat spurious absent a comprehensive intervention in task affinity, which is seldom if ever attempted).
2. are you doing something to raise the application priority of the CPU servicing task above the default level at which it would otherwise run (for example, running Process Lasso with custom user control input, or running Fred's Priority application, which I think will configure itself automatically)?
Also, I notice the host now running your 660 has a very high accumulated credit score and RAC. What type and number of GPUs were you running before replacing them with the 660?
Thanks for the information--yours seems the most favorable report on BOINC use of the 660 I have seen to date.
Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.
Are you doing anything to improve the latency with regard to servicing of the GPU tasks by your CPU?
In particular.
1. are you restricting the CPU to running fewer than the maximum number of BOINC tasks? (some people call this dedicating a core to the GPU-but that terminology is somewhat spurious absent a comprehensive intervention in task affinity, which is seldom if ever attempted).
2. are you doing something to raise the application priority of the CPU servicing task above the default level at which it would otherwise run (for example, running Process Lasso with custom user control input, or running Fred's Priority application, which I think will configure itself automatically)?
Also, I notice the host now running your 660 has a very high accumulated credit score and RAC. What type and number of GPUs were you running before replacing them with the 660?
Thanks for the information--yours seems the most favorable report on BOINC use of the 660 I have seen to date.
Wonder how they compaire to GTX 470/480/570/580 or AMD/ATI 5000 series and up.
PCIe 3.0 is a huge advantage and their (KEPPLER) architecture appears
to be more efficient.
By the way you don't need to free 1 or 2 cores at Einstein (BRP4), cause
0.5CPU+ 1(0.5/0.33)GPU and doing 2 WUs on 1 GPU, it'll free 1 core, so 4 WU
on 2 ATI GPU will 'free/not use' 2 CPUs/cores. Atleast that's a fact on
ATI rig.
I only see this on i7-2600+2x HD5870 GPUs. Although the GTX470 & 480 need
more CPU time compaired to 2 HD5870s.
[pre]
2,952.41 268.15 2.56 500.00 Binary Radio Pulsar Search (Arecibo) v1.28 (opencl-ati)
2,126.25 483.06 3.95 500.00 Binary Radio Pulsar Search (Arecibo) v1.28 (BRP4cuda32).[/pre]
Just took a quick look at my GTX 660Ti (superclocked) that has now been running 30+ days straight and the Rac is now at 30,197.30 running tasks X2 (along with other tasks at the same time -one LHC and T4T X2) on a just average quad-core host.
O.K I have now lived with my 660Ti for 10 days, the first 5 days proved to be a real headache with 90% of my BRP tasks being judged invalid or resulting in straight validate errors, my host lost over 3000 in RAC and as HenkM suggested I suspected driver issues and/or a defective card.
Turns out it was neither of these things, when I first installed the card in this host (ID: 4125210) I also installed the msi Afterburner utility, which appears to have been applying an overclock. I uninstalled Afterburner and all is good again and I can now post information on my reliable run times.
CPU running 3x GW tasks leaving 1 core spare for GPU
BRP Run times for 2 concurrent tasks = 44 minutes (22 minutes effective per task)
BRP Run times for 3 concurrent tasks = 66 minutes (22 minutes effective per task)
(Einstein is my only project and this host is running purely for Einstein)
Looking through the list of hosts for 660 users my run times are comparable and in some cases even favorable compared to 670 cards. (Although I can't be sure how many concurrent tasks are being run on those hosts).
Now, to put these times into a little perspective (and forgive my rough averaging and slight rounding of the figures involved but, I lack time.) The same host was previously running a Palit GTX460 Sonic and all four CPU cores were devoted to GW tasks
BRP run times for 3 concurrent tasks = 71 minutes (24 minutes effective per task)
I have no data for 2 concurrent tasks on this host with the 460 so you will have to take my word that running 3 fold was more efficient than 2 fold!
Whilst I accept that the 660Ti is clearly quicker in this host than the 460 and that I may be expecting too much from my old tech. motherboard/cpu combination, from a crunching point of view (as an upgrade) the performance of the two cards is far too similar to justify the cost of upgrading, unless perhaps you can utilise lastest generation hardware PCi-e3 etc.
It would be interesting to hear from people using new generation hardware who have upgraded from a 400/500 series card to the 660 and have their impressions...
Another thing to bear in mind is that power savings/consumption may not be a justified reason for upgrade, for example my old Palit GTX460's rated peak power consumption is 160W whereas the MSi 660Ti is rated at 190W (both figures from the respective manufacturer websites).
Although during crunching they are unlikely to reach these peak figures, one has to wonder... especially as (if I understand correctly)the 660Ti uses GPU 'boost' based on current temperature and available power!
@ Joe
Looking forward to your feedback re: the 670 and host used.
I may even be persauded to get one myself and directly compare it against the 660 I have now, afterall its only money and would better spent on science than beer and cigarettes etc!
BRP Run times for 2 concurrent tasks = 44 minutes (22 minutes effective per task)
That is rather disappointing. My 560Ti needs 40 min for 2 concurrent tasks. Also if you consider price diff., which is approx. 50€ in favor of 560Ti; 660Ti max. power consumption is 20W less though. But does that matter for crunching?
RE: Alexander, I certainly
)
I don't run Seti tasks myself
)
I don't run Seti tasks myself but as far as errors running cuda X2 here on my superclocked 660Ti I rarely have any error tasks and get average of 57 complete tasks per day. (same on my OC'd 550Ti 54 tasks per day)
GPU-Z saying the temp is 64-C on a sunny day here today.
http://einsteinathome.org/host/4109993/tasks&offset=0&show_names=1&state=3&appid=0
RE: RE: Alexander, I
)
Since I also run SETI MB and AstroPulse work using the LUNATICs Optimized
Application, all too often 1 SETI MB and 1 Einstein BRP4CUDA or OpenCL
are using the same HD5870 GPU. Haven't seen this on my 2 NVidia GTX470 & 480 yet.
But it works without problems, a.t.m. ;-)
And you'll need an so called app_info.xml file and edit those
parameters accordingly. (F.i 2 instances_per_device and 0.5 (CUDA/OpenCL)GPU).
Today I received my MSI GTX
)
Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.
With 2 cards an average production of about 85000 credits is possible.
Not bad for a 200 euro card.
My CPU runs at 4 GHz, and with slower CPU’s the result will be lower.
HenkM wrote:Today I received
)
Are you doing anything to improve the latency with regard to servicing of the GPU tasks by your CPU?
In particular.
1. are you restricting the CPU to running fewer than the maximum number of BOINC tasks? (some people call this dedicating a core to the GPU-but that terminology is somewhat spurious absent a comprehensive intervention in task affinity, which is seldom if ever attempted).
2. are you doing something to raise the application priority of the CPU servicing task above the default level at which it would otherwise run (for example, running Process Lasso with custom user control input, or running Fred's Priority application, which I think will configure itself automatically)?
Also, I notice the host now running your 660 has a very high accumulated credit score and RAC. What type and number of GPUs were you running before replacing them with the 660?
Thanks for the information--yours seems the most favorable report on BOINC use of the 660 I have seen to date.
RE: HenkM wrote:Today I
)
From NVidia website : GTX660
GPU Engine Specs:
384 coresCUDA Cores
835 MHzGraphics Clock (MHz)
30.4Texture Fill Rate (billion/sec)
Memory Specs:
2000Memory Clock
GDDR5Memory Interface
128bitMemory Interface Width
64.0Memory Bandwidth (GB/sec)
OpenGL
PCI Express 2.0, PCI Express 3.0Bus Support
YesCertified for Windows 7
3D Vision, 3DTV Play, CUDA, DirectX 11, PhysX, SLI, TXAA, FXAA, Adaptive VSyncSupported Technologies
2-waySLI Options.
Wonder how they compaire to GTX 470/480/570/580 or AMD/ATI 5000 series and up.
PCIe 3.0 is a huge advantage and their (KEPPLER) architecture appears
to be more efficient.
By the way you don't need to free 1 or 2 cores at Einstein (BRP4), cause
0.5CPU+ 1(0.5/0.33)GPU and doing 2 WUs on 1 GPU, it'll free 1 core, so 4 WU
on 2 ATI GPU will 'free/not use' 2 CPUs/cores. Atleast that's a fact on
ATI rig.
I only see this on i7-2600+2x HD5870 GPUs. Although the GTX470 & 480 need
more CPU time compaired to 2 HD5870s.
[pre]
2,952.41 268.15 2.56 500.00 Binary Radio Pulsar Search (Arecibo) v1.28 (opencl-ati)
2,126.25 483.06 3.95 500.00 Binary Radio Pulsar Search (Arecibo) v1.28 (BRP4cuda32).[/pre]
All 4 GPUs are doing 2 instances per device.
Just took a quick look at my
)
Just took a quick look at my GTX 660Ti (superclocked) that has now been running 30+ days straight and the Rac is now at 30,197.30 running tasks X2 (along with other tasks at the same time -one LHC and T4T X2) on a just average quad-core host.
http://einsteinathome.org/host/4109993/tasks&offset=40&show_names=1&state=3&appid=0
RE: From NVidia website :
)
That must be cut-and-paste error, the GTX 660 has different (better) specs, see here:
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Cheers
HBE
O.K I have now lived with my
)
O.K I have now lived with my 660Ti for 10 days, the first 5 days proved to be a real headache with 90% of my BRP tasks being judged invalid or resulting in straight validate errors, my host lost over 3000 in RAC and as HenkM suggested I suspected driver issues and/or a defective card.
Turns out it was neither of these things, when I first installed the card in this host (ID: 4125210) I also installed the msi Afterburner utility, which appears to have been applying an overclock. I uninstalled Afterburner and all is good again and I can now post information on my reliable run times.
The host:
Intel Q6600 quad core @ 3Ghz
Foxconn P43a motherboard with Pci-e2x16 slot running @Pci-e2x16
4GB Kingston Hyper-X PC2-6400
MSi GTX660Ti power edition
CPU running 3x GW tasks leaving 1 core spare for GPU
BRP Run times for 2 concurrent tasks = 44 minutes (22 minutes effective per task)
BRP Run times for 3 concurrent tasks = 66 minutes (22 minutes effective per task)
(Einstein is my only project and this host is running purely for Einstein)
Looking through the list of hosts for 660 users my run times are comparable and in some cases even favorable compared to 670 cards. (Although I can't be sure how many concurrent tasks are being run on those hosts).
Now, to put these times into a little perspective (and forgive my rough averaging and slight rounding of the figures involved but, I lack time.) The same host was previously running a Palit GTX460 Sonic and all four CPU cores were devoted to GW tasks
BRP run times for 3 concurrent tasks = 71 minutes (24 minutes effective per task)
I have no data for 2 concurrent tasks on this host with the 460 so you will have to take my word that running 3 fold was more efficient than 2 fold!
Whilst I accept that the 660Ti is clearly quicker in this host than the 460 and that I may be expecting too much from my old tech. motherboard/cpu combination, from a crunching point of view (as an upgrade) the performance of the two cards is far too similar to justify the cost of upgrading, unless perhaps you can utilise lastest generation hardware PCi-e3 etc.
It would be interesting to hear from people using new generation hardware who have upgraded from a 400/500 series card to the 660 and have their impressions...
Another thing to bear in mind is that power savings/consumption may not be a justified reason for upgrade, for example my old Palit GTX460's rated peak power consumption is 160W whereas the MSi 660Ti is rated at 190W (both figures from the respective manufacturer websites).
Although during crunching they are unlikely to reach these peak figures, one has to wonder... especially as (if I understand correctly)the 660Ti uses GPU 'boost' based on current temperature and available power!
@ Joe
Looking forward to your feedback re: the 670 and host used.
I may even be persauded to get one myself and directly compare it against the 660 I have now, afterall its only money and would better spent on science than beer and cigarettes etc!
Gavin.
RE: BRP Run times for 2
)
That is rather disappointing. My 560Ti needs 40 min for 2 concurrent tasks. Also if you consider price diff., which is approx. 50€ in favor of 560Ti; 660Ti max. power consumption is 20W less though. But does that matter for crunching?
Michael
Team Linux Users Everywhere