Speed of NVIDIA GTX660

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320,378,898
RAC: 0

RE: Alexander, I certainly

Quote:

Alexander,

I certainly don't have a good understanding of how multiple jobs share a GPU. I am putting my two cents in, hoping others can help me understand better.

There is an nVidia whitepaper http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf that's not all that detailed but they say:

Quote:
Fermi supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. For example, a PhysX program may invoke a fluids solver and a rigid body solver which, if executed sequentially, would use only half of the available thread processors. On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources. Kernels from different application contexts can still run sequentially with great efficiency thanks to the improved context switching performance.

I'm not sure of the precise definition of an application context but I believe in our case it's a work unit. I believe what we're seeing is analogous to mutli=threading on a single core. The advantage comes from overlapping I/O wait times. 2 compute bound threads still take a bit more than twice as long.

I'm hoping people have some good links to expand our understanding of how this can be optimized.

Joe

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,721
Credit: 1,102,919,449
RAC: 1,177,408

I don't run Seti tasks myself


I don't run Seti tasks myself but as far as errors running cuda X2 here on my superclocked 660Ti I rarely have any error tasks and get average of 57 complete tasks per day. (same on my OC'd 550Ti 54 tasks per day)

GPU-Z saying the temp is 64-C on a sunny day here today.

http://einsteinathome.org/host/4109993/tasks&offset=0&show_names=1&state=3&appid=0

Fred J. Verster
Fred J. Verster
Joined: 27 Apr 08
Posts: 118
Credit: 22,451,438
RAC: 0

RE: RE: Alexander, I

Quote:

Quote:

Alexander,

I certainly don't have a good understanding of how multiple jobs share a GPU. I am putting my two cents in, hoping others can help me understand better.

There is an nVidia whitepaper http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf that's not all that detailed but they say:

Quote:
Fermi supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. For example, a PhysX program may invoke a fluids solver and a rigid body solver which, if executed sequentially, would use only half of the available thread processors. On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources. Kernels from different application contexts can still run sequentially with great efficiency thanks to the improved context switching performance.

I'm not sure of the precise definition of an application context but I believe in our case it's a work unit. I believe what we're seeing is analogous to mutli=threading on a single core. The advantage comes from overlapping I/O wait times. 2 compute bound threads still take a bit more than twice as long.

I'm hoping people have some good links to expand our understanding of how this can be optimized.

Joe

Since I also run SETI MB and AstroPulse work using the LUNATICs Optimized
Application
, all too often 1 SETI MB and 1 Einstein BRP4CUDA or OpenCL
are using the same HD5870 GPU. Haven't seen this on my 2 NVidia GTX470 & 480 yet.

But it works without problems, a.t.m. ;-)
And you'll need an so called app_info.xml file and edit those
parameters accordingly. (F.i 2 instances_per_device and 0.5 (CUDA/OpenCL)GPU).

HenkM
HenkM
Joined: 29 Sep 09
Posts: 32
Credit: 279,008,202
RAC: 1

Today I received my MSI GTX

Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.
With 2 cards an average production of about 85000 credits is possible.
Not bad for a 200 euro card.
My CPU runs at 4 GHz, and with slower CPU’s the result will be lower.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,146
Credit: 7,093,904,931
RAC: 1,378,560

HenkM wrote:Today I received

HenkM wrote:
Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.

Are you doing anything to improve the latency with regard to servicing of the GPU tasks by your CPU?

In particular.

1. are you restricting the CPU to running fewer than the maximum number of BOINC tasks? (some people call this dedicating a core to the GPU-but that terminology is somewhat spurious absent a comprehensive intervention in task affinity, which is seldom if ever attempted).

2. are you doing something to raise the application priority of the CPU servicing task above the default level at which it would otherwise run (for example, running Process Lasso with custom user control input, or running Fred's Priority application, which I think will configure itself automatically)?

Also, I notice the host now running your 660 has a very high accumulated credit score and RAC. What type and number of GPUs were you running before replacing them with the 660?

Thanks for the information--yours seems the most favorable report on BOINC use of the 660 I have seen to date.

Fred J. Verster
Fred J. Verster
Joined: 27 Apr 08
Posts: 118
Credit: 22,451,438
RAC: 0

RE: HenkM wrote:Today I

Quote:
HenkM wrote:
Today I received my MSI GTX 660 card.
And …… it’s a very good card.
Working with 3 concurrent tasks, a task finishes in about 50 minutes.

Are you doing anything to improve the latency with regard to servicing of the GPU tasks by your CPU?

In particular.

1. are you restricting the CPU to running fewer than the maximum number of BOINC tasks? (some people call this dedicating a core to the GPU-but that terminology is somewhat spurious absent a comprehensive intervention in task affinity, which is seldom if ever attempted).

2. are you doing something to raise the application priority of the CPU servicing task above the default level at which it would otherwise run (for example, running Process Lasso with custom user control input, or running Fred's Priority application, which I think will configure itself automatically)?

Also, I notice the host now running your 660 has a very high accumulated credit score and RAC. What type and number of GPUs were you running before replacing them with the 660?

Thanks for the information--yours seems the most favorable report on BOINC use of the 660 I have seen to date.

From NVidia website : GTX660

GPU Engine Specs:
384 coresCUDA Cores
835 MHzGraphics Clock (MHz)
30.4Texture Fill Rate (billion/sec)
Memory Specs:
2000Memory Clock
GDDR5Memory Interface
128bitMemory Interface Width
64.0Memory Bandwidth (GB/sec)
OpenGL
PCI Express 2.0, PCI Express 3.0Bus Support
YesCertified for Windows 7
3D Vision, 3DTV Play, CUDA, DirectX 11, PhysX, SLI, TXAA, FXAA, Adaptive VSyncSupported Technologies
2-waySLI Options.

Wonder how they compaire to GTX 470/480/570/580 or AMD/ATI 5000 series and up.
PCIe 3.0 is a huge advantage and their (KEPPLER) architecture appears
to be more efficient.

By the way you don't need to free 1 or 2 cores at Einstein (BRP4), cause
0.5CPU+ 1(0.5/0.33)GPU and doing 2 WUs on 1 GPU, it'll free 1 core, so 4 WU
on 2 ATI GPU will 'free/not use' 2 CPUs/cores. Atleast that's a fact on
ATI rig.

I only see this on i7-2600+2x HD5870 GPUs. Although the GTX470 & 480 need
more CPU time compaired to 2 HD5870s.
[pre]
2,952.41 268.15 2.56 500.00 Binary Radio Pulsar Search (Arecibo) v1.28 (opencl-ati)
2,126.25 483.06 3.95 500.00 Binary Radio Pulsar Search (Arecibo) v1.28 (BRP4cuda32).[/pre]

All 4 GPUs are doing 2 instances per device.

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,721
Credit: 1,102,919,449
RAC: 1,177,408

Just took a quick look at my

Just took a quick look at my GTX 660Ti (superclocked) that has now been running 30+ days straight and the Rac is now at 30,197.30 running tasks X2 (along with other tasks at the same time -one LHC and T4T X2) on a just average quad-core host.

http://einsteinathome.org/host/4109993/tasks&offset=40&show_names=1&state=3&appid=0

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 692,134,435
RAC: 40,861

RE: From NVidia website :

Quote:


From NVidia website : GTX660

GPU Engine Specs:
384 coresCUDA Cores
835 MHzGraphics Clock (MHz)
30.4Texture Fill Rate (billion/sec)
[...]

That must be cut-and-paste error, the GTX 660 has different (better) specs, see here:

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications

Cheers
HBE

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40,644,307,741
RAC: 14,904

O.K I have now lived with my

O.K I have now lived with my 660Ti for 10 days, the first 5 days proved to be a real headache with 90% of my BRP tasks being judged invalid or resulting in straight validate errors, my host lost over 3000 in RAC and as HenkM suggested I suspected driver issues and/or a defective card.

Turns out it was neither of these things, when I first installed the card in this host (ID: 4125210) I also installed the msi Afterburner utility, which appears to have been applying an overclock. I uninstalled Afterburner and all is good again and I can now post information on my reliable run times.

The host:

Intel Q6600 quad core @ 3Ghz
Foxconn P43a motherboard with Pci-e2x16 slot running @Pci-e2x16
4GB Kingston Hyper-X PC2-6400
MSi GTX660Ti power edition

CPU running 3x GW tasks leaving 1 core spare for GPU

BRP Run times for 2 concurrent tasks = 44 minutes (22 minutes effective per task)

BRP Run times for 3 concurrent tasks = 66 minutes (22 minutes effective per task)

(Einstein is my only project and this host is running purely for Einstein)

Looking through the list of hosts for 660 users my run times are comparable and in some cases even favorable compared to 670 cards. (Although I can't be sure how many concurrent tasks are being run on those hosts).

Now, to put these times into a little perspective (and forgive my rough averaging and slight rounding of the figures involved but, I lack time.) The same host was previously running a Palit GTX460 Sonic and all four CPU cores were devoted to GW tasks

BRP run times for 3 concurrent tasks = 71 minutes (24 minutes effective per task)

I have no data for 2 concurrent tasks on this host with the 460 so you will have to take my word that running 3 fold was more efficient than 2 fold!

Whilst I accept that the 660Ti is clearly quicker in this host than the 460 and that I may be expecting too much from my old tech. motherboard/cpu combination, from a crunching point of view (as an upgrade) the performance of the two cards is far too similar to justify the cost of upgrading, unless perhaps you can utilise lastest generation hardware PCi-e3 etc.

It would be interesting to hear from people using new generation hardware who have upgraded from a 400/500 series card to the 660 and have their impressions...

Another thing to bear in mind is that power savings/consumption may not be a justified reason for upgrade, for example my old Palit GTX460's rated peak power consumption is 160W whereas the MSi 660Ti is rated at 190W (both figures from the respective manufacturer websites).
Although during crunching they are unlikely to reach these peak figures, one has to wonder... especially as (if I understand correctly)the 660Ti uses GPU 'boost' based on current temperature and available power!

@ Joe

Looking forward to your feedback re: the 670 and host used.

I may even be persauded to get one myself and directly compare it against the 660 I have now, afterall its only money and would better spent on science than beer and cigarettes etc!

Gavin.

Michael Karlinsky
Michael Karlinsky
Joined: 22 Jan 05
Posts: 888
Credit: 23,502,182
RAC: 0

RE: BRP Run times for 2

Quote:

BRP Run times for 2 concurrent tasks = 44 minutes (22 minutes effective per task)

That is rather disappointing. My 560Ti needs 40 min for 2 concurrent tasks. Also if you consider price diff., which is approx. 50€ in favor of 560Ti; 660Ti max. power consumption is 20W less though. But does that matter for crunching?

Michael

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.