Robert, thanks for the data. It will be interesting to see what impact the three jobs switch makes. I run my GTX 750 at 2x but my GTX 660s at 3x. On the other hand your high reported utilization may limit hopes for improvement.
While the performance seems a bit less than I hoped, the power consumption is less than I feared. Not wishing to raise power consumption much from my 660s I had been thinking of the 970, but it seems my power concern may allow me to use a 980.
As you were using Afterburner, what clock frequencies did it report during 2x job operation?
One last tidbit: I've done some careful test series over the last couple of years, and I distinctly recall that in one of them the GPU job output was actually BETTER with one CPU Einstein task running than with zero. Not by much, and yes, it was a surprise, so I checked carefully, and for that test set it was true.
This may be a stupid question, but can the nVidia cards run the OpenCL app here? With the CUDA app being built with 3.2, the OpenCL version might be better despite nVidia's lackluster OpenCL drivers.
Quote:
While the performance seems a bit less than I hoped
I agree, at least for the current Einstein CUDA app. It seems to run into some other bottleneck than shading power. What about memory bandwidth? The 970 achieves almost double the throughput of the 750Ti, with about 2.6 times as much shader power and twice the bandwidth. The L2 cache size is also the same for both cards - so if the cache is helping the 750Ti a lot, it won't be quite as useful for the larger card, since naturally more work is in flight there.
Quote:
One last tidbit: I've done some careful test series over the last couple of years, and I distinctly recall that in one of them the GPU job output was actually BETTER with one CPU Einstein task running than with zero. Not by much, and yes, it was a surprise, so I checked carefully, and for that test set it was true.
I could imagine this to be due to the delay associated with ramping up an idle core to full clock & voltage. On the first K10 Athlons this was so bad that this dynamic scaling was removed via bios updates entirely. Your old Nehalem was certainly better, but from the same generation.
This may be a stupid question, but can the nVidia cards run the OpenCL app here? With the CUDA app being built with 3.2, the OpenCL version might be better despite nVidia's lackluster OpenCL drivers.
I guess you are referring to OpenCl on S6CasA and FGRP3. The short answer is yes, but in general AMD is a little better on comparable cards. I am about to build a new Haswell machine, and will be putting in a pair of HD 7790s rather than a pair of GTX 660s for example. My past experience indicates that will be better overall, though I don't remember the numbers.
Also, the OpenCl versions of BRP4G and BRP5 run a little better on the HD 7790s as compared to the CUDA versions on the GTX 660s as I recall, though it is not a large difference. I keep hoping for improved CUDA support, but until then I will go with AMD.
Repeating a request I first posted over at GPUGrid, but which didn't get an answer.
Could somebody running a Maxwell-aware version of BOINC check and report this [the 'GFOPS peak' value shown at startup for a GTX 970/980], please, and do a sanity-check of whether BOINC's figure is correct from what you know of the card's SM count, cores per SM, shader clock, flops_per_clock etc. etc? We got the figures for the 'baby Maxwell' 750/Ti into BOINC on 24 February (3edb124ab4b16492d58ce5a6f6e40c2244c97ed6), but I think that was just too late to catch v7.2.42
We're in a similar position this time, with v7.4.22 at release-candidate stage - I'd say that one was safe to test with, if nobody here has upgraded yet. TIA.
This may be a stupid question, but can the nVidia cards run the OpenCL app here? With the CUDA app being built with 3.2, the OpenCL version might be better despite nVidia's lackluster OpenCL drivers.
I guess you are referring to OpenCl on S6CasA and FGRP3.
No, I was referring to the OpenCL BRP binaries the AMD GPUs are running. They are compiled with AMD in mind, but OpenCL should be device-agnostic and e.g. at Collatz the actual binary is the same for AMD, Intel and nVidia.
No, I was referring to the OpenCL BRP binaries the AMD GPUs are running. They are compiled with AMD in mind, but OpenCL should be device-agnostic and e.g. at Collatz the actual binary is the same for AMD, Intel and nVidia.
In that case, I don't know how you would get them to run on Nvidia. They always pick up the CUDA 3.2 work units on BRP for me (GTX 650 Ti, 660, 660 Ti and 750 Ti).
No, I was referring to the OpenCL BRP binaries the AMD GPUs are running. They are compiled with AMD in mind, but OpenCL should be device-agnostic and e.g. at Collatz the actual binary is the same for AMD, Intel and nVidia.
In that case, I don't know how you would get them to run on Nvidia. They always pick up the CUDA 3.2 work units on BRP for me (GTX 650 Ti, 660, 660 Ti and 750 Ti).
Hi Folks,
Just been having a look at the output report for a BRP5 WU and noticed something I find odd.
NO CUDA Cores!! on a GPU with over a thousand? Any ideas why a GTX970 is so reported?
7.2.42
Activated exception handling...
[04:03:10][4920][INFO ] Starting data processing...
[04:03:10][4920][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 194 MB (3904 MB free / 4098 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[04:03:10][4920][INFO ] Using CUDA device #0 "GeForce GTX 970" (0 CUDA cores / 0.00 GFLOPS)
[04:03:10][4920][INFO ] Version of installed CUDA driver: 6050
[04:03:10][4920][INFO ] Version of CUDA driver API used: 3020
[04:03:11][4920][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[04:03:11][4920][INFO ] Header contents:
------> Original WAPP file: ./PB0048_01241_DM1604.00
Yes. The cuda core enumeration (and the peak speed enumeration, for that matter) is done by a piece of software called an API (Application Programming Interface) built into the project's - any project's - science application at the time it was compiled.
The API can process the reply from any card already in manufacturing at the time it was designed, but for newer cards it just throws up its hands and says Huh? Wassat?
In fact, the newest cards the BRP5 API knows about are
GeForce GT 555M (144 CUDA cores / 374.40 GFLOPS)
GeForce GTX 570 (480 CUDA cores / 1440.00 GFLOPS)
After that, we just get
GeForce GT 640 (0 CUDA cores / 0.00 GFLOPS)
GeForce GTX 650 Ti (0 CUDA cores / 0.00 GFLOPS)
(data from Albert, there may be more exotic cards here that Albert hasn't seen yet)
Robert, thanks for the data.
)
Robert, thanks for the data. It will be interesting to see what impact the three jobs switch makes. I run my GTX 750 at 2x but my GTX 660s at 3x. On the other hand your high reported utilization may limit hopes for improvement.
While the performance seems a bit less than I hoped, the power consumption is less than I feared. Not wishing to raise power consumption much from my 660s I had been thinking of the 970, but it seems my power concern may allow me to use a 980.
As you were using Afterburner, what clock frequencies did it report during 2x job operation?
One last tidbit: I've done some careful test series over the last couple of years, and I distinctly recall that in one of them the GPU job output was actually BETTER with one CPU Einstein task running than with zero. Not by much, and yes, it was a surprise, so I checked carefully, and for that test set it was true.
This may be a stupid
)
This may be a stupid question, but can the nVidia cards run the OpenCL app here? With the CUDA app being built with 3.2, the OpenCL version might be better despite nVidia's lackluster OpenCL drivers.
I agree, at least for the current Einstein CUDA app. It seems to run into some other bottleneck than shading power. What about memory bandwidth? The 970 achieves almost double the throughput of the 750Ti, with about 2.6 times as much shader power and twice the bandwidth. The L2 cache size is also the same for both cards - so if the cache is helping the 750Ti a lot, it won't be quite as useful for the larger card, since naturally more work is in flight there.
I could imagine this to be due to the delay associated with ramping up an idle core to full clock & voltage. On the first K10 Athlons this was so bad that this dynamic scaling was removed via bios updates entirely. Your old Nehalem was certainly better, but from the same generation.
MrS
Scanning for our furry friends since Jan 2002
RE: This may be a stupid
)
I guess you are referring to OpenCl on S6CasA and FGRP3. The short answer is yes, but in general AMD is a little better on comparable cards. I am about to build a new Haswell machine, and will be putting in a pair of HD 7790s rather than a pair of GTX 660s for example. My past experience indicates that will be better overall, though I don't remember the numbers.
Also, the OpenCl versions of BRP4G and BRP5 run a little better on the HD 7790s as compared to the CUDA versions on the GTX 660s as I recall, though it is not a large difference. I keep hoping for improved CUDA support, but until then I will go with AMD.
Repeating a request I first
)
Repeating a request I first posted over at GPUGrid, but which didn't get an answer.
Could somebody running a Maxwell-aware version of BOINC check and report this [the 'GFOPS peak' value shown at startup for a GTX 970/980], please, and do a sanity-check of whether BOINC's figure is correct from what you know of the card's SM count, cores per SM, shader clock, flops_per_clock etc. etc? We got the figures for the 'baby Maxwell' 750/Ti into BOINC on 24 February (3edb124ab4b16492d58ce5a6f6e40c2244c97ed6), but I think that was just too late to catch v7.2.42
We're in a similar position this time, with v7.4.22 at release-candidate stage - I'd say that one was safe to test with, if nobody here has upgraded yet. TIA.
Here are the results for
)
Here are the results for running 3 jobs at a time on the GTX 970.
970 = 12,926 secs; GPU Usage = 96%; temp = 62 C; watts = 125
970 Daily Credits = 3333 * 86,400 / (12,926 / 3) = 66,835
Effectively no change over running 2 jobs at a time.
970 core clock for both runs = 1342 MHz. This must be a built-in boost, because I made no adjustments to have it run at that speed.
Richard, I'll try to get to your request tonight.
RE: RE: This may be a
)
No, I was referring to the OpenCL BRP binaries the AMD GPUs are running. They are compiled with AMD in mind, but OpenCL should be device-agnostic and e.g. at Collatz the actual binary is the same for AMD, Intel and nVidia.
MrS
Scanning for our furry friends since Jan 2002
RE: No, I was referring to
)
In that case, I don't know how you would get them to run on Nvidia. They always pick up the CUDA 3.2 work units on BRP for me (GTX 650 Ti, 660, 660 Ti and 750 Ti).
RE: RE: No, I was
)
Anonymous platform?
Hi Folks, Just been
)
Hi Folks,
Just been having a look at the output report for a BRP5 WU and noticed something I find odd.
NO CUDA Cores!! on a GPU with over a thousand? Any ideas why a GTX970 is so reported?
7.2.42
Activated exception handling...
[04:03:10][4920][INFO ] Starting data processing...
[04:03:10][4920][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 194 MB (3904 MB free / 4098 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[04:03:10][4920][INFO ] Using CUDA device #0 "GeForce GTX 970" (0 CUDA cores / 0.00 GFLOPS)
[04:03:10][4920][INFO ] Version of installed CUDA driver: 6050
[04:03:10][4920][INFO ] Version of CUDA driver API used: 3020
[04:03:11][4920][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[04:03:11][4920][INFO ] Header contents:
------> Original WAPP file: ./PB0048_01241_DM1604.00
Regards,
Cliff,
Been there, Done that, Still no damm T Shirt.
Yes. The cuda core
)
Yes. The cuda core enumeration (and the peak speed enumeration, for that matter) is done by a piece of software called an API (Application Programming Interface) built into the project's - any project's - science application at the time it was compiled.
The API can process the reply from any card already in manufacturing at the time it was designed, but for newer cards it just throws up its hands and says Huh? Wassat?
In fact, the newest cards the BRP5 API knows about are
GeForce GT 555M (144 CUDA cores / 374.40 GFLOPS)
GeForce GTX 570 (480 CUDA cores / 1440.00 GFLOPS)
After that, we just get
GeForce GT 640 (0 CUDA cores / 0.00 GFLOPS)
GeForce GTX 650 Ti (0 CUDA cores / 0.00 GFLOPS)
(data from Albert, there may be more exotic cards here that Albert hasn't seen yet)