GPU Upgrade Shows No Improvement in Work Unit Completion

Florida Rancher
Florida Rancher
Joined: 4 Oct 13
Posts: 31
Credit: 23,998,436
RAC: 0
Topic 198610

I'm a novice Einstein@Home member. I upgraded my second video card from a Geforce GTX 680 to a Geforce GTX 970 but the time it takes to complete one GPU work unit is about the same; averaging about 95 minutes. I set the card to run two GPU work units and it now takes 220 minutes to complete each GPU work unit which is about a 13.5% loss of efficiency. GPU load averages about 90% for each card.

I need help improving efficiency.

Thank you,
Philip Smith
Florida Rancher

Windows 10 Pro
Intel I7-6700 CPU running at 3690 MHz
Dell 0XJ8C4 motherboard
32 GB DDR4-2133 Kingston RAM
Nvidia Geforce GTX 970 4 GB GDDR5 RAM
Core 1408 MHz, Memory 3004 MHz
Nvidia Geforce GTX 730 4 GB DDR3 RAM
Core 1332 MHz, Memory 2432 MHz

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1,671,340,358
RAC: 832,770

GPU Upgrade Shows No Improvement in Work Unit Completion

I have a faint memory of people on this forum saying that the 9xx series does not add much over the previous generations for E@H. But I could be wrong, I don't have any 9xx cards myself.

On a general efficiency note, I see you also run CPU tasks. Have you reserved any CPU cores for GPU support?

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 59
Credit: 1,205,878,861
RAC: 681,591

I would - in BOINC: set CPU

I would
- in BOINC: set CPU usage to 75% or 50%.
- don't overclock so hard! 1408 MHz is IMHO too much for a GTX 970.

You have too many errors (more than 1000!) in your tasklist. Start with stock clocks, no CPU-tasks and test.

My 750Ti takes about 215 minutes with two workunits and no cpu tasks.

Good luck!
Juergen

Florida Rancher
Florida Rancher
Joined: 4 Oct 13
Posts: 31
Credit: 23,998,436
RAC: 0

Juergen: Thank you very

Juergen:

Thank you very much for responding to my inquiry. The errors mainly occurred because I had too many days of tasks accumulated and then I had to go into the hospital for 11 days so the tasks expired. Additionally, I canceled a large batch of tasks myself to get the number down to a manageable level. In preferences I set the number of days to 5 instead of 10 and 5.

I took your advice and set the MHz of the 970 to 1368. Should I reduce it further? The surprising thing is that this gpu card crunches at about the same speed of my GTX 680. I bought it thinking it would offer a major improvement in gpu crunching times.

In Boinc preferences there are two selections for CPU usage limits:
Use at most % of CPUs:
& Use at most % of CPU time.

Which setting should be set to 50 to 75% and what about the other?

Regards,
Phil

Florida Rancher
Florida Rancher
Joined: 4 Oct 13
Posts: 31
Credit: 23,998,436
RAC: 0

That you for your response.

That you for your response. I'm getting some great help from Juergen getting me on the right track. The 680 and 970 times are almost identical.

I have 8 cores; 7 for crunching and one for the GPU.

I don't use the computer for much else right now so I have it set to mostly doing Einstein tasks.

Thank you,
Phil

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 59
Credit: 1,205,878,861
RAC: 681,591

Hi Phil, 1. "Use at most

Hi Phil,

1. "Use at most 100% of CPU time" and

2. "Use at most % of CPUs" leaves one (75%) or two (50%) cores for GPU tasks free (with 4 cores on my host).

With 8 cores you can try smaller 12.5% steps (87.5, 75.0, 62.5, ...)

3. Stock core clocks for a GTX 970 are IMHO ~1050-1100 MHz, not more.

Happy crunching,
Juergen

archae86
archae86
Joined: 6 Dec 05
Posts: 2,780
Credit: 3,011,973,539
RAC: 2,692,407

It is worth mentioning that

It is worth mentioning that 970 is a lower-ranking card within the 900 series than is the 680 within the 600 series. You may have had higher expectations of upgrade from this than warranted.

As it happens I run a 970 on Einstein, and just stopped running a 660 on Einstein (a fan died). The 660 at base clocks was running 2 WU with elapsed times of about 165 minutes.

The 970, running two WU of GRP6 (Parkes) at once, consumes about 95 minutes elapsed time to complete the unit. However that is with a substantial memory overclock which was a bit tricky to achieve (there is a thread devoted to that topic pinned to the top of this forum), a modest core clock overclock, and a pretty frisky support CPU (dual-core Haswell) which is running zero BOINC CPU jobs but does also support a 750Ti also running Einstein.

I don't have good recall of my 970 base clock times, but they were very substantially longer (Einstein responds much more to memory clock rate on the Maxwell and Maxwell2 cards than seems to be the response to most games).

Florida Rancher
Florida Rancher
Joined: 4 Oct 13
Posts: 31
Credit: 23,998,436
RAC: 0

Juergen: All of the help

Juergen:

All of the help I'm receiving from everyone is outstanding and I want to think everyone who's offered advice. I changed everything you directed me to change and I set the "Use at most % CPUs" to 75%

I also used NVIDIA Inspector to set both my cards to their default configurations. The program already has the default settings.

Archae told me that the 970 is a lower spec card and unfortunately I didn't do my due diligence when I bought it. I assumed wrongly that it would be a great leap over the GTX 680 for GPU crunching. I'm not a gamer so the video card is dedicated to Boinc. Does anyone publish ratings for the different GeForce cards?

Now that I've changed the CPU setting to 75% how do I get the GPUs to use the extra cores? With 2 extra cores will the the usage change to .5 or for each GPU?

Right now I'm crunching 4 GPU cores; 2 cores for each card and each GPU is using .2 CPUs.

Regards,
Phil

Florida Rancher
Florida Rancher
Joined: 4 Oct 13
Posts: 31
Credit: 23,998,436
RAC: 0

Thank you Archae. The 970

Thank you Archae. The 970 elapsed times running 2 WUs is 225 minutes. Like yours my single WUs takes 95 minutes to complete. It is actually taking longer to complete WUs running 2 cores.

After dinner I'm going to research the thread you directed me to. I really appreciate the advice because I'm back crunching after a long Illness and a 3 year hiatus.

Regards,
Phil

archae86
archae86
Joined: 6 Dec 05
Posts: 2,780
Credit: 3,011,973,539
RAC: 2,692,407

RE: how do I get the GPUs

Quote:

how do I get the GPUs to use the extra cores? With 2 extra cores will the the usage change to .5 or for each GPU?

Right now I'm crunching 4 GPU cores; 2 cores for each card and each GPU is using .2 CPUs.


You are reading into some indicators a meaning they don't have. Lots of people do, so this is not me accusing you of being sloppy.

The task(s) which runs on the CPU(s) to support the GPU job(s) are just another task from the point of the view of the CPU, the OS and the mechanisms which decide which task is actually running on a given core at a given moment.

In your case, nothing is rushing in and pausing your GPU support task because it has used more than 0.2 of a core over some period. Don't worry about that number, it has no direct bearing on your performance.

What does have bearing is whether the work on your GPU gets service by the CPU as quickly as feasible each time it needs external resource (sometimes computation, frequently data transfer).

If there is idle CPU capacity, then the GPU support task is much more likely to start up promptly than otherwise, which is the main reason that reducing the number of always-running BOINC tasks can help GPU throughput. The details vary hugely with specific application, and specific GPU, so experimentation is key.

Regarding relative GPU performance, as it happens the project does maintain a page on this subject, but it is gravely flawed. The statistics accumulated take no account of differences among the populations in how many tasks are run simultaneously, and there are probably other problems. Nevertheless, the ordering probably does have some resemblance to the truth.

For the Einstein list see here

Actually one can probably do better by identifying individuals who post here who run machines with GPUs of interest to you, and directly asking them the few relevant questions:

such as:
1. what multiplicity are you running the jobs at (two at once is common, but values from 1 to 6 are not rare).
2. how many BOINC cpu tasks are running?
3. which applications are running? (for comparison purposes it is good that a single GPU application type and a single CPU application type, from just one project is all that is running). Since no two applications behave the same, it is even more important that the workload running is the one you wish to run efficienty.
4. does the machine run 24x7 BOINC, with not enough other work to matter?

I, personally, can give you very specific information on the GTX 750, GTX 750 Ti, GTX 970, and probably within a few weeks for the GTX 1080. Quite likely none of those is of current operational interest to you, but there are other folks posting here willing and able to supply useful information on their personal systems.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,162
Credit: 38,661,754,096
RAC: 43,566,016

############### EDIT: It

###############
EDIT: It took me a while to compose my full message below. I wasn't aware that Archae86 had also responded in the meantime. I've just perused his response and fortunately we don't seem to disagree or contradict each other very much :-).

He actually makes a number of very good complementary remarks. I would like to thank him for his effort.
################

Hi Phil,

Welcome back to crunching!! Thanks very much for rejoining. The project really values all contributions that people are prepared to make. You will find there are lots of people around here who are more than willing to offer advice and assistance to get you going again.

To help you get up to speed, I'll pick a few things you've mentioned that haven't yet been responded to and try to give you some pointers.

Quote:
... I canceled a large batch of tasks myself to get the number down to a manageable level. In preferences I set the number of days to 5 instead of 10 and 5.


The two 'work days' settings are additive. If you set 10+5, BOINC will try to download 15 days work. With a 14 day deadline, this is crazy :-). Also, initial estimates may be quite wrong until things settle down. Your work could take far longer to complete than the initial estimates suggest. When starting out or when adding new devices to the crunching mix (or even when experimenting with different mixes of science runs), you should always set a low work cache until everything settles.

Another point about those two settings is that if you spread your required 5 days across both (say 3+2), BOINC will 'fill-up' to 5 and then not ask for more work until the remainder drops to less than 3. If you are trying to protect against a multi-day work outage, you should always set 5+0 so that you always have a full cache on hand. I don't really know why anyone would want a cache that cycles between two disparate numbers all the time so just set the first value to what you want and forget the second if you want a constant value.

Einstein is a quite reliable project. Major outages are quite rare so there is little risk of not being able to get work when you need it. Probably the thing you should plan for is an outage that occurs 5 mins after close of business on a Friday. It may not be fixed until Monday. In other words 3-4 days of work will more than cover the great majority of (rare) problems that might occur. But don't set that while tweaking and testing your crunching system. You don't really want to trash hundreds of tasks if something goes wrong. Use the full cache setting once you have finished testing and everything is running to your satisfaction.

Quote:
Archae told me that the 970 is a lower spec card and unfortunately I didn't do my due diligence when I bought it. I assumed wrongly that it would be a great leap over the GTX 680 for GPU crunching. I'm not a gamer so the video card is dedicated to Boinc. Does anyone publish ratings for the different GeForce cards?


You can't always predict in advance how good a new card will be based on the performance of an older model. There isn't really any good source of information that will accurately predict how a new card will perform here. I've been quite surprised to see the reports of the relatively poor performance of the 970. On paper, you would think it would do better than it apparently does.

I tend to use comparison lists something like this one to see how new cards might perform compared to old ones. You should read the preamble to understand the numbers listed in the full table. If you then scroll down through the table, you will find these details for the two GPUs in question.

[pre]GPU Clock† Turbo Clock Memory Clock Memory Interface Memory Bandwidth CUDA cores
GeForce GTX 680 1,006 MHz 1,058 MHz 6,008 MHz 256-bit 192.2 GB/s 1,536
GeForce GTX 970 1,050 MHz 1,178 MHz 7 GHz 256-bit 224.0 GB/s 1,664[/pre]

On paper, the above numbers suggest the 970 should be a little better than the 680. It has the same 'width' memory interface and a higher bandwidth and number of CUDA cores. There must be something missing (compared to the 680) that affects crunching performance without being revealed in the above specs. It could even be something that will be improved as the driver matures.

Unfortunately, it's a bit of a slog, but the best way to assess what a particular GPU is capable of is to go through the stats for top hosts and find examples for the particular GPU of interest. Unfortunately this is not easy, particularly for multi GPU setups. You can't rely on the GPUs all being of the same type and you'll often have to guess how many concurrent tasks a given GPU may be crunching.

Take Archae86 as an example. Here is a link to the valid BRP6 tasks for the host he mentioned as having a 970 plus a 750Ti and running 2x (2 concurrent tasks per GPU). When you look at his list of hosts, this particular one only mentions the 970, as if there were two of them. When you drill down to the actual tasks as shown in the above link, you will see some taking around 5,700 secs and some taking 12,700 secs. If you click on the taskID link (the 9 digit number in the far left hand column) for one of each of those, you will be able to read the stderr text that was returned to the project from each task. The precise GPU that crunched the task is listed in line 8 of the stderr output.

Because Archae86 told us he was running 2x, we know that the times we see are for completing two tasks which means that the 970 is doing a task every ~2,900 secs and the 750Ti every ~6,400 secs - say 49 and 107 minutes respectively. I decided to have a look at your valid BRP6 tasks and the first thing I noticed was that you are running the cuda32 version of the BRP6 app. There is a newer version based on cuda55 (version 1.57) but unfortunately it's still listed as a test app (beta) so you will need to allow test apps in your project preferences in order to be able to get work for that app. You should do that because it will give you a significant speed improvement. Then you could make a better comparison with the many people who would be using the cuda55 app.

I didn't 'drill down' into many of your completed tasks but I saw some done by 970s and some where the GPU was identified as a GTX 745. For some obscure reason BOINC is mis-identifying your 2nd GPU (GT 730). There's no reference to a GTX 745 in the big table I linked to earlier. The stderr output lists it as having 384 cuda cores which is exactly what a GT 730 is supposed to have so at least BOINC got that bit correct. Perhaps you'd like to upgrade to the cuda55 app and see what that does to your 970's performance.

Quote:
Now that I've changed the CPU setting to 75% how do I get the GPUs to use the extra cores? With 2 extra cores will the the usage change to .5 or for each GPU?


GPU tasks don't "use the extra cores". The CPU is used to transfer stuff to the GPU and then to receive the results back from the GPU. The critical thing is to have CPU resources available for immediate response to service requests from the GPU. If all CPU cores are fully loaded with CPU tasks, the latency involved in stopping one of them in order to respond to a GPU request is what harms the GPU performance. The fractional numbers you see for CPU involvement are rather meaningless. The GPU actually needs very little CPU cycles in total but it needs it often and immediately if possible. An idle CPU core not tied up with a CPU task gives the best chance of instant response. It's very much a case of 'diminishing returns'. Two 'free' cores may do a bit better than one in giving immediate response. The only way to know for sure is to try various combinations and measure the results. NVIDIA GPUs tend to do pretty well with just one 'free' core. AMD GPUs often need more. There is no blanket rule for all situations.

At the end of the day you have to make a choice about which particular science runs interest you the most. If you're only interested in finding radio pulsars, you may choose to exclude CPU tasks entirely. If you're interested in the search for continuous gravity waves with the new advanced LIGO data, you may very well decide to accept a slightly lower GPU output in order to keep the bulk of your CPU cores crunching GW tasks - at least for the moment when there is not yet a GPU app on the horizon for GW tasks :-).

Quote:
Right now I'm crunching 4 GPU cores; 2 cores for each card and each GPU is using .2 CPUs.


You are actually crunching 4 GPU tasks on two individual GPUs (two per GPU). Each GPU has a very large number of tiny 'cores' so that the crunching can be highly parallelized across a large number of them. This is what makes the GPU so much faster at doing the job compared to a single CPU. The GPU needs the service of transferring data in and data out to be handled by a CPU. In total, the number of CPU cycles needed to do this is relatively small - it just needs to be immediate. If you make more CPU cores available, the GPU can't use them other than when a data transfer is needed.

Previously I mentioned 'diminishing returns' simply because the chance of two separate GPU tasks needing service at exactly the same instant may not be all that high. If you look at Archae86's results for a 970 task, you see CPU times around 650 secs and run time of 6700 secs. So a single CPU core is still going to be idle for 90% of the time when servicing that GPU task and is therefore able to service other GPU tasks as well without too much potential for conflict - at least that's the way it seems to me. I'm no expert so I could easily be wrong but I often run NVIDIA GPUs 2x or even 3x with no 'free' CPU cores and accept a small penalty because I want to maximise CPU task output. Everyone should decide what is best for them.

Sorry for how long this has become. I just wanted to give you as much as possible to think about. The exams will be tomorrow :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.