Bringing up a GTX 460 host

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783
Topic 196030

Inspired by the thread GPU Price performance curve I included a Gigabyte GTX 460 of the SOC flavor in a new PC I built on October 7, 2011.

In general the experience has been good, but I did hit a number of issues and thought that as others not yet initiated into the use of CUDA might find it interesting that I'd start a thread here. I'll pose some observations, issues, and questions that have come up in separate posts.

For background where it may matter system attributes include:

- Windows 7 Home Premium 64-bit
- Z68 motherboard with 8 Gbytes of RAM installed
- uses the Z68 supported Intel Smart Response function to provide caching of the main, rather slow 750 Gbyte hard drive by a small 20 Gbyte SSD drive (I conservatively chose the expensive Larsen Creek model Intel pushes for this function)
- i5-2500K CPU, running stock 3.3 GHz clock and voltage so far (may undervolt later for power savings)
- the specific GPU is the Gigabyte GV-N460SO-1GI
- the power supply is a Nexus Silent 430, rated at 430W total, of which 396 is supposed to be available at 12V, divided among four rails. Claimed efficiency is over 80% across the full range of interest, with a broad peak hitting 85% centered almost at 50% of rated output.

While I've mostly run Einstein as shown here, I also signed up this host for SETI as shown here.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

Bringing up a GTX 460 host

Performance observed

Whether run with one, two, or three BRP4 tasks (BRP4 is the only currently available Einstein CUDA application) simultaneously active on the GPU, this moderately overclocked 460 retires work at rates only slightly different from two WU for each elapsed hour. At the current 500 credit/WU allocation, for a continuously running host this amounts to nearly 24,000 RAC for BRP4 work on the GPU alone, with more output available from CPU tasks.

My i5 2500K CPU is a non-hyperthreaded Sandy Bridge running at 3.3 GHz, so each core is pretty fast by current standards. Nevertheless, even running one-fold (as for example by not using an app_info.xml) the support executable which runs on the CPU consumes a substantial amount of CPU time. Going to two-fold or three-fold eats into the available CPU two ways. First, the reported CPU consumption by the CUDA support application goes up (whether measured per WU or per hour), and second the reported CPU required to compute CPU GW jobs goes up as well (presumably reflecting some sort of swapping or other inefficiency). I'll give details in a later post, but this degradation of the available CPU resource is sufficient as to make the very small GPU performance gain I've observed on _most_ WUs of questionable net value. On restarts, transitions from running a SETI GPU task, and a few other times I've sometimes observed substantially higher performance on a single set of 2-fold or 3-fold tasks (higher GPU load, higher power consumption, shorter elapsed time, and higher CPU consumption), but I do not know a method to make this effect persistent.

Here and at SETI some people have advocated running less than a full load of CPU work on GPU hosts with a view to improving overall performance by reducing time lost as the GPU waits for CPU service (both data movement and at least some computation). I made a trial comparing one-fold work operating in the presence of four CPU GW jobs, and operating with no CPU BOINC work at all. The difference in GPU output was negligible, so I doubt that advice is sound for a host like this one running one-fold. However as I made initial two-fold and three-fold trials, observation with GPU-Z suggested that the GPU was starving frequently in the face of user interaction and otherwise. I guessed that raising the priority of the CPU task supporting the GPU would be constructive, and when an initial trial by hand using the task priority change function of Process Explorer looked favorable, I installed Process Lasso and set it always to raise the priority of the CPU support executable to "above normal". This seems to work well, and I've not observed ill effects, but I do not (yet) have any quantitative data to support the utility of this measure, and think it likely not helpful for one-fold (or default) installations, while possible useful for two-fold or three-fold.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

Power consumption I was

Power consumption

I was surprised that the active power added by this GPU was less than I feared, but also that the idle power added was more than I hoped. The PC as built before I installed the GPU card had an idle power consumption at the wall of about 48 watts, and active power running four Einstein GW tasks (at stock clock and voltage) of just about 100 watts, with somewhat more if SETI MB CPU tasks were active. I use the i5-2500K onboard graphics to drive my monitor, and have not installed the Virtu software to support spreading graphics work over both resources, but the idle power nevertheless increased by about 27 watts, to 75 watts. Active power when running four CPU Einstein GW tasks and one GPU BRP4 tasks is about 222 watts, so an increase of about 122 watts in the active state (far below the manufacturers TDP of 200 watts--especially considering that I am reporting power at the wall socket, which should be derated for inferred power to the card by the efficiency of the main PC power supply--about 85% according to Nexus).

I dwell on the power point because the implied lifetime cost of power consumption going this route is substantial compared to the initial cost of the graphics card, and it also bears on the possible need and cost of an upgraded power supply for the PC, and lastly because it affects room heating. (for example, at a cheaper than Europe, more expensive than northwestern USA, somewhat typical for BOINC users 11 cents U.S. per kilowatthour, this card, if run nonstop at my numbers would generate extra utility power cost of about US $117/year, and I only paid US $150 for the card post rebate).

Substantial though an extra 122 watts is (and more if one goes to two-fold or three-fold operation, or runs SETI), on a BOINC output per watt basis, the incremental addition of a GTX 460 card is highly superior in efficiency to any currently configurable system, and probably to nearly all incremental upgrades of plausible systems (whether over-clocking/over-volting, or fast RAM, or a higher spec CPU, or ...)

That said, adding this 222 watts (plus the monitor when I've failed to turn it off) to my study where it temporarily resides during commissioning) quite noticeably has warmed the room. As both my study and the computer room which is the ultimate destination for this box are typically cold most hours during winter, extra heat for about twenty hours in the day in winter is a plus. I plan to use the BOINC function for allowed hours of operation to reduce the heat added when it is not beneficial, and may also throttle back some of my older (and far less BOINC productivity power efficient) hosts to try to make this a near power-neutral change to my household.

While the manufacturer of this GTX 460 card recommends a system power supply of over 500 watts, I'm entirely comfortable with my supply vs. consumption situation, and suggest that a merit of the GTX 460 over truly high-end GPUs for BOINC use is the likelihood that many users would find themselves not needing an upgraded supply, not heating their rooms so much, and paying a good deal less to their utility over the life of their system.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

Work fetch issues Work

Work fetch issues

Work fetch in the world of BOINC has flummoxed users who have spent far more time on these projects than I, involving as it does an interaction of initially estimated floating point performance, a project-specific task duration correction factor (DCF) which varies in time and is not always the same on the host as on the central server, application-specific Average Processing Rates (APR) which also vary in time, a magic ten-validation mode change, somewhat mysterious corrective behavior in response to project finished work imbalance (debt), and doubtless other things. While a forum search here and at SETI on some of these terms will get you a host of posts, and some informative entire threads, I've not spotted a succinct, up-to-date overview written to the needs of a person bringing up a new system. So here I'll recount the specific troubles I had, and attempt a very little interpretation and suggestions for action for others bringing up a new host.

I initially started up the host on both SETI and Einstein, with very short requested work queue (four hours, I think). After I added the GPU card, I soon noticed that I was not getting enough Einstein CUDA BRP4 work to keep the new GPU busy. Generally I'd get just one or two WUs issued (or no more than an hour of work) after which a four-hour deferral would mean no more automatic downloading until long after the GPU had run out of work. User-requested updates still got work. I slowly increased requested queue length out to several days, but still this behavior continued.

In point of fact, I had two separate problems, both of which were capable of suppressing adequate GPU BRP4 work fetch on my host at prevailing conditions.

One was that something about the way BOINC works and the specific way I brought up the machine set an initial APR value for CUDA BRP4 work an order of magnitude below reality. While, unlike SETI, Einstein does not expose APR from the host page on an "application details" link, one can readily observe the combined effect of current APR and DCF in the estimated completion time (the "remaining" column on the task tab of boincmgr, for example). I saw values of five to seven hours instead of the proper half hour when running one-fold, about eleven hours when running two-fold (vs. 1.0 correct), and a whopping twenty-one hours when running three-fold (vs. 1.5 correct). All of these after several dozen results processed in the given configuration.

The second problem involved the relationship between SETI and Einstein work completion, prescribed work share, WU start criteria, and debt management. As will be familiar to those who have studied it, but escaped my own notice in the heat of battle for quite a while, if one greatly reduces work share of one project (I took SETI down from 25% to 1% early on), a relatively small debt reaches the point of requiring action. Oddly to the uninitiated, the action did not take the form of deferring running of SETI or Einstein tasks already in the queue--which continued, though at a very low rate for SETI, even when I was running two-fold and three-fold, contrary to something I expected based on a Richard Haselgrove comment. Instead it took the form of inhibiting work fetch on the Einstein side. Only when I put all Einstein work on suspend and ran down my small SETI queue by about half, did the re-enabled Einstein project request work even at the order-of-magnitude lessened level governed by the hugely erroneous APR.

I believe that even so large an APR error as mine is supposed to be annealed away in time by automatic BOINC processes, but as the system had run for days before I intervened with the flops directive, and had by then returned and gotten validation on many dozens if not hundreds of results, I think a typical user commissioning a new system could get deeply frustrated before it fixed itself. I was finally able to get enough BRP4 work to get past the 4-hour deferral once I raised the requested queue from the inadequate to the purpose three days to, eventually seven days. Before I installed flops directives, that setting had me with a huge (about seven day) CPU GW queue, and a just barely adequate (somewhat past four hour) GPU BRP4 queue.

As my trial of two-fold and three-fold GPU processing already required the anonymous platform (using app_info.xml) mechanism, it was a small but awkward further step to introduce the directive. After forum nugget-finding, computation from observed results, and a fair bit of trial and error, I've now (after two weeks) reached a condition in which the DCF hovers near 1.0, the predicted execution times for both GW CPU work and BRP4 GPU work are usually within 20% of the truth, and work fetch seems to be working in an orderly way. I'll give details and some suggestions in a separate post, as this one is already too long.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

Using to tune up work

Using to tune up work fetch

To the (limited) degree than I understand this, BOINC on a host comes up with a performance estimate for a computing resource (whether a CPU core or a whole or fractional GPU) in units of floating point operations per second. This value compared to a server-set work content for each task and the current DCF fudge factor then gives the estimated elapsed time on the host, and thus affects number of tasks sent in response to work requests very directly.

The performance estimate for the CPU, I believe, comes from the BOINC benchmarks. Imperfect though they are, for this purpose it seems they are more often than not good enough. I don't know where the estimate comes from for a GPU, but for mine it was clearly massively too low. Using the directive in app_info.xml, to my understanding, just substitutes a numerical value specified by the user for whatever the system would otherwise use as initial estimate of hardware performance. So far as I know, it does not short-circuit subsequent experience-based adjustments (anyone know?).

One can find possible starting-point flops numbers in various places, quite likely not right for one's own specific hardware, project, application combination. But by setting a very low queue length in preferences, and inhibiting work fetch in boincmgr, it seems to me one can usually do trial and error very quickly by just looking at the resulting estimated task execution time in the boincmgr task list after restarting boinc with an app_info containing the new set of flops estimates. It is important that the flops (or APR) values for all applications currently requesting work be internally consistent. If they are, completion of a single WU will reset the DCF to an appropriate higher value if it is substantially too low (thus giving excessively short completion estimates), and not so very many must be processed for the DCF to work down to an appropriate value if it is initially too high. But a relative error (say between CPU GW and GPU BRP4--my problem) must await automatic rectification by the APR adjustment mechanism, which appears to run at a massively slower, possibly glacial, pace than the DCF adjustment process.

After several cycles of trial and error, I've decided that I like the following flops values for single-fold BRP4 CUDA processing and CPU GW processing on my specific host:

for the GW CPU task: 4233000000
for the BRP4 GPU task: 71477000000

These so far seem to give a DCF mostly moving back and forth between .95 and 1.05, with resulting predicted BRP4 tasks times moving from a little too short to a little too long, while the CPU GW predictions are moderately too long save for the outlier much-slower one which somehow catches more than a fair share of surrendering the core to the cuda support task. More importantly, work fetch seems orderly and closely related to the actual request, with no difficulty at all maintaining enough work in queue to avoid trouble from the regular four-hour deferrals.

For running 2-fold or 3-fold, in principle the GW flops estimate needs to be scaled down a little (reflecting longer elapsed times both because of utilization loss to the cuda support tasks, and also to some sort of thrashing or other conflict penalty). But the big adjustment is to the GPU flops number, which to a good approximation needs to be scaled--one half the one-fold value for two-fold, one third for three-fold.

Again I advocate a technique of suspend fetch, restart for trial, observe relative error of the CPU vs. the GPU completion estimates, and revise, until the estimates look close enough to re-enable fetch. For extra safety, I reduce requested queue to four or six hours, and double it after each successful batch of completions until actual fetch resumes.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

n-fold BRP4 tasks on the

n-fold BRP4 tasks on the BPU--Performance observed

Quote:
Going to two-fold or three-fold eats into the available CPU two ways. First, the reported CPU consumption by the CUDA support application goes up (whether measured per WU or per hour), and second the reported CPU required to compute CPU GW jobs goes up as well (presumably reflecting some sort of swapping or other inefficiency). I'll give details in a later post, but this degradation of the available CPU resource is sufficient as to make the very small GPU performance gain I've observed on _most_ WUs of questionable net value.


Quote:
I guessed that raising the priority of the CPU task supporting the GPU would be constructive, and when an initial trial by hand using the task priority change function of Process Explorer looked favorable, I installed Process Lasso and set it always to raise the priority of the CPU support executable to "above normal". This seems to work well, and I've not observed ill effects, but I do not (yet) have any quantitative data to support the utility of this measure, and think it likely not helpful for one-fold (or default) installations, while possible useful for two-fold or three-fold.


I've now run some pretty long tests--generally over a day per condition--and yet some of the variability is rich enough that I'm not confident of the actual relative performance values. But contrary to my hint quoted, I now think that on my system running either two or three simultaneous BRP4 tasks on the Fermi card is beneficial. I now doubt that Process Lasso helps in the system condition of my testing (very quiet, almost no user keyboard interaction or other non-BOINC use).

[pre]Fold Lasso BRPsecs GWsecs BRPcred GWcred TotCred Advantage
1 Y 1914.85 13867.55 22,560.51 6,258.53 28,819.05
2 Y 3610.95 13816.13 23,927.22 6,281.83 30,209.05 1.0482
3 Y 5335.10 14312.21 24,291.95 6,064.09 30,356.04 1.0533
2 N 3615.11 13546.16 23,899.69 6,407.02 30,306.71 1.0516
3 N 5030.79 14179.36 25,761.36 6,120.91 31,882.27 1.1063[/pre]
The credit rates in the table are for relative comparison purposes only, and are in units of indicated credit award per day, using current credit rate of 500 for a BRP4 WU and slightly over 250 for a GW WU.

In these cases, I was using Process Lasso to set the cuda32 support application priority to "below normal" imagining this was raising it above the "idle" I though it would otherwise be, thus giving it an earlier place in line for available CPU execution in competition with the idle priority GW CPU tasks. But, in fact, for my environment, the cuda32 ap priority without Lasso intervention is already "below normal", so this probably did nothing.

Even with a day to a day and a half per test point, these reports are materially affected by a very few cases in which the 2-fold or 3-fold work ran much faster than usual ("on step" as reported before). I view the reported advantage to _not_ using Process Lasso as probably artifact. But I do think there is a small real advantage to running either 2-fold or 3-fold for my environment. A user must decide for themselves whether this small advantage is worth the trouble of setup, and the continued maintenance commitment, as using anonymous platform disconnects one from automatic update of applications.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

Internet Traffic At the

Internet Traffic

At the time of writing, so far as I know the production Einstein site only supports CUDA work on BRP4, not on GW, nor on the LATnnn (Gamma Ray Pulsar) sets of work. This particular type of work has very high internet data demands, as each individual WU requires the download of eight p*.binary files of size 4 Megabytes each. My particular ISP is Comcast, and on my account page they list 250 Gigabytes/month as my allocation. My monthly usage before I brought this CUDA system to life has recently been accounted by them as 20 to 30 Gigabytes, so I can afford as much as 200 to this cause. Unlike GW work, for which an initially substantial set of files often enables several subsequent WUs to be processed merely by the transmission of a tiny amount of administrative handshake traffic, it appears that BRP4 work involves the transmission of eight 4 Megabyte files for each WU, independently of all others. As my new GTX 460 host is currently capable of almost 50 BRP4 WUs processing per day, the CUDA-enabled traffic here is about 50*8*4*31 Megabytes/month, or about 50 Gigabytes. As an alternate calculation, this system started running CUDA BRP4 on October 8, and Comcast listed my October bandwidth consumption as 70 Gigabytes, as compared to a baseline around 20 Gigabytes in preceding months. While this suggests a higher monthly adder for the new system than my calculation above, it does leave me safely under my Comcast 250 limit, even if I add a GTX460 card to one other host, as I am strongly tempted to do. Your situation with your internet provider may be more concerning, possibly even very expensive.

We can hope that the Einstein project will find the GPU contribution sufficiently compelling, the adaptability of the computation to the GPU model adequate, and the available development environment capable enough, to resource the support of GPU use for applications with much lower Internet traffic to computation ratios in the future (currently Gravitational Wave or Gamma Ray Pulsar).

It does create, for me, a somewhat new condition of protracted periods of download when either I extend the requested work queue size, or just the "breathing" of the DCF calculation cause boingmgr to request and be granted many new WUs at once. In their current condition, the Einstein servers have generally averaged between 50 and 500 kbytes/sec for extended downloads of this kind, so the transmission of the couple of dozen units which represents just a half day's work takes many minutes. While I've seen SETI use up that much time on my lesser hosts, it was in periods when the SETI servers were only reaching very low average data rates per download (through opening too many connections, or other of their many problems).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117631742881
RAC: 35226040

RE: ... This particular

Quote:
... This particular type of work has very high internet data demands ...


Welcome to the joys of cattle ranching ... :-) ;-).

I want to thank you very much for documenting your journey so thoroughly!! I've looked forward to each successive chapter and jumped straight in (to digest it avidly) whenever a new one has appeared. Much as I've wanted to buy a couple of decent GPUs myself for quite a while, I'm now reminded of exactly why I've abstained.

I have two separate internet accounts and a family member has a third. My two have a monthly limit of 30GB each, but have the attraction of 'unlimited' off-peak downloads, off-peak being 2.00am to 2.00pm. Some time ago I tested 'unlimited' to be about 120-150GB before you run the risk of being labeled as an 'excessive' user (leecher) and being booted out of the contract. That was in the days when large data files for GW tasks were being very prematurely deleted. Oliver was good enough to succumb to my 'complaints' about this and the current run seems to be making much better use of these files before they are deleted. So much so that my monthly usage of around 150GB per account is now down to maybe 10-15GB per account and I no longer even bother to 'arrange' for my downloading to occur in off-peak time.

Another reason for not buying suitable GPUs is the thought that the data for the BRP search might be hard to get in the not too distant future. We've done the original Arecibo data, we've done the 'anti-centre' data, we've done the Parkes data and now we're chewing through the latest Arecibo (Mock) data. I guess I'm being overcautious but I don't particularly want to invest in a number of new GPUs if the work runs out shortly.

Yes, there have been statements that other searches will (probably) eventually have a GPU app available so I've just waited to see what might happen. I've been very keen to do Fermi-LAT tasks so I've been content to let others chase the BRP tasks while I do my own thing. I'm particularly attracted by the prospect of new pulsars being found in the LAT data.

Quote:
... We can hope that the Einstein project will find the GPU contribution sufficiently compelling ...


I don't think you need have any concern on this score. When I attended the open day back in July and got to see the plans for the evolution of their in-house computing facilities, GPU crunching seemed to be very much at the top of the agenda. As always in University environments, it's just a matter of balancing all of the things that are on the to-do lists with the (perennially insufficient) manpower available. Bruce is a very impressive leader so I'm sure his plans will come to fruition. I see myself being involved with this project for a long time to come.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

RE: it does leave me safely

Quote:
it does leave me safely under my Comcast 250 limit, even if I add a GTX460 card to one other host, as I am strongly tempted to do.


Another host, with Lasso and half-populated HT usage

As foreshadowed, I was sufficiently impressed with the GTX 460 behavior on my new machine to find, purchase, and install another card of the exact same model on my year-old daily use primary machine--replacing a very low-end ATI graphics card. It is now running rather nicely, but I had some interesting moments on the way, and have come to a possibly non-conventional configuration. I by no means have performed sufficient properly controlled experimentation to demonstrate advantage to all aspects of my configuration even for this host, let alone for the myriad other variants. Still, you may find this a source of ideas to try in your own case.

Basics of the second host:
CPU: Intel E5620 Westmere quad Nehalem-class running at stock clock and voltage, hyperthreaded
RAM: 3 channels, each with a single 2 Gbyte DDR3 stick running at E5620 stock conditions
OS: Windows 7 Professional 64-bit
GPU: Gigabyte super-overclock variant GTX 460 GV-N460SO-1GI running stock

BOINC version 6.12.33 (x64)
GPU is currently running Einstein BRP4 at 3-fold.

Hyperthreading, CPU affinity, and task priority nudging:

I currently run with HT enabled, but use the BOINC preferences to restrict BOINC CPU task operation to 50% of available (i.e. four of the eight virtual CPU instances). This restriction in practice appears to apply to the GW CPU science application instances, and not to the CPU support tasks associated with BRP4 GPU processing.

I use Process Lasso extensively to modify what Windows would otherwise do with this basic configuration.

1. I use the facility of setting an application-specific default CPU restriction to constrain the GW science application to 0;2;4;6. I restrict the *cuda32.exe support application for the GPU work to 1;3;5;7, and I also restrict most of the other non-system tasks which contribute appreciably to system resource consumption to 1;3;5;7. This list is still growing, but currently includes: gpu-z, process explorer, cumulus (a weather station data posting application), chrome, firefox, speedfan, boinc, boincmgr, MS Word, MS Excel, process lasso, and several others.

2. I use the facility of setting an application-specific default priority to raise the priority of the einstein*cuda32.exe support application to Above normal.

Having given the specific current configuration, I'll mention some evolution history and some guesstimated theory behind my choices. I wished to do my GPU upgrades while adding little to my household power consumption, which I already regard as somewhat high. So as a first step I backed down this E5620 from running 8 Einstein GW tasks hyperthreaded at an appreciable over-volt and over-clock to run non hyperthreaded at stock clock and voltage. (this saved 73 watts of power). My first effort simply used the 3-fold app_info from my first GTX 460 host on this host. The results were miserable. GPU-Z on the GPU load graph showed clear indications of a great amount of pausing of the GPU work, and at the same time the interactive user experience was un-acceptably degraded with hitches and pauses in responding to me.

I reasoned that since Nehalem-generation hyperthreading is shown to impose near zero overhead loss on four tasks running on a quad I would lose nothing in GW output by going from four tasks nHT to four tasks HT so long as I used CPU affinity to preclude the "unlucky assignment" case in which Windows puts two GW tasks on the two virtual instances of one core while leaving both virtual instances of another physical core idle. Hence the use of Process Lasso 0;2;4;6 affinity for the GW tasks. I further reasoned that I especially wanted very fast access to CPU resource by the CUDA support tasks anytime they requested it--hence the elevated priority, and also the restriction to the "opposite half" CPU instances, which would assure that GW would not have to be swapped out just to give the cuda32 task a chance to do some memory movement. With that start, I further reasoned that restricting the other tasks also to 1;3;5;7 might be beneficial in making it less likely that the GW tasks would be swapped out, while giving a set of tasks which in aggregate generally use rather less than one virtual CPU a chance at four CPUs, assuring under most conditions good response. My real goal here is the original SETI model of using up nearly all feasible spare resource, while giving the other tasks on the system as near un-impaired performance as feasible.

It is quite likely that some of my reasoning is flawed, and some of my settings not only unrequired but somewhat harmful, but in practice I am very pleased with the results. In contrast to my first trial with no Process Lasso, nHT, a full load of 4 GW jobs, and 3-fold CUDA BRP4, which gave bad interactive results, this setup in well over a day of use remains enjoyable to me. In contrast to the hash of wildly varying (and mostly rather low) GPU load shown by GPU-Z for that first trial, on this trial the graph looks solidly unchanging, and the long-term load average computed by GPU-Z is 83%.

Unlike my first host, where the benefits of 2-fold and 3-fold operation were rather modest, on this system they are quite substantial.

[pre]n-fold GPU load estimated system Einstein RAC (CPU GW + GPU BRP4)
1 58 22797
2 76 29196
3 83 31387[/pre]

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223644931
RAC: 1004783

Inspired to further worth by

Inspired to further worth by a recent spate of new application versions, I tried a matrix of number of CPU GW jobs and number of GPU BRP jobs running on my i5-2500K/GTX 460 host. The watts and electricity cost indications come from my Kill-a-watt meter, with the electricity price set to my recent (but not current) incremental rate of US $0.103/kilowatthour. For the more interesting cases, both the power consumption and application productivity numbers are based on as much as a day of run time. Credits/day is calculated, using actual observed elapsed times for samples I deemed adequate in view of the variability.

[pre]nGPU nCPU %CPU Cr/dy watts GPU_ut $/mon $/Mcred tsk/day task gain/day
0 0 100 0 75 0% $5.63 #######
1 0 100 22930 173 68% $12.75 $18.53 45.86 (45.86)
2 0 100 30534 198 91% $14.60 $15.94 61.07 (61.07)
3 0 100 31690 205 95% $15.27 $16.06 63.38 (63.38)
0 1 100 1692 92 0% $6.74 $132.77 6.74 33.26
1 1 100 25079 188 69% $13.94 $18.53 53.51 (13.51)
2 1 100 32188 213 91% $15.79 $16.35 67.73 (27.73)
3 1 100 32817 214 93% $15.87 $16.12 69.01 (29.01)
0 2 100 3375 104 0% $7.71 $76.15 13.45 66.55
1 2 100 26958 203 70% $15.05 $18.61 60.61 19.39
2 2 100 32718 219 88% $16.24 $16.55 72.14 7.86
3 2 100 33129 225 91% $16.61 $16.71 72.91 7.09
3 2 50 32218 214 92% $15.79 $16.34 67.83 12.17
0 3 100 5144 117 0% $8.75 $56.70 20.50 99.50
1 3 100 27848 216 69% $15.94 $19.08 65.77 54.23
2 3 100 30642 225 78% $16.61 $18.07 71.26 48.74
3 3 100 31205 228 81% $16.90 $18.05 72.23 47.77
0 4 100 6761 126 0% $9.41 $46.39 26.94 133.06
1 4 100 29287 226 71% $16.68 $18.98 70.81 89.19
2 4 100 30315 226 75% $16.68 $18.34 72.39 87.61
3 4 100 31222 230 78% $17.05 $18.20 73.98 86.02
[/pre]

The task gain/day column refers to a problem on which I've started a new thread. There appears to currently be in force at Einstein a task distribution limit of 40 tasks per day per CPU core currently allowed to run Einstein work. As any of these configurations employing at least one GPU tasks can process more than that, all of these configurations with fewer than two active CPU jobs will shortly work their task queue down to zero.

Aside from that point, I was surprised to see how strong the burden placed on the CPU of multiple GPU parallel tasks is. Despite the (untested in these configurations) presumed help of Process Lasso increased priority, the CPU support task response slowed GPU output considerably when any more than one CPU task was active. So strong is the effect in the case of three parallel GPU tasks that the most productive number of CPU tasks is just two, with the third lowering output (and increasing power consumption).

Until I learned of the 40 task/day problem, my personal preference based on these results was to run with three GPU BRP jobs and one CPU GW job. Just now I am seriously considering a sort of binge/purge cycle: once every couple of days set the host to 3 GPU/4 CPU with 25% maximum CPU usage, wait for it to build up an adequate queue, then switch back to 3 GPU/1 CPU 100% usage for better running.

Robert
Robert
Joined: 5 Nov 05
Posts: 47
Credit: 323425152
RAC: 21943

All This good data is just

All This good data is just begging to be plotted.

And a cost plot here.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.