Support for (integrated) Intel GPUs (Ivy Bridge and later)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,059
Credit: 965,468,767
RAC: 1,447,086

SETI are back, so here are

SETI are back, so here are the GPU utilisation and power figures Oliver was interested in.

First, some better info for Einstein. HWinfo gives more precise power figures: the GPU-Z capture is the same one as before.

And here is the SETI AstroPulse application running on the same co-processor:

All figures tend to fluctuate, of course, but I think those are pretty representative. Both were taken with SIMAP allowed 75% CPU occupancy.

So, SETI gains an extra 1% on GPU loading, at a cost of another 8W on 'package' and 'IA Cores' power consumption?

archae86
archae86
Joined: 6 Dec 05
Posts: 3,064
Credit: 5,783,330,587
RAC: 3,860,971

RE: So, SETI gains an extra

Quote:
So, SETI gains an extra 1% on GPU loading, at a cost of another 8W on 'package' and 'IA Cores' power consumption?


Possibly you could get a better comparison of GPU loading (and temperature, possibly others) by exiting GPU-Z just before you launch a task under test, then relaunching it and choosing the AVG display option for the parameters where an average over most of a task execution makes sense.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 893
Credit: 25,165,933
RAC: 0

RE: I think the confusion

Quote:


I think the confusion arose when I looked at the changelog for cs_account.cpp - the file which Claggy posted in full.

There are a couple of lines:

Quote:
@1174b00 8 months oliver.bock - client/manager: tweaks to Intel GPU code
@ce87ec9 8 months oliver.bock OpenCL: First pass at adding support for Intel Ivy Bridge GPUs

and things like http://boinc.berkeley.edu/trac/changeset/ce87ec9848643a094337f67f78a1d5077cf7f772/boinc-v2/client/cs_account.cpp - all arising from the clean-up in March - which perhaps still lead to the blame being wrongly put on Oliver....

Oh well, I said it before and I say it again: Trac sucks for projects using git because of its pure SVN heritage. There's been a new version around with improved git support for quite some time now, but BOINC just hasn't got round to upgrade it.

Git differentiates between author and committer. While the changelog displays the committer as author (!), the commit details do show the differences:

git-author: Charlie Fenton  (12/05/12 05:11:20)
git-committer: Oliver Bock  (03/04/13 06:23:39)

This means you want to ask Charlie in this case :-)

Best,
Oliver

 

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 893
Credit: 25,165,933
RAC: 0

RE: RE: As far as I'm

Quote:
Quote:

As far as I'm aware we don't use any OpenCL 1.2 specific features. In fact I'm surprised that we're actually using 1.2 to build our apps. Are you sure about that? Where did see this?

Best,
Oliver


It crops up periodically in threads like this morning's does Einstein have WUs for an INTEL GPU? :P

Heh, fair enough :-) I just had a look at our plan classes and we do indeed require OpenCL 1.2 for the Intel OpenCL apps. I have no clue why we raised that from 1.1 since I wasn't involved in the Intel GPU app release. I'll check with Bernd for the reasoning behind that decision.

Update: the reason for this decision was that we noticed some Intel driver issues and it was easiest to enforce working driver versions by raising the OpenCL level to 1.2. So, as I mentioned above, we don't use any 1.2 features. In fact, all our OpenCL apps are the same binaries, no vendor-specific builds.

Oliver

 

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 87,766,779
RAC: 156,990

Hello I would like to join to

Hello
I would like to join to this discussion "Einstein vs SETI" on Intel GPU.

In SETI we have few different OpenCL apps, lets talk about AstroPulse for now exclusively. It exists in 3 version for Ati, NV and Intel GPUs. There are some fine tunes of kernels and workgroup sizes between NV and ATi versions. Intel one did not have any specific tunes (besides single fix required for Intel OpenCL implementation). In general, all three use same approach to host vs GPU synchronization. IMO for this discussion all 3 can be viewed as same app (just as for Einstein's app).

All versions have quite good GPU usage. Richard already referenced typical case for iGPU, load on AMD and NV the same in general.
But the CPU usage is quite different.
For AMD: CPU usage is low (few % of single CPU core).
For NV: with old drivers like 263.06 CPU load is very low (few %) but with modern drivers it's 100% CPU core usage.
For Intel: again 100% core usage. But as with NV there is strong dependance of drivers in use. Before trying Einstein's app I use OpenCL 1.1 intel driver.
100% CPU usage and big reduce in CPU usage (and big increase in run time) when CPU fully loaded. Now, with recent OpenCL 1.2 Intel drivers 100% CPU usage remains on fully loaded CPU too and run time increase not so big as before (app was not rebuilt vs new SDK, just kernels were rebuilt with new driver's compiler).

Hence, all this "CPU usage" stuff directly depends on using drivers (for AMD it's the case too. Some old AMD drivers allowed to run single copy of AP w/o speed loss on fully loaded GPU, just as CUDA allows. Unfortunately, attempt to run 2 copies or something like youtube led to BSoD on those drivers).
This led me to conclusion that the increased CPU consumption is derivative of hostGPU synching used in particular driver. AP is OpenCL 1.0 compatible app so can run on any released driver starting from very first ones.

This question (why 100% CPU usage) arises constantly. I did comparison with MilkyWay's OpenCL app and found that they use quite constant and big kernel launches (they can split work on equal parts of ~20ms each). In SETI app I can't do the same for now unfortunately. There are big kernel launches and small ones depending of particular search done. And the bigger single kernel lauch is the less important synching between GPU and host. Quite possible that for long kernels synching policy is yield while for small ones it's spin-wait polling loop.

So I'm interesting about typical kernel launch size for Einstein app. I'm not aware if some profiling tools exist for intel, but as both our apps capable to run on all 3 GPU types profiling on NV for example (or on Ati) would be quite enough for my purpose. I have rich profiling data on ATi GPUs for comparison.

Also, are Einstein's app sources available and where if yes?

P.S. and lets not compare power usage for now. Power usage depends not only from GPU load but from CPU load too. hence 1% better in GPU but 100% vs few in CPU leaves the question from where 10% power increase comes quite open. The main goal is to understand why Einstein's app can use only few % of CPU core with good GPU load while SETI AP uses 100% with roughly same GPU load.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 87,766,779
RAC: 156,990

RE: RE: Raistmer is

Quote:

Quote:

Raistmer is getting his 99% CPU usage with OpenCL 1.1 - you are getting 3% with OpenCL 1.2: could that be significant?

I doubt it. As far as I'm aware we don't use any OpenCL 1.2 specific features. In fact I'm surprised that we're actually using 1.2 to build our apps. Are you sure about that? Where did see this?

Best,
Oliver

More precisely: SETI AP is OpenCL 1.0 app. It can run under all OpenCL drivers starting from very first ones (it appears quite long ago, right when AMD starts to implement openCL on their GPUs).
If Einstein's app is true OpenCL 1.1 one it can use some different methods for hostGPU synching (events). So I'm wonder is it the case or not ?
OpenCL 1.2 features are irrelevant for this.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 87,766,779
RAC: 156,990

RE: Hm, it's not trivial

Quote:

Hm, it's not trivial to profile individual kernels in terms of absolute runtime with the tools at our disposal. Also, it's not just a matter of the individual kernels but rather the workgroup size you use and the total number of work items as both parameters define the "length" of the actual kernel call. One also needs to take into account that each GPU series can have different limits for the workgroup size so you have to probe and adjust them for optimal performance (hint!). This concerns AMD GPUs much more than NVIDIA GPUs, not sure about the Intel GPUs...


Sure, particular kernel invocation time directly depends on launch dimensions (and, as one of parameters on worgroup size and its dimensions too).
But current GPU profiler can give launch time for particular kernel call even if grid size changes from launch to launch. AMD can do this and NV, not aware about Intel's profiler.

Cause we talk here about differences in app behavior on single host, same hardware used for both apps so direct comparison possible.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 87,766,779
RAC: 156,990

RE: b) you might be

Quote:


b) you might be comparing apple to oranges here as we don't know how CUDA implements its CPU/GPU threading model and how that compares to OpenCL's implementation. OpenCL might simply be more demanding in terms of its feeding requirements or context/thread switching respectively. The individual drivers play a crucial role here and, frankly, let's not discuss those...

Exactly. Lets leave CUDA untouched. OpenCL doesn't expose synching control while CUDA does. Also, indeed there are huge differencies for different drivers. Let's not discuss them.
But lets discuss the possible reasons for differencies in behavior for particular ("current") Intel OpenCL driver between our 2 apps (same hardware, same system software but different outcome). And use referencies to other GPUs and drivers just to look on picture more widely.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 87,766,779
RAC: 156,990

RE: I don't think OpenCL

Quote:


I don't think OpenCL spins but rather yields control to the OS. That's why our app gets as low as 3% CPU usage. It would be up to 100% if it would spin. CUDA offers a setting to define this, OpenCL doesn't. Fortunately the default choice is the right one for the BOINC use case.

HTH,
Oliver

Actually, the policy could be "spin when small and yield when big".
Hence big kernel launches allow small CPU usage and app with small kernel launches would suffer from increased CPU usage.
I want to understand if this guess is true or the reason in differencies we observe lies in something another.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 427,429,026
RAC: 134,722

Hey guys, now that we're

Hey guys, now that we're seriously talking about this.. I'd like to first quote myself:

Quote:
We had a similar, or at least "probably related" problem at Collatz. The app was tuned for good GPU utilization but consumed an entire thread, despite being programmed not to do so. I experimented with various parameters and could reduce the CPU time to next to nothing by using shorter kernels (I think, can't remember the exact term). This reduced GPU utilization significantly, but could be made up for by running 2 or 3 WUs in parallel. I think one thread still had to be left free, but at least it wasn't working hard any more.


The data is over there. So here the situation was quite clear: the CC app did not ask for continous polling, yet the intel driver used it depending on parameters. By making the individual work packages smaller I could finally get the CPU usage down to where it should have been. At the cost of throughput, which I made up for by running multiple WUs in parallel.

But ever since Einstein@Intel GPU appeared I haven't looked back, despite the lower credits :)

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.