Support for (integrated) Intel GPUs (Ivy Bridge and later)

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

RE: Actually, the policy

Quote:

Actually, the policy could be "spin when small and yield when big".
Hence big kernel launches allow small CPU usage and app with small kernel launches would suffer from increased CPU usage.
I want to understand if this guess is true or the reason in differencies we observe lies in something another.

Our custom kernels (see source code) are small in my opinion, but we have a large number of work items (up to 2^25 IIRC). The work group size is determined (limited) dynamically at runtime to respect the underlying hardware.

Update: I just noticed that our "kernelPowerSpectrum*" kernels became more complex these days :-) I've to admit that I haven't looked at the code for quite a while. So your profiling efforts could indeed be interesting.

Oliver

Einstein@Home Project

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957453003
RAC: 714477

This gets curiouser and

This gets curiouser and curiouser. I pointed out that my Einstein tasks run for about 11 minutes when the CPU is 75% loaded, and only record ~20 seconds CPU time. But if the CPU is 100% loaded, the elapsed time jumps massively to around 80 minutes (extrapolated - I didn't complete a full one).

I've just tried the reverse experiment with SETI (astropulse) running on the same hardware - task list.

The most recent one - 15081890, reported 17 Oct 2013, 14:00:10 UTC - was run with 100% CPU loading throughout, and the extra elapsed time is barely noticeable (though some of the SIMAP tasks running in parallel took longer than usual).

From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.

Just to check, I had a poke through Task Manager while the Einstein app was running in 75% CPU mode, to see if extra CPU time was being accounted for in other places. Probably not a robust enough test (even with 'show processes for all users' checked), but the most I could see was occasional spikes up to 2% in 'NT Kernel & System'. A fairly steady 23% was being allocated to 'System Idle Process'

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

RE: From the outside, it

Quote:

From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.

Doesn't this make sense? I mean if our app doesn't use much CPU it depends on very fast context switching for the few parts that do run on the CPU. SETI's app does seem to use the CPU extensively, which is why it doesn't depend so much on the context switching - it already is running more or less continuously on the CPU, so it doesn't compete with other CPU processes for time slices, but our app does...

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181421324
RAC: 7968

Yes, cause AP uses ~100% of

Yes, cause AP uses ~100% of CPU it's not correct to say that AP "doesn't need it". It just does not share it ;)
The strange thing is that such reaction on fully loaded CPU came only after last driver update on my host (from OpenCL 1.1 to OpenCL 1.2 driver). Cause Richard runs OpenCL Einstein more than I it's quite possibly he had updated driver (Einstein requires OpenCL 1.2 driver) long before I did upgrade so has no pre-OpenCL 1.2 points to compare with. I have. Older driver reacted in different way: on full load elapsed time increased considerably, but CPU time decreased. It's very fact I consider as confirmation for my "spend CPU on synching" theory. When CPU not available immediately app has less time for waiting loops (GPU mostly ready already to switching time).

EDIT: maybe with last driver change priority of corresponding driver thread was changed or smth alike - now app continue to use CPU under full load.

EDIT2: typical example for old driver (fully loaded CPU):

32,186.11 16,924.63 (elapsed/CPU).
And current situation:
26,569.22 26,345.06 (elapsed/CPU)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2957453003
RAC: 714477

If you were following the

If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU (yes, not GPU - I got that wrong too), you'll remember that this machine was supplied from the factory with OpenCL 1.2 drivers for the HD 4600 - so no, I have no Astropulse times under OpenCL 1.1 driver for comparison.

As part of the v7.2.18 testing, I downloaded the additional Intel SDK and runtime support for OpenCL on CPU - that wasn't pre-installed.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 578166875
RAC: 203462

RE: Unfortunately, have

Quote:
Unfortunately, have absolutely no idea what that "kernels per reduction" does in his app. So hard to comment how it would change load.


I'm not that deep into CC either. But in principle they're checking huge integers (many in parallel for the GPUs) for the Collatz conjecture by running some algorithm on them ("3+1"). Thereby the numbers gradually become smaller, until the algorithm terminates with "CC still holds true" or "not" (never happened so far). So.. the "reductions per kernel" could be the algorithm iterations performed per kernel called.

MrS

Scanning for our furry friends since Jan 2002

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181421324
RAC: 7968

Then the more number is the

Then the more number is the bigger kernel, not reverse.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181421324
RAC: 7968

RE: If you were following

Quote:
If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU

No, I don't follow that conversation and feel big temptation to unjsubscribe from all BOINC lists at all right now.
Quota management fundamental issue ignored completely but very lively discussion where to put button in Android interface screen and how to properly detect Windows 8.1...

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 578166875
RAC: 203462

RE: Then the more number is

Quote:
Then the more number is the bigger kernel, not reverse.


That's how I understnad it as well. But still, I was able to get CPU usage under control by setting smaller values.

MrS

Scanning for our furry friends since Jan 2002

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181421324
RAC: 7968

Looked at Einstein's sources

Looked at Einstein's sources - only one difference spotted.
You directly call clFinish every time synching needed. I do indirect synching on blocking reads when required.

That is, Einstein:

enqueue();
clFinish();
bufferRead(false);
clFinish();

SETI:

enqueue();
bufferRead(true);

Could this difference lead to such big consequencies or not - no idea right now.
Next thing will be size of kernel calls determination.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.