Support for (integrated) Intel GPUs (Ivy Bridge and later)

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 949

Credit: 25167626

RAC: 5

RE: Actually, the policy

17 Oct 2013 9:58:32 UTC

Message 117457 in response to message 117445

(moderation:

)

Quote:

Actually, the policy could be "spin when small and yield when big".
Hence big kernel launches allow small CPU usage and app with small kernel launches would suffer from increased CPU usage.
I want to understand if this guess is true or the reason in differencies we observe lies in something another.

Our custom kernels (see source code) are small in my opinion, but we have a large number of work items (up to 2^25 IIRC). The work group size is determined (limited) dynamically at runtime to respect the underlying hardware.

Update: I just noticed that our "kernelPowerSpectrum*" kernels became more complex these days :-) I've to admit that I haven't looked at the code for quite a while. So your profiling efforts could indeed be interesting.

Oliver

Einstein@Home Project

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2142

Credit: 2780372725

RAC: 749451

This gets curiouser and

17 Oct 2013 14:30:39 UTC

Message 117458

(moderation:

)

This gets curiouser and curiouser. I pointed out that my Einstein tasks run for about 11 minutes when the CPU is 75% loaded, and only record ~20 seconds CPU time. But if the CPU is 100% loaded, the elapsed time jumps massively to around 80 minutes (extrapolated - I didn't complete a full one).

I've just tried the reverse experiment with SETI (astropulse) running on the same hardware - task list.

The most recent one - 15081890, reported 17 Oct 2013, 14:00:10 UTC - was run with 100% CPU loading throughout, and the extra elapsed time is barely noticeable (though some of the SIMAP tasks running in parallel took longer than usual).

From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.

Just to check, I had a poke through Task Manager while the Einstein app was running in 75% CPU mode, to see if extra CPU time was being accounted for in other places. Probably not a robust enough test (even with 'show processes for all users' checked), but the most I could see was occasional spikes up to 2% in 'NT Kernel & System'. A fairly steady 23% was being allocated to 'System Idle Process'

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 949

Credit: 25167626

RAC: 5

RE: From the outside, it

17 Oct 2013 14:38:25 UTC

Message 117459 in response to message 117458

(moderation:

)

Quote:

From the outside, it looks as if Oliver's app needs a free CPU, but doesn't actually use it much: Raistmer's app uses a lot of CPU, but doesn't actually need it.

Doesn't this make sense? I mean if our app doesn't use much CPU it depends on very fast context switching for the few parts that do run on the CPU. SETI's app does seem to use the CPU extensively, which is why it doesn't depend so much on the context switching - it already is running more or less continuously on the CPU, so it doesn't compete with other CPU processes for time slices, but our app does...

Einstein@Home Project

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 180017207

RAC: 27036

Yes, cause AP uses ~100% of

17 Oct 2013 16:11:00 UTC

Message 117460 in response to message 117459

(moderation:

)

Yes, cause AP uses ~100% of CPU it's not correct to say that AP "doesn't need it". It just does not share it ;)
The strange thing is that such reaction on fully loaded CPU came only after last driver update on my host (from OpenCL 1.1 to OpenCL 1.2 driver). Cause Richard runs OpenCL Einstein more than I it's quite possibly he had updated driver (Einstein requires OpenCL 1.2 driver) long before I did upgrade so has no pre-OpenCL 1.2 points to compare with. I have. Older driver reacted in different way: on full load elapsed time increased considerably, but CPU time decreased. It's very fact I consider as confirmation for my "spend CPU on synching" theory. When CPU not available immediately app has less time for waiting loops (GPU mostly ready already to switching time).

EDIT: maybe with last driver change priority of corresponding driver thread was changed or smth alike - now app continue to use CPU under full load.

EDIT2: typical example for old driver (fully loaded CPU):

32,186.11 16,924.63 (elapsed/CPU).
And current situation:
26,569.22 26,345.06 (elapsed/CPU)

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2142

Credit: 2780372725

RAC: 749451

If you were following the

17 Oct 2013 17:12:21 UTC

Message 117461 in response to message 117460

(moderation:

)

If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU (yes, not GPU - I got that wrong too), you'll remember that this machine was supplied from the factory with OpenCL 1.2 drivers for the HD 4600 - so no, I have no Astropulse times under OpenCL 1.1 driver for comparison.

As part of the v7.2.18 testing, I downloaded the additional Intel SDK and runtime support for OpenCL on CPU - that wasn't pre-installed.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 542064092

RAC: 171916

RE: Unfortunately, have

17 Oct 2013 20:54:48 UTC

Message 117462 in response to message 117447

(moderation:

)

Quote:

Unfortunately, have absolutely no idea what that "kernels per reduction" does in his app. So hard to comment how it would change load.

I'm not that deep into CC either. But in principle they're checking huge integers (many in parallel for the GPUs) for the Collatz conjecture by running some algorithm on them ("3+1"). Thereby the numbers gradually become smaller, until the algorithm terminates with "CC still holds true" or "not" (never happened so far). So.. the "reductions per kernel" could be the algorithm iterations performed per kernel called.

MrS

Scanning for our furry friends since Jan 2002

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 180017207

RAC: 27036

Then the more number is the

17 Oct 2013 22:06:27 UTC

Message 117463 in response to message 117462

(moderation:

)

Then the more number is the bigger kernel, not reverse.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 180017207

RAC: 27036

RE: If you were following

19 Oct 2013 7:58:29 UTC

Message 117464 in response to message 117461

(moderation:

)

Quote:

If you were following the v7.2.18 release thread on boinc_alpha, and the discussion about OpenCL detection on CPU

No, I don't follow that conversation and feel big temptation to unjsubscribe from all BOINC lists at all right now.
Quota management fundamental issue ignored completely but very lively discussion where to put button in Android interface screen and how to properly detect Windows 8.1...

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 542064092

RAC: 171916

RE: Then the more number is

19 Oct 2013 15:31:09 UTC

Message 117465 in response to message 117463

(moderation:

)

Quote:

Then the more number is the bigger kernel, not reverse.

That's how I understnad it as well. But still, I was able to get CPU usage under control by setting smaller values.

MrS

Scanning for our furry friends since Jan 2002

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 180017207

RAC: 27036

Looked at Einstein's sources

19 Oct 2013 19:15:41 UTC

Message 117466

(moderation:

)

Looked at Einstein's sources - only one difference spotted.
You directly call clFinish every time synching needed. I do indirect synching on blocking reads when required.

That is, Einstein:

enqueue();
clFinish();
bufferRead(false);
clFinish();

SETI:

enqueue();
bufferRead(true);

Could this difference lead to such big consequencies or not - no idea right now.
Next thing will be size of kernel calls determination.

Support for (integrated) Intel GPUs (Ivy Bridge and later)

Forums › Technical News

Comment viewing options

Forums › Technical News