Support for (integrated) Intel GPUs (Ivy Bridge and later)

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181407464
RAC: 8747

RE: RE: http://devgurus.a

Quote:

Funny, I know Gergely and could ask him whether he could solve his issue. Are the symptoms the same in your case or did you indeed just list "similar" ones...?

Well, he under Linux, I'm under Windows.
But I get same message too when abort app. In short - no profiling even started (EDIT: I mean, no data written to any file. CodeXL GUI reports that "GPU profiling in progress"), looks like app got some OpenCL runtime crash at very beginning (same behavior as after driver crash and restart: no progress+ full CPU core usage). But there were not messages about driver restart in this case.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181407464
RAC: 8747

Is it possible to build app

Is it possible to build app under MSVC w/o re-writing your build script completely?
What external libs required (besides BOINC ones, FFTW and OpenCL) ?

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: Is it possible to build

Quote:
Is it possible to build app under MSVC w/o re-writing your build script completely?


Hm, should be. But that will surely require some work.

Quote:

What external libs required (besides BOINC ones, FFTW and OpenCL) ?


Have a look at build.sh, it downloads every third party lib it needs. In addition to what you already mentioned you'll need GSL, libxml2 and our OpenCL FFT library I referenced earlier.

Oliver

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181407464
RAC: 8747

I had no time so far to

I had no time so far to rebuild einstein's app or profile it on NV (but want to do such profiling still) but I tried to eliminate known differencies between einstein's app and SETi's astropulse. Looks like AMD SDK vs Intel SDK doesn't matter for CPU consumption. But what really matters is the synching style. When each runtime call followed by clFinish CPU usage drops considerably. So, synching on blocking read not the same as synching on clFinish for intel (and i suspect for NV too) GPU. AMD GPUs don't affected.

total synching, of course, leads to some performance drop for astropulse, but later i could eliminate excessive synching and performance should almost (or even totally) restore.

Here are examples of bench runs that Richard made on his host:

for free CPU core:
WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
Elapsed 422.874 secs
CPU 420.251 secs
AP6_win_x86_SSE2_OpenCL_Intel_r1922.exe -verbose :
Elapsed 85.395 secs, speedup: 79.81% ratio: 4.95x
CPU 82.259 secs, speedup: 80.43% ratio: 5.11x
AP6_win_x86_SSE2_OpenCL_Intel_r1922_OCL_SYNCHED.exe -verbose :
Elapsed 91.791 secs, speedup: 78.29% ratio: 4.61x
CPU 8.502 secs, speedup: 97.98% ratio: 49.43x

for fully loaded CPU:

WU : Clean_01LC.wu
astropulse_6.01_windows_intelx86.exe -verbose :
Elapsed 422.874 secs
CPU 420.251 secs
AP6_win_x86_SSE2_OpenCL_Intel_r1922.exe -verbose :
Elapsed 85.005 secs, speedup: 79.90% ratio: 4.97x
CPU 81.589 secs, speedup: 80.59% ratio: 5.15x
AP6_win_x86_SSE2_OpenCL_Intel_r1922_OCL_SYNCHED.exe -verbose :
Elapsed 90.389 secs, speedup: 78.63% ratio: 4.68x
CPU 7.675 secs, speedup: 98.17% ratio: 54.76x

So, I would say this enigmatic difference is explained, thanks Oliver for code sharing and Richard for initiating this comparison.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

RE: But what really matters

Quote:
But what really matters is the synching style. When each runtime call followed by clFinish CPU usage drops considerably. So, synching on blocking read not the same as synching on clFinish for intel (and i suspect for NV too) GPU.


That rings a bell somewhere. I think we also ran into this issue of implicit vs explicit synchronisation at some point but I can't find any reference to that anymore. However, we only sync where we actually need to in order to not spoil the performance.

Quote:

So, I would say this enigmatic difference is explained, thanks Oliver for code sharing and Richard for initiating this comparison.


Great news! Glad I could help!

Cheers,
Oliver

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181407464
RAC: 8747

Hi again. I'm trying to

Hi again.
I'm trying to follow your other hint regarding possible FFT inaccuracy so incorporating your changes into clFFT.

Unfortunately, code can't be built under MSVC compiler.
This line

float pi2=0x1.921fb54442d18p+2;

gives next error:

Quote:
1>..\..\..\..\src\OpenCL_FFT\fft_kernelstring.cpp(1115) : error C2059: syntax error : 'bad suffix on number'

Any ideas how to make it more portable?

EIDT:
indeed, MSVC 2008 doesn't know any suffixes in hexadecimal numbers: [url]http://msdn.microsoft.com/en-us/library/2k2xf226(VS.90).aspx[/url]

EDIT2:
perhaps

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif

float pi2=2.0f*M_PI;


could go.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

Sigh, that's standard C99.

Sigh, that's standard C99. One reason why we use GCC for all binaries. Converting 2PI from decimal to float will often incur some minor rounding errors so your code might be slightly less accurate.

Oliver

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181407464
RAC: 8747

Yeah, but too bound to MSVS

Yeah, but too bound to MSVS for now to make change :)

And thanks again, your hint about too poor native_sin/cos implementation on Intel GPUs was very useful too. Your changes in clFFT + replacement native_sin/cos to sincos inside SETIs dechirping function made result much more accurate. Now validation passed on test tasks.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171376
RAC: 43

That's great news!

That's great news!

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181407464
RAC: 8747

Unfortunately, failed again

Unfortunately, failed again to profile your app:

Activated exception handling...
20:05:56 (2824): Can't set up shared mem: -1. Will run in standalone mode.
[20:05:56][2824][INFO ] Starting data processing...
[20:05:56][2824][INFO ] Using OpenCL platform provided by: NVIDIA Corporation
[20:05:56][2824][ERROR] Couldn't find any suitable OpenCL GPU device!
[20:05:56][2824][ERROR] Demodulation failed (error: 2004)!
20:05:56 (2824): called boinc_finish

Cause AMD profiler refused to profile I tried to use CUDA profiler instead on NV GPU. But looks like app doesn't accept my NV host config as valid OpenCL environment for it.

I use GTX260 GPU and my app reports:

Number of OpenCL platforms: 1

OpenCL Platform Name: NVIDIA CUDA
Number of devices: 1
Max compute units: 27
Max work group size: 512
Max clock frequency: 1242Mhz
Max memory allocation: 234799104
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 939196416
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 16384
Queue properties:
Out-of-Order: Yes
Name: GeForce GTX 260
Vendor: NVIDIA Corporation
Driver version: 263.06
Version: OpenCL 1.0 CUDA
Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64

Perhaps, OpenCL 1.0 is no go for your app. Well, I'm afraid I have no PC with OpenCL 1.1 on NV so suhc profiling will be hard to make.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.