Support for (integrated) Intel GPUs (Ivy Bridge and later)

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181423403
RAC: 7429

RE: I do have a newer

Quote:

I do have a newer NVidia card online and available for testing, if that helps: host 1226365.

Driver is a little long in the tooth currently, but that's easy to fix. OS is WinXP - that would be harder to change.

Try to run CUDA profiler (or NSight) with einstein's app then and post output. Most interesting is size of kernel calls inside the app.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

RE: [Most interesting is

Quote:
[Most interesting is size of kernel calls inside the app.

I presume in absolute runtime that is? You've already seen the kernel and host code (size/complexity), and we discussed the work item dimensions (quantity/scheduling). Do you have further questions regarding that?

Best,
Oliver

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181423403
RAC: 7429

RE: RE: [Most interesting

Quote:
Quote:
[Most interesting is size of kernel calls inside the app.

I presume in absolute runtime that is? You've already seen the kernel and host code (size/complexity), and we discussed the work item dimensions (quantity/scheduling). Do you have further questions regarding that?

Best,
Oliver

You said millions of workitems per call, right? It could constitute quite long kernel call even with simplest kernel code. So yes, having test case for some single gpu with absolute (in us/ms) runtimes would be interesting still (absolute kernel call duration should be compared with OS quantum time slice after). I thought you interesting in this too.
What is not quite understandable is when I go from totally synched code that has amazingly low CPU usage, just as einstein's app has, to synching only in places where indirect (via blocking buffer reads) was done I get big increase in CPU consumption again. How much additiional clFinish through the code required still need to be established but this looks weird. So, having also kernel call sizes could give some more info still.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

RE: You said millions of

Quote:
You said millions of workitems per call


Yes. You may build a debug version of our app by passing "--debug" as an additional command line option to the build script. The binary will then be pretty verbose about what's going on, including the exact number of work items per kernel call.

Quote:
So, having also kernel call sizes could give some more info still.


This is what I wanted to understand: "kernel call size" can mean a number of different things, like kernel code length/complexity, work group size (depends on hardware in our case), total number of work items or the absolute runtime...

Oliver

Einstein@Home Project

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181423403
RAC: 7429

Absolute time length of

Absolute time length of particular kernel invocation, in seconds.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 76

RE: RE: The expectation

Quote:
Quote:

The expectation is that CUDA will go on for another year, year and a half, and then fold, with Nvidia moving full on to OpenCL.

Who's expectation is that? I know that NVIDIA doesn't plan to support anything above OpenCL 1.1.


The community's expectation.

A good example is that of Apple. They release a new operating system, which has OpenCL support for all AMD products out of the box, but no CUDA support because Nvidia weren't ready. Either that's bad communication (but then Nvidia does this every Apple OS X release), bad planning, or they just don't give a damn.

Apple released a MacPro in June this year that had no Nvidia support at all. The only videocards in there were Intel, with the dedicated card being a dual AMD FirePro W9000. Models of the Macbook Pro have a rumoured Intel only GPU, embedded on the Haswell chip.

Now, unless Apple is on its way to topple over and goes to play dead, Nvidia may want to scratch themselves behind the ears and figure if their plan for the future is the right one. They're losing out on a lot of revenue, this way.

Now, on the other hand, the problem with Nvidia is apparently --according to the same community-- that they want to keep their OpenCL compiler underdeveloped, not as much up-to-par as their CUDA compiler, because when they do, anyone could use it to compile rocksolid, fast OpenCL code on Nvidia GPUs... and anything else capable of using OpenCL. That would include AMD and Intel, who both have huge ranges of CPUs and GPUs that are OpenCL capable. That's why Nvidia says they won't develop a compiler for OpenCL 2.0, and stay on track with CUDA.

The community are probably right. When the application developers stop using CUDA, but all embrace OpenCL, Nvidia has no other choice but to fall in line. For really, those DirectCompute applications you build on a daily basis, are superfast, right? ;-)

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181423403
RAC: 7429

I much more agree with Oliver

I much more agree with Oliver now in OpenCL on NV pessimism and would like to share latest findings regarding OpenCL NV implementation.

There is 100% CPU usage issue on OpenCL NV too for AstroPulse but way I solved this issue for Intel doesn't work for NV. So I make attempts to suspend thread exsecution for some time. But no effect was achieved.

Then I ran this code:

	 {cl_event evPC;
	 err = clEnqueueNDRangeKernel(
			     cq,
                 PC_single_pulse_kernel_FFA_update_reduce1,
                 3,//R: 3D execution domain used
                 NULL,
                 globalThreads,
                 localThreads,//R: TODO tune workgroup dimensions
                 0,
                 NULL,//R: synching between kernels not actually needed cause queue in-order one.
#if DEBUG_SLEEP_TUNING
				 &events[APEvents::evPC]
#else
				 &evPC
#endif
	 );OCL_LOG_ERR("PC_single_pulse_kernel_FFA_update_reduce1");
     clFlush(cq);
	 if(use_sleep){
		int wait_time=0;cl_int ret=0;
		do{/*Sleep(0)*/;wait_time++;
			err=clGetEventInfo(evPC,CL_EVENT_COMMAND_EXECUTION_STATUS,sizeof(ret),&ret,NULL);
		}while(ret>CL_COMPLETE);/*err=clWaitForEvents(1,&evPC)*/;clReleaseEvent(evPC);
	 	if(verbose>=2)  fprintf(stderr,"Single pulse find before bufferRead: Awaited %d  iterations for completion\n",wait_time);
	 }
	 }

if(use_sleep)//R: spins with Sleep(1) while readback finished
{cl_event ev;
err=clEnqueueReadBuffer(cq, gpu_results,CL_FALSE,0,sizeof(cl_uint),&CPU_result,0, NULL, &ev);OCL_LOG_ERR("clEnqueueReadBuffer->CPU_result");
clFlush(cq);
int wait_time=0;cl_int ret;
do{/*Sleep(0)*/;wait_time++;
err=clGetEventInfo(ev,CL_EVENT_COMMAND_EXECUTION_STATUS,sizeof(ret),&ret,NULL);
}while(ret>CL_COMPLETE);err=clWaitForEvents(1,&ev);clReleaseEvent(ev);
if(verbose>=2) fprintf(stderr,"Single pulse find: Awaited %d iterations for completion\n",wait_time);
}

With such results:

Quote:
Single pulse find before bufferRead: Awaited 1093 iterations for completion
Single pulse find: Awaited 1 iterations for completion

So, async buffer call returns only after transfer actually completed.
And when I tried to wait after enqueuing async buffer read (after few enqueued kernels) I actually tired to wait after synched bufffer transfer.
so, there is no way to issue async buffer transfer in NV OpenCL runtime, async call executed as sync one by all manifestations.

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

RE: A good example is that

Quote:

A good example is that of Apple. They release a new operating system, which has OpenCL support for all AMD products out of the box


Apple's OpenCL platform isn't limited to AMD by the way. They also support NVIDIA GPUs and Intel C/GPUs.

Quote:

but no CUDA support because Nvidia weren't ready. Either that's bad communication (but then Nvidia does this every Apple OS X release), bad planning, or they just don't give a damn.


Yep, NVIDIA had plenty of lead time to get their stuff together but they failed. As a customer I don't care about their reasons, so from my vantage point, it's just another failure.

FYI: CUDA 5.5.28 is supported on OSX 10.9 but only on > sm_1x architecture GPUs for the time being (a fix is planned).

Quote:

Apple released a MacPro in June this year that had no Nvidia support at all. The only videocards in there were Intel, with the dedicated card being a dual AMD FirePro W9000. Models of the Macbook Pro have a rumoured Intel only GPU, embedded on the Haswell chip.


Right. However, Apple has switched their GPU vendors constantly so this doesn't really mean anything. In particular the MacBook Pro series has switched from NVIDIA to AMD and back (for the dedicated GPUs) almost every year. Also for the MacPro, there have been GTX 285 and Quadro 4000 GPUs alongside AMD GPUs. So Apple seems to decide on case-by-case basis, depending on the then current market. I don't think that there's any clear trend to see yet.

Quote:

That's why Nvidia says they won't develop a compiler for OpenCL 2.0, and stay on track with CUDA.


Yep, not even OpenCL 1.2.

But hey, these are plans and intentions. Those might change if they notice the potential consequences ;-)

Quote:

The community are probably right. When the application developers stop using CUDA, but all embrace OpenCL, Nvidia has no other choice but to fall in line.


Sure, but I don't think it's that easy as CUDA has had quite a head start, so there's a lot of production CUDA code - and CUDA still seems to be better supported in terms of software/education/libraries, etc. But yes, OpenCL is gaining traction...

JM2C,
Oliver

PS:

Raistmer wrote:

I much more agree with Oliver now in OpenCL on NV pessimism


It's not really pessimism, it's knowledge about NVIDIA's plans.

Einstein@Home Project

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 727950019
RAC: 1225710

In the end, both CUDA and

In the end, both CUDA and OpenCL (at least as an API/language extension that is exposed to the programmer, as we know it today) might decrease in relevance.

Intel with its Xeon Phi accelerators are leaning towards an OpenMP-ish paradigm. NVIDIA itself has acquired The Portland Group Inc, a compiler specialist that is part of a group of OpenACC proponents (OpenACC is quite similar to OpenMP). Yes, PGI has also done work in OpenCL compilers, but AFAIK those activities were reduced if not stopped after NVIDIA took control...hmmm....you see a trend there??? ;-)

Cheers
HB

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 578246873
RAC: 197681

NVidias half-hearted stance

NVidias half-hearted stance on OpenCL support actually puts me into a dilemma when I want to get my next GPU: I've got a few PCs creating the major amount of credits for my team, so my GPUs have to run something which yields many credits (sorry Einstein!). But it shouldn't be completely useless either, like Moo Wrapper, which for me narrows the choices down to, in descending priority:

1. POEM (OpenCL, better on AMD)
2. GPU-Grid (nVidia, not available on AMD)
3. Milkyway (OpenCL, much better on AMD)
4. Einstein as backup, also better on AMD

And since POEM can't supply enough work most of the time project 2 or 3 would run. So if I want GPU-Grid I'd have to go for a nVidia thanks to CUDA - despite AMD being better for the all other 3 projects :/

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.