Nvidia opencl capability is not cleaned by Nvidia clean install

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 449
Credit: 208,736,569
RAC: 26,071

Keith Myers wrote:The OpenCL

Keith Myers wrote:
The OpenCL may be more general but it doesn't offer the performance in parallel computing that CUDA does.  So far.

I usually buy Nvidia for that reason.  In fact, it often does better even at OpenCl than AMD does. 

But here I find that my RX 570 does better than a GTX 1060 or even a 1070.  Whether it depends on the developer or the science they are trying to do is a moot point.  Also, the new AMD Navis will be quite cost-effective if they work well.  So it would be nice if the developers can learn to take advantage of OpenCl.  (Or if Einstein can come up with a good CUDA app, I wouldn't mind either.)

But I expect that the real future lies with AI and machine learning everyplace that can use it.  So that is something of a new ballgame as between AMD and Nvidia, and they will each have to scramble to fill that niche.

 

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,219
Credit: 137,279,852
RAC: 47,856

Jim1348 wrote:Mike Hewson

Jim1348 wrote:
Mike Hewson wrote:
My understanding is truly pretty basic. This rabbit hole is a deep one, but try to think of OpenCL as a formalised attempt to get radically different hardware architectures to work together in parallel.

I wonder how that compares to CUDA.  I have always thought of CUDA as similar to OpenCl, except limited to Nvidia GPUs, which allows for an increase in efficiency since they match the hardware with the software.

Well, more power to NVidia for having their own optimised language/toolset tuned for their own hardware. In that regard ( appropriately well written ) CUDA is going to beat OpenCL on the same machine. I guess the decision at E@H is for generality, both to cater for a wide range of platforms and to keep the developers sane for that matter also ! :-)

Jim1348 wrote:

But it may be more than that - I really don't know.  There was an intriguing statement in the Wiki on OpenCl:

The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices. This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices. Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance. 

https://en.wikipedia.org/wiki/OpenCL

So can they do that with CUDA?  If not, the very generality of OpenCl might lead to an increase in overall efficiency, instead of a loss.

Further down the Rabbit Hole :

"... the problem of deciding how to partition the work ..." : ay, there's the rub indeed. What we've been ( implicitly ) discussing thus far may be called data parallelism which means using the same algorithm on many different data pieces.

{ In the case of an FFT, in effect one is inverting a matrix but not by the 'high school long method' as that goes by O(N2) complexity. The short story is that one can 'factorise bigger' FFTs into 'smaller' FFTs iteratively until one gets to quick/simply inverted matrices. }

Alternatively one can distribute different algorithms ( & data ) amongst several devices. There may be no correspondence b/w the internals of either task eg. you change the tyres while I fill the petrol tank, and someone else adjusts the ride height. That 'visit to the pits' type of parallelism is one of doing different things but simultaneously. Call this, say, task parallelism. Balancing the workload here is the key bit. Continuing with the car racing analogy ( my favourite hobby is watching pit crews ) there is often a guy who returns a used tyre to the stack & then strips a tear-off from the windscreen. Who or what task is rate limiting for a given pit stop ? So what Grewe and O'Boyle want to do is calculate in advance the total times of various pit stop cases, each case with tasks done in different orders and combinations by different crew members. Without doing an actual pit visit. The older/traditional approach is to do the work while profiling and decide, possibly by exhaustion*, which is the quickest way.

But really this is a matter independent of either CUDA or OpenCL, we're at a higher level here. Another way to state the problem is : find the task/device combination(s) that lead to the minimum time difference b/w the first device to finish and the last device to finish ( assume they all start at the same instant ). If you think carefully about that goal then you'll realise (a) you can't finish everything any quicker and (b) you'll have the least total idle time across all devices. The lower bound on that is the ideal situation where everybody finishes at once : the guy filling the petrol pulls the hose out just as the last nut is put on the last wheel etc. The OpenCL mantra here is that all tasks can be implemented by using it, and not just for NVidia machines like CUDA.

FWIW I reckon in the real world you still have to factor in the time it takes to decide how you are going to decide !! No point in the spaceship's AI completing the calculation for the perfect avoidance trajectory just as it splats into the asteroid ..... :-)

Cheers, Mike.

* With The Travelling Salesman Problem one can estimate closely the difference b/w any actual calculated solution and the perfect one, even if you can't actually solve for the perfect one. Nice. So pick a tolerance, say 5%, and then just stop looking once you have discovered any solution within 5% of the ideal. Close enough is good enough.

( edit ) I forgot to mention task dependencies. You have to take your shoes off before you take off your socks. Well that's how I do it  ....

( edit ) Also there's not much point in optimising some busy-wait loop.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 449
Credit: 208,736,569
RAC: 26,071

Mike Hewson wrote: * With

Mike Hewson wrote:

* With The Travelling Salesman Problem one can estimate closely the difference b/w any actual calculated solution and the perfect one, even if you can't actually solve for the perfect one. Nice. So pick a tolerance, say 5%, and then just stop looking once you have discovered any solution within 5% of the ideal. Close enough is good enough.

I recall when Karmarkar's algorithm came out there was a lot of excitement.  I had never considered that it might be applied to GPU design or operation.  You never know what information will be useful where, even pit stops.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,219
Credit: 137,279,852
RAC: 47,856

Mike Hewson wrote:ASIDE : I

Mike Hewson wrote:
ASIDE : I believe someone has created a thin C++ layer over the OpenCL API in order to bury messy detail inside OOP objects. That's clever.

In fact I've found about four such wrappers. The Khronos group offer one for OpenCL version 1.x that seems simple. I will consider creating a small standalone command-line tool here for users, Windows and Linux at least, that will interrogate a system to determine :

- the number and type of OpenCL platforms available.

- per platform, the number and type of devices available.

- per device, a list of characteristics.  

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 171,124,647
RAC: 98,542

Mike Hewson wrote:Mike Hewson

Mike Hewson wrote:
Mike Hewson wrote:
ASIDE : I believe someone has created a thin C++ layer over the OpenCL API in order to bury messy detail inside OOP objects. That's clever.

In fact I've found about four such wrappers. The Khronos group offer one for OpenCL version 1.x that seems simple. I will consider creating a small standalone command-line tool here for users, Windows and Linux at least, that will interrogate a system to determine :

- the number and type of OpenCL platforms available.

- per platform, the number and type of devices available.

- per device, a list of characteristics.  

Cheers, Mike.

 

Are you talking about something like clinfo ?

 

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,219
Credit: 137,279,852
RAC: 47,856

Indeed ! I was going to do it

Indeed ! I was going to do it for fun ....

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.