Steven Pletsch had an interesting post over on Lattice that I thought would get more attention here since TLP doesn't do CUDA discussions much. Here is a copy and paste of that post:
Another thought on this subject.
Since most of the consumer level video cards that are being used with CUDA the most efficiently come with 896MB up to 1.7GB of memory, and most of the workstation grade CUDA enabled video cards are typically in the range of of 1 to 4 GB of memory. Is it at all possible to load the application, or at the least the majority of the calculation data into the memory of the video card ?
It seems that this would free up resources on the computer for handling other BOINC applications, and would possibly serve to increase the efficiency of the application. I do not know of the feasibility of doing something like this, but if the option is there it may well be worth trying.
Copyright © 2024 Einstein@Home. All rights reserved.
More CUDA Budha
)
Similar discussion as at Milkyway, but using the ATI HD38xx and 48xx cards for their double precision maths and more pipelines.
Shih-Tzu are clever, cuddly, playful and rule!! Jack Russell are feisty!
The applications themselves
)
The applications themselves cannot be started in the GPU or put in its memory and started as the OS doesn't know that the GPU is a processor. That's because it isn't. It's a coprocessor, it helps the CPU, it won't take over from the CPU (yet).
The actual application will always run in the computer's main memory (RAM), it can't run in the GPU's memory (VRAM) because the GPU isn't recognized by the OS as something that can natively execute data. Code in the videocard's drivers and the science application will tell that there is a coprocessor available and that the data needs to run on that coprocessor.
The actual data isn't run as "data" either, but as kernels. Every bit of information has to be translated as a kernel, which the GPU's multiprocessors can run. This will take up quite a chunk of memory.
For example, the run of the mill Seti Enhanced Multibeam task which takes about 32MB of RAM when run on the CPU, runs in 200MB+ memory on the GPU.
Then there's the trouble that all applications have to be ported from whatever language they are in now (C++ for the majority) to C to make them able to work. C isn't as sophisticated as C++, so some may be difficult to translate, or you'll be adding lots of lines of code to get it to do the same function as C++ could do in less lines. Future versions of CUDA are expected to have FORTRAN, C++ and OpenCL support. Not that this makes it any easier to just port over code from one platform to the CUDA platform, but it's a start.
A lot more information can be found in the Nvidia CUDA programming Guide (2.0).
RE: The actual application
)
This might speed things up a bit though: Zero Copy in CUDA 2.2