CUDA Plugin

From K-3D

Jump to: navigation, search


The CUDA plugins are GPGPU implementations of existing plugins using the CUDA API by Nvidia[1].

In order to make use of the CUDA API, the process is broken up into a number of stages.

  1. Memory is allocated on the device (GPU) for data.
  2. The data (if any) is copied from the host (CPU) to the device.
  3. One or more kernels are called to perform the desired processing on the data in the device memory.
  4. The results (if any) are then coppied from the device to the host
  5. The memory on the device is freed.

In this process the steps that take the majority of the time are those involving transfers to and from the device. This is a known limiting factor of GPGPU applications since for small problems, the overhead involved in transferring data back and forth is too costly compared to the amount of time actually spent processing and the GPU does little or no better than the CPU in terms of execution time. However, as the problems increase in size, the amount of work needed to be done increases and since the GPU can perform much of this work in parallel it may show significant gains. For a general overview of GPU computing refer to GPU Computing by Owens et al.

The CUDA Architecture

In order to get the most out of a Nvidia GPU in terms of the acceleration of general purpose calculations, it is important that one has at least some knowledge of the underlying architecture and most importantly, understand its limitations. The GPU gains its advantage my exploiting data level parallelism or a single instruction, multiple data (SIMD) execution model. What this means is that the same operation or set of instructions (kernel) is executed for a number of different data elements. If a large number of these kernels can be executed in parallel then a significant speedup can be achieved.

The typical CUDA enabled GPU consist of a number of multiprocessors each of which consists of a number of SIMD cores or stream processors. The number of multiprocessors and stream processors varies between devices. A GeForce 8600 GT has 32 stream processors whereas a GeForce 8800 GT has 112. The latest offering from Nvidia, the GTX-280, has 240. For further information regarding the architecture, the reader is referred to the NVIDIA CUDA Programming guide.

Thread Execution Model

A CUDA enabled device allows for the parallel execution of a large number of threads. Typically each thread executes the desired kernel on a different data element. For the thread execution model employed by CUDA, the threads are grouped into thread blocks which can be 1, 2, or 3-dimensional. In addition, the number of threads that a block can contain is limited (512 for the GeForce 8600GT). A number of thread block can also be grouped together to form a 2-dimensional grid. The position of a block in the grid as well as the position of a thread in a block can be used as a unique identifier to specify which data element should be operated on by the given thread.

In terms of execution, at least one block of threads is executed on a multiprocessor at a time with the threads in a block being split into fixed-size SIMD groups called warps. Each of these warps is executed in parallel - ie the same kernel is executed for each thread in the warp concurrently - and a scheduler switches execution between warps. It is important to note that communication between threads of the same block is possible to some extent by using synchronization instructions as well as a shared memory space. However, there is no means to communicate between different blocks in a grid.

Memory Model

Types of memory available on a CUDA device as well as their scope.
Memory Access Type Scope Accessible by Host
registers read-write thread no
local memory read-write thread no
shared memory read-write block no
global memory read-write grid yes
constant memory read-only grid yes
texture memory read-only grid yes