CUDABitmapMultiply

From K-3D

Jump to: navigation, search

Template:CUDABitmapMultiply

Description

The CUDABitmapMultiply plugin is a GPGPU implementation of the BitmapMultiply plugin using the CUDA API by Nvidia[1].

See the CUDA_Plugin for more details on the CUDA plugins in general. Examples of CUDA based image processing plugins for Adobe Photoshop can be found on the Nvidia CUDA download page.

Details

The CUDABitmapMultiply plugin is implemented as follows:

  1. Allocate a linear block of memory of size width*height*8 bytes on the device. Here width and height are the input image dimensions in pixels. Since each pixel consists of a half-float red, green, blue, and alpha (RGBA) channel [2] the total memory required per pixel is 8 bytes.
  2. Copy the input image data to the device.
  3. Setup and execute the desired CUDA kernel.
  4. Copy the data from the device to the output image.
  5. Free the device allocated device memory.

Kernel Setup and Execution

A block size of 8x8 (64) threads is used and the dimensions of the grid are calculated as ceil(width/8) and ceil(height/8) respectively. This means that the blocks in the grid that coincide with the edges of the image may contain up to 63 threads that are do not perform any operation on the image data. Some applications have shown a significant speedup when padding the input data so that there is no need to check the validity of a thread's index. This is however not considered here.

In executing the kernel, any double parameters are first converted to float on the host before passing them as parameters. Although the device does support half-float operations at an instruction level, this functionality is not made available by the current API. It is therefore necessary to convert between single and half-precision values. This conversion takes place on the device with the input image stream interpreted as an unsigned short bit sequence of length 16. The kernel execution can be broken down as follows:

  1. Calculate the index in the linear array from the thread's index.
  2. Check to ensure that the pixel being accessed is valid wrt the image dimensions. If not, do nothing.
  3. Convert the half-precision channel values for a pixel to single-precision values.
  4. Perform the desired operation on the single-precision channel values.
  5. Convert the single-precision results to half-precision and store the result.

Improving Performance

A number of options exist to possibly improve the performance of the plugin. As already mentioned, using a padded two-dimensional array instead of a linear structure may result in improved performance as it does away with the conditional statements in the kernel. Additionally, the size of the thread blocks can be tuned for performance, although the results could be heavily device dependent.

Results

The following image shows the speedup obtained for the CUDABitmapMultiply plugin over the CPU implemetation.

Image:CUDABitmapMultiplySpeedup.png