User:Evanlezar/GSoC2008-Progress

From K-3D

Jump to: navigation, search

Investigation of GPU-Based implementations of k-3D filters

Initial List of Plugins

The following table lists the plugins included in the project application. A * in the status column indicates that the plugin is functional (albeit in a developmental capacity), whereas a # indicates that the plugin is considered complete.

Module Plugin Status Implementation Date
bitmap add * 08 May 2008
multiply * 11 May 2008
subtract * 11 May 2008
color_monochrome * 18 May 2008
gamma * 9 July 2008
invert * 10 July 2008
matte_colordiff * 10 July 2008
matte_invert * 10 July 2008
threshold * 10 July 2008
deformation transform_points * 28 May 2008
center_points
rotate_points
scale_points
shear_points
translate_points
bend_points
bulge_points
cylindrical_wave_points
linear_point_noise
linear_wave_points
smooth_points
sphereize_points
taper_points
tweak_points
twist_points
qslim quadric_deformation

Additional Plugins

The following table is provided so that other users may request additional plugins to be considered.

Module Plugin Requested By Accepted
mesh subdivide_edges Bart *
polygon poly_grid *

Progress Summary and Notes

Start to 05 May 2008

  • Implemented a CUDA version of the BitmapAdd plugin as part of the application process. This included integrating the module into the k3d build process. Currently the plugin has been tested and works in device emulation mode under as well as in hardware mode under windows.
  • Some time was also spent getting a hardware development environment running under Linux.
  • Benchmarks investigating the contributions of specifically memory transfer times to the total execution time of the plugin are being implemented.

08 May 2008

  • The hardware implementation under Linux is now working. The problem was as a result of type-casting a float to an int in order to obtain the float bitstring. The CUDA device functions __int_as_float() and __float_as_int() solved this problem.
  • Some benchmarking was performed for square BitmapSolids of different sizes (from 8x8 to 4096x4096) using k3d::log() and it was found (as expected) that the majority of the time is used in transferring memory to and from the device. However, due to the high bandwidth (1GB/s measured using the bandwidthTest in the Nvidia CUDA SDK) the CUDA implementation is still able to outperform the CPU implementation by a considerable margin. A summary of the benchmark results is available at File:CUDA-Benchmarks08052008.pdf.

11 May 2008

  • I have improved the benchmarking of the CUDABitmapAdd plugin by including script-readable properties in the cuda_add and add classes. A benchmark test case has been written that outputs a table of timing results for images of increasing size.
  • The first time in a benchmark that memory is allocated on the device and data transfered to it, the cudaMalloc function takes about 700ms longer than subsequent calls - even though the memory is freed after the processing is completed. In addition, this first kernel execution also takes longer.
  • CUDABitmapMultiply and CUDABitmapSubtract as well as their corresponding regression tests have been implemented.

18 May 2008

  • An initial implementation of CUDABitmapColorMonochrome has been submitted.
  • The use of a pipeline_profiler to perform the timings of the CUDA plugins is being investigated. The python wrapper for idocument needs to be exended and a wrapper for the pipeline profiler created.
  • Additional functionality in the pipeline profiler may be required.

20 May 2008

  • The benchmarking of the plugins has been changed to make use of a PipelineProfiler document node.
  • The profiler output has been converted into a user-readable form and displayed in the CTest Dashboard.
  • Benchmarks for some existing Bitmap plugins have been added
  • The charting of benchmark results is being discussed

25 May 2008

  • Due to the amount of time taken up by memory transfers to and from the device, there is some discussion as to how the effect of this can be minimized (see the discussion page)
  • The further conversion of Bitmap plugins is put on hold for now in favor of some plugins that handle meshes. The aim of this is to obtain more data pertaining to the performance of the CUDA plugins with different types of operations.

28 May 2008

  • An initial implementation of the CUDATransformPoints has been submitted along with benchmarks for comparing the CUDA implementation to the TransformPoints plugin.
  • At this stage, the non-CUDA plugin performs quite well and some optimisation of the CUDA kernel is required.
  • Since a conversion to single-precision is required for current generation GPU's, it may be possible to use asynchronous transfer between host and GPU memory.
  • Currently the two CUDA plugins (Bitmap and Deformation) are still in their own modules, but will be moved to a single CUDA module before further development takes place.

29 May 2008

  • All the CUDA plugins have been moved to a single CUDA module.
  • As was the case with the bitmap plugins, the transfer to the card for the simple TransformPoints is the limiting factor, although the implementation is still in its early stages.
  • A number of test cases of varying sizes must be set up to ascertain the effect of the number of points in the mesh on the relative performance of the CUDA and CPU-based implementations.
  • In the implementation as it stands, it may be possible to make use of asynchronous host to device transfers to obtain better performance.

03 June 2008

  • Due to the limitation placed on the performance by memory transfers, an asynchronous implementation of the CUDATransformPoints plugin is being investigated. Unfortunately, some of the GeForce 8800 cards do not support compute capability 1.1 which is required for this functionality.
  • The module should be able to check the capabilities of the card(s) installed in the machine and select various implementations accordingly.

08 June 2008

  • An asynchronous version of the CUDATransformPoints plugin has been submitted. Some foolish arithmetic on my part was causing some segmentation faults, but these have now been sorted.
  • With the current single benchmark, the effect of mesh size on the performance of the plugin cannot be investigated.
  • In the next day or two the mesh benchmarks will be improved so make provision for meshes of different sizes in much the same way as the bitmap benchmarks.
  • Some functionality to check the capabilities of the installed GPU should also be implemented - even if it is only to check for Compute capability 1.1 for the asynchronous transfers and obviously for the presence of a card.

11 June 2008

  • The mesh modifier benchmarks have been updated to allow for increasing mesh sizes.
  • The results show that the majority of the time for applying a transformation is spent in updating the mesh. Subsequent studies show that this is due to the merge_selection being executed.
  • In general more than 90% of the total time is spent updating the mesh. Thus, without moving this to the GPU the performance gains will be negligible - or non-existent.
  • The next step in implementing the mesh modifiers on the GPU is investigating the feasibility of moving functions such as merge_selection.
  • In addition, the benchmarking system is still not adequate in terms of trying to compare the results of various implementations and needs some attention.

15 June 2008

  • Spent some time improving the benchmarking system. CSV files containing the benchmark data are now generated in the build/tests/benchmarks folder. Some more thought needs to be put into timestamping etc of consecutive benchmarks so that results could be processed.
  • The time taken in updating the mesh was due to the fact that merge_selection was being called even when it was not necessary and since this has been fixed (thanks Bart) the timing results are a bit more meaningful.
  • Both a synchronous and asynchronous implementations of TransformPoints are present, although only one can be compiled at a time at present.
  • Some initial benchmarks (and the use of the CUDAVisualProfiler) indicate that with the current implementations the synchronous code outperforms the asynchronous version - this is probably due to stream creation overhead and is also expected to be a function of the number of points in the mesh.
  • Fine-grained timings have been included in the synchronous version once more and will be included in the asynchronous version as soon as possible. ** The results from these timings seem to indicate that almost half the time is spent in converting values from double to float and then back again at a rate of roughly 70M floats per second. This would not be required on the next generation of NVIDIA cards (from the 9800) as they support double precision.
    • It may be possible to move the double to float conversion to the GPU in the same way that the half to float conversion is implemented, but this would mean that the amount of data being transfered is doubled which may be counter productive - although, as long as the combined rate of data transfer (about 100 doubles per second) and the rate of conversion is high enough this would be favorable.
  • An implementation of CUDATransformPoints that uses cuBLAS's SGEMM implementation could also be investigated and compared to the synchronous and asynchronous implementation. It would be possible to choose between various implementations at runtime depending on the problem size to obtain optimal performance.

22 June 2008

  • Implemented a simple graphing system to process the benchmark results. The system currently uses matplotlib to perform plotting, but could be moved to something like pycairo without too much additional effort.
  • The CUDA code has been moved into a shared library so as to make getting the code running on windows a little easier.

23 June 2008

  • It has been decided to move off from the simple mesh deformation plugins for the time being and move onto a more complex plugin which will hopefully be more computationally intensive.
  • Bart has requested that the subdivide_edges plugin be looked at.
  • Over the next day or two I will plan a GPU implementation of the mesh and start implementing a CUDA version of the subdivide edges plugin.

30 June 2008

  • Work on the CUDA mesh implementation is commencing a little more slowly than expected, although some headway is being made.
  • The plan is to implement a simple mesh converter plugin so that writing the mesh to the device and reading it back again can be tested.
  • See the files cuda_mesh.h and cuda_mesh.cpp for the work in progress.

3 July 2008

  • The mesh converter plugins (k3d to cuda, cuda to k3d) have been implemented for a mesh consisting of polyhedra.
  • Work still needs to be done on the named_arrays associated with the structures.

9 July 2008

  • The implementation of a CUDA version of SubdivideEdges is taking more time than expected. This is aggravated by the fact that there is a difference between results obtained in emulation mode and in device mode.
  • An issue stemming from the 64-bit width of k3d::uint_t on 64-bit systems has also come to light and requires attention.
  • To step away from the problem briefly - to allow some time for reflection and digestion, some more Bitmap plugins will be converted first.

10 July 2008

  • All the bitmap plugins have now been converted, although there is still scope for performance improvement.
  • The fact that the benchmark comparison plots rely on matplotlib is not ideal.
  • A python wrapper for the Google Charts api is available at pygooglechart and may be investigated.

27 July 2008

  • Work on CUDA subdivide edges continues. At this stage all the parallel parts of the execution have been converted and only the serial edge_index_calculator and varying_data_copier remaining.
  • Considering the performance of the implementation, the memory transfer is, as expected, the dominating factor (especially for small meshes) in the execution time. The following image shows the CUDA Visual Profiler output for the CUDASubdivideEdges.py test.

Image:CUDASubdivideEdge_Profile.jpg

  • In this image the yellow bars represent memory transfers and account for 75% of the execution time.
  • For larger meshes, the memory transfer time becomes less significant, as shown by the following output for the final mesh in the CUDASubdivideEdges benchmark suite

Image:CUDASubdivideEdges_benchmark.jpg

  • In this case, the memcpy overhead is reduced from 75% to just over 40%.

7 August 2008

  • edge_index_calculator has been implemented.
  • Some erratic behavior in the benchmarks due to the serial code was observed. It is assumed that this is due to the fact that the nested loops were taking to long to execute. Moved the face and loop loops to outside the kernel call.=

18 August 2008

  • Main steps in SubdivideEdges now have device equivalents.
  • Calculation of companions still requires some reworking as it contains serial code that hampers performance.
  • Thanks to Bart for optimizing the edge_index_calculator significantly.
  • Added documentation for most of the CUDA Plugins.
  • Documentation that still needs work includes the discussion on the device implementation of the mesh structures.

Where to from here

With Summer of Code coming to an end, there is some time for reflection and reevaluation of achievements as well as future goals. Regarding the success of the project, I would say that as a general learning experience, both in terms of GPGPU applications as well as working as part of a larger open source community, it has been an invaluable experience. In terms of the applicability of GPGPU acceleration to the K-3D plugins, the future is less clear.

There are areas where the GPGPU implementations were able to show their worth. The image processing plugins on a slightly outdated PC were able to get quite an improvement. This was expected to some extent due to the fact that the processing taking place fitted into the SIMD execution model so well and the amount of data that had to be transfered to the device was relatively small. For the TransformPoints plugin the speedup was not as significant due to the domination of other steps in the computation process that had not been ported to the GPU. In addition it was necessary to convert a large amount of doubles to floats before execution could take place. This would not be a problem on the latest generation of offerings from Nvidia and ATI since they provide double precision support. Then lastly, the complex nature of the SubdivideEdges plugin resulted in the need for serial execution (at least as an initial implementation) for some of the steps which severely hampered performance. This did however illustrate that it is at least possible, if not optimal, to execute serial code on a GPU.

A common theme in the benchmarking of these plugins was the fact that data transfer is expensive and computation is relatively cheap. A GPU can easily perform 300 billion floating point operations per second, but can only transfer floats at a rate of 20 billion per second. It then makes sense to do as much work as possible on each value that is transfered before sending back the result.

In terms of future work, the device implementation of more functions used by the existing plugins will only improve the performance - for example, if the merging of a selection happened on the device, then the performance of the CUDATransformPoints plugin could be greatly improved. This will also affect the performance of other mesh plugins such as CUDASubdivideEdges although the serial execution of some parts is a bigger problem at present.

Making the code more robust in terms of device detection, for example, is also a necessity, although it is not crucial in these experimental stages. The partitioning of the visualization pipeline into independent segments would also allow for the parallel execution of plugins on the GPU and CPU if the device equivalents of enough of the plugins exist. At present, only Nvidia's CUDA has been considered - this was mainly due to the relative maturity of their API over ATI's at the start of the project, but recent developments both in terms of hardware as well as the release of newer versions of the ATI API make these an interesting alternative to investigate.

ToDo

The following things still require attention

  • CUDA error handling
  • Querying device capabilities
  • Making thread and block allocation more robust
  • Doxygen compliance

The following have been implemented to some degree, but may require further attention

  • Benchmark submission process. Currently using pylab but would like to switch to another plotting backend - Google Charts perhaps?