Updated CUDA example code: cudaHello2.zip
Using CUDA Threads
To fully exploit GPU hardware using CUDA, it is necessary to use CUDA threads.
The good news is that CUDA threads are not like CPU threads (e.g., pthreads or Java threads). Threads appear in the CUDA programming model as an additional index or indices, similar to the block index/indices, that your kernel function will use to determine its assignment in the overall computation.
The basic idea is very simple: in specifying the parallel blocks to be executed, either as an integer or as multiple dimensions, you can also specify the number of threads per block. The number of threads can be specified as an integer, or as multiple dimensions (using the dim3
data type, in the same way that multiple dimensions can be specified for parallel blocks.)
Example: Mandelbrot set
In our original CUDA example code, we used one thread per parallel block. In the revised example code (cudaHello2.zip), we define a constant (THREADS_PER_BLOCK
) to specify a fixed number of threads per block. The number of threads per block should be a multiple of 32 in order to use hardware resources efficiently.
Because each block will now execute the kernel function using multiple threads, we reduce the number of rows of parallel blocks:
Note that because the problem size (in one or more dimensions) might not divide evenly by the number of threads, it may be necessary to schedule additional blocks.
Here is the modified kernel function:
Note that blockDim.x
gives us the total number of threads per block, and threadIdx.x
gives us the current thread number (in the range 0..blockDim.x
-1, inclusive). In the kernel function, we use this information to compute the row number in the computation.
Using 32 threads per block, the running time of the computation decreases from 472.6 ms to 7.2 ms. Using 64 threads per block, the running time is about 6.4 ms.