February 4, 2011

CUDA Blocks and Grids

In our previous CUDA example, we wrote a kernel that did some math based on thread number. I was specifically vague on the details of the triple-chevron launch syntax, with a promise to cover it next time. Next time is today.

Recall that we launched our kernel with a line like this one:
where n was the number of threads that would be spawned on the device, along with a caution than values larger than 512 might cause problems. The good news is that we are certainly not limited to 512 threads, and the launch syntax has some features that make mapping threads to larger problems much easier. First the code:
 * cuda_enum2.cu
 * CUDA Kernel at enumerates by block and thread indices
 * compile with:  nvcc -o cuda_enum2 cuda_enum2.cu
 * resultsovercoffee@gmail.com

#include <iostream>
#include <unistd.h>

using namespace std;

//function prototype for our CUDA kernel
__global__ void enumerate(int *);

int main(int argc, char *argv[]) {
    dim3 grid(5,5);
    dim3 block(3,3,3);
    int n = grid.x * grid.y * block.x * block.y * block.z;
    size_t size = n * sizeof(int);

    //declare, allocate, and zero an array on the device
    int *d_array;
    if (cudaMalloc(&d_array,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
    if (cudaMemset(d_array,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;

    //launch kernel on device to generate results

    //declare and allocate an array on the host
    int *h_array = (int*)malloc(size);
    if (h_array == 0){
        cerr << "malloc failed" << endl;

    //copy the device buffer back to the host
    if (cudaMemcpy(h_array,d_array,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;

    //print results
    for (int i=0;i<n;i++) {
        cout << h_array[i] << " ";
    cout << endl;

    //deallocate arrays


 * enumerate
 * kernel that computes a unique thread number based on the grid and
 * block indices and stores it in the A[]
__global__ void enumerate(int *A) {
    int value =
          blockIdx.y  * gridDim.x  * blockDim.z * blockDim.y * blockDim.x
        + blockIdx.x  * blockDim.z * blockDim.y * blockDim.x
        + threadIdx.z * blockDim.y * blockDim.x
        + threadIdx.y * blockDim.x
        + threadIdx.x;
    A[value] = value;

Again, this should look pretty similar to our previous CUDA example. We've renamed our kernel to something more descriptive and changed the launch syntax a little, we've also re-written the kernel.

First that launch syntax. Rather than a couple scalar numbers, we now are using vectors of type dim3. As you might suspect, the dim3 structure is a vector with three values. When we generate it using the constructor syntax, any values not specified are set to one. That means that "dim3 grid(5,5);" creates a vector with three vaules, (5,5,1).

Additionally, you can see that the launch syntax uses two arguments: blocks and grids. A thread block is a group of related threads that can support up to three dimensions. With Fermi, the maximum block size 1024 threads, and the maximum dimensions are 1024 x 1024 x 64. Previous GPUs have lower limits, so you might be restricted to only 512 threads and 512 x 512 x 64 if you have a GT200 based system. Threads in the same block are scheduled on the same streaming multiprocessors and can communicate with shared memory.

That many threads sounds like a lot, but isn't much for a GPU that has 448 cores. That's where grids come in. Multiple thread blocks can be launched from the same grid. With Fermi, grids are two-dimensional and can be up to 65535 x 65535. Since each grid square is a block of threads, the total number of threads can get large very quickly. Luckily GPUs have hardware schedulers and can easily handle many thousands of threads. That's important, because GPUs use threading to hide memory access latency rather that caching like a CPU does. So on GPU's you always want to launch more threads than cores.

In the code above, we are launching a 5 x 5 grid of 3 x 3 x 3 thread blocks. That's 675 threads being launched in parallel on the GPU. If you need a picuture:

The other thing we changed is the CUDA kernel itself. Our enumerate kernel calculates a thread number based on the block and grid dimensions and the thread's index. As a result the number is unique per thread, and we calculated it so that threads access the array in a coalesced fashion (neighboring threads in a block store to neighboring locations in the array).

My coffee is empty, so that's enough explanation for today.

No comments: