January 30, 2011

Our First CUDA C Kernel

Our first CUDA program covered the basics of moving data to and from the GPU (aka the device). While a necessary skill, data movement by itself is not very compelling. We didn't use any of the GPU's processing power, and there are better ways to create a volatile ram disk.

For a second effort, lets do something simple: we'll demonstrate how to give each thread we launch on the GPU a unique ID and do some integer addition. First the code:

/*
 * cuda_enum.cu
 *
 * Simple CUDA Kernel that enumerates by thread index
 *
 * compile with:  nvcc -o cuda_enum cuda_enum.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <unistd.h>

using namespace std;

//function prototype for our CUDA kernel
__global__ void kernel(int *);


int main(int argc, char *argv[]) {
    int n = 16;
    size_t size = n * sizeof(int);

    //declare, allocate, and zero an array on the device
    int *d_array;
    if (cudaMalloc(&d_array,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }
    if (cudaMemset(d_array,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //launch kernel on device to generate results
    kernel<<<1,n>>>(d_array);

    //declare and allocate an array on the host
    int *h_array = (int*)malloc(size);
    if (h_array == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }

    //copy the device buffer back to the host
    if (cudaMemcpy(h_array,d_array,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //print results
    for (int i=0;i<n;i++) {
        cout << h_array[i] << endl;
    }

    //deallocate arrays
    cudaFree(d_array);
    free(h_array);

    //bye
    exit(0);
}


/*
 * kernel
 * 
 * CUDA kernel -- sets a value of A[] to the thread's X index + 1
 */
__global__ void kernel(int *A) {
    A[threadIdx.x] = 1 + threadIdx.x;
}

This should look a little similar to the previous example. We are still declaring and allocating memory on the device, but instead of character buffers, we are using arrays of integers. The variable n that is set in main is used to control the length of the arrays. Additionally, we are using a new host function cudaMemcpy() to set the device array to all zeros. This isn't strictly necessary for this example, but will be in the future.

The really new part is the kernel function. In CUDA C, we refer to the arithmetically dense function that runs on the device as a kernel, so our example kernel is eponymously named. The key part of the prototype (and later definition) is the __global__ type qualifier (note the double underscores prefix and postfix). __global__ functions are executed on the device, but must be called from the host and must return void.

Our kernel function is defined at the end of the code. It takes a pointer to an array A. The body of the function performs an addition to calculate A[threadIdx.x] + 1, then stores the result in A[threadIdx.x]. Notice that we didn't declare or set threadIdx.x -- it is a built-in variable in CUDA that is only valid when the code is executing on the device. In our example the first thread gets 0, the second gets 1, etc. So each thread gets a different index and thus operates on a different A. Because different threads can run on different CUDA cores, the calculations can occur in parallel on the GPU.

The final new thing is the launch of the kernel that occurs near the center of main(). Note that the syntax looks like a standard function call, but includes some additional arguments between triple chevron brackets, <<<1,n>>>. Without going into too much detail, the second argument is the number of threads our example will spawn on the device. Since we also use n to size our array, we will be spawning a thread for each value of A[]. Feel free to experiment with different values of n, but note that values greater than 512 may cause problems. More on that next time.

Congratulations, you've just written a program that performs parallel computation on the GPU using multiple threads executing on multiple cores. We'll come back to this example next week and dive into the triple chevron syntax in some more detail.

No comments: