January 21, 2011

Our First CUDA C Program

Ok, time to come clean. One reason I started this blog was that I don't do enough recreational programming. You'd think that I'd get most of that out of my system with the amount of CUDA and Tesla related tasks that constitute my normal workday, but you'd be wrong. So, I'm going to use this blog to keep myself honest by posting regular CUDA C programs, tips, and code segments.

To make this a little more interesting, I'm going to assume that we're all starting with zero knowledge of CUDA C. So I'll begin with simple programs that illustrate the basics, and slowly work up to more complicated concepts and features. As I do so I'll try to throw in a little Socratic method and refine things based on the deficiencies of the early examples. Or you could buy a book.

For the first program, let's do something simple. The CUDA model exposes the GPU as a an accelerator or co-processor that has it's own memory space. So the first task a GPU programmer faces is learning how to allocate space on the device and move data back and forth between host memory and device memory. I'll show the full program first, and then break down the interesting bits.

/*
 * hello_cuda.cu
 *
 * Our first cuda program, resultsovercoffee@gmail.com
 *
 * compile with:  nvcc -o hello_cuda hello_cuda.cu
 */

using namespace std;

#include <iostream>
#include <string.h>
#include <unistd.h>

int main (int argc, const char *argv[]) {

    //our message
    const char *message = "hello world!";
    size_t size = strlen(message)+1;
     
    //delcare and allocate a buffer on the device
    char *d_buffer;
    if (cudaMalloc(&d_buffer,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //copy our message to the device buffer
    if (cudaMemcpy(d_buffer,message,size,cudaMemcpyHostToDevice)
        != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //declare and allocate a buffer on the host
    char *h_buffer = (char*)malloc(size);
    if (h_buffer == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }
    
    //copy the device buffer back to the host
    if (cudaMemcpy(h_buffer,d_buffer,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    cout << h_buffer << endl;
    cudaFree(d_buffer);
    free(h_buffer);
}


The first thing many of you will notice is that I'm using C++ (or at least some C++). If nothing else the using namespace std; probably gives it away. In this case I stuck with C allocation and char* arrays rather than strings, but I will generally be using C++ features when they are available.

The first CUDAism is the cudaMalloc() call that used to allocate space on the device. The syntax is slightly different than regular host-side malloc(), in that cudaMalloc takes a pointer-to-a-pointer and changes the value directly, while regular malloc provides the address of the allocated memory in it the return code. As a result, the cudaMalloc call can use the return code to indicate success or failure. Notice that we check that by comparing it to the value cudaSuccess and if it fails we make a call to get the specific error and print it to standard error. [If you'd like to see this in action, try increasing size to a number larger than the amount of memory on your NVIDIA GPU.]

The other CUDA C function we are using is cudaMemcpy(). That function takes four arguements, a pointer to the destination, a pointer to the source, the amount of memory to copy, and a direction. In this example, we first copy our message to a buffer on the device, then we copy it back to a new buffer on the host. Also note that I'm prefixing my pointers with h and d to indicate wether they point to memory in the host or the device. This isn't required, but it helps prevent you from shooting yourself in the foot later.

Finally, we print the result to standard output so that we can visually see that it survived intact and deallocate that which we allocated. That's it. If you've followed this far, you've now got a CUDA C program under your belt.

No comments: