January 31, 2011

Local Boy Does Well

A quick note that the whitepaper version of an article I contributed to garnered mentions by Vizworld and InsideHPC last week.

Thanks to Wes for being leader and chief cat-herder -- getting all of us on a phone call is a herculean effort under normal circumstances, and it was even more challenging because I was in China working on Nebulae during the critical edits. It was an honor to be involved with a who's-who of HPC visualization.

January 30, 2011

Our First CUDA C Kernel

Our first CUDA program covered the basics of moving data to and from the GPU (aka the device). While a necessary skill, data movement by itself is not very compelling. We didn't use any of the GPU's processing power, and there are better ways to create a volatile ram disk.

For a second effort, lets do something simple: we'll demonstrate how to give each thread we launch on the GPU a unique ID and do some integer addition. First the code:

/*
 * cuda_enum.cu
 *
 * Simple CUDA Kernel that enumerates by thread index
 *
 * compile with:  nvcc -o cuda_enum cuda_enum.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <unistd.h>

using namespace std;

//function prototype for our CUDA kernel
__global__ void kernel(int *);


int main(int argc, char *argv[]) {
    int n = 16;
    size_t size = n * sizeof(int);

    //declare, allocate, and zero an array on the device
    int *d_array;
    if (cudaMalloc(&d_array,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }
    if (cudaMemset(d_array,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //launch kernel on device to generate results
    kernel<<<1,n>>>(d_array);

    //declare and allocate an array on the host
    int *h_array = (int*)malloc(size);
    if (h_array == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }

    //copy the device buffer back to the host
    if (cudaMemcpy(h_array,d_array,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //print results
    for (int i=0;i<n;i++) {
        cout << h_array[i] << endl;
    }

    //deallocate arrays
    cudaFree(d_array);
    free(h_array);

    //bye
    exit(0);
}


/*
 * kernel
 * 
 * CUDA kernel -- sets a value of A[] to the thread's X index + 1
 */
__global__ void kernel(int *A) {
    A[threadIdx.x] = 1 + threadIdx.x;
}

This should look a little similar to the previous example. We are still declaring and allocating memory on the device, but instead of character buffers, we are using arrays of integers. The variable n that is set in main is used to control the length of the arrays. Additionally, we are using a new host function cudaMemcpy() to set the device array to all zeros. This isn't strictly necessary for this example, but will be in the future.

The really new part is the kernel function. In CUDA C, we refer to the arithmetically dense function that runs on the device as a kernel, so our example kernel is eponymously named. The key part of the prototype (and later definition) is the __global__ type qualifier (note the double underscores prefix and postfix). __global__ functions are executed on the device, but must be called from the host and must return void.

Our kernel function is defined at the end of the code. It takes a pointer to an array A. The body of the function performs an addition to calculate A[threadIdx.x] + 1, then stores the result in A[threadIdx.x]. Notice that we didn't declare or set threadIdx.x -- it is a built-in variable in CUDA that is only valid when the code is executing on the device. In our example the first thread gets 0, the second gets 1, etc. So each thread gets a different index and thus operates on a different A. Because different threads can run on different CUDA cores, the calculations can occur in parallel on the GPU.

The final new thing is the launch of the kernel that occurs near the center of main(). Note that the syntax looks like a standard function call, but includes some additional arguments between triple chevron brackets, <<<1,n>>>. Without going into too much detail, the second argument is the number of threads our example will spawn on the device. Since we also use n to size our array, we will be spawning a thread for each value of A[]. Feel free to experiment with different values of n, but note that values greater than 512 may cause problems. More on that next time.

Congratulations, you've just written a program that performs parallel computation on the GPU using multiple threads executing on multiple cores. We'll come back to this example next week and dive into the triple chevron syntax in some more detail.

January 26, 2011

CUDA in Runlevel 3

Most Linux clusters operate at runlevel 3 instead of runlevel 5. The major difference is that runlevel 5 starts the Xdm graphical login system -- something that isn't very useful on cluster nodes which don't have a display attached. Configuring nodes for runlevel 3 eliminates the overhead of that unused X11 server, which is good for application performance. Unfortunately, the NVIDIA driver is usually loaded when X11 starts, so a node operating at runlevel 3 can't run CUDA applications.

The fix is using an init script to modprobe the NVIDIA driver and create the necessary /dev files. My script for doing that is below. In addition to loading the driver and creating the necessary device files, it also sets compute-exclusive mode on each card and loops nvidia-smi to keep a persistent connection for faster kernel launches.

#!/bin/bash
#
# /etc/init.d/cuda startup script for nvidia driver
# symlink from /etc/rc3.d/S80cuda in non xdm environments
#
# Creates devices, sets persistent and compute-exclusive mode
# Useful for compute nodes in runlevel 3 w/o X11 running
#
# chkconfig: 345 80 20

# Source function library
. /lib/lsb/init-functions

# Alias RHEL's success and failure functions
success() {
    log_success_msg $@
}
failure() {
    log_failure_msg $@
}

# Create /dev nodes
function createdevs() {
    # Count the number of NVIDIA controllers
    N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`

    # Create Devices, exit on failure
    while [ ${N} -gt 0 ] 
    do
      let N-=1
      /bin/mknod -m 666 /dev/nvidia${N} c 195 ${N} || exit $?
    done
    /bin/mknod -m 666 /dev/nvidiactl c 195 255 || exit $?
}

# Remove /dev nodes
function removedevs() {
    /bin/rm -f /dev/nvidia*
}

# Set compute-exclusive
function setcomputemode() {
    # Count the number of NVIDIA controllers
    N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`
    # Set Compute-exclustive mode, continue on failures
    while [ $N -gt 0 ]
    do
      let N-=1
      /usr/bin/nvidia-smi -c 1 -g ${N} > /dev/null
    done
}

# Start daemon
function start() {
   echo -n $"Loading nvidia kernel module: "
   /sbin/modprobe nvidia && success || { failure ; exit 1 ;}
   echo -n $"Creating CUDA /dev entries: "
   createdevs && success || { failure ; exit 1 ;}
   echo $"Setting CUDA compute-exclusive mode."
   setcomputemode
   echo $"Starting nvidia-smi for persistence."
   /usr/bin/nvidia-smi -l -i 60 > /dev/null &
}

# Stop daemon
function stop() {
   echo $"Killing nvidia-smi."
   /usr/bin/killall nvidia-smi
   echo -n $"Unloading nvidia kernel module: "
   sleep 1
   /sbin/rmmod -f nvidia && success || failure
   echo -n $"Removing CUDA /dev entries: "
   removedevs && success || failure
}

# See how we were called
case "$1" in
   start)
       start
      ;;
   stop)
       stop
      ;;
   restart)
       stop
       start
      ;;
   *)
       echo $"Usage: $0 {start|stop|restart}"
       exit 1
esac
exit 0

This script is fairly well-tested on RHEL/CentOS systems, but probably works on other distros with no or minor modifications.

January 24, 2011

Disruptive Technology


disrupt:
 to interrupt the normal course or unity of


With somewhat regular frequency, I get into discussions about whether GPUs are a "disruptive technology" in HPC (I hang out with nerds). Often this is in response to articles like this. As someone who now covers HPC for NVIDIA, but was working for LLNL through most of the last decade, I have some decidedly strong opinions on this topic.

The disruptive part is the upset in the top500 performance curve. If you plot the up-and-to-the-right performance of the top500 list, IBM BlueGene/L was above the curve as it existed at the time -- same outlay of cash/power/time, lots more performance. As the first massively parallel processing machine based on embedded processors, it was assumed that subsequent improvements in PPC (or other) embedded processors would provide a series of increasingly faster MPP systems. This has been mostly true, though the follow-on machines have not been large enough to capture the number 1 position.

Going back further in time, I don't think I ever heard the Earth Simulator referred to as disruptive technology. Though it captured the and held the "worlds fastest computer" title by a larger margin and for a longer time that BG/L would, most viewed it as the last gasp of vector technology. Distributed memory clusters like IBM SP had already won the mind-share at most sites, and though Earth Simulator was able to remain at #1 for over two years, it was pretty clear that it's dominance would come to an end. More vector machines were built, but HPC codes continued to move from vector to distributed memory.

Getting back to the question, with BlueGene/Q poised to re-take the top500 crown, are GPUs a temporary disturbance of embedded MPP march to dominance, or a permanent shift?

My answer is that disruption isn't a zero-sum game, and embedded MPP doesn't have to fail for GPUs to succeed. GPUs provide the "same slope, higher intercept" disruption of the top500 curve that embedded MPP did. Like embedded MPP, advances in the underlying technology will continue to provide GPU computing with performance improvements for years to come. And like embedded MPP, GPUs are leveraging a robust consumer market to achieve those advancements.

Where GPUs hold a significant advantage is barrier to entry. There are millions of CUDA-capable GPUs in systems today, and hundreds of universities teaching CUDA. Moreover, GPU clusters can be constructed entirely of common off-the-shelf hardware and software, putting the cost within reach of individual researchers and small teams. It's instructive to remember that Nebulae and Tianhe-1A were designed and built in months, while most of the embedded MPP designs have taken years to go from powerpoint to power-on.

There is room for both. In the upper range of the top500, embedded MPP will continue to leverage specialized networks and OS stacks to achieve performance through high node counts. At the same time, GPUs are already delivering systems with over 1TF of peak double-precision per node on everything from desktops to petaflop clusters. Both have been disruptive, and either could grow to dominate HPC in years to come. I've adjusted my career path accordingly for the next five years or so.

January 21, 2011

Our First CUDA C Program

Ok, time to come clean. One reason I started this blog was that I don't do enough recreational programming. You'd think that I'd get most of that out of my system with the amount of CUDA and Tesla related tasks that constitute my normal workday, but you'd be wrong. So, I'm going to use this blog to keep myself honest by posting regular CUDA C programs, tips, and code segments.

To make this a little more interesting, I'm going to assume that we're all starting with zero knowledge of CUDA C. So I'll begin with simple programs that illustrate the basics, and slowly work up to more complicated concepts and features. As I do so I'll try to throw in a little Socratic method and refine things based on the deficiencies of the early examples. Or you could buy a book.

For the first program, let's do something simple. The CUDA model exposes the GPU as a an accelerator or co-processor that has it's own memory space. So the first task a GPU programmer faces is learning how to allocate space on the device and move data back and forth between host memory and device memory. I'll show the full program first, and then break down the interesting bits.

/*
 * hello_cuda.cu
 *
 * Our first cuda program, resultsovercoffee@gmail.com
 *
 * compile with:  nvcc -o hello_cuda hello_cuda.cu
 */

using namespace std;

#include <iostream>
#include <string.h>
#include <unistd.h>

int main (int argc, const char *argv[]) {

    //our message
    const char *message = "hello world!";
    size_t size = strlen(message)+1;
     
    //delcare and allocate a buffer on the device
    char *d_buffer;
    if (cudaMalloc(&d_buffer,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //copy our message to the device buffer
    if (cudaMemcpy(d_buffer,message,size,cudaMemcpyHostToDevice)
        != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //declare and allocate a buffer on the host
    char *h_buffer = (char*)malloc(size);
    if (h_buffer == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }
    
    //copy the device buffer back to the host
    if (cudaMemcpy(h_buffer,d_buffer,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    cout << h_buffer << endl;
    cudaFree(d_buffer);
    free(h_buffer);
}


The first thing many of you will notice is that I'm using C++ (or at least some C++). If nothing else the using namespace std; probably gives it away. In this case I stuck with C allocation and char* arrays rather than strings, but I will generally be using C++ features when they are available.

The first CUDAism is the cudaMalloc() call that used to allocate space on the device. The syntax is slightly different than regular host-side malloc(), in that cudaMalloc takes a pointer-to-a-pointer and changes the value directly, while regular malloc provides the address of the allocated memory in it the return code. As a result, the cudaMalloc call can use the return code to indicate success or failure. Notice that we check that by comparing it to the value cudaSuccess and if it fails we make a call to get the specific error and print it to standard error. [If you'd like to see this in action, try increasing size to a number larger than the amount of memory on your NVIDIA GPU.]

The other CUDA C function we are using is cudaMemcpy(). That function takes four arguements, a pointer to the destination, a pointer to the source, the amount of memory to copy, and a direction. In this example, we first copy our message to a buffer on the device, then we copy it back to a new buffer on the host. Also note that I'm prefixing my pointers with h and d to indicate wether they point to memory in the host or the device. This isn't required, but it helps prevent you from shooting yourself in the foot later.

Finally, we print the result to standard output so that we can visually see that it survived intact and deallocate that which we allocated. That's it. If you've followed this far, you've now got a CUDA C program under your belt.

January 17, 2011

First Post: 2010 in Review

My name is Dale Southard. I'm a senior solution architect with NVIDIA, where I cover primarily high-performance computing and cloud computing. Results Over Coffee (ROC) is my blog, and I'll be posting on HPC and Cloud as well as some how-to information on programming GPUs, integrating them in Linux clusters, and whatever else I can cover over a cup of coffee. When I'm posting here, I'm speaking for me, not NVIDIA (even if I am posting about NVIDIA hardware or software).

I've often compared my work at NVIDIA to being a professional debugger -- I spend a lot of time working with very new hardware and software in new and complicated deployments. It's a lot of work, a lot of fun, and I feel lucky to be doing it for NVIDIA. In 2010, my personal high points were:
I have long viewed GPU Computing as a disruptive technology for high-performance computing and I am excited to have a ringside while it unfolds. ROC will be my place to share thoughts, tips, and hopefully have some fun.

So, welcome to ROC. There's more to come...