February 18, 2011

Snake Oil 2: Decoupling Threads and Array Size

Jumping off from last week's snake oil encryption example, this week we will focus on making our example CUDA C program a little more user-friendly. We will also demonstrate one method for decoupling the number of threads from the amount of work. As usual, first the code:

/*
 * snake_oil2.cu
 *
 * CUDA program to do bad encryption
 * 
 * compile with:  nvcc -o snake_oil2 snake_oil2.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <string>
#include <new>
#include <cstdlib>

using namespace std;

// function prototype for our CUDA kernel
__global__ void badcrypto(char *, char *, char *, int, int);


int main(int argc, char *argv[]) {
    string key("password");
    int keysize = key.size()+ 1;

    //read from stdin to a buffer (max of 512 bytes)
    const int max = 512;
    char *h_in = new (nothrow)char[max];
    if (h_in == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   
    cin.read(h_in,max);
    int size = cin.gcount();
    
    //declare, allocate, and copy the input to the device
    char *d_in;
    if (cudaMalloc(&d_in,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_in,h_in,size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //declare, allocate, and copy the key to the device 
    char *d_key;
    if (cudaMalloc(&d_key,keysize) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_key,key.c_str(),keysize,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    // declare, allocate and zero output array on device
    char *d_out;
    if (cudaMalloc(&d_out,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemset(d_out,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //launch kernel on device to generate results
    dim3 grid(1,1);
    dim3 block(512);    
    badcrypto<<<grid,block>>>(d_in,d_out,d_key,size,keysize);

    //declare and allocate an output array on the host
    char *h_out = new (nothrow)char[size];
    if (h_out == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   

    //copy the device buffer back to the host
    if (cudaMemcpy(h_out,d_out,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //print results
    for (int i=0;i<size;i++) {
        cout << (int)h_out[i] << " " ;
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_in);
    cudaFree(d_key);
    cudaFree(d_out);
    delete[] h_in;
    delete[] h_out;

    //bye
    exit(EXIT_SUCCESS);
}


/*
 * badcrypto
 *
 * kernel that "encrypts" data using xor.  Don't do this.
 */
__global__ void badcrypto(char *in, char *out, char *key,
                          int in_length, int key_length) {
    if (threadIdx.x < in_length)
        out[threadIdx.x] = in[threadIdx.x] ^ key[threadIdx.x % key_length];
}


First let's cover the minor modifications. Rather than using fixed input, we are now reading the cleartext from stdin to a host side buffer. For simplicity, we've limited the buffer to 512 bytes for the time being. After reading in the file, we set size to the number of bytes read. This modification will allow us to use snake_oil2 on the command line like this:

$ echo -n cleartxt | ./snake_oil2

But note that the output is still printed as integers for now.

The other modifications are in the CUDA C kernel. Now that we are accepting input from stdin, we can no longer assume that the cleartext and the key are the same length. To handle this, we are passing in the length of the key, and using the modulus operator (%) to find the index into the key that is appropriate for the current position in the cleartext. The result is that our cleartext and key can be different lengths -- the key is simply repeated encrypt the cleartext.

The more important change is how we have dealt with the variable length of the cleartext. Rather than launching one CUDA thread per input character, we are launching a fixed number of CUDA threads (512), but predicating the encryption operation. We pass in the length of the input cleartext, and each thread checks if it has a thread index less than that length. Threads with an index that would fall beyond the end of the cleartext fail the if statement, and thus do not perform any operation. This decouples the thread count from the array size.

The decoupling is possible because, unlike SIMD (single instruction, multiple data) vector paradigms, CUDA C is SIMT (single instruction, multiple threads). These threads have some autonomy and (within some practical limits) can support diverging code paths. In this case, we have predicated the encryption operation to make the threadblock behave like an unrolled while() loop.

Obviously, this example is still pretty unusable. We're limited to a maximum of 512 bytes of input. We also have many threads that are not contributing to the calculation, and we're only using 32 CUDA cores. Unfortunately, my 20 minutes is up, so I'll have to address those issues next week.

February 16, 2011

CUDA_VISIBLE_DEVICES

On systems with more than one GPU, it's useful to be able to select which device(s) you want to use for running CUDA apps. The CUDA APIs will select a GPU as the default, so unless you specify differently, all your CUDA applications will run on the same GPU. Setting compute-exclusive mode doesn't change this behavior -- all the programs will still target the same default GPU, but at least the additional ones will fail quickly rather than consuming resources that might be required by the first program.

One solution is to use $CUDA_VISIBLE_DEVICES. The environment variable CUDA_VISIBLE_DEVICES lists which devices are visible as a comma-separated string. For example, I've equipped my desktop with two Tesla cards and a Quadro card. I can use the deviceQuery program from the CUDA SDK and a little grep magic to list them:

$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C1060"
Device 2: "Quadro FX 3800"

By setting the envar, I can make only a subset of them visible to the runtime:

$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Quadro FX 3800"

Note that I didn't change the deviceQuery options at all, just the setting of $CUDA_VISIBLE_DEVICES. Also note that the GPUs are still enumerated sequentially from zero, but only the cards listed in the visible devices envar are exposed to the CUDA app.

This is useful in a couple situations. The first is the example I used above. My development workstation often has a mix of CUDA-capable devices in it. I generally need to target a specific model of card for testing, and the easiest way to do so is $CUDA_VISIBLE_DEVICES (especially with bash where it can be set per-invocation by prefixing the command).

The other case is clusters where nodes might be time-shared (multiple jobs running on the same node at the same time) but multiple GPUs on those nodes should be space-shared (one job per GPU). By setting $CUDA_VISIBLE_DEVICES in the prologue script, the batch system can route jobs to the right GPU without requiring the user to set additional command-line flags or configuration files. Of course, that requires the batch scheduler to support treating GPUs as independent consumables, but several common batch scheduling and resource management systems have already added that capability.

CUDA_VISIBLE_DEVICES is a simple feature, but it's a good technique to have in your pocket if you work on multiple GPU systems.

February 14, 2011

More Shameless Self-Aggrandizement

Quick note that a certain journal article has been covered by both International Science Grid This Week and HPCwire.

The HPCwire article is especially interesting as it calls us out for asking the tough questions. So be it -- HPC requires a balanced environment for productive science. You can't print a hundred terabyte dataset, nor can you build a hundred trillion pixel power wall. So if you need to build a machine that generates large datasets, you also need to plan for a post-processing environment capable of extracting the science. The exact mix of hardware, software, and skinny guys will depend on the site, but it's not something that good HPC sites ignore.

And yes, I realize that those questions are substantially less tough for me at a dot com, as compared to the co-authors who are at various dot edu and dot gov sites. Of course, that mix of employers also indicates that a lot of sites are thinking about this and "doing it right".

February 11, 2011

CUDA Project: Snake Oil Encryption

Now that we have a few simple CUDA programs under our belt, I'd like to tackle an example project for a few weeks. Ideally this is something that can scale to lots of cores, though I think it will be OK if it lacks the arithmetic intensity that most HPC apps have. I also want to do something a little silly....

I've been a fan of Bruce Schneier for well over a decade. That includes following his blog (and before that the Crypto-Gram newsletter), as well as reading many of his books. One theme that he has returned to again and again is that there are a lot of bad security products. In particular, one segment of Applied Cryptography always stands out in my mind:
The simple-XOR algorithm is really an embarrassment; it's nothing more than a Vigenère polyalphabetic cipher. It's here only because of its prevalence in commercial software packages.

-- Applied Cryptography, p14
And now we have our first CUDA project! Over the next few weeks, we'll use the massively parallel single-instruction multiple-thread streaming multiprocessors in the NVIDA GPU to implement one of the worst encryption algorithms since ROT13. With luck, this will provide the world with a fresh supply of snake oil products for Bruce to induct into the doghouse. I'm excited.

As usual, let's start with the code:

/*
 * snake_oil1.cu
 *
 * CUDA program to do bad encryption
 * 
 * compile with:  nvcc -o snake_oil1 snake_oil1.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <string>
#include <new>
#include <cstdlib>

using namespace std;

// function prototype for our CUDA kernel
__global__ void badcrypto(char *, char *, char *);


int main(int argc, char *argv[]) {
    string in("cleartxt");
    string key("password");
    int size = key.size();
    dim3 grid(1,1);
    dim3 block(size);

    //declare, allocate, and populate input & key on the device
    char *d_in;
    if (cudaMalloc(&d_in,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_in,in.c_str(),size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    char *d_key;
    if (cudaMalloc(&d_key,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_key,key.c_str(),size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    // declare, allocate and zero output array on device
    char *d_out;
    if (cudaMalloc(&d_out,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemset(d_out, 0 ,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //launch kernel on device to generate results
    badcrypto<<<grid,block>>>(d_in, d_out, d_key);

    //declare and allocate an output array on the host
    char *h_out = new (nothrow)char[size];
    if (h_out == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   

    //copy the device buffer back to the host
    if (cudaMemcpy(h_out,d_out,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //print results
    for (int i=0;i<size;i++) {
        cout << (int)h_out[i] << " " ;
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_in);
    cudaFree(d_key);
    cudaFree(d_out);
    delete[] h_out;

    //bye
    exit(EXIT_SUCCESS);
}


/*
 * badcrypto
 *
 * kernel that "encrypts" data using xor.  Don't do this.
 */
__global__ void badcrypto(char *in, char *out, char *key) {
    out[threadIdx.x] = in[threadIdx.x] ^ key[threadIdx.x];
}

Wow, so much fail there. This should look vaguely like last week's exercise. We're defining two strings (the cleartex input in and the key) copying them to the device, doing a bitwise XOR, then copying the results back. Currently, this isn't going to be very useful to purveyors of bad software -- the strings are statically defined and the kernel is launched as a single thread block (limiting us to 512 threads on some GPUs). Additionally, the cleartext and key need to be the same length -- that's clearly not acceptable since without the ability to encrypt large amounts of data with a short key an ISV might accidentally implement a one-time-pad.

That's enough of a start. My plan for next few weeks is to first turn this into a usable app, then do some light optimization to better map it to GPU hardware. Given the lightweight nature of bitwise XOR, we probably won't see cosmic gains in performance, but it should be fun anyway.

February 9, 2011

Errata and What's in a Name

A little behind the scenes at ROC....

I really do write these entries over coffee. My self-imposed time limit for blogging is about 20 minutes, so I can usually queue up a couple short articles or part of a code example before I need to go back to earning a living. The short time limit insures that I can write throughout the week, but also means splitting the longer posts and code examples across writing sessions. I think this works OK for the topics I'm posting on -- with programming there's not any urgency to be current like a political blog would face (poor Andrew Sullivan).

As a result of the 20-and-out policy, my (lack of) proofreading may allow errors to slip through. For minor typos and thinkos, I will just make corrections to the published articles silently. If I need to make a major change to a code example, I'll make a note of it on the post for those who have already cut-and-pasted the broken version. If I somehow manage to post a non-working code example or tip, I will self-flagellate with a length of cat5 cable.

While I'm on this topic, the name Results Over Coffee was inspired by a giveaway. The 2010 GPU Technology Conference gave away coffee mugs that featured the phrase "Results over coffee, not overnite". Unlike most conference chotchkies, the GTC mugs are pretty nice (insulated ceramic), and I often use mine when working in my home office. I happened to be drinking from it on the same day I was figuring out what to call the blog I'd be writing 20min at a time. The coincidence seemed too good to ignore, so I borrowed part of the phrase for a blog title.

A trip to register.com later, here we are. Still on my things to do list is blogging some coffee reviews. As a preview, I'm doing this entry over a cup of Barefoot Coffee's Michicoy, which is absolutely wonderful out of a press pot. Life is good.

February 7, 2011

To the Cloud or Something

I cover both high performance computing and cloud computing for NVIDIA. I have to confess that, like most HPC people, I hated "cloud" as an adjective. It seemed to be a pointless renaming of previously existing technologies. I now use it several times a day.

In the absolute sense, there are some pretty comprehensive definitions of cloud computing. The various types of cloud computing are well-specified, as are the mix of technologies required. That combination of technologies differs from other models. Operating a service in Amazon's EC2 is significantly different that standing up an IMAP server at a small campus, even if both ultimately involve computers and networks.

It also means something in the relative sense. CTO's and CIO's understand what "moving a service to the cloud" refers to, even if they can't repeat the NIST or wikipedia definitions from memory. This is similar to "high performance computing", which means something specific to most of us in HPC, even if looks like "servers connected to a network" to non-HPC folks.

Now that that is out of the way, why not try it? Amazon's Elastic Compute Cloud is available to anyone with an Internet connection and a credit card. I was heavily involved in getting EC2 enabled with GPUs, and the resulting service works smoother than I could have imagined when the project began. It also provides nearly bare-metal performance.

I gave a talk on getting started with EC2 at Supercomputing 2010. The process is easy enough that anyone with 30min and the previously mentioned Internet connection and credit card should be able to do it. If you're too impatient to sit though the slide deck and/or video, the summary is:

  1. Create an AWS Account at http://aws.amazon.com/
  2. Sign into the EC2 console at http://aws.amazon.com/console
  3. Create a Key Pair for ssh
  4. Add SSH to a Security Group to open the firewall
  5. Launch (instance) a GPU-ready image using cg1.4xlarge
  6. Enjoy root access to a 1TF system

The final point is not a typo. The Amazon GPU offering is a system with two Tesla m2050 GPUs, each of which provides 515 gigaflops of double-precision computing performance. Currently, Amazon is charging $2.10/hour for a system that would have easily placed in the top 50 of the world's computer systems ten years ago. It's really an amazing amount of double-precision computing power, and I've already seen some interesting uses.

It's also a good technique to have in your back pocket if you're reading this blog to follow the CUDA programming examples. My test machines generally have one or two Tesla cards, which have many features that the GeForce consumer cards lack (higher double-precision floating point performance, dual DMA engines, ECC, more memory, etc). My examples will usually work on GeForce, but EC2 provides a way to get access to higher performance Tesla devices when it is required.

And you don't have to call it cloud computing if you don't want to.

February 4, 2011

CUDA Blocks and Grids

In our previous CUDA example, we wrote a kernel that did some math based on thread number. I was specifically vague on the details of the triple-chevron launch syntax, with a promise to cover it next time. Next time is today.

Recall that we launched our kernel with a line like this one:
kernel<<<1,n>>>(d_array);
where n was the number of threads that would be spawned on the device, along with a caution than values larger than 512 might cause problems. The good news is that we are certainly not limited to 512 threads, and the launch syntax has some features that make mapping threads to larger problems much easier. First the code:
/*
 * cuda_enum2.cu
 *
 * CUDA Kernel at enumerates by block and thread indices
 * 
 * compile with:  nvcc -o cuda_enum2 cuda_enum2.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <unistd.h>

using namespace std;

//function prototype for our CUDA kernel
__global__ void enumerate(int *);


int main(int argc, char *argv[]) {
    dim3 grid(5,5);
    dim3 block(3,3,3);
    int n = grid.x * grid.y * block.x * block.y * block.z;
    size_t size = n * sizeof(int);

    //declare, allocate, and zero an array on the device
    int *d_array;
    if (cudaMalloc(&d_array,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }
    if (cudaMemset(d_array,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //launch kernel on device to generate results
    enumerate<<<grid,block>>>(d_array);

    //declare and allocate an array on the host
    int *h_array = (int*)malloc(size);
    if (h_array == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }

    //copy the device buffer back to the host
    if (cudaMemcpy(h_array,d_array,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //print results
    for (int i=0;i<n;i++) {
        cout << h_array[i] << " ";
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_array);
    free(h_array);

    //bye
    exit(0);
}


/*
 * enumerate
 *
 * kernel that computes a unique thread number based on the grid and
 * block indices and stores it in the A[]
 */
__global__ void enumerate(int *A) {
    int value =
          blockIdx.y  * gridDim.x  * blockDim.z * blockDim.y * blockDim.x
        + blockIdx.x  * blockDim.z * blockDim.y * blockDim.x
        + threadIdx.z * blockDim.y * blockDim.x
        + threadIdx.y * blockDim.x
        + threadIdx.x;
    A[value] = value;
}

Again, this should look pretty similar to our previous CUDA example. We've renamed our kernel to something more descriptive and changed the launch syntax a little, we've also re-written the kernel.

First that launch syntax. Rather than a couple scalar numbers, we now are using vectors of type dim3. As you might suspect, the dim3 structure is a vector with three values. When we generate it using the constructor syntax, any values not specified are set to one. That means that "dim3 grid(5,5);" creates a vector with three vaules, (5,5,1).

Additionally, you can see that the launch syntax uses two arguments: blocks and grids. A thread block is a group of related threads that can support up to three dimensions. With Fermi, the maximum block size 1024 threads, and the maximum dimensions are 1024 x 1024 x 64. Previous GPUs have lower limits, so you might be restricted to only 512 threads and 512 x 512 x 64 if you have a GT200 based system. Threads in the same block are scheduled on the same streaming multiprocessors and can communicate with shared memory.

That many threads sounds like a lot, but isn't much for a GPU that has 448 cores. That's where grids come in. Multiple thread blocks can be launched from the same grid. With Fermi, grids are two-dimensional and can be up to 65535 x 65535. Since each grid square is a block of threads, the total number of threads can get large very quickly. Luckily GPUs have hardware schedulers and can easily handle many thousands of threads. That's important, because GPUs use threading to hide memory access latency rather that caching like a CPU does. So on GPU's you always want to launch more threads than cores.

In the code above, we are launching a 5 x 5 grid of 3 x 3 x 3 thread blocks. That's 675 threads being launched in parallel on the GPU. If you need a picuture:


The other thing we changed is the CUDA kernel itself. Our enumerate kernel calculates a thread number based on the block and grid dimensions and the thread's index. As a result the number is unique per thread, and we calculated it so that threads access the array in a coalesced fashion (neighboring threads in a block store to neighboring locations in the array).

My coffee is empty, so that's enough explanation for today.

February 2, 2011

I Hate RC Scripts

Having had roles at Unix/Linux sites for the better part of two decades, my experience is that users are amazingly proficient at catastrophically mangling their RC files (.profile, .chsrc, .login, .bashrc, and others). In fact, it's such a problem that I wrote an article on how to deploy environment modules to deal with it.

For sites that don't use modules, below are a pair of scripts that can be placed in /etc/profile.d on Linux. They will correctly configure environments for most shells so that CUDA C just works. They've been used at several sites under RHEL/CentOS, but probably work on other distros with little or no modifications.

# cuda.sh -- CUDA intialization script (sh/bash)
# resultsovercoffee@gmail.com

# edit this line and place script in /etc/profile.d
cudapath="/usr/local/cuda"

# set LD_LIBRARY_PATH
if  [[ ${LD_LIBRARY_PATH} != *${cudapath}/lib64* ]] 
then
  export LD_LIBRARY_PATH=${cudapath}/lib64:${cudapath}/lib:${LD_LIBRARY_PATH}
fi

# set PATH
if [[ ${PATH} != *${cudapath}/bin* ]]
then
  export PATH=${cudapath}/bin:${PATH}
fi

# cuda.csh -- CUDA intialization script (csh)
# resultsovercoffee@gmail.com

# edit this line and place script in /etc/profile.d
set cudapath="/usr/local/cuda"

# set LD_LIBRARY_PATH
if ( "${?LD_LIBRARY_PATH}" == "0" ) then
  setenv LD_LIBRARY_PATH ${cudapath}/lib64:${cudapath}/lib
else if ( "${LD_LIBRARY_PATH}"  !~ "*${cudapath}/lib64*" ) then
  setenv LD_LIBRARY_PATH ${cudapath}/lib64:${cudapath}/lib:${LD_LIBRARY_PATH}
endif

# set path
if ( "${path}" !~ "*${cudapath}/bin*" ) then
  set path = ( ${cudapath}/bin $path )
endif

You can use the time not spent on helpdesk calls to read a book on CUDA.