February 11, 2011

CUDA Project: Snake Oil Encryption

Now that we have a few simple CUDA programs under our belt, I'd like to tackle an example project for a few weeks. Ideally this is something that can scale to lots of cores, though I think it will be OK if it lacks the arithmetic intensity that most HPC apps have. I also want to do something a little silly....

I've been a fan of Bruce Schneier for well over a decade. That includes following his blog (and before that the Crypto-Gram newsletter), as well as reading many of his books. One theme that he has returned to again and again is that there are a lot of bad security products. In particular, one segment of Applied Cryptography always stands out in my mind:
The simple-XOR algorithm is really an embarrassment; it's nothing more than a Vigenère polyalphabetic cipher. It's here only because of its prevalence in commercial software packages.

-- Applied Cryptography, p14
And now we have our first CUDA project! Over the next few weeks, we'll use the massively parallel single-instruction multiple-thread streaming multiprocessors in the NVIDA GPU to implement one of the worst encryption algorithms since ROT13. With luck, this will provide the world with a fresh supply of snake oil products for Bruce to induct into the doghouse. I'm excited.

As usual, let's start with the code:

/*
 * snake_oil1.cu
 *
 * CUDA program to do bad encryption
 * 
 * compile with:  nvcc -o snake_oil1 snake_oil1.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <string>
#include <new>
#include <cstdlib>

using namespace std;

// function prototype for our CUDA kernel
__global__ void badcrypto(char *, char *, char *);


int main(int argc, char *argv[]) {
    string in("cleartxt");
    string key("password");
    int size = key.size();
    dim3 grid(1,1);
    dim3 block(size);

    //declare, allocate, and populate input & key on the device
    char *d_in;
    if (cudaMalloc(&d_in,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_in,in.c_str(),size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    char *d_key;
    if (cudaMalloc(&d_key,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_key,key.c_str(),size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    // declare, allocate and zero output array on device
    char *d_out;
    if (cudaMalloc(&d_out,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemset(d_out, 0 ,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //launch kernel on device to generate results
    badcrypto<<<grid,block>>>(d_in, d_out, d_key);

    //declare and allocate an output array on the host
    char *h_out = new (nothrow)char[size];
    if (h_out == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   

    //copy the device buffer back to the host
    if (cudaMemcpy(h_out,d_out,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //print results
    for (int i=0;i<size;i++) {
        cout << (int)h_out[i] << " " ;
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_in);
    cudaFree(d_key);
    cudaFree(d_out);
    delete[] h_out;

    //bye
    exit(EXIT_SUCCESS);
}


/*
 * badcrypto
 *
 * kernel that "encrypts" data using xor.  Don't do this.
 */
__global__ void badcrypto(char *in, char *out, char *key) {
    out[threadIdx.x] = in[threadIdx.x] ^ key[threadIdx.x];
}

Wow, so much fail there. This should look vaguely like last week's exercise. We're defining two strings (the cleartex input in and the key) copying them to the device, doing a bitwise XOR, then copying the results back. Currently, this isn't going to be very useful to purveyors of bad software -- the strings are statically defined and the kernel is launched as a single thread block (limiting us to 512 threads on some GPUs). Additionally, the cleartext and key need to be the same length -- that's clearly not acceptable since without the ability to encrypt large amounts of data with a short key an ISV might accidentally implement a one-time-pad.

That's enough of a start. My plan for next few weeks is to first turn this into a usable app, then do some light optimization to better map it to GPU hardware. Given the lightweight nature of bitwise XOR, we probably won't see cosmic gains in performance, but it should be fun anyway.

No comments: