kdb+cuda

I’ve been discussing a GPU implementation of q style languages with some former colleagues recently. Dusting off some old examples..

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Compile the code into a shared library; the compiler
# produces code for the GPU to execute the kernel
# and a regular shared library for the host to kdb+
# can load the functions.
amandla> nvcc -I. --compiler-options '-fPIC' -m64 cu.cu \
  --shared -o ../l64/cu.so 
 
# Now, in kdb+, load in the gpu_square function and
# test drive it..
q)square:`cu 2:(`gpu_square;1)
q)numbers: "e"$til 10
q)numbers
0 1 2 3 4 5 6 7 8 9e
q)square[numbers]
q)numbers
0 1 4 9 16 25 36 49 64 81e
q)

Here’s the code.. all very simple.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <cuda.h>
#include "k.h"
 
// Export the function we will load into kdb+.
extern "C" K gpu_square(K x);
 
// Define the "Kernel" that executes on the 
// CUDA device.
__global__ void square_array(float *a, int N) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx<N)
      a[idx] = a[idx] * a[idx];
}
 
// A function to use from kdb+ to square a vector of reals by
// - allocating space on the GPU
// - copying the data over from the K object
// - doing the work
// - copy back and overwrite the K object data
K gpu_square(K x) {
    // Pointers to host & device arrays
   float *host_memory = (float*) &(kF(x)[0]), *device_memory;
 
   // Allocate memory on the device for the 
   // data and copy it to the GPU
   size_t size = xn * sizeof(float);
   cudaMalloc((void **)&device_memory, size);
   cudaMemcpy(device_memory, host_memory, size, 
      cudaMemcpyHostToDevice);
 
   // Do the computaton on the card
   int block_size = 4;
   int n_blocks = xn/block_size + (xn%block_size == 0 ? 0:1);
   square_array <<< n_blocks, block_size >>> (device_memory, xn);
 
   // Copy back the data, overwriting the input, free 
   // the memory we allocated on the graphics card
   cudaMemcpy(host_memory, device_memory, size, 
       cudaMemcpyDeviceToHost);
   cudaFree(device_memory);
   return 0;
}

Now to something more interesting.. like a BGM model of a Bermudan swaption. Monte Carlo tends to be pretty zippy on a high-end GPU..

2 Comments »

  1. shweta Said,

    August 6, 2009 @ 7:45 am

    Hi,

    We are working on performing mathematical operations on GPU
    using CUDA. The inputs are to be obtained from a kdb data base. I need some pointers to how can we connect to the data base using C. Appreciate your help.

    Thanks,
    Shweta

  2. niall Said,

    August 6, 2009 @ 8:01 pm

    I’d start with the “Interfacing with C” cookbook I wrote on http://code.kx.com

Leave a Comment