Friday 10 October 2014

Memories from CUDA - Symbol Addresses (II)

In a previous post we gave a simple example of accessing constant memory in CUDA from inside a kernel function. What if we need to access it from the host (i.e., pass it as an argument to a kernel)? In this post we focus on how to use cudaGetSymbolAddress to get the address of a device variable (can be a __constant__ or a __device__). Nowadays, modern architectures support the keyword __managed__, but we're going to do things the old way...




Passing a __constant__ variable to a kernel



Variables that are declared as __device__ or __constant__ are accessible from the device, but cannot be accessed or referenced directly in host-side code, and, as a result, these or their addresses cannot be passed to a kernel function directly. Thankfully, there is a way out every time we need to do so: cudaGetSymbolAddress returns the address a device variable on the device memory which can be used host-side. Let's go through the following example (the source code of this example can be found at https://gist.github.com/alphaville/be2c6865eab86dd491a6):

 #include <stdio.h>  
 #include <cuda_runtime.h>  
 #include "helper_cuda.h"  


We define the following variables:
 int host_x[4] = {1, 2, 3, 4};  
 __constant__ int dev_x[4];  

and we define a very simple kernel:

 __global__ void kernel(int *d_var) { d_var[threadIdx.x] += 10; }  
 int main(void)  
 {  
   int data_size = 4 * sizeof(int);  
   int *address;  

We now allocate memory for the __constant__ variable dev_x and we copy the data stored into host_x to the device variable dev_x using cudaMemcpyToSymbol as we explained in the previous post

   checkCudaErrors(cudaMallocHost((void**) &dev_x,   
       data_size, cudaHostAllocWriteCombined));   
   checkCudaErrors(cudaMemcpyToSymbol(dev_x, host_x,   
       data_size,0, cudaMemcpyHostToDevice));   

We now get the address of dev_x which we will pass to the kernel function. We cannot pass dev_x directly to the kernel!!!

   checkCudaErrors(  
    cudaGetSymbolAddress((void**)&address, dev_x));  

And this is how we invoke the kernel:

   kernel<<<1,4>>>(address);  checkCudaErrors(cudaDeviceSynchronize());  
   getLastCudaError("wtf!");  

We now get the data stored in dev_x back to host_x using cudaMemcpyFromSymbol (after all __constant__ variables are not so constant, huh?)

 checkCudaErrors(cudaMemcpyFromSymbol(host_x, dev_x, data_size, 0,   
   cudaMemcpyDeviceToHost));   
     
   for (int i=0; i< 4; i++){   
    printf("%d\n", host_x[i]);   
   }   
   return 0;   
  }   

and... that's it! The source code of the example above is:




Passing a __device__ variable to a kernel

Variables that have been declared as __device__ (i.e., they reside on the device and are accessible from all threads in a grid) can be accessed from the host using cudaGetSymbolAddress. The code is then identical, we only need to replace __constant__ with __device__ (and the output of the program remains the same). So simple! The source code becomes:




Using the address of a __device__ or __constant__ variable to transfer data


The address of a __device__ or __constant__ variable can be used in any context it is required to use a device variable (point to it) from host code. This is the case when we need to transfer data using cudaMemcpy. As a matter of fact, one can use cudaMemcpy instead of cudaMemcpyToSymbol to transfer data to a symbol but passing the address of the __device__ or __constant__ variable.

This is how the whole thing looks like... First we allocate memory for dev_x using cudaMallocHost (same as before):

 checkCudaErrors(cudaMallocHost((void**) &dev_x,   
    data_size, cudaHostAllocWriteCombined));  

Then, we retrieve the address of the symbol on the device using cudaGetSymbolAddress:

   checkCudaErrors(cudaGetSymbolAddress((void**)&address, dev_x));    

We now pass this address to cudaMemcpy so as to transfer the data (yes, we could have used cudaMemcpyToSymbol):

  checkCudaErrors(cudaMemcpy(address, host_x, data_size,   
     cudaMemcpyHostToDevice));  

And finally we launch the kernel passing to it the address:

   kernel<<<1,4>>>(address);  

and it produces the same result as before. The corresponding source code can be found at https://gist.github.com/alphaville/9c6021692a89fdeb39e4 and goes like this:

4 comments:

  1. dev_x has already been declared as __constant__ int dev_x[4]. So, it seems like there is an error with respect to allocating memory for dev_x i.e. cudaMallocHost(&dev_x)? I think you might have meant dev_y? In your previous post, it says that we don't need to allocate memory for constant memory since it has already been statically allocated (this makes sense to me).
    Thanks for putting up these explanatory posts.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I've got exactly the same impression that memory for 'dev_x' does not need to be allocated. Hopefully the author will read our comments and give explanations. Could it be that the purpose of 'cudaMalloc*Host*' is to get the address on the host side which will be linked to the static address on the device (if it makes sense)?
    Anyway, thank you very much for the post which helps to deepen understanding of 'cudaMemcpyFromSymbol'/'cudaMemcpyToSymbol' functions.

    ReplyDelete