TP 0.4 - Kernel programming
Sept. 2020
You should complete this part within 30 minutes.
In this exercise you will code a kernel function which initializes a buffer in device global memory with a given value.
It is a sort of “fill” function.
In this case, both the buffer and the grid/blocks are in one dimension.
Read and edit the file 1-vec_init.cu from the student resources archive student_resources.tar.gz.
You have to write the code for the kernel
__global__ void vecinit(int *a, int N)You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.
Running make test-1-vec_init should build and run your program. The integrated test should succeed.
In this exercise your will code a simple vector addition: given two vectors x and y, you have to replace each yi by xi + yi.
In this case, both the buffer and the grid/blocks are in one dimension.
Read and edit the file 2-vec_add.cu from the student resources archive student_resources.tar.gz.
You have to write the code for the kernel
__global__ void add(float *x, float *y, int N)You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.
Running make test-2-vec_add should build and run your program. The integrated test should succeed.
In this exercise your will code another vector addition kernel. In this case the buffer is still allocated in 1 dimension, but the grid and blocks are 2 dimensional.
Read and edit the file 3-mat_add_1Dalloc.cu from the student resources archive student_resources.tar.gz.
You have to write the code for the kernel
__global__ void add(int *a, int *b, int *c, int N, int M) {You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.
Running make test-3-mat_add_1Dalloc should build and run your program. The integrated test should succeed.
In this exercise your will initialize a 2D buffer. In this case both the buffer and the grid/blocks are 2 dimensional.
You will also have to check on the device (ie using another kernel) that the memory is properly initialized.
Read and edit the file 4-mallocpitch.cu from the student resources archive student_resources.tar.gz.
You have to write the code for the kernels
template <typename T>
__global__ void simpleInit2D(T *buffer, T value, int cols, int rows, size_t pitch);
template <typename T>
__global__ void checkOnDevice(T *buffer, T expectedValue, int cols, int rows, size_t pitch);You will also have to code the function
template <typename T>
__device__ inline T* eltPtr(T *baseAddress, int col, int row, size_t pitch);which computes the pointer address of a given value in a 2D array given: - baseAddress: the base address of the buffer - col: the col coordinate of the value - row: the row coordinate of the value - pitch: the actual allocation size in bytes of a row plus its padding
You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.
Running make test-4-mallocpitch should build and run your program. The integrated test should succeed.
Except from the CUDA C Programming guide, from section 5.3.2. Device Memory Accesses:
Two-Dimensional Arrays
A common global memory access pattern is when each thread of index (tx,ty) uses the following address to access one element of a 2D array of width width, located at address BaseAddress of type
type*(where type meets the requirement described in Maximize Utilization):
BaseAddress + width * ty + txFor these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size.
In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The
cudaMallocPitch()andcuMemAllocPitch()functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.
__device__ execution space specifier for the eltPtr() function? eltPtr(), why do we convert the base address of the buffer to (char*)?