TP 0.4 - Kernel programming
Sept. 2020
You should complete this part within 30 minutes.
In this exercise you will code a kernel function which initializes a buffer in device global memory with a given value.
It is a sort of “fill” function.
In this case, both the buffer and the grid/blocks are in one dimension.
Read and edit the file 1-vec_init.cu
from the student resources archive student_resources.tar.gz
.
You have to write the code for the kernel
void vecinit(int *a, int N) __global__
You are expected to insert your code in the sections demarcated with //@@
. You should not have to make any change in other code sections.
Running make test-1-vec_init
should build and run your program. The integrated test should succeed.
In this exercise your will code a simple vector addition: given two vectors x and y, you have to replace each yi by xi + yi.
In this case, both the buffer and the grid/blocks are in one dimension.
Read and edit the file 2-vec_add.cu
from the student resources archive student_resources.tar.gz
.
You have to write the code for the kernel
void add(float *x, float *y, int N) __global__
You are expected to insert your code in the sections demarcated with //@@
. You should not have to make any change in other code sections.
Running make test-2-vec_add
should build and run your program. The integrated test should succeed.
In this exercise your will code another vector addition kernel. In this case the buffer is still allocated in 1 dimension, but the grid and blocks are 2 dimensional.
Read and edit the file 3-mat_add_1Dalloc.cu
from the student resources archive student_resources.tar.gz
.
You have to write the code for the kernel
void add(int *a, int *b, int *c, int N, int M) { __global__
You are expected to insert your code in the sections demarcated with //@@
. You should not have to make any change in other code sections.
Running make test-3-mat_add_1Dalloc
should build and run your program. The integrated test should succeed.
In this exercise your will initialize a 2D buffer. In this case both the buffer and the grid/blocks are 2 dimensional.
You will also have to check on the device (ie using another kernel) that the memory is properly initialized.
Read and edit the file 4-mallocpitch.cu
from the student resources archive student_resources.tar.gz
.
You have to write the code for the kernels
template <typename T>
void simpleInit2D(T *buffer, T value, int cols, int rows, size_t pitch);
__global__
template <typename T>
void checkOnDevice(T *buffer, T expectedValue, int cols, int rows, size_t pitch); __global__
You will also have to code the function
template <typename T>
inline T* eltPtr(T *baseAddress, int col, int row, size_t pitch); __device__
which computes the pointer address of a given value in a 2D array given: - baseAddress
: the base address of the buffer - col
: the col coordinate of the value - row
: the row coordinate of the value - pitch
: the actual allocation size in bytes of a row plus its padding
You are expected to insert your code in the sections demarcated with //@@
. You should not have to make any change in other code sections.
Running make test-4-mallocpitch
should build and run your program. The integrated test should succeed.
Except from the CUDA C Programming guide, from section 5.3.2. Device Memory Accesses:
Two-Dimensional Arrays
A common global memory access pattern is when each thread of index (tx,ty) uses the following address to access one element of a 2D array of width width, located at address BaseAddress of type
type*
(where type meets the requirement described in Maximize Utilization):
BaseAddress + width * ty + tx
For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size.
In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The
cudaMallocPitch()
andcuMemAllocPitch()
functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.
__device__
execution space specifier for the eltPtr()
function? eltPtr()
, why do we convert the base address of the buffer to (char*)
?