GPU Computing

TP 0.4 - Kernel programming

E. Carlinet

J. Chazalon

Sept. 2020

Objective(s)

Duration

You should complete this part within 30 minutes.

Exercise 1: Vector initialization

Instructions

In this exercise you will code a kernel function which initializes a buffer in device global memory with a given value.

It is a sort of “fill” function.

In this case, both the buffer and the grid/blocks are in one dimension.

Code template

Read and edit the file 1-vec_init.cu from the student resources archive student_resources.tar.gz.

You have to write the code for the kernel

__global__ void vecinit(int *a, int N)

You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.

Validation

Running make test-1-vec_init should build and run your program. The integrated test should succeed.

Relevant documentation

Exercise 2: Vector addition

Instructions

In this exercise your will code a simple vector addition: given two vectors x and y, you have to replace each yi by xi + yi.

In this case, both the buffer and the grid/blocks are in one dimension.

Code template

Read and edit the file 2-vec_add.cu from the student resources archive student_resources.tar.gz.

You have to write the code for the kernel

__global__ void add(float *x, float *y, int N)

You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.

Validation

Running make test-2-vec_add should build and run your program. The integrated test should succeed.

Relevant documentation

Exercise 3: Matrix addiction with 1D allocation

Instructions

In this exercise your will code another vector addition kernel. In this case the buffer is still allocated in 1 dimension, but the grid and blocks are 2 dimensional.

Code template

Read and edit the file 3-mat_add_1Dalloc.cu from the student resources archive student_resources.tar.gz.

You have to write the code for the kernel

__global__ void add(int *a, int *b, int *c, int N, int M) {

You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.

Validation

Running make test-3-mat_add_1Dalloc should build and run your program. The integrated test should succeed.

Relevant documentation

Exercise 4: Matrix initialization with 2D (pitched) allocation

Instructions

In this exercise your will initialize a 2D buffer. In this case both the buffer and the grid/blocks are 2 dimensional.

You will also have to check on the device (ie using another kernel) that the memory is properly initialized.

Code template

Read and edit the file 4-mallocpitch.cu from the student resources archive student_resources.tar.gz.

You have to write the code for the kernels

template <typename T>
__global__ void simpleInit2D(T *buffer, T value, int cols, int rows, size_t pitch);

template <typename T>
__global__ void checkOnDevice(T *buffer, T expectedValue, int cols, int rows, size_t pitch);

You will also have to code the function

template <typename T>
__device__ inline T* eltPtr(T *baseAddress, int col, int row, size_t pitch);

which computes the pointer address of a given value in a 2D array given: - baseAddress: the base address of the buffer - col: the col coordinate of the value - row: the row coordinate of the value - pitch: the actual allocation size in bytes of a row plus its padding

You are expected to insert your code in the sections demarcated with //@@. You should not have to make any change in other code sections.

Validation

Running make test-4-mallocpitch should build and run your program. The integrated test should succeed.

Relevant documentation

Except from the CUDA C Programming guide, from section 5.3.2. Device Memory Accesses:

Two-Dimensional Arrays

A common global memory access pattern is when each thread of index (tx,ty) uses the following address to access one element of a 2D array of width width, located at address BaseAddress of type type* (where type meets the requirement described in Maximize Utilization):

BaseAddress + width * ty + tx

For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size.

In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch() and cuMemAllocPitch() functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.

Questions

  1. Why do we need a __device__ execution space specifier for the eltPtr() function?
  2. In the function eltPtr(), why do we convert the base address of the buffer to (char*)?