# Getting started with CUDA

## Part 2 - Host view of GPU computation

Edwin Carlinet, Joseph Chazalon {firstname.lastname@epita.fr} Fall 2023

EPITA Research Laboratory (LRE)



### Host view of GPU computation



We need to transfer inputs from the host to the device and outputs the other way around.

Figure 2: Computation on separate device

## #include <cuda.h> void vecAdd(float \*h\_A, float \*h\_B, float \*h\_C, int n) int size = n \* sizeof(float); float \*d\_A, \*d\_B, \*d\_C; // 1.1 Allocate device memory for A, B and C // 1.2 Copy A and B to device memory // 2. Launch kernel code - computation done on device // 3. Copy C (result) from device memory // Free device vectors

Calling kernels, not writing them

Almost complete code

You do not need to write kernels to run CLIDA code: you can use kernels from a library, written by someone else.

This section is about how to properly launch CUDA kernels using their API only.

We use the GPU(s) as co-processor(s)

Sequential and parallel sections

Our program is made of a series of sequential and parallel sections.

Of course, CPU code can be multi-threaded tool

Checking errors



Figure 1: Heterogeneous programming

1D 2D 3D Allocate cudaMalloc() cudaMallocPitch() cudaMalloc3D() Copy cudaMemcpy() cudaMemcpy2D() cudaMemcpy3D() On-device init cudaMomset() cudaMomset2D() cudaMomset3D() Reclaim cudaFree()

plus many others detailed in the CUDA Runtime API documentation...

### Why 2D and 3D variants?

**CUDA** memory primitives

Host vs device: reminder

- Strong alignment requirements in device memory
- Enables correct loading of memory chunks to SM caches (correct bank alignment)
- · Proper striding management in automated fashion

 $\textbf{Host} \, \leftrightarrow \, \textbf{Device memory transfer}$ 

A proper kernel invocation

Let's fix this code!

cudaError\_t cudaMalloc ( void\*\* devPtr, size\_t size\_in\_bytes )

We just need the three following ones for now: Allocates space in the device global memory.

cudaError t cudaMemcpy ( void\* dst, const void\* src, size t size in bytes, cudaMemcpvKind kind )

Asynchronous data transfer. cudaMemcpyKind ≈ copy direction:

cudaMemcpvHostToHost

- $cudaMemcpyHostToDevice \leftarrow useful$
- cudaMemcpyDeviceToHost ← useful
- cudaMemcpvDeviceToDevice
- $\hbox{\tt cudaMemcpyDefault} \leftarrow \textit{Direction inferred from pointer values}. \textit{ Requires unified virtual}$ addressing.

cudaError\_t cudaFree ( void\* devPtr )

How to set gridDim and BlockDim properly?

void vecAdd(float \*h\_A, float \*h\_B, float \*h\_C, int n) int size = n \* sizeof(float); float \*d\_A, \*d\_B, \*d\_C; // 1.1 Allocate device memory for A. B and C cudaMalloc((void \*\*) &d\_A, size); // TODO repeat for d\_B and d\_C // 1.2 Copy A and B to device memory cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice); // 2. Launch kernel code - computation done on device  $\label{eq:k_VecAdd} $$k_VecAdd<<< NB, NT>>> (d_A, d_B, d_C, n); $$// FIXME $$How to compute NB and NT?$$ // 3. Copy C (result) from device memory cudaMemcpy(h\_C, d\_C, size, cudaMemcpyDeviceToHost); // Free device vectors cudaFree(d\_A); // TODO repeat for d\_B and d\_C

Intermission: Can I use memory management functions inside kernels?

No: cudaMalloc(), cudaMemcpy() and cudaFree() shall be called from host only.

memset() memcry() and free() functions

threads are going to call malloc()?

However, kernels may allocate, use and reclaim memory dynamically using regular malloc(),

Note that if some device code allocates some memory, it must free it.\ Warning: how many

## LINE ); exit(EXIT FAILURE):

\_\_FILE\_\_,

if (err != cudaSuccess) {

In practice, we need to check for API errors

printf("%s in %s at line %d\n",

cudaGetErrorString(err),

cudaError t err = cudaMalloc((void \*\*) &d A. size);

Fix the kernel invocation line

We want to fix this line:

k\_VecAdd<<<NB, NT>>>(d\_A, d\_B, d\_C, n);

Kernel invocation syntax:

kernel<<<br/>blocks, threads\_per\_block, shmem, stream>>>(param1, param2, ...);

- blocks: number of blocks in the grid:
- threads\_per\_block: number of threads for each block;
- shmem: (opt.) amount of shared memory to allocate (in bytes);
- stream: (opt.) CUDA stream (not discussed in this course, see the documentation).

How to set gridDim and BlockDim properly?

Lvl 0: Naive trial with as many threads as possible

k\_VecAdd<<<1, n>>>(d\_A, d\_B, d\_C, n);

Lvl 1: It works with just enough blocks

dim3 DimGrid(xBlocks, 1, 1); // Launch the kernel

// Get max threads per block int devId = 0: // There may be more devices! cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, devId); printf("Maximum grid dimensions: %d x %d x %d\n",  ${\tt deviceProp.maxGridSize[0],}$ deviceProp.maxGridSize[1], deviceProp.maxGridSize[2]); printf("Maximum block dimensions: %d x %d x %d\n", deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]); // Compute the number of blocks int xThreads = deviceProp.maxThreadsDim[0]; dim3 DimBlock(xThreads, 1, 1); // 1D VecAdd int xBlocks = (int) ceil(n/xThreads);

k\_VecAdd<<<DimGrid, DimBlock>>>(d\_A, d\_B, d\_C, n);

Lvl 0: Naive trial with as many threads as possible

k\_VecAdd<<<1, n>>>(d\_A, d\_B, d\_C, n);

Will fail with large vectors.

Hardware limitation on the maximum number of threads per block (1024 for Compute Capability

Will fail with vectors of size which is not a multiple of warp size (32).

Lvl 2: Tune block size given kernel requirements and hardware constraints

It is important to understand the difference between:

- the logical decomposition of your program  $\mathsf{problem} \approx \mathsf{grid} \to \mathsf{blocks} \to \mathsf{threads}$
- . the scheduling of the computation on the hardware:
- assignment of each block to a Streaming Multiprocessor (SM)
- groups threads into warps
- · run groups of warps concurrently

The hardware constraints are different between each Compute Capability version. See the CUDA C programming manual, Appendix H for details about each hardware version In particular, the amount of memory available on each SM may limit the number of threads one would actually want to launch (because of cache and registers pressure).

But this depends on the kernel code!

The CUDA Occupancy Calculator APIs are designed to assist programmers in choosing the best number of threads per block based on register and shared memory requirements of a given

However, remember that experiments on the target hardware is the way to go.

But wait... Kernel invocation is asynchronous

```
#include <stdio.h>
__global__ void print_kernel() {
  printf("Hello!\n");
int main() {
   print_kernel<<<1, 1>>>();
```

Remember that the device runtime is a functional subset of the host runtime, ie you can perform

device management, kernel launching, device memcpy, etc., but with some restrictions (see the

This code prints nothing!

```
#include <stdio.h>
__global__ void print_kernel() {
   printf("Hello!\n");
int main() {
   print_kernel<<<1, 1>>>();
   cudaDeviceSynchronize();
```

Host code synchronization cudaDeviceSynchronize because kernel invocation

from host perspective.

On the device, kernel invo sequential (unless you sche different streams).

Intermission: Can I call kernels inside kernels?

Yes: This is the basis of dynamic parallelism.

Some restrictions over the stack size apply.

The compiler may inline some of those calls, though.

documentation for details).

Conclusion about the host-only view

A host-only view of the computation is sufficient for most of the cases:

- 1. upload input data to the device
- 2. fire a kernel
- 3. download output data from the device

Advanced CUDA requires to make sure we saturate the SMs, and may imply to determine the best:

- amount of threads per blocks
- amount of blocks per grid
- work per thread (if applicable)

This depends on:

- hardware specifications: maximum gridDim and blockDim, etc.
- kernel code: amount of register and shared memory used by each thread

| n requires<br>e()<br>n is asynchronous |    |  |  |
|----------------------------------------|----|--|--|
| cations are strictly edule them on     |    |  |  |
|                                        | 16 |  |  |
|                                        |    |  |  |
| y some kernel study                    |    |  |  |
|                                        |    |  |  |
| d                                      | 18 |  |  |
|                                        |    |  |  |
|                                        |    |  |  |
|                                        |    |  |  |