### Instruction scheduling

#### Akim Demaille, Etienne Renault, Roland Levillain

May 19, 2018

CCMP2

• • • • • • • • • •

# Table of contents

#### Dependencies

- 2 Dependency graph
- 3 Instruction Pipeline
- 4 Minimizing stalls
- 5 Loops unrolling
- 6 Managing caches

→ ∢ ∃

## Dependencies analysis 1/2

Two instructions are independent they can be permuted without altering the consistency

## Dependencies analysis 1/2

Two instructions are **independent** they can be permuted without altering the consistency

• The 3 following instructions are independent

| inst <sub>1</sub> : | $a \leftarrow 42$ |
|---------------------|-------------------|
| inst <sub>2</sub> : | $b \gets 51$      |
| inst <sub>3</sub> : | $c \gets 0$       |

## Dependencies analysis 1/2

Two instructions are **independent** they can be permuted without altering the consistency

• The 3 following instructions are independent

| inst <sub>1</sub> : | $a \leftarrow 42$ |
|---------------------|-------------------|
| inst <sub>2</sub> : | $b \gets 51$      |
| inst <sub>3</sub> : | $c \gets 0$       |

• inst<sub>1</sub>, inst<sub>2</sub> and *inst*<sub>3</sub> can then be reordered

| $inst_1$ :          | $a \gets 42$      | inst <sub>1</sub> : | $a \gets 42$      | inst <sub>3</sub> : | $c \gets 0$  |
|---------------------|-------------------|---------------------|-------------------|---------------------|--------------|
| inst <sub>2</sub> : | $b \leftarrow 51$ | inst <sub>3</sub> : | $c \gets 0$       | inst <sub>1</sub> : | $a \gets 42$ |
| inst <sub>3</sub> : | $c \gets 0$       | inst <sub>2</sub> : | $b \gets 51$      | inst <sub>2</sub> : | $b \gets 51$ |
|                     | I                 |                     |                   |                     |              |
| $inst_1$ :          | $c \leftarrow 0$  | inst <sub>1</sub> : | $b \leftarrow 51$ | inst <sub>3</sub> : | $b \gets 51$ |
| inst <sub>2</sub> : | $b \gets 51$      | inst <sub>3</sub> : | $c \gets 0$       | inst <sub>1</sub> : | $a \gets 42$ |
| inst <sub>3</sub> : | a ← 42            | inst <sub>2</sub> : | $a \leftarrow 42$ | inst <sub>2</sub> : | $c \gets 0$  |
| 9                   |                   | 1                   |                   | 2                   |              |

### Dependencies analysis 2/2

Two instructions are dependent if the first one needs to be executed before the second one.

## Dependencies analysis 2/2

Two instructions are dependent if the first one needs to be executed before the second one.

• The 3 following instructions are dependent, i.e. no reordering is possible!

 $\begin{array}{rll} \mathsf{inst}_1: & \mathsf{a} \leftarrow 42\\ \mathsf{inst}_2: & \mathsf{b} \leftarrow \mathsf{a} + 51\\ \mathsf{inst}_3: & \mathsf{c} \leftarrow \mathsf{b} \times 12 \end{array}$ 

## Dependencies analysis 2/2

Two instructions are dependent if the first one needs to be executed before the second one.

• The 3 following instructions are dependent, i.e. no reordering is possible!

 $\begin{array}{rll} \mathsf{inst}_1: & \mathsf{a} \leftarrow 42\\ \mathsf{inst}_2: & \mathsf{b} \leftarrow \mathsf{a} + 51\\ \mathsf{inst}_3: & \mathsf{c} \leftarrow \mathsf{b} \times 12 \end{array}$ 

- Two kind of dependencies:
  - Data dependencies: the instruction manipulates a "variable" computed by another instruction.
  - Instruction dependencies: the instruction is a "cjump", the next instruction depends of the "cjump".

・ 何 ト ・ ヨ ト ・ ヨ ト

## Read after Write (RAW)

An instruction reads from a location after an earlier instruction has written to it.



## Read after Write (RAW)

An instruction reads from a location after an earlier instruction has written to it.

| inst <sub>1</sub> : | lw   | \$2, | 0(\$4) |    |
|---------------------|------|------|--------|----|
| inst <sub>2</sub> : | addi | \$6, | \$2,   | 42 |

# Read after Write (RAW)

An instruction reads from a location after an earlier instruction has written to it.

| inst <sub>1</sub> : | lw   | \$2, | 0(\$4) |    |
|---------------------|------|------|--------|----|
| inst <sub>2</sub> : | addi | \$6, | \$2,   | 42 |

inst\_1 and inst\_2 cannot be permuted, otherwise inst\_2 would read an old value for 2

## Write after Read (WAR)

An instruction writes to a location after an earlier instruction has read from it.

## Write after Read (WAR)

An instruction writes to a location after an earlier instruction has read from it.

inst<sub>1</sub>: lw \$2, 0(\$4)inst<sub>2</sub>: addi \$4, \$12, 42

## Write after Read (WAR)

An instruction writes to a location after an earlier instruction has read from it.

inst<sub>1</sub>: lw \$2, 0(\$4) inst<sub>2</sub>: addi \$4, \$12, 42

inst\_1 and inst\_2 cannot be permuted, otherwise inst\_1 would read a new value for 4

Write after Write (WAW)

An instruction writes to a location after an earlier instruction has written to it.

Write after Write (WAW)

An instruction writes to a location after an earlier instruction has written to it.

| inst <sub>1</sub> : | add | \$1, | \$2, | \$3 |
|---------------------|-----|------|------|-----|
| inst <sub>2</sub> : | add | \$1, | \$5, | \$6 |

Write after Write (WAW)

An instruction writes to a location after an earlier instruction has written to it.

inst<sub>1</sub>: add \$1, \$2, \$3 inst<sub>2</sub>: add \$1, \$5, \$6

 $\mathsf{inst}_1$  and  $\mathsf{inst}_2$  cannot be permuted, otherwise  $\mathsf{inst}_1$  would write an old value in \$1

## Why and When reordering?

We would like to reorder the instructions within each basic block in a way which:

- preserves the dependencies between those instructions (and hence the correctness of the program)
- achieves the minimum possible number of pipeline stalls, i.e. two instructions simultaneously in the pipeline manipulates same data, registers, etc.

The two problems can be addressed separately (whew!).

## Preserving and computing dependencies?

We construct a directed acyclic graph (DAG) to represent the dependencies between instructions:

- For each instruction in the basic block, create a corresponding vertex in the graph
- For each dependency between two instructions, create a corresponding (annotated) edge in the graph. Note that this edge is annotated.

 $(i_1)$ 

CCMP2

< □ > < 同 > < 回 > < 回 > < 回 >

 $(i_1)$ 

 $(i_2)$ 

CCMP2

< □ > < 同 > < 回 > < 回 > < 回 >

i<sub>1</sub>: lw \$1,0(\$10) i<sub>4</sub>: sw \$3,12(\$10) i<sub>7</sub>: sw \$3,16(\$10) i<sub>2</sub>: lw \$2,4(\$10) i<sub>5</sub>: lw \$4,8(\$10) i<sub>3</sub>: add \$3,\$1,\$2 i<sub>6</sub>: add \$3,\$1,\$4



I2



3

< □ > < 同 > < 回 > < 回 > < 回 >



3

- 4 回 ト 4 ヨ ト 4 ヨ ト



3

・ 何 ト ・ ヨ ト ・ ヨ ト



3

・ 何 ト ・ ヨ ト ・ ヨ ト

lw \$1,0(\$10) | i<sub>4</sub>: sw \$3,12(\$10) | i<sub>7</sub>: sw \$3,16(\$10)  $i_1$ : i<sub>2</sub> : lw \$2,4(\$10) | i<sub>5</sub>: lw \$4,8(\$10) i3 : add \$3,\$1,\$2 | i<sub>6</sub>: add \$3,\$1,\$4



- ∢ ⊒ →

▲ 伊 ▶ ▲ 三



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

3

- ∢ ⊒ →



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

э

4 E b

< 冊 > < Ξ



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

э

→ ∃ →

▲ 伊 ▶ ▲ 三



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

э

< ∃⇒



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

- ∢ ⊒ →



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

- ∢ ⊒ →



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

э

- ∢ ⊒ →



Type of dependency: RAW, WAW, WAR

CCMP2

Instruction scheduling

May 19, 2018 10 / 57

э

- ∢ ⊒ →

## Preserving dependencies: Critical Path 1/2

The critical path represents the longest path between two nodes. We add **delays** (weights) to edges:

- 0 for WAW and WAR dependencies
- 2 for RAW dependencies with memory access
- 1 for other RAW dependencies

## Preserving dependencies: Critical Path 1/2

The critical path represents the longest path between two nodes. We add **delays** (weights) to edges:

- 0 for WAW and WAR dependencies
- 2 for RAW dependencies with memory access
- 1 for other RAW dependencies


# Preserving dependencies: Critical Path 2/2

Any (reverse) topological sort of this DAG (i.e. any linear ordering of the vertices which keeps all the edges "pointing forwards") will maintain the dependencies and hence preserve the correctness of the program.

Algorithm:

- Associate a weight 1 to all "instruction node"
- For all nodes ni in topological postorder
  - If n<sub>i</sub> is not a leaf
    - \* For all nodes  $n_j$  in succ $(n_i)$  do  $n_i$ .weight  $\leftarrow$  max  $(n_i$ .weight,  $n_j$ .weight+ delay $(n_i, n_j)$ )

Remember "important" edges during computations, they will form the critical path.

Delays: blue arrows 2, red and green 0



Delays: blue arrows 2, red and green 0



i7 doesn't have successors, skip it!

CCMP2

Instruction scheduling

Delays: blue arrows 2, red and green 0



delay(i<sub>6</sub>, i<sub>7</sub>)=2 > 1, change i<sub>6</sub> weight!

< /□ > < ∃</p>

Delays: blue arrows 2, red and green 0



Delays: blue arrows 2, red and green 0



delay( $i_5$ ,  $i_6$ )=2 > 1, change  $i_5$  weight!

Delays: blue arrows 2, red and green 0



Delays: blue arrows 2, red and green 0



 $i_6.weight=3 > 1$ , change  $i_4$  weight!

Delays: blue arrows 2, red and green 0



Delays: blue arrows 2, red and green 0



 $delay(i_3, i_4) + i_4.weight=3 > 1$ , change  $i_3$  weight!

Delays: blue arrows 2, red and green 0



Delays: blue arrows 2, red and green 0



delay( $i_1$ ,  $i_3$ ) +  $i_3$ .weight=7 > 1, change  $i_1$  weight!

Delays: blue arrows 2, red and green 0



Delays: blue arrows 2, red and green 0



 $delay(i_2, i_3) + i_3.weight=7 > 1$ , change  $i_2$  weight!

Delays: blue arrows 2, red and green 0



#### So many orders ... with one critial path



i<sub>1</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>5</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>1</sub>,i<sub>2</sub>,i<sub>5</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>1</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>5</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>2</sub>,i<sub>1</sub>,i<sub>5</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>5</sub>,i<sub>1</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i2,i1,i3,i5,i4,i6,i7 i1,i5,i2,i3,i4,i6,i7 i5,i2,i1,i3,i4,i6,i7 i<sub>2</sub>,i<sub>1</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>5</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>2</sub>,i<sub>5</sub>,i<sub>1</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub>

#### So many orders ... with one critial path



i<sub>1</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>5</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>1</sub>,i<sub>2</sub>,i<sub>5</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>1</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>5</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>2</sub>,i<sub>1</sub>,i<sub>5</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>5</sub>,i<sub>1</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>2</sub>,i<sub>1</sub>,i<sub>3</sub>,i<sub>5</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>1</sub>,i<sub>5</sub>,i<sub>2</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>5</sub>,i<sub>2</sub>,i<sub>1</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>2</sub>,i<sub>1</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>5</sub>,i<sub>6</sub>,i<sub>7</sub> i<sub>2</sub>,i<sub>5</sub>,i<sub>1</sub>,i<sub>3</sub>,i<sub>4</sub>,i<sub>6</sub>,i<sub>7</sub>

All these permutations respect dependencies but is there a best instruction scheduling?

May 19, 2018 14 / 57

### Performances and Pipeline

Not all orders are equivalents!

- Some dependencies can bring hazards that slow down performances inside of the pipeline
- Hazard occurs when:
  - ▶ 1 instruction requires the previous instruction has finished
  - 2 instructions need the same data at the same time: one of the two is blocked

#### Instructions Pipeline

The microprocessor (MIPS) contains 5 stages:

- IF: Instruction Fetch
- ID: Instruction Decode. Read operands from registers, compute the address of the next instruction
- EX Execute instructions requiring the ALU
- ME Read/write into Memory
- WB Write Back. Results are written into registers.

|                    | cycle <sub>1</sub> | cycle <sub>2</sub> | cycle <sub>3</sub> | cycle <sub>4</sub> | cycle <sub>5</sub> | cycle <sub>6</sub> | cycle7 | cycle <sub>8</sub> | cycleg |
|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------|--------------------|--------|
| instr <sub>1</sub> | IF                 | ID                 | EX                 | ME                 | WB                 |                    | 1      |                    |        |
| instr <sub>2</sub> | 1                  | IF                 | ID                 | EX                 | ME                 | WB                 |        |                    |        |
| instr <sub>3</sub> |                    |                    | IF                 | ID                 | EX                 | ME                 | WB     |                    |        |
| instr <sub>4</sub> |                    | !<br>              |                    | IF                 | ID                 | EX                 | ME     | WB                 |        |
| instr <sub>5</sub> |                    | 1                  | 1                  |                    | IF                 | ID                 | EX     | ME                 | WB     |

## Hazard: RAW dependencies 1/2

Some instruction requires a result computed by a previous one!

#### Consider the following example:



- 1w produces its result into \$2 during the ME stage
- ADDI requires \$2 for the EX stage
- In this example, 1 stall (cycle 4)

The goal of risc architectures is to produce one per cycle!

< □ > < □ > < □ > < □ > < □ > < □ >

## Hazard: RAW dependencies 2/2

#### Consider now the following example:

|                     | cycle <sub>1</sub> | cycle <sub>2</sub> | cycle <sub>3</sub> | cycle <sub>4</sub> | cycle <sub>5</sub> | cycle <sub>6</sub> | cycle7 | cycle <sub>8</sub> |
|---------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------|--------------------|
| lw \$2, 0(\$4)      | IF                 | ID                 | EX                 | ME                 | WB                 |                    |        |                    |
| addi \$5, \$2, 10   |                    | IF                 | ID                 |                    | νEX                | ME                 | WB     | 1                  |
| add \$12, \$9, \$11 |                    | <br> <br>          | IF                 |                    | ID                 | EX                 | ME     | WB                 |

イロト イヨト イヨト イヨト

## Hazard: RAW dependencies 2/2

#### Consider now the following example:

|                     | cycle <sub>1</sub> | cycle <sub>2</sub> | cycle <sub>3</sub> | cycle <sub>4</sub> | cycle <sub>5</sub> | cycle <sub>6</sub> | cycle7 | cycle <sub>8</sub> |
|---------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------|--------------------|
| lw \$2, 0(\$4)      | IF                 | ID                 | EX                 | ME                 | WB                 |                    |        |                    |
| addi \$5, \$2, 10   |                    | IF                 | ID                 |                    | νEX                | ME                 | WB     |                    |
| add \$12, \$9, \$11 |                    | <br> <br>1         | IF                 |                    | ID                 | EX                 | ME     | WB                 |

Let's look ... instruction 3 is independent from the others

CCMP2

## Hazard: RAW dependencies 2/2

#### Consider now the following example:

|                     | cycle <sub>1</sub> | cycle <sub>2</sub> | cycle <sub>3</sub> | cycle <sub>4</sub> | cycle <sub>5</sub> | cycle <sub>6</sub> | cycle7 | cycle <sub>8</sub> |
|---------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------|--------------------|
| lw \$2, 0(\$4)      | IF                 | ID                 | EX                 | ME                 | WB                 |                    |        |                    |
| addi \$5, \$2, 10   |                    | IF                 | ID                 |                    | νEX                | ME                 | WB     |                    |
| add \$12, \$9, \$11 |                    | <br> <br>1         | IF                 |                    | ID                 | EX                 | ME     | WB                 |

Let's look ... instruction 3 is independent from the others so we can change the order!

|                     | cycle <sub>1</sub> | cycle <sub>2</sub> | cycle <sub>3</sub> | cycle <sub>4</sub> | cycle <sub>5</sub> | cycle <sub>6</sub> | cycle7 | cycle <sub>8</sub> |
|---------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------|--------------------|
| lw \$2, 0(\$4)      | IF                 | ID                 | EX                 | ME                 | WB                 |                    |        |                    |
| add \$12, \$9, \$11 |                    | IF                 | ID                 | EX                 | ME                 | WB                 |        |                    |
| addi \$5, \$2, 10   |                    | 1                  | IF                 | ID                 | $\lambda_{\rm EX}$ | ME                 | WB     |                    |

#### Hazard: WAW dependencies

Two instructions write in the same register!

#### Consider the following example:



WAW do not produce stalls ! (even when writing in the same memory address)

| <u> </u> | $\sim$ | ۸. | 4 6 | 2   | ١ |
|----------|--------|----|-----|-----|---|
| 5        | L      | IV | ш   | - 4 |   |

. . . . . . .

#### Hazard: WAR dependencies

#### One instruction writes where a previous one reads!

Consider the following example:



WAR do not produce stalls !

| C | <u> </u> | Ν.Λ |   | 5 |
|---|----------|-----|---|---|
| C | L        | IVI | r | ~ |

Back to the example – without scheduling

|                | c1 | c2    | c3 | C4 | C5 | c <sub>6</sub> | C7 | C8 | C9 | C10    | C <sub>11</sub> | c <sub>12</sub> | c <sub>13</sub> |
|----------------|----|-------|----|----|----|----------------|----|----|----|--------|-----------------|-----------------|-----------------|
| $i_1$          | IF | ID    | EX | ME | WB |                |    |    |    | ı<br>I | · ·             |                 | · ·             |
| i <sub>2</sub> | I  | IF    | ID | EX | ME | WB             |    |    |    | I      |                 |                 | i I             |
| i3             | 1  | I     | IF | ID |    | EX             | ME | WB |    | l<br>I | I I             |                 | I I             |
| i4             | 1  | 1     |    | IF |    | ID             | EX | ME | WB |        |                 |                 | í '             |
| i <sub>5</sub> | I  | 1<br> | I  |    |    | IF             | ID | EX | ME | WB     |                 |                 | ,               |
| i <sub>6</sub> | I  | I     | I  | I  | l  | 1              | IF | ID |    | EX     | ME              | WB              |                 |
| i7             | 1  | 1     |    | 1  |    | l              |    | IF |    | ID     | EX              | ME              | WB              |

(日) (四) (日) (日) (日)

Back to the example – without scheduling

| i <sub>1</sub> : | lw  | \$1,0(\$10) | i4 :             | SW  | \$3,12(\$10) | i <sub>7</sub> : | sw | \$3,16(\$10) |
|------------------|-----|-------------|------------------|-----|--------------|------------------|----|--------------|
| i <sub>2</sub> : | lw  | \$2,4(\$10) | i <sub>5</sub> : | lw  | \$4,8(\$10)  |                  |    |              |
| i3 :             | add | \$3,\$1,\$2 | i <sub>6</sub> : | add | \$3,\$1,\$4  |                  |    |              |

|                | c1   | c2    | c3 | C4 | C5 | c <sub>6</sub> | C7 | C8 | C9 | C10    | C11 | C12 | c <sub>13</sub> |
|----------------|------|-------|----|----|----|----------------|----|----|----|--------|-----|-----|-----------------|
| $i_1$          | IF   | ID    | EX | ME | WB |                |    |    |    | ı<br>I |     | 1   | , ,<br>, ,      |
| i <sub>2</sub> | I    | IF    | ID | EX | ME | WB             |    |    |    | I      |     | l   | I               |
| i3             | 1    | I     | IF | ID |    | EX             | ME | WB |    | l<br>I | I I | l   | I I             |
| i4             | <br> | 1     | 1  | IF |    | ID             | EX | ME | WB |        |     | I   |                 |
| i <sub>5</sub> | I    | 1<br> | I  |    | I  | IF             | ID | EX | ME | WB     |     |     |                 |
| i <sub>6</sub> | I    | I     | I  | I  | I  | 1              | IF | ID |    | EX     | ME  | WB  |                 |
| i7             | 1    | 1     |    | 1  | l  | 1              |    | IF |    | ID     | EX  | ME  | WB              |

Without scheduling: 2 dependencies, 2 stalls, 13 cycles!

May 19, 2018 21 / 57

#### Minimizing Stalls – First approach

Each time we emit the next instruction, we should try to choose one which

- P1 does not conflict with the previous emitted instruction
- P<sub>2</sub>: is most likely to conflict if first of a pair (e.g. prefer lw to add)
- P<sub>3</sub>: is as far away as possible (along paths in the DAG) from an instruction which can validly be scheduled last

#### Minimizing Stalls – First approach

Each time we emit the next instruction, we should try to choose one which

- $P_1$  does not conflict with the previous emitted instruction
- P<sub>2</sub>: is most likely to conflict if first of a pair (e.g. prefer lw to add)
- P<sub>3</sub>: is as far away as possible (along paths in the DAG) from an instruction which can validly be scheduled last

Algorithm:

- Compute the dependency graph
- While the list of candidate instructions is not empty
  - ▶ If one instruction satisfies P<sub>1</sub>, P<sub>2</sub>, and P<sub>3</sub>: remove it from the list and emit it.
    - ★ Remove the instruction from the DAG and insert the newly minimal elements into the candidate list.
  - Otherwise emit a nop instruction

< ロ > < 同 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ >



Candidates =  $\{i_1, i_2, i_5\}$ Final Order =



Choose  $i_1$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

CCMP2

< 回 > < 三 > < 三



 $\begin{array}{rcl} {\sf Candidates} & = & \{i_1,\,i_2,\,i_5\} \\ {\sf Final Order} & = & i_1 \end{array}$ 

Choose  $i_1$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ |    |    |   |
|--------|--------|----|----|---|
|        | ι.     | IV | IF | 2 |
|        |        |    |    |   |

・ 何 ト ・ ヨ ト ・ ヨ ト



 $\begin{array}{rcl} {\sf Candidates} & = & \{i_1,\,i_2,\,i_5\} \\ {\sf Final Order} & = & i_1 \end{array}$ 

Choose  $i_1$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

・ 何 ト ・ ヨ ト ・ ヨ ト





Choose  $i_2$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| C | 6        | Ν.  |     |  |
|---|----------|-----|-----|--|
| C | <b>L</b> | IVI | LP: |  |

・ 何 ト ・ ヨ ト ・ ヨ ト



Choose  $i_2$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| LCCMP: | _        | -        |      | _   |
|--------|----------|----------|------|-----|
|        |          | <u> </u> | ΝЛ   | D   |
|        | <u> </u> | <u> </u> | IV I | E 4 |

< (日) × (日) × (4)


Choose  $i_2$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| LCCMP: | _        | -        |      | _ | - |
|--------|----------|----------|------|---|---|
|        |          | <u> </u> | ΝЛ   | ъ |   |
|        | <u> </u> | <u> </u> | IV I |   |   |

▲ □ ▶ ▲ □ ▶ ▲ □



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_5,\,i_3\} \\ \mbox{Final Order} & = & i_1,\,i_2 \end{array}$ 

э



Choose  $i_5$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ |    |    |   |
|--------|--------|----|----|---|
|        | ι.     | IV | IF | 2 |
|        |        |    |    |   |

< (日) × (日) × (4)



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_5, \, i_3\} \\ \mbox{Final Order} & = & i_1, \, i_2 \, , \, i_5 \end{array}$ 

Choose  $i_5$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ |    | 10             |
|--------|--------|----|----------------|
| L      | L      | IV | ٠ <sub>4</sub> |
|        |        |    |                |

A (10) < A (10) < A (10) </p>



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_5, \, i_3\} \\ \mbox{Final Order} & = & i_1, \, i_2 \, , \, i_5 \end{array}$ 

Choose  $i_5$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| - 1 | Ν. | м | D | ~ |
|-----|----|---|---|---|
|     | Iν |   | - | 1 |
|     |    |   |   |   |

< (日) × (日) × (4)





Candidates =  $\{i_3\}$ Final Order =  $i_1, i_2, i_5$ 

э



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_3\} \\ \mbox{Final Order} & = & i_1, \, i_2, \, i_5 \end{array}$ 

Choose  $i_3$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| - 1 | Ν. | м | D | ~ |
|-----|----|---|---|---|
|     | Iν |   | - | 1 |
|     |    |   |   |   |

A (10) < A (10) < A (10) </p>



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_3\} \\ \mbox{Final Order} & = & i_1, \, i_2, \, i_5, \, i_3 \end{array}$ 

Choose  $i_3$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| LCCMP: | _        | -        |      | _ | - |
|--------|----------|----------|------|---|---|
|        |          | <u> </u> | ΝЛ   | ъ |   |
|        | <u> </u> | <u> </u> | IV I |   |   |

< (日) × (日) × (4)



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_3\} \\ \mbox{Final Order} & = & i_1, \, i_2, \, i_5, \, i_3 \end{array}$ 

Choose  $i_3$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ | ۸. |   |   |
|--------|--------|----|---|---|
| L      | L      | IV | Р | 2 |
|        |        |    |   |   |

(日) (四) (日) (日) (日)





 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_4\} \\ \mbox{Final Order} & = & i_1, \, i_2, \, i_5, \, i_3 \end{array}$ 

Choose  $i_4$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| - 1 | Ν. | м | D | ~ |
|-----|----|---|---|---|
|     | Iν |   | - | 1 |
|     |    |   |   |   |

(日) (四) (日) (日) (日)



 $\begin{array}{rcl} \mbox{Candidates} & = & \{i_4\} \\ \mbox{Final Order} & = & i_1, \, i_2, \, i_5, \, i_3, \, i_4 \end{array}$ 

Choose  $i_4$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ |    | 10             |
|--------|--------|----|----------------|
| L      | L      | IV | ٠ <sub>4</sub> |
|        |        |    |                |

3

(日) (四) (日) (日) (日)



Candidates =  $\{i_4\}$ Final Order =  $i_1$ ,  $i_2$ ,  $i_5$ ,  $i_3$ ,  $i_4$ 

Choose  $i_4$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ |    | 10             |
|--------|--------|----|----------------|
| L      | L      | IV | ٠ <sub>4</sub> |
|        |        |    |                |

・ 何 ト ・ ヨ ト ・ ヨ ト



 $\begin{array}{rcl} \mathsf{Candidates} & = & \{i_6\} \\ \mathsf{Final Order} & = & i_1, \, i_2, \, i_5, \, i_3, \, i_4 \end{array}$ 



 $\begin{array}{rcl} {\sf Candidates} & = & \{i_6\} \\ {\sf Final \ Order} & = & i_1, \, i_2, \, i_5, \, i_3, \, i_4 \end{array}$ 

Choose  $i_6$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| $\sim$ | $\sim$ | ۸. |   |   |
|--------|--------|----|---|---|
| L      | L      | IV | Р | 2 |
|        |        |    |   |   |

・ 何 ト ・ ヨ ト ・ ヨ ト



 $\begin{array}{rcl} \mathsf{Candidates} & = & \{i_6\} \\ \mathsf{Final} \ \mathsf{Order} & = & i_1, \, i_2, \, i_5, \, i_3, \, i_4, \, \underline{i_6} \end{array}$ 

Choose  $i_6$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

| - | $\sim$ | N / |     | 1 |
|---|--------|-----|-----|---|
| ~ | L      | IV  | IP. | 2 |

< /⊒ ► < Ξ ► <

(i7)

Candidates =  $\{i_6\}$ Final Order =  $i_1$ ,  $i_2$ ,  $i_5$ ,  $i_3$ ,  $i_4$ ,  $i_6$ 

Choose  $i_6$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

- 4 目 ト - 4 日 ト



Candidates =  $\{i_7\}$ Final Order =  $i_1$ ,  $i_2$ ,  $i_5$ ,  $i_3$ ,  $i_4$ ,  $i_6$ 



Candidates =  $\{i_7\}$ Final Order =  $i_1, i_2, i_5, i_3, i_4, i_6$ 

Choose  $i_7$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

・ 何 ト ・ ヨ ト ・ ヨ ト



Candidates =  $\{i_7\}$ Final Order =  $i_1, i_2, i_5, i_3, i_4, i_6, i_7$ 

Choose  $i_7$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

▲ □ ▶ ▲ □ ▶ ▲ □ ▶

Candidates =  $\{i_7\}$ Final Order =  $i_1, i_2, i_5, i_3, i_4, i_6, i_7$ 

Choose  $i_7$  since it satisfies  $P_1$ ,  $P_2$  and  $P_3$ 

Final Order =  $i_1$ ,  $i_2$ ,  $i_5$ ,  $i_3$ ,  $i_4$ ,  $i_6$ ,  $i_7$ 

|                | с1 | c <sub>2</sub> | c3 | c4  | c5            | c <sub>6</sub> | c7 | c8 | c9 | c <sub>10</sub> | c <sub>11</sub> |
|----------------|----|----------------|----|-----|---------------|----------------|----|----|----|-----------------|-----------------|
| 1              | IF | ID             | EX | ME  | WB            |                |    |    | l  |                 |                 |
| 2              | I  | IF             | ID | EX  | ME            | WB             |    |    | l. |                 | L I             |
| i5             | l  | I              | IF | ID  | $\mathbf{EX}$ | ME             | WB |    |    | L 1             |                 |
| i <sub>3</sub> | 1  | 1              |    | IF  | ID            | EX             | ME | WB |    |                 |                 |
| 4              | I  |                | I  |     | IF            | ID             | EX | ME | WB |                 |                 |
| i <sub>6</sub> | I. | I              | I  | I i |               | IF             | ID | EX | ME | WB              |                 |
| i7             | 1  | 1              | 1  |     |               |                | IF | ID | EX | ME              | WB              |
|                |    |                |    |     |               |                |    |    |    |                 |                 |

< (17) > < (17) > <

Final Order =  $i_1$ ,  $i_2$ ,  $i_5$ ,  $i_3$ ,  $i_4$ ,  $i_6$ ,  $i_7$ 

|                | с1 | c <sub>2</sub> | c3 | c4  | c5            | c <sub>6</sub> | c7 | c8 | c9 | c <sub>10</sub> | c <sub>11</sub> |
|----------------|----|----------------|----|-----|---------------|----------------|----|----|----|-----------------|-----------------|
| 1              | IF | ID             | EX | ME  | WB            |                |    |    | l  |                 |                 |
| 2              | I  | IF             | ID | EX  | ME            | WB             |    |    | I  |                 | L I             |
| i5             | l  | I              | IF | ID  | $\mathbf{EX}$ | ME             | WB |    |    | L 1             |                 |
| i <sub>3</sub> | 1  | 1              |    | IF  | ID            | EX             | ME | WB |    |                 |                 |
| 4              | I  |                | I  |     | IF            | ID             | EX | ME | WB |                 |                 |
| i <sub>6</sub> | I. | I              | I  | I i |               | IF             | ID | EX | ME | WB              |                 |
| i7             | 1  | 1              | 1  |     |               |                | IF | ID | EX | ME              | WB              |
|                |    |                |    |     |               |                |    |    |    |                 |                 |

With scheduling: still 2 dependencies but 0 stalls and 11 cycles!

CCMP2

A B A B A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 B
 A
 A
 A
 A
 A

# A word on scheduling strategies

- Sometimes we cannot avoid some stalls
- Computing the critical path can be smarter:
  - Rather than attributing 1 as weight to every instruction, we can adjust according to the real time of executing the instruction
  - We can take advantages of the number of successors
  - ... many yet-to-be-define heuristics!
- Computing the DAG of dependencies can be done in  $O(n^2)$  by scanning backwards through the basic block and adding edges as dependencies arise

# A word on performances

We can statically compute instructions per cycle IPC= $\frac{\text{nb instructions}}{\text{nb cycles}}$ , to evaluate 2 possible scheduling.

In the previous example:

- without scheduling IPC= $\frac{7}{13} = 0.53$
- with scheduling IPC= $\frac{7}{11} = 0.63$  (better!)

We can also statically compute cycle per instructions:  $CPI = \frac{1}{IPC}$ . The CPI lower bound is  $\frac{\sum \alpha \times \beta}{\text{nb instructions}}$ , avec  $\alpha$  is the number of instructions for a given instruction type and  $\beta$  the associated cost.

### Can we do better?

#### Consider the following code (representing a basic block):

| i <sub>1</sub> : | Loop: | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
|------------------|-------|------|-------|------------|---------------------|
| i <sub>2</sub> : |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i3:              |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| i4:              |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>5</sub> : |       | bne  | \$s1, | \$0, Loop  | # branch s1!=0      |

(日) (四) (日) (日) (日)

### Can we do better?

#### Consider the following code (representing a basic block):

| $i_1$ :          | Loop: | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
|------------------|-------|------|-------|------------|---------------------|
| i <sub>2</sub> : |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i <sub>3</sub> : |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| i4:              |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>5</sub> : |       | bne  | \$s1, | \$0, Loop  | # branch s1!=0      |



э

# Can we do better?

Consider the following code (representing a basic block):





16 cycles for 5 instructions that are all dependent!  $\label{eq:IPC} \mathsf{IPC} = 0.31$ 

|                       |    | ~  |     | De      |
|-----------------------|----|----|-----|---------|
| · · · · · · · · · · · | ι. | ι. | IVI | $P_{2}$ |

・ 何 ト ・ ヨ ト ・ ヨ ト

# Loop Unrolling

- Replicate loop body to expose more parallelism
- Reduces loop-control overhead

(日) (四) (日) (日) (日)

# Loop Unrolling

- Replicate loop body to expose more parallelism
- Reduces loop-control overhead
- At high level, it can be seen as following:

| Without Loop Unrolling    | With Loop Unrolling        |
|---------------------------|----------------------------|
| int i;                    | int i;                     |
| for (i = 0; i < 100; ++i) | for (i = 0; i < 100; i+=5) |
| tab[i] = tab[i] + 42;     | tab[i] = tab[i] + 42;      |
|                           | tab[i+1] = tab[i+1] + 42;  |
|                           | tab[i+2] = tab[i+2] + 42;  |
|                           | tab[i+3] = tab[i+3] + 42;  |
|                           | tab[i+4] = tab[i+4] + 42;  |

(日) (四) (日) (日) (日)

# Loop Unrolling

- Replicate loop body to expose more parallelism
- Reduces loop-control overhead

At high level, it can be seen as following:

| Without Loop Unrolling      | With Loop Unrolling        |
|-----------------------------|----------------------------|
| int i;                      | int i;                     |
| for $(i = 0; i < 100; ++i)$ | for (i = 0; i < 100; i+=5) |
| tab[i] = tab[i] + 42;       | tab[i] = tab[i] + 42;      |
|                             | tab[i+1] = tab[i+1] + 42;  |
|                             | tab[i+2] = tab[i+2] + 42;  |
|                             | tab[i+3] = tab[i+3] + 42;  |
|                             | tab[i+4] = tab[i+4] + 42;  |

Special care must be taken for pre and post loops operations (as well as intra-loop dependencies)

| i <sub>1</sub> :  | Loop: | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
|-------------------|-------|------|-------|------------|---------------------|
| i <sub>2</sub> :  |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i3:               |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| i4:               |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>5</sub> :  |       | bne  | \$s1, | \$0, Loop  | # branch s1!=0      |
| i <sub>6</sub> :  | Loop: | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
| i <sub>7</sub> :  |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i <sub>8</sub> :  |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| ig:               |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>10</sub> : |       | bne  | \$s1, | \$0, Loop  | # branch s1!=0      |
| i <sub>11</sub> : | Loop: | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
| i <sub>12</sub> : | -     | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i <sub>13</sub> : |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| i <sub>14</sub> : |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>15</sub> : |       | bne  | \$s1, | \$0, Loop  | # branch s1!=0      |

#### First duplicate N times the the body of the loop!

| i <sub>1</sub> :  | Loop: | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
|-------------------|-------|------|-------|------------|---------------------|
| i <sub>2</sub> :  |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i3:               |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| 14:               |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>6</sub> :  |       | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
| i7:               |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| 8:                |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| 19:               |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>11</sub> : |       | lw   | \$t0, | 0(\$s1)    | # t0=array element  |
| 12:               |       | addu | \$t0, | \$t0, \$s2 | # add scalar in s2  |
| i <sub>13</sub> : |       | SW   | \$t0, | 0(\$s1)    | # store result      |
| 1 <sub>14</sub> : |       | addi | \$s1, | \$s1,-4    | # decrement pointer |
| i <sub>15</sub> : |       | bne  | \$s1, | \$0, Loop  | # branch s1!=0      |
|                   |       |      |       |            |                     |

Remove redundant labels and jump (by supposing that we are able to do it!)

| <b>C</b> | $\sim$ | ٨  | л | n |   |
|----------|--------|----|---|---|---|
| -        | L      | I١ | 1 | r | 2 |
|          |        |    |   |   |   |

Instruction scheduling

May 19, 2018 36 / 57

э

| i <sub>1</sub> :  | Loop: | lw   | \$t0, | 0(\$s1)                 | # t0=array element  |
|-------------------|-------|------|-------|-------------------------|---------------------|
| i <sub>2</sub> :  |       | addu | \$t0, | \$t0, \$s2              | # add scalar in s2  |
| i <sub>3</sub> :  |       | SW   | \$t0, | 0(\$s1)                 | # store result      |
| i4:               |       | addi | \$s1, | \$s1,-4                 | # decrement pointer |
| i <sub>6</sub> :  |       | lw   | \$t1, | 0(\$s1)                 | # t0=array element  |
| i7:               |       | addu | \$t1, | <mark>\$t1,</mark> \$s2 | # add scalar in s2  |
| i <sub>8</sub> :  |       | SW   | \$t1, | 0(\$s1)                 | # store result      |
| ig:               |       | addi | \$s1, | \$s1,-4                 | # decrement pointer |
| i <sub>11</sub> : |       | lw   | \$t2, | 0(\$s1)                 | # t0=array element  |
| i <sub>12</sub> : |       | addu | \$t2, | <mark>\$t2,</mark> \$s2 | # add scalar in s2  |
| i <sub>13</sub> : |       | SW   | \$t2, | 0(\$s1)                 | # store result      |
| i <sub>14</sub> : |       | addi | \$s1, | \$s1,-4                 | # decrement pointer |
| i <sub>15</sub> : |       | bne  | \$s1, | \$0, Loop               | # branch s1!=0      |

#### Use other temporaries name when possible!

< A > < E

| i4:               | Loop: | addi | \$s1, | \$s1,-12              | # decrement pointer |
|-------------------|-------|------|-------|-----------------------|---------------------|
| i <sub>1</sub> :  |       | lw   | \$t0, | 0(\$s1)               | # t0=array element  |
| i <sub>2</sub> :  |       | addu | \$t0, | \$t0, \$s2            | # add scalar in s2  |
| i3:               |       | SW   | \$t0, | 0(\$s1)               | # store result      |
| i <sub>6</sub> :  |       | lw   | \$t1, | <mark>4</mark> (\$s1) | # t0=array element  |
| i7:               |       | addu | \$t1, | \$t1, \$s2            | # add scalar in s2  |
| i <sub>8</sub> :  |       | SW   | \$t1, | <b>4</b> (\$s1)       | # store result      |
| i <sub>11</sub> : |       | lw   | \$t2, | <mark>8</mark> (\$s1) | # t0=array element  |
| i <sub>12</sub> : |       | addu | \$t2, | \$t2, \$s2            | # add scalar in s2  |
| i <sub>13</sub> : |       | SW   | \$t2, | <mark>8</mark> (\$s1) | # store result      |
| i <sub>15</sub> : |       | bne  | \$s1, | \$0, Loop             | # branch s1!=0      |

#### Grab redundant operation and merge them carefully!

| C | <u> </u> | Ν.Λ | D | 2 |
|---|----------|-----|---|---|
| 0 | L        | IVI | r | 2 |
|   |          |     |   |   |

Image: A = 1 = 1

| i <sub>1</sub> :  | Loop: | addi | \$s1, | \$s1, <mark>-12</mark> | # decrement pointer for N=3 |
|-------------------|-------|------|-------|------------------------|-----------------------------|
| i <sub>2</sub> :  |       | lw   | \$t0, | 0(\$s1)                | # t0=array element          |
| i <sub>3</sub> :  |       | lw   | \$t1, | <b>4</b> (\$s1)        | $\#$ t $1{=}$ array element |
| i4:               |       | lw   | \$t2, | <mark>8</mark> (\$s1)  | # t2=array element          |
| i <sub>5</sub> :  |       | addu | \$t0, | \$t0, \$s2             | # add scalar in s2          |
| i <sub>6</sub> :  |       | addu | \$t1, | \$t1, \$s2             | # add scalar in s2          |
| i7:               |       | addu | \$t2, | \$t2, \$s2             | # add scalar in s2          |
| i <sub>8</sub> :  |       | SW   | \$t0, | 0(\$s1)                | # store result              |
| ig:               |       | SW   | \$t1, | <b>4</b> (\$s1)        | # store result              |
| i <sub>10</sub> : |       | SW   | \$t2, | <mark>8</mark> (\$s1)  | # store result              |
| i <sub>11</sub> : |       | bne  | \$s1, | \$0, Loop              | # branch s1!=0              |

#### Schedule the instructions and renumber them (and update comments)!
#### Pros & Cons

- We avoid a lot of conditional jumps (and many stall hence)
- We require 19 cycles for 11 instructions: IPC=0.57 (a lot better than the previous 0.31)
- This trick allows to have more independent instructions to insert, and thus, less stalls!
- But we have now a prologue and an epilogue: i.e., two more basic blocks
- Require more temporaries: register allocation will be harder!
- Try it by yourself in gcc -funroll-loops

A very last word on Branch Hazards 1/2

- Conditional jumps often introduce delays since we cannot pre-fetch instructions
  - Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)
  - Conditional branches are solved when PC is updated at the end of the ME stage (4th stage)
- Can we avoid them?

|                   | c1     | c2 | c3 | c4 | c5 | c <sub>6</sub> | C7 | c8 | C9 |
|-------------------|--------|----|----|----|----|----------------|----|----|----|
| bne \$1,\$2, loop | IF     | ID | EX | ME | WB |                |    |    |    |
| nop               | 1      | IF | ID | EX | ME | WB             |    |    |    |
| nop               | I      |    | IF | ID | EX | ME             | WB |    |    |
| nop               | l<br>I |    |    | IF | ID | EX             | ME | WB |    |
| i <sub>next</sub> | 1      |    |    |    | IF | ID             | EX | ME | WB |

#### We only know $i_{next}$ at cycle 5!

A very last word on Branch Hazards 2/2

- X delayed slot: the X instructions after a branch are systematically executed
- The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches
- We need branch prediction... but nowadays, most of processors do it for us (and use slt...)!
- Some architectures have bypass between stages to avoid stalls

Avoid as possible floating points and jumps!

A very last word on Branch Hazards 2/2

- X delayed slot: the X instructions after a branch are systematically executed
- The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches
- We need branch prediction... but nowadays, most of processors do it for us (and use slt...)!
- Some architectures have bypass between stages to avoid stalls

Avoid as possible floating points and jumps!

"Do you program in mips?" she asked. "nop", he said.

(4) (日本)

#### Stalls due to caches

When the processor processor needs to access a data:

- If data is in cache: with a cost of 3 cycles
- Otherwise: with a cost of 100 cycles

#### Stalls due to caches

When the processor processor needs to access a data:

- If data is in cache: with a cost of 3 cycles
- Otherwise: with a cost of 100 cycles





#### Stalls due to caches

When the processor processor needs to access a data:

- If data is in cache: with a cost of 3 cycles
- Otherwise: with a cost of 100 cycles



(4) (5) (4) (5)





A D N A B N A B N A B N





Access to adress 0x1, 4 words are fetched

| i |   | 2 | Ē | л |   | h | $\sim$ |   |  |
|---|---|---|---|---|---|---|--------|---|--|
|   | 4 | r | Ŀ | U | v |   | L      | - |  |

3

(日) (四) (日) (日) (日)





Access to adress 0x5, 4 words are fetched

|                         |   | <b>D</b> |   |    | -  | $\sim$ |
|-------------------------|---|----------|---|----|----|--------|
| · · · · · · · · · · · · | è | Ρ.       | 1 | IN | ι. | ι.     |

(日) (四) (日) (日) (日)

э





Access to adress 0x9, 4 words are fetched

| CCMP? |        |                  |     |   |  |
|-------|--------|------------------|-----|---|--|
|       | $\sim$ | $\boldsymbol{c}$ | ΝИ  | D |  |
|       |        |                  | IVI |   |  |

(日) (四) (日) (日) (日)





Access to adress  $0 \times 13$ , 4 words are fetched

|   | - |   | - |   |
|---|---|---|---|---|
| • | ( | M | Р | 2 |
|   | ~ |   |   | - |

3

< □ > < 同 > < 回 > < 回 > < 回 >





Access to adress 0x17, 4 words are fetched First line of cache is replaced!

э

< □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ >





Access to adress 0x21, 4 words are fetched Second line of cache is replaced!

Instruction scheduling

May 19, 2018 44 / 57

э

< □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ >

Many strategies to put data into the cache:

- Direct Mapping:
  - The address is decomposed in 3 parts: tag (8b), line (22b), and word(2b)
  - Each block of main memory maps to only one cache line, i.e. block-size = cache-line-size
  - Simple, Inexpensive, and fixed location for given block
- Associative Mapping:
  - A main memory block can load into any line of cache
  - Memory address is interpreted as tag and word
  - Tag uniquely identifies block of memory
  - Each block of main memory maps to only one cache line, i.e. block-size = cache-line-size
  - Complex, Expensive, and no-fixed location for given block

▲ □ ▶ ▲ □ ▶ ▲ □ ▶

## Prefetching

Fetch the data before it is needed (i.e. pre-fetch) by the program

- Eliminate cache misses
- Involves predicting which address will be needed in the future (as for branch prediction)
- In contrast to branch prediction:
  - incorrect prefetched data will simply not be used
  - there is no need for state recovery

#### Locality

- Locality is the principle that future memory accesses are near past accesses
- Memories take advantage of two types of locality
  - Temporal locality, i.e. near in time: we will often access the same data again very soon
  - Spatial locality, i.e. near in space/distance: our next access is often very close to our last access (or recent accesses)

#### Locality

- Locality is the principle that future memory accesses are near past accesses
- Memories take advantage of two types of locality
  - Temporal locality, i.e. near in time: we will often access the same data again very soon
  - Spatial locality, i.e. near in space/distance: our next access is often very close to our last access (or recent accesses)

Some Instruction Set Architecture (ISA) allows to pre-fetch some data: i.e., Humans or compilers has to insert (take advantage) of these instructions

#### Loops optimisations

We have already seen loops-unrolling to avoid stalls inside of the processor. Other techniques exist to avoid stalls due to cache:

- Loop Fission
- Loop interchanging
- Tabular Grouping
- Loop blocking
- Loop reversal
- Loop tiling
- . . .

Consider the following code, and direct mapping strategy:

```
int A[1024]; int B[1024]; int C[1024];
for (int i = 1; i<1024; ++i) {
    A[i] = B[i];
    C[i] = C[i-1] + 1;
}
```

< □ > < 同 > < 回 > < 回 > < 回 >

Consider the following code, and direct mapping strategy:

Fetch A[i], A[i + 1], A[i + 2] and A[i + 3]



イロト 不得 トイヨト イヨト 二日

Consider the following code, and direct mapping strategy:

$$\begin{array}{l} \mbox{int } A[1024]; \mbox{int } B[1024]; \mbox{int } C[1024]; \\ \mbox{for (int } i = 1; \ i < 1024; \ + +i) \ \{ \\ A[i] = B[i]; \\ C[i] = C[i-1] + 1; \\ \} \end{array}$$

Fetch B[i], B[i + 1], B[i + 2] and B[i + 3]



< □ > < 同 > < 回 > < 回 > < 回 >

Consider the following code, and direct mapping strategy:

Fetch C[*i*], C[*i* + 1], C[*i* + 2] and C[*i* + 3]



イロト イポト イヨト イヨト

Consider the following code, and direct mapping strategy:

Fetch C[i - 1] will probably conflict



- 4 回 ト 4 ヨ ト 4 ヨ ト

Hopefully A[i], B[i] and C[i] will not conflict in the cache
but ... C[i-1] will probably!

#### Solution

Divide the loop into two:

- Less pressure on cache
- We can now insert padding to avoid conflicts

```
 \begin{array}{l} \mbox{int } A[1024]; \mbox{ padding[xx]; int } B[1024]; \mbox{ int } C[1024]; \\ \mbox{for (int } i = 1; \ i < 1024; \ ++i) \\ A[i] = B[i]; \\ \mbox{for (int } i = 1; \ i < 1024; \ ++i) \\ C[i] = C[i-1] \ + \ 1; \\ \end{array}
```

Try it by yourself in gcc -ftree-loop-distribution

CCMP2

- 4 回 ト - 4 三 ト

## Loop interchanging 1/2

Consider the following code, and direct mapping cache:

In Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss A B CD E F is stored A D B E C F

## Loop interchanging 1/2

Consider the following code, and direct mapping cache:

```
\begin{array}{l} \text{int } A[1024][1024]; \\ \text{for } (\text{int } j=1; \ j{<}1024; \ +{+}j) \\ \text{for } (\text{int } i=1; \ i{<}1024; \ +{+}i) \\ A[j][i]=A[j][i] \ * \ 42; \end{array}
```

Fetch A[j][i], A[j + 1][i], A[j + 2][i], and A[j + 3][i]



In Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss A B CD E F is stored A D B E C F

## Loop interchanging 1/2

Consider the following code, and direct mapping cache:

```
\begin{array}{l} \text{int } A[1024][1024]; \\ \text{for } (\text{int } j=1; \ j{<}1024; \ +{+}j) \\ \text{for } (\text{int } i=1; \ i{<}1024; \ +{+}i) \\ A[j][i]=A[j][i] \ * \ 42; \end{array}
```

Fetch A[j + 1][i], A[j + 2][i], A[j + 3][i], and A[j + 4][i]



In Fortran, the elements of an array are stored in memory contiguously by column, and the original loop iterates over rows, potentially creating at each access a cache miss A B CD E F is stored A D B E C F

### Loop interchanging 2/2

#### Solution

This transformation switches the positions of one loop that is tightly nested within another loop.

$$\begin{array}{ll} \text{int } A[1024][1024]; \\ \text{for (int } i=1; \ i{<}1024; \ +{+}i) \\ \text{for (int } j=1; \ j{<}1024; \ +{+}j) \\ A[j][i]=A[j][i] * 42; \\ \end{array}$$

Legal if the outermost loop does not carry any data dependence Try it by yourself in gcc -floop-interchange

| $\sim$ | $\sim$ |     |   | - |
|--------|--------|-----|---|---|
|        | ι.     | IVI | Р | 2 |
|        |        |     |   |   |

. . . . . . .

Consider the following code, and direct mapping cache:

```
int A[1024]; int B[1024];
for (int j = 1; j<1024; ++j)
A[j] = B[j] * 42;
```

< □ > < 同 > < 回 > < 回 > < 回 >

Consider the following code, and direct mapping cache:

```
 \begin{array}{l} \mbox{int } A[1024]; \mbox{ int } B[1024]; \\ \mbox{for } (\mbox{int } j=1; \mbox{ } j{<}1024; \mbox{ } +{+j}) \\ A[j] = B[j] \mbox{ } 42; \end{array}
```

Fetch B[i], B[i + 1], B[i + 2] and B[i + 3]



▲ □ ▶ ▲ □ ▶ ▲ □ ▶

Consider the following code, and direct mapping cache:

```
int A[1024]; int B[1024];
for (int j = 1; j<1024; ++j)
A[j] = B[j] * 42;
```

Fetch A[i], A[i + 1], A[i + 2] and A[i + 3]



▲ □ ▶ ▲ □ ▶ ▲ □ ▶

CCMP2

Consider the following code, and direct mapping cache:

```
int A[1024]; int B[1024];
for (int j = 1; j<1024; ++j)
A[j] = B[j] * 42;
```

Fetch A[*i*], A[*i* + 1], A[*i* + 2] and A[*i* + 3]



▲ □ ▶ ▲ □ ▶ ▲ □ ▶

Dynamic allocation does not allow padding. In the worst case, two miss per iterations

CCMP2

#### Solution

Group the two tabular into one

struct twoval {int A; int B}; struct twoval R[1024]; for (int j = 1; j<1024; ++j) R[j].A = R[j].B \* 42;

Avoid a lot of caches miss! Very hard for compiler to detect such cases

CCMP2

• • = • • = •

## Loop Blocking

Consider the code below.

```
 \begin{array}{ll} \mbox{int } A[1024][1024]; \mbox{ int } B[1024][1024]; \\ \mbox{for (int } i = 1; \ i {<} 1024; \ +{+}i) \\ \mbox{for (int } j = 1; \ j {<} 1024; \ +{+}j) \\ A[i][j] = B[i][j]; \\ \end{array}
```

• If A and B are aligned we may encounter problems.

- Similar problems occur when processing images: A[i][j] = B[i-1][j-1] + B[i-1][j] + B[i-1][j+1] + B[i][j-1] + B[i][j] + B[i][j+1] + B[i-1][j+1] + B[i+1][j] + B[i+1][j+1];
- In this latter case, padding is complicated...

## Loop Blocking

#### Solution

Try to work with data that fit in memory!

int A[1024][1024]; int B[1024][1024];  
for (int i = 1; i<1024; i += B)  
for (int j = 1; j<1024; j += B)  
for (int ii = 1; iifor (int jj = 1; jj
$$A[i][j] = B[i][j];$$

< □ > < 同 > < 回 > < 回 > < 回 >
## Summary

• stalls in the processor can come from many reasons

- from data dependencies between instructions
- from instruction dependencies
- from cache and memory
- modern compiler hardly try to reduce them
  - by using Instruction Level Parallelism (i.e, to have a lot of independent instructions)
  - all these optimization must occur before register allocation (which is the final step)
  - When writing a compiler, you must know the target processor by heart!
- Caches can be shared between many processors!