

# Instruction Scheduling (Part 2) Software Pipelining

Patrick Carribault patrick@ices.utexas.edu

CS 380C: Advanced Compiler Techniques

Tuesday, October 23th 2007

◆□▶ ◆□▶ ◆三▶ ◆三▶ →三 ∽ ⊙へ⊙

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ ○○○

# Lecture Overview

## Code Generator

- Back end part of compiler (code generator)
- Instruction scheduling
- Register allocation

## Instruction Scheduling

- Input: set of instructions
- Output: total order on that set

## Lecture Outline

## Previous Lecture

- Introduction to instruction scheduling
- Q Representation of data dependences and resource constraints
- Acyclic scheduling: list scheduling

## Today

- Loop scheduling: definition of software pipelining
- Parameters of software-pipelined schedules
- Heuristics: modulo scheduling
- Hardware support for code generation of software-pipelined loops

# Introduction to Software Pipelining

## Loop Scheduling

- Apply list scheduling on the loop body
- Ignore dependence distance > 0
- No iteration overlapping ⇒ any schedule respects the dependence distances
- But: exploit only intra-iteration parallelism

#### How to benefit from inter-iteration parallelism?

- Unroll the loop before scheduling (or while scheduling)
- Overlap consecutive iterations in a continuous flow
  - ⇒ Software Pipelining

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
|                  |              |                   |                  |            |
| Example 1        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

◆□ ▶ ◆□ ▶ ◆三 ▶ ◆三 ▶ ● ○ ● ● ●

• A valid simple schedule?

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
|                  |              |                   |                  |            |
| Example 1        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |



| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

◆□▶ ◆□▶ ◆三▶ ◆三▶ ○三 ○○○

 $\bullet$  A valid simple schedule?  $\Rightarrow$  Use list scheduling on loop body

| Cycle | Schedule |
|-------|----------|
|       |          |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
|       |    |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
|                  |              |                   |                  |            |
| Example 1        |              |                   |                  |            |



 $\bullet$  A valid simple schedule?  $\Rightarrow$  Use list scheduling on loop body

| Cycle | Schedule |
|-------|----------|
| 0     | A        |
|       |          |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
|       |    |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 1        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

 $\bullet$  A valid simple schedule?  $\Rightarrow$  Use list scheduling on loop body

| Cycle | Schedule |
|-------|----------|
| 0     | А        |
| 1     | В        |
|       |          |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     |    | Х  |    |
|       |    |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 1        |              |                   |                  |            |



• A valid simple schedule?  $\Rightarrow$  Use list scheduling on loop body

| Cycle | Schedule |
|-------|----------|
| 0     | А        |
| 1     | В        |
| 2     | С        |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     |    | Х  |    |
| 2     |    |    | Х  |

• Length of 3, but inter-iteration parallelism available

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ ○○○

Х





▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()





| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

| Cycle | Schedule |  |  |  |
|-------|----------|--|--|--|
|       | 0 1 2 3  |  |  |  |
|       |          |  |  |  |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
|       |    |    |    |





| A | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
|       |    |    |    |

・ロト・日本・日本・ 日本・ シック・





| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        | А |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     | Х  | Х  |    |
|       |    |    |    |





| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |



| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |  |
|-------|----------|---|---|--|
|       | 0 1 2 3  |   |   |  |
| 0     | А        |   |   |  |
| 1     | В        | А |   |  |
| 2     | С        | В | А |  |
|       |          |   |   |  |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     | Х  | Х  |    |
| 2     | Х  | Х  | Х  |
|       |    |    |    |

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ● ○ ○ ○ ○





| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |



| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |  |
|-------|----------|---|---|---|--|
|       | 0 1 2 3  |   |   |   |  |
| 0     | А        |   |   |   |  |
| 1     | В        | Α |   |   |  |
| 2     | С        | В | Α |   |  |
| 3     |          | С | В | А |  |

• Kernel (1 cycle), depth of 3

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     | Х  | Х  |    |
| 2     | Х  | Х  | Х  |
| 3     | Х  | Х  | Х  |

◆□▶ ◆□▶ ◆目▶ ◆目▶ ●目 ● のへぐ





| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
|                  |              |                   |                  |            |
| Example 2        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ● □ ● ● ● ●

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
|       |    |    |    |





| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | Α        |   |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
|       |    |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
| Example 2        |              |                   |                  |            |



| Α | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        |   |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     |    | Х  |    |
|       |    |    |    |

◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 - のへで

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 2        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

|   | С | r0 | r1 | r2 |
|---|---|----|----|----|
| ſ | 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        |   |   |   |
| 2     | С        | Α |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     |    | Х  |    |
| 2     | Х  |    | Х  |
|       |    |    |    |

▲□▶ ▲□▶ ▲目▶ ▲目▶ 三回 めん⊙

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 2        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| ſ | С | r0 | r1 | r2 |
|---|---|----|----|----|
|   | 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        |   |   |   |
| 2     | С        | А |   |   |
| 3     |          | В |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     |    | Х  |    |
| 2     | Х  |    | Х  |
| 3     |    | Х  |    |
|       |    |    |    |

◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 - のへで

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 2        |              |                   |                  |            |



| А | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| В | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        |   |   |   |
| 2     | С        | А |   |   |
| 3     |          | В |   |   |
| 4     |          | С | А |   |

• Kernel (2 cycles), depth of 2

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    |    |
| 1     |    | Х  |    |
| 2     | Х  |    | Х  |
| 3     |    | Х  |    |
| 4     | Х  |    | Х  |



▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

#### DDG and reservation tables:







• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |  |
|-------|----------|---|---|---|--|
|       | 0        | 1 | 2 | 3 |  |
|       |          |   |   |   |  |

| Cycle | r0 | r1 |
|-------|----|----|
|       |    |    |

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
|                  |              |                   |                  |            |
| Example 3        |              |                   |                  |            |



| Α | r0 | r1 |
|---|----|----|
| 0 | Х  |    |



| С | r0 | r1 |
|---|----|----|
| 0 |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |  |  |   |  |
|-------|----------|--|--|---|--|
|       | 0 1 2 3  |  |  | 3 |  |
| 0     | А        |  |  |   |  |
|       |          |  |  |   |  |

| Cycle | r0 | r1 |
|-------|----|----|
| 0     | Х  |    |
|       |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 3        |              |                   |                  |            |



| А | r0 | r1 |
|---|----|----|
| 0 | Х  |    |



| С | r0 | r1 |
|---|----|----|
| 0 |    | Х  |

• Schedule with overlapped iterations?

| Cycle | Schedule |  |  |  |
|-------|----------|--|--|--|
|       | 0 1 2 3  |  |  |  |
| 0     | А        |  |  |  |
| 1     | В        |  |  |  |
|       |          |  |  |  |

| Cycle | r0 | r1 |
|-------|----|----|
| 0     | Х  |    |
| 1     | Х  |    |
|       |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
| Example 3        |              |                   |                  |            |

r0

Α

n



• Schedule with overlapped iterations?

| Cycle | Schedule |   |  |   |
|-------|----------|---|--|---|
|       | 0 1 2 3  |   |  | 3 |
| 0     | А        |   |  |   |
| 1     | В        |   |  |   |
| 2     | С        | Α |  |   |
|       |          |   |  |   |

| Cycle | r0 | r1 |
|-------|----|----|
| 0     | Х  |    |
| 1     | Х  |    |
| 2     | Х  | Х  |
|       |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
| Example 3        |              |                   |                  |            |

r1

А

0

r0

Х



• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        |   |   |   |
| 2     | С        | А |   |   |
| 3     |          | В |   |   |
|       |          |   |   |   |

| Cycle | r0 | r1 |
|-------|----|----|
| 0     | Х  |    |
| 1     | Х  |    |
| 2     | Х  | Х  |
| 3     | Х  |    |
|       |    |    |

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
|                  |              |                   |                  |            |
| Example 3        |              |                   |                  |            |

r1

r0

х

А

n



• Schedule with overlapped iterations?

| Cycle | Schedule |   |   |   |
|-------|----------|---|---|---|
|       | 0        | 1 | 2 | 3 |
| 0     | А        |   |   |   |
| 1     | В        |   |   |   |
| 2     | С        | А |   |   |
| 3     |          | В |   |   |
| 4     |          | С | Α |   |

• Kernel (2 cycles), depth of 2

| Cycle | r0 | r1 |
|-------|----|----|
| 0     | Х  |    |
| 1     | Х  |    |
| 2     | Х  | Х  |
| 3     | Х  |    |
| 4     | Х  | Х  |

◆□ ▶ ◆□ ▶ ◆ □ ▶ ◆ □ ▶ ● ● ● ● ● ●

# Software Pipelining – Definition

## Definition

- Do not wait for an iteration to finish to launch the next one
- Constant time between two consecutive iteration launches
   Initiation Interval (or *II*)
- Need to respect constraints (including dependence distance)
- Prolog/Kernel/Epilog

## Schedule

- Several iterations alive in the kernel (depth)
- One instruction scheduling  $\Rightarrow$  one cycle and one iteration
  - $\Rightarrow$  2-dimensional schedule

◆□▶ ◆□▶ ◆□▶ ▲□▶ ▲□ ◆ ○ ◆ ○ ◆

# Software Pipelining – Parameters

## Performance

• Performance in cycles *P* with *n* iterations:

$$P = (n-1) imes II + M$$

• Linear in 
$$II \Rightarrow$$
 lower is better

#### Parameters

- Initiation Interval II
- Depth D
- Makespan M

## Initiation Interval

- Time (in cycles) between 2 consecutive iteration launches
- Corresponds to the kernel size
- Shape the overall performance of pipelined schedule

## Parameters influencing *II*

• Data dependences: RecMII:

$$RecMII = \max_{\forall circuits \ \theta} \left| \frac{latency(\theta)}{distance(\theta)} \right|$$

- Resource constraints: ResMII
- Minimum value *MII*:

 $MII = \max(ResMII, RecMII)$ 

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ ○○○

# Data Dependences – RecMII

## Dependence Constraint

- Let  $\sigma$  be the schedule date of instructions for one iteration, l is the latency and d the distance
- Consider a and b two instructions:
  - Intra-iteration constraint (d(a, b) = 0)

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ ○○○

# Data Dependences – RecMII

## Dependence Constraint

- Let  $\sigma$  be the schedule date of instructions for one iteration, l is the latency and d the distance
- Consider a and b two instructions:
  - Intra-iteration constraint (d(a, b) = 0)

$$\sigma(a) + l(a, b) \leq \sigma(b)$$

• Inter-iteration constaint (loop-carried dependences)

# Data Dependences – RecMII

#### Dependence Constraint

- Let  $\sigma$  be the schedule date of instructions for one iteration, l is the latency and d the distance
- Consider a and b two instructions:
  - Intra-iteration constraint (d(a, b) = 0)

 $\sigma(a) + l(a, b) \leq \sigma(b)$ 

• Inter-iteration constaint (loop-carried dependences)

$$\sigma(a) + l(a, b) \leq \sigma(b) + ll \times d(a, b)$$

## MII due to dependence constraints on a circuit $\theta$

# Data Dependences – RecMII

## Dependence Constraint

- Let  $\sigma$  be the schedule date of instructions for one iteration, l is the latency and d the distance
- Consider a and b two instructions:
  - Intra-iteration constraint (d(a, b) = 0)

 $\sigma(a) + l(a, b) \leq \sigma(b)$ 

• Inter-iteration constaint (loop-carried dependences)

$$\sigma(a) + l(a, b) \leq \sigma(b) + ll \times d(a, b)$$

## MII due to dependence constraints on a circuit $\theta$

$$I(\theta) - II \times d(\theta) \le 0 \quad \Rightarrow \quad II \ge \frac{I(\theta)}{d(\theta)}$$

### Depth and Makespan

#### Depth

- Number of iterations alive in the kernel
- Secondary parameter
- Influence prolog/epilog size
- Complex relation with II

### Makespan

- Time to complete a whole iteration in the kernel loop
- Secondary parameter
- Related to both *II* and depth of the pipeline
- Influence variable lifetimes

# Software Pipelining Approaches

#### Main approaches

- Exact algorithms: enumerate every possibility (NP-Complete)
- Heuristics: best choice in production compilers

### Heuristic family

- Modulo scheduling
- Kernel recognition

#### Today

• Approach mainly used in production compilers: *modulo scheduling* 

### Modulo Scheduling

### Principle

- Schedule a single iteration such as it is valid repeted every *II* cycle
- Need to fix *II* before scheduling one iteration
- If not possible, then increase the value of targeted II

### Main algorithm

```
Sort the nodes by priority
Compute MII
II = MII
While (schedule not valid)
Schedule a single iteration with II
If (schedule not valid)
| II++
```

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ ○○○

# Iterative Modulo Scheduling (IMS)

### Principle

- Seminal work on modulo scheduling by Bob Rau, MICRO-27 (1994)
- Extension of list scheduling to loop
- Notion of budget

### Compiler

• Implemented in Intel's compiler ICC

Lecture Overview

Introduction

Modulo Scheduling

Hardware Support

◆□▶ ◆□▶ ◆三▶ ◆三▶ →三 ∽ ⊙へ⊙

Conclusion

### Iterative Modulo Scheduling (IMS)

• Pick up next instruction in decreasing priority H

$$H(P) = \begin{cases} 0 & \text{if P is a leaf} \\ \max_{Q \in Succ(P)} (H(Q) + L(P, Q) - II \times D(P, Q)) & \text{otherwise} \end{cases}$$

Lecture Overview

Introduction

▲ロト ▲御 ト ▲ 臣 ト ▲ 臣 ト ○ 臣 - の Q ()

# Iterative Modulo Scheduling (IMS)

• Pick up next instruction in decreasing priority H

$$H(P) = \begin{cases} 0 & \text{if P is a leaf} \\ \max_{Q \in Succ(P)} (H(Q) + L(P, Q) - II \times D(P, Q)) & \text{otherwise} \end{cases}$$

• Compute the range to schedule it: [Estart, Estart + II - 1]

$$\textit{Estart}(P) = \max_{Q \in \textit{Pred}(P)} \begin{cases} 0 & \text{if } Q \text{ is unscheduled} \\ \max(0, \sigma(Q) + L(Q, P) - II \times D(Q, P)) & \text{otherwise} \end{cases}$$

# Iterative Modulo Scheduling (IMS)

• Pick up next instruction in decreasing priority H

$$H(P) = \begin{cases} 0 & \text{if P is a leaf} \\ \max_{Q \in Succ(P)} (H(Q) + L(P, Q) - II \times D(P, Q)) & \text{otherwise} \end{cases}$$

• Compute the range to schedule it: [Estart, Estart + II - 1]

$$\textit{Estart}(P) = \max_{Q \in \textit{Pred}(P)} \begin{cases} 0 & \text{if } Q \text{ is unscheduled} \\ \max(0, \sigma(Q) + L(Q, P) - II \times D(Q, P)) & \text{otherwise} \end{cases}$$

- Try to schedule it within the range
- If failed (due to either data dependences or resource usage), then force the schedule and unschedule conflicting instructions
- Involve a notion of *budget* to avoid cyclically unscheduling the same set of instructions

# IMS – Example 1 [LlosaPACT96]

DDG:



#### Reservation tables:

| LD/ST | r0 | r1 | r2 | r3 |
|-------|----|----|----|----|
| 0     |    |    | Х  |    |
| 1     |    |    | Х  |    |
| LD/ST | r0 | r1 | r2 | r3 |
|       |    |    |    |    |
| 0     |    |    |    | Х  |

| ADD | r0 | r1 | r2 | r3 |
|-----|----|----|----|----|
| 0   | Х  |    |    |    |
| 1   | Х  |    |    |    |

| 1 | MUL | r0 | r1 | r2 | r3 |
|---|-----|----|----|----|----|
|   | 0   |    | Х  |    |    |
|   | 1   |    | Х  |    |    |

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ○ □ ○ ○ ○ ○

# IMS – Example 2

DDG:



| Reservation tables: |    |    |    |  |  |
|---------------------|----|----|----|--|--|
| Α                   | r0 | r1 | r2 |  |  |
| 0                   | Х  |    |    |  |  |

| ſ | В | r0 | r1 | r2 |
|---|---|----|----|----|
|   | 0 | Х  |    |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| D | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| Е | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    |    | Х  |

◆□ ▶ ◆□ ▶ ◆ □ ▶ ◆ □ ▶ ● □ ● ● ● ●

- Compute MII:
  - RecMII = 3, ResMII =  $3 \Rightarrow$  MII = 3
  - 2 Compute priority H

• 
$$H(A) = 4$$
,  $H(B) = 3$ ,  $H(C) = 2$ ,  $H(D) = 1$ ,  $H(E) = 0$ 

Start the scheduling process with II = MII = 3

| Cycle | Schedule |
|-------|----------|
| 0     | A        |
| 1     | В        |
| 2     | С        |
| 3     |          |
| 4     |          |
| 5     | D        |
| 6     | E        |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  |    | Х  |
| 1     | Х  |    |    |
| 2     | Х  | Х  |    |

▲日▶ ▲□▶ ▲ヨ▶ ▲ヨ▶ ヨー のなべ

**Success:** II = 3, D = 3, M = 7

# Swing Modulo Scheduling (SMS)

#### Principle

- By Llosa et al., PACT'96
- Avoid the need to unscheduled instructions
  - i.e., when both predecessors and successors are scheduled
- Based on the scheduling of strongly connected components (SCC)
  - Sort SCC by decreasing RecMII
- Go backward and forward on nodes linking two SCCs
- When it is impossible to schedule an instruction, do not force, just increase *II*

### Compiler

Implemented in GCC by IBM Haifa

# SMS – Example 1 [LlosaPACT96]

DDG:



#### Reservation tables:

| LD/ST | r0 | r1 | r2 | r3 |
|-------|----|----|----|----|
| 0     |    |    | Х  |    |
| 1     |    |    | Х  |    |
| LD/ST | r0 | r1 | r2 | r3 |
|       | 10 | тт | 12 | 10 |
| 0     | 10 | 11 | 12 | X  |

| ADD | r0 | r1 | r2 | r3 |
|-----|----|----|----|----|
| 0   | Х  |    |    |    |
| 1   | Х  |    |    |    |

| MUL | r0 | r1 | r2 | r3 |
|-----|----|----|----|----|
| 0   |    | Х  |    |    |
| 1   |    | Х  |    |    |

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ○ □ ○ ○ ○ ○

# SMS – Example 2

DDG:



| Reservation tables: |    |    |    |  |  |  |
|---------------------|----|----|----|--|--|--|
| Α                   | r0 | r1 | r2 |  |  |  |
| 0                   | Х  |    |    |  |  |  |

| ſ | В | r0 | r1 | r2 |
|---|---|----|----|----|
|   | 0 | Х  |    |    |

| С | r0 | r1 | r2 |
|---|----|----|----|
| 0 |    | Х  |    |

| D | r0 | r1 | r2 |
|---|----|----|----|
| 0 | Х  |    |    |

| 1 | Е | r0 | r1 | r2 |
|---|---|----|----|----|
|   | 0 |    |    | Х  |

◆□▶ ◆□▶ ◆ □▶ ◆ □▶ 亘 のへぐ



Compute MII:

- RecMII = 3, ResMII = 3  $\Rightarrow$  MII = 3
- Occupate Priority order O

• *O* =< *C*, *D*, *E*, *B*, *A* >

Start the scheduling process with II = MII = 3

| Cycle | Schedule |
|-------|----------|
| -3    | A        |
| -2    |          |
| -1    | В        |
| 0     | С        |
| 1     | D        |
| 2     | E        |

| Cycle | r0 | r1 | r2 |
|-------|----|----|----|
| 0     | Х  | Х  |    |
| 1     | Х  |    |    |
| 2     | Х  |    | Х  |

• Success: II = 3, D = 2, M = 6

# Code Generation and Hardware Support

#### **Register Allocation**

- Classical register allocation
- Problems arise when lifetimes exceed II cycles
- Solutions:
  - Software: Modulo variable expansion
  - Hardware: Rotating register file

#### Predication

- Small irregular control  $\Rightarrow$  If-conversion
- Kernel-only loop

#### Example of architecture

• Overview of Itanium architecture

#### Example

• Software-pipelined schedule with II = 1:

| Cycle | Schedule    |  |                 |  |              |
|-------|-------------|--|-----------------|--|--------------|
| 0     | ST[r4]=r7,8 |  | FMA r7=r5,r1,r6 |  | LD r5=[r2],8 |
|       |             |  |                 |  | LD r6=[r3],8 |

• Lifetime of 2 cycles  $\Rightarrow$  produced values are erased

#### Solutions

- Unroll the kernel 2 times (Modulo Variable Expansion)
- Use rotating registers (hardware feature)

| Lecture Overview | Introduction | Modulo Scheduling | Hardware Support | Conclusion |
|------------------|--------------|-------------------|------------------|------------|
| Predication      |              |                   |                  |            |

### Principle

- Convert control dependence into data dependence
- Add one bit (predicate) per instruction
- Instruction is committed iff the predicate is true

### Applications

- If-conversion (small control irregularity)
- Kernel-only schedule

### Example:

| if (c) |  |
|--------|--|
| A ;    |  |
| else   |  |
| B ;    |  |

# Itanium 2 Architecture

### EPIC (Explicitly Parallel Instruction Computing)

- VLIW-like architecture except for memory operations (out-of-order memory accesses)
- IPC up to 6 (Instructions Per Cycle)
- Compiler has to expose parallelism (+ constraints on instruction grouping: *bundling*)
- Provides fully support for software-pipelined schedules: rotating register files and predication

#### Currently

- Itanium2 Montecito
- Dual-core with hyperthreading (2-way)
- Seperate caches: L1, L2 and L3 (12MB for L3)

200



#### Instruction Scheduling

- Input: set of instructions
- Output: order on instructions
- Must respect both data dependences and resource constraints
  - $\bullet\,$  Need a model to represent such constraints (DDG, reservation tables, automaton,  $\ldots$  )

### Schedule Types

- Basic block: List scheduling
- Loop:
  - List scheduling on the body  $\Rightarrow$  do not benefit from intra-iteration parallelism
  - Software pipelining  $\Rightarrow$  Modulo scheduling heuristics (IMS, SMS)

1