# Hardware Success Stories

Chris Rossbach and Calvin Lin

# Computer Architecture

The Instruction Set Architecture (ISA)

The ISA is an example of a successful parallel abstraction

# **Computer Architecture**

The Instruction Set Architecture (ISA) The ISA is an example of a successful parallel abstraction

Today we'll look at microprocessor trends over the years

What can we learn from this success story?

## Parallelism in Hardware

Microprocessors are highly parall Consider a block diagram of Pentium processor



Figure 1. Pentium block diagram.

Bit-serial ALUs







#### Divide an instruction into stages

| Stage 1              |               | Stage 2                                   |               | Stage 3 |               | Stage 4          |               | Stage 5                |
|----------------------|---------------|-------------------------------------------|---------------|---------|---------------|------------------|---------------|------------------------|
| Instruction<br>Fetch | $\Rightarrow$ | Instruction<br>Decode &<br>Register Fetch | ⇒             | Execute | $\Rightarrow$ | Memory<br>Access | $\Rightarrow$ | Register<br>Write-back |
| IF                   | $\Rightarrow$ | ID/RF                                     | $\Rightarrow$ | EX      | $\Rightarrow$ | MEM              | $\Rightarrow$ | WB                     |

time —



#### Divide an instruction into stages

| Stage 1              |               | Stage 2                                   |               | Stage 3 |               | Stage 4          |               | Stage 5                |
|----------------------|---------------|-------------------------------------------|---------------|---------|---------------|------------------|---------------|------------------------|
| Instruction<br>Fetch | $\Rightarrow$ | Instruction<br>Decode &<br>Register Fetch | $\Rightarrow$ | Execute | $\Rightarrow$ | Memory<br>Access | ⇒             | Register<br>Write-back |
| IF                   | $\Rightarrow$ | ID/RF                                     | $\Rightarrow$ | EX      | $\Rightarrow$ | MEM              | $\Rightarrow$ | WB                     |

time ———



#### Divide an instruction into stages

| Stage 1              |                            | Stage                        | e 2 Stage 3 S                |      |    |     | St                                                                                    | Stage 4 |     |          | Stage 5                |    |    |
|----------------------|----------------------------|------------------------------|------------------------------|------|----|-----|---------------------------------------------------------------------------------------|---------|-----|----------|------------------------|----|----|
| Instruction<br>Fetch | $\Rightarrow$              | Instruc<br>Decod<br>Register | $e \& \Rightarrow Execute =$ |      |    | e ⇒ | $\Rightarrow \begin{array}{c} \text{Memory} \\ \text{Access} \end{array} \Rightarrow$ |         |     | <b>`</b> | Register<br>Write-back |    |    |
| IF                   | $\Rightarrow$              | ID/R                         | F                            | =    | ⇒  | EX  | $\Rightarrow$                                                                         | Ν       | 1EM |          | >                      | WI | 3  |
|                      |                            |                              |                              | time |    |     |                                                                                       |         |     |          |                        |    |    |
| A form of tas        | A form of task parallelism |                              |                              | IF   | ID | EX  | MM                                                                                    | WB      |     | _        |                        |    |    |
|                      |                            |                              |                              |      | IF | ID  | EX                                                                                    | MM      | WB  |          |                        |    |    |
| Increases la         | itenc                      | v of a                       | instructions                 |      |    | IF  | ID                                                                                    | EX      | MM  | WB       |                        | _  |    |
|                      |                            | •                            | SU                           |      |    |     | IF                                                                                    | ID      | EX  | MM       | WB                     |    |    |
| single inst          | ructi                      | on                           | 1                            |      |    |     |                                                                                       | IF      | ID  | EX       | MM                     | WB |    |
|                      |                            |                              |                              |      |    |     |                                                                                       |         | IF  | ID       | EX                     | MM | WB |

#### Divide an instruction into stages

| Stage 1                    |               | Stage                        | e 2 Stage 3 St |      |                                     |    | tage 4        |    |                                                   | Stage 5 |    |                        |    |  |
|----------------------------|---------------|------------------------------|----------------|------|-------------------------------------|----|---------------|----|---------------------------------------------------|---------|----|------------------------|----|--|
| Instruction<br>Fetch       | ⇒             | Instruc<br>Decod<br>Register | le &           |      | $\Rightarrow$ Execute $\Rightarrow$ |    |               |    | $\frac{\text{Memory}}{\text{Access}} \Rightarrow$ |         |    | Register<br>Write-back |    |  |
| IF                         | $\Rightarrow$ | ID/R                         | <b>R</b> F     | =    | ⇒                                   | EX | $\Rightarrow$ | Ν  | 1EM                                               |         | >  | WB                     |    |  |
|                            |               |                              |                | time |                                     |    |               |    |                                                   |         |    |                        |    |  |
| A form of task parallelism |               |                              | ins            | IF   | ID                                  | EX | MM            | WB |                                                   |         |    |                        |    |  |
|                            |               |                              | tru            |      | IF                                  | ID | EX            | MM | WB                                                |         |    |                        |    |  |
| Increases latency of a     |               |                              | instructions   |      |                                     | IF | ID            | EX | MM                                                | WB      |    |                        |    |  |
|                            |               | •                            | SU             |      |                                     |    | IF            | ID | EX                                                | MM      | WB |                        |    |  |
| single inst                | .ructi        | ON                           | 1              |      |                                     |    |               | IF | ID                                                | EX      | MM | WB                     |    |  |
| Increases throughput-      |               |                              |                |      |                                     |    |               |    | IF                                                | ID      | EX | ММ                     | WB |  |
|                            |               |                              | •              |      |                                     |    |               |    |                                                   |         |    |                        |    |  |

ideally completes one instruction per cycle

Program order

Bit-serial ALUs Bit-parallel ALUs Pipelined microprocessors Out-of-order execution

Bit-serial ALUs Bit-parallel ALUs Pipelined microprocessors Out-of-order execution

| Pro          | Program order |    |    |    |    |      |          |          |          |   |  |
|--------------|---------------|----|----|----|----|------|----------|----------|----------|---|--|
|              |               |    |    |    |    |      |          |          |          |   |  |
| lss          | Issue order   |    |    |    |    |      |          |          |          |   |  |
|              | time          |    | -  |    | -  |      |          |          |          |   |  |
| ins          | IF            | ID | EX | MM | WB |      |          |          |          |   |  |
| tru          |               | IF | ID | EX | ММ |      |          |          |          |   |  |
|              |               |    |    |    |    | WB   |          |          |          | 1 |  |
| ctio         |               |    | IF | ID | EX | VV B |          | MM       | WB       |   |  |
| instructions |               |    |    |    |    | EX   | MM       | MM<br>WB | WB       |   |  |
| ctions       |               |    |    | ID | EX |      | MM<br>EX |          | WB<br>WB |   |  |



How does out-of-order execution help?

Program order

Bit-serial ALUs Bit-parallel ALUs Pipelined microprocessors Out-of-order execution



Issue order

How does out-of-order execution help? Initiate a slow instruction early



Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



Bit-serial ALUs Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



Bit-serial ALUs Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution



**Bit-serial ALUs** Bit-parallel ALUs Pipelined microprocessors Out-of-order execution Superscalar execution **Multithreading** 



Bubbles in the pipeline represent performance loss



Bubbles in the pipeline represent performance loss



Fill bubbles with instructions from other threads

Bubbles in the pipeline represent performance loss



Bubbles in the pipeline represent performance loss



Bubbles in the pipeline represent performance loss



s Stories

Hard

8

Bubbles in the pipeline represent performance loss



What is the cost of multithreading? Is it ever not profitable?



roblem – Still have lots of unfilled issue slots

### Improving Hardware Utilization

Limitation of Multithreading

At each cycle, issue instructions from any one thread



### Improving Hardware Utilization

Limitation of Multithreading At each cycle, issue instructions from any one thread

Simultaneous Multithreading (SMT)

[Tulsen, Eggers, Levy, ISCA95] At each cycle, issue instructions from multiple threads



### Improving Hardware Utilization

Limitation of Multithreading At each cycle, issue instructions from any one thread

Simultaneous Multithreading (SMT)

[Tulsen, Eggers, Levy, ISCA95] At each cycle, issue instructions from multiple threads



What is the cost of SMT?

Bit-serial ALUs Bit-parallel ALUs Pipelined microprocessors

Out-of-order execution

Superscalar execution

Multithreading

SMT

VLIW

# VLIW

Very Long Instruction Word

Wide instructions for controlling multiple functional units
ISA explicitly encodes parallelism
Statically scheduled
Reduces power by devoting less
hardware to control



Transmeta TM8000 Core

# Vector Processors

Goal:

Reduce instruction fetch bandwidth

Apply instructions to all elements of vectors simultaneously

eg. VectorAdd

## **Vector Processors**

Goal:

Reduce instruction fetch bandwidth

Apply instructions to all elements of vectors simultaneously

#### eg. VectorAdd

Alleviates the von Neumann Bottleneck:



### **Vector Processors**

Goal:

Reduce instruction fetch bandwidth

Apply instructions to all elements of vectors simultaneously

#### eg. VectorAdd

Alleviates the von Neumann Bottleneck:



Have compilers convert existing programs to vector programs

Vectorizing compilers have been quite successful

Vectorizing compilers have been quite successful But initially, many programs could not be vectorized

Vectorizing compilers have been quite successful But initially, many programs could not be vectorized

Did compilers get better?

Vectorizing compilers have been quite successful But initially, many programs could not be vectorized

Did compilers get better?

No. Through compiler feedback and training, programmers learned how to write code that could be vectorized

Vectorizing compilers have been quite successful But initially, many programs could not be vectorized

Did compilers get better?

No. Through compiler feedback and training, programmers learned how to write code that could be vectorized

Can we follow a similar path for parallelizing compilers?

Vectorizing compilers have been quite successful But initially, many programs could not be vectorized

Did compilers get better?

No. Through compiler feedback and training, programmers learned how to write code that could be vectorized

Can we follow a similar path for parallelizing compilers?

Parallelization is more difficult.

The question is not just "is a loop parallelizable?" There are many more tradeoffs to consider.

**Bit-serial ALUs** Bit-parallel ALUs Pipelined microprocessors Out-of-order execution Superscalar execution Multithreading SMT VLIW Vector processors

Multi-cores

#### Increasing granularity of parallelism

Bit-serial ALUs Bit-parallel ALUs Pipelined microprocessors Out-of-order execution Superscalar execution Multithreading

SMT

VLIW

Vector processors

Multi-cores

Increasing granularity of parallelism

Which of these expose parallelism to through the ISA?

Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution

Multithreading

#### SMT

VLIW

Vector processors

Multi-cores

#### Increasing granularity of parallelism

Which of these expose parallelism to through the ISA?

Bit-serial ALUs

Bit-parallel ALUs

Pipelined microprocessors

Out-of-order execution

Superscalar execution

Multithreading

SMT

VLIW

Vector processors

Multi-cores

Increasing granularity of parallelism

Which of these expose parallelism to through the ISA?



For years, the ISA was able to hide hardware parallelism, so programmers could just write sequential code

As the granularity of parallelism has increased, it's become difficult to hide the parallelism

Why is the granularity increasing?



For years, the ISA was able to hide hardware parallelism, so programmers could just write sequential code

As the granularity of parallelism has increased, it's become difficult to hide the parallelism

Why is the granularity increasing?

Is hidden complexity always good?