

# <u>Administration</u>

#### • Instructor:

- Keshav Pingali (Professor, CS, ECE & Oden)
  - 4.126 Peter O'Donnell Building (POB)
  - Email: pingali@cs.utexas.edu

#### • TA: Dani Wang

- (Graduate student, CS)
  - Email: daniw@utexas.edu

1

# <u>Prerequisites</u>

- Basic computer architecture course
  - (e.g.) PC, ALU, cache, memory, instruction-level parallelism (ILP)
- Basic calculus and linear algebra
  - differential equations and matrix operations
- Software maturity
  - assignments will be in C/C++ on Linux computers
  - ability to write medium-sized programs (~1000 lines)
- Self-motivation
  - willingness to experiment with systems

# <u>Coursework</u>

- 6-7 programming projects
  - These will be more or less evenly spaced through the semester
  - Some projects will require the use of Intel performance analysis tools
- One mid-semester exam
  - Date: TBA
  - Final exam

4

3

### Text-book for course

No official book for course

This book is a useful reference.

"Parallel programming in C with MPI and OpenMP", Michael Quinn, McGraw-Hill Publishers. ISBN 0-07-282256-2

Lots of material on the web

5

### What this course is not about

• This is not a clever hacks course

 We are interested in general scientific principles for performance programming, not in squeezing out every last cycle for somebody's favorite program

- This is not a tools/libraries course
  - We will use several tools (Intel Vtune, Advisor) and libraries (MPI) but for us, they are a means to an end and not end in themselves.















### <u>Caches: typical latency numbers</u> (today)

| L1 cache reference/hit<br>Floating-point add/mult/FMA operation                                                                                                                             | 1.5 ns<br>1.5 ns | 4 cycles<br>4 cycles            |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|---------------------------------|
| L2 cache reference/hit                                                                                                                                                                      | 5 ns             | 12 ~ 17 cycles                  |
| L3 cache hit                                                                                                                                                                                | 16-40 ns         | 40-300 cycles                   |
| 256MB main memory reference<br>2690v4                                                                                                                                                       | 75-120 ns        | TinyMemBench on "Broadwell" E5- |
| Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec hard disk (seek time would be additional latency)<br>Random Disk Access (seek+rotation) 10,000,000 ns 10,000 us 10 ms |                  |                                 |
| Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms                                                                                                                            |                  |                                 |

Locality is important.

From: Latency numbers every HPC programmer should know <sup>14</sup>







- performance optimization are limited
- Unoptimized portions of program become bottleneck
- Analogy: suppose I go from Austin to Houston at 60 mph, and return "infinitely" fast. What is my average speed?
  - Answer: 120 mph, not infinity

17

17



- In general, program will have both optimized and unoptimized portions

   Suppose program has N operations
  - r\*N operations in optimized portion
  - (1-r)\*N operations in unoptimized portion

Assume

- Unoptimized portion requires one time unit per operation
- Optimized portion can be executed infinitely fast so it takes zero time to execute.

• Speed-up:

 $\frac{\text{Original execution time}}{\text{Optimized execution time}} = \frac{N}{(1-r)*N} = \frac{1}{(1-r)}$ 



Unless most of your program is performance-optimized, you won't see much benefit.  $$^{18}$$ 





### (1) Using the additional transistors: old ideas have run out of steam

#### • More cache

 More cache buys performance until working set of program fits in cache

#### • Deeper pipeline

- Deeper pipeline buys frequency at expense of increased branch mis-prediction penalty
- Deeper pipelines => higher clock frequency => more power
- Add more functional units/vector units
  - Diminishing returns for adding more units

#### • Wider data paths

Increases bandwidth between functional units in a core but we now have comprehensive 64-bit designs

21

21





22





24

### **Clusters and data-centers**



TACC Stampede 2 cluster

- 4,200 Intel Knights Landing nodes, each with 68 cores
- 1,736 Intel Xeon Skylake nodes, each with 48 cores
- 100 Gb/sec Intel Omni-Path network with a fat tree topology employing six core switches

25

25

### Software challenges post-2005

- Exploiting parallelism: keep the cores busy
  - Node-level and thread-level parallelism
  - Load-balancing
- Exploiting memory hierarchy
  - Spatial and temporal locality
  - Avoid sharing data with other cores as far as possible

26

28

- New kinds of bugs:
  - race conditions, deadlocks

26

### Parallel programming

- Shared-memory programming
  - Architecture: processor has some number of cores (e.g., Intel Skylake has up to 18 cores depending on the model)
  - Application program is decomposed into a number of threads, which run on these cores
  - Threads communicate by reading and writing memory locations
  - We will study pThreads and OpenMP for shared-memory programming

#### • Distributed-memory programming

- Architecture: network of machines (Stampede II: 4,200 KNL hosts)
- Application program and data structures are partitioned into processes, which run on machines
- Processes communicate by sending and receiving messages since they have no memory locations in common
- We will study MPI for distributed-memory programming

27

### Major Lecture Topics

- Applications
  - Parallelism and locality in important algorithms
- Locality
  - Memory hierarchy, code and data transformations
- Vector parallelism
  - Vectorizing compilers
- Shared-memory parallelism
  - Multicore architectures, pThreads, OpenMP, TBB
- Distributed-memory parallelism – Clusters, MPI
- GPUs
  - CUDA