### **ATLAS Library Generator**

### Lecture based on these papers:

- "A comparison of empirical and model-driven optimization" Yotov et al, PLDI 2003
- "Is search really necessary to generate high-performance BLAS' Yotov et al, Proceedings of IEEE, 93(2), 2005
- "Think globally, search locally" Yotov et al., ICS 2005

### Overview

- ATLAS: portable BLAS generator
  - □ Implemented by Dongarra (UTK), Demmel(UCB) et al.
  - We will focus on MMM
- Importance: popularized the idea of auto-tuning
  - □ Generate-and-test
  - Program generator to generate program variants
  - Test performance of variant by running program
- Program variants in ATLAS
  - Iterative blocked kernels for different levels of memory hierarchy
- Auto-tuning useful for library generators
  - Large upfront cost for generate-and-test

### **BLAS**

• Let us focus on MMM:

```
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < K; k++)
C[i][j] += A[i][k]*B[k][j]</pre>
```

- Properties
  - □ Very good reuse: O(N²) data, O(N³) computation
  - Many optimization opportunities
    - Few "real" dependencies
  - Will run poorly on modern machines
    - Poor use of cache and registers
    - Poor use of processor pipelines
- Key optimization
  - □ Blocking/tiling to improve temporal locality

### Why blocking?

```
for bi = 1,N,B
for bj = 1,N,B
for bk = 1,N,B
for i = bi, min(bi+B-1,N)
for j = bj, min(bj+B-1,N)
for k = bk, min(bk+B-1,N)
    y(i) = y(i) + A(i,j)*x(j)
```



- Assume blocks fit in cache during block computation
- # of cache misses for block data = 3B2/L (L: line size)
- # of block computations = (N/B)3
- Total number of misses =  $(N/B)^3 * (3B^2/L) = 3N^3/BL$
- High-level picture:
  - number of cache misses decreases with block size as long as working set of block computation fits in cache





### High-level picture of high-performance MMM code

- Block the code for each level of memory hierarchy
  - □ Registers: requires loop unrolling
  - □ L1 cache
  - **-** .....
- Choose block sizes at each level using the theory described previously
  - Useful optimization: choose block size at level
     L+1 to be multiple of the block size at level L



### **ATLAS**

- Library generator for MMM and other BLAS
- Blocks only for registers and L1 cache
- Uses search to determine block sizes, rather than the analytical formulas we used
  - Search takes more time, but we do it once when library is produced
- Let us study structure of ATLAS in little more detail

# Study in Yotov et al. paper Original ATLAS Infrastructure OFLOPS OFLOPS OFLOPS Compile, Execute, Measure MindMINUTU offects (MM/Case) MindMINUTU (MM/Case) Model-Based ATLAS Infrastructure Option (MM/Case) Model-Based ATLAS Infrastructure Option (MM/Case) Model offect (MM/Case) ATLAS MM (MN/M) (MM/Case) ATLAS MM (MN/M) (MM/Case) ATLAS MM (MN/M) (MM/Case) Option (MM/Case) Minimm (MM/Case) Option (MM/Case)

### Parameters in ATLAS code

- Cache-level blocking (tiling)
  - Atlas blocks only for L1 cache
- NB: L1 cache time size
- Register-level blocking
  - Important to hold array values in registers
  - MU,NU: register tile size
- Filling the processor pipeline
- Unroll and schedule operationsLatency, xFetch: scheduling parameters
- Versioning
  - Dynamically decide which way to compute
- Back-end compiler optimizations: nothing to do with ATLAS
  - Scalar Optimizations
  - Instruction Scheduling

### Cache-level blocking (tiling)

- Tiling in ATLAS
  - Only square tiles (NBxNBxNB)
  - Working set of tile fits in L1
  - Tiles are usually copied to continuous storage
  - Special "clean-up" code generated for boundaries
- Mini-MMM

for (int j = 0; j < NB; j++)
 for (int i = 0; i < NB; l++)
 for (int k = 0; k < NB; k++)
 C[i][j] += A[i][k] \* B[k][j]</pre>









### Search Strategy

- Multi-dimensional optimization problem:
  - Independent parameters: NB,MU,NU,KU,...
  - Dependent variable: GFlops
  - Function from parameters to variables is given implicitly; can be evaluated repeatedly
- One optimization strategy: orthogonal line search
  - Optimize along one dimension at a time, using reference values for parameters not yet optimized
  - $\ensuremath{\text{\fontfamily Not}}$  Not guaranteed to find optimal point, but might come close

### Find Best NB

- Search in following range
  - □ 16 <= NB <= 80
  - □ NB<sup>2</sup> <= L1Size
- In this search, use simple estimates for other parameters
  - □ (eg) KU: Test each candidate for
    - Full K unrolling (KU = NB)
    - No K unrolling (KU = 1)



## Modeling for Optimization Parameters ■ Optimization parameters ■ NB ■ Hierarchy of Models (later) ■ MU, NU ■ MU\*NU+MU+NU+Latency ≤ NR ■ KU ■ maximize subject to L1 Instruction Cache ■ Latency ■ [(L + 1)/2] ■ MulAdd ■ hardware parameter ■ XFetch ■ set to 2









### Eliminating performance gaps Think globally, search locally Gap between model-based optimization and empirical optimization can be eliminated by Local search: for small performance gaps in neighborhood of model-predicted values Model refinement: for large performance gaps must be done manually (future) machine learning: learn new models automatically Model-based optimization and empirical

optimization are not in conflict





### Itanium diagnosis and solution

- Memory hierarchyL1 data cache: 16 KB

  - □ L2 cache: 256 KB
- □ L3 cache: 3 MB
- Diagnosis:
  - Model tiles for L1 cache
  - On Itanium, FP values not cached in L1 cache!
  - Performance gap goes away if we model for L2 cache (NB = 105)
  - Obtain even better performance if we model for L3 cache (NB = 360, 4.6 GFlops)
- Problem:
  - □ Tiling for L2 or L3 may be better than tiling for L1
  - How do we determine which cache level to tile for??
  - Our solution: model refinement + a little search
  - Determine tile sizes for all cache levels
  - Choose between them empirically



### Opteron diagnosis and solution

- Opteron characteristics
  - Small number of logical registers
  - Out-of-order issue
  - Register renaming
- For such processors, it is better to
  - let hardware take care of scheduling dependent instructions,
  - use logical registers to implement a bigger register tile.
- x86 has 8 logical registers
  - $\neg$  register tiles must be of the form (x,1) or (1,x)







## Things to think about What is the space of program variants? Space must include the optimal point or at least points close to it in performance Question: what kinds of MMM implementations are not explored by ATLAS? What is the search strategy and is it guaranteed to find the optimal (or at least very good) point? ATLAS uses orthogonal line search One general problem: you spend much more time executing nonoptimal program variants than the optimal one! Some notion of importance sampling might be useful if search time matters What were the key approximations made in the analytical performance model? Need to look at model used for each parameter Are these approximations reasonable?