## Direct Addressed Caches for Reduced Power Consumption

# Emmett Witchel Sam Larsen C. Scott Ananian Krste Asanović MIT Lab for Computer Science

# **The Domain**

- n We are attempting to reduce power consumed by the caches and memory system.
  - Not discs or screens.
  - o 16% of processor + cache energy for StrongARM is dissipated in the data cache.
- n We focus on the data cache. The instruction cache is amenable to hardware-only techniques.
- n We are interested in power optimizations that are not just existing speed optimizations.
- n Exploit compile time knowledge to avoid runtime work.
   o Partially evaluate a program for certain hardware resources.
- n We show how software can eliminate cache tag checks which saves energy.

# **The First Problem — Cache Tags**

| Direct Mapped                                           | Set-Associative                                                                        | CAM-tag                                                                                        |
|---------------------------------------------------------|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| Each memory<br>location has a<br>unique home.           | Each memory<br>location has a<br>small number<br>(e.g., 4) homes.                      | Each memory<br>location can be<br>anywhere in a<br>sub bank.                                   |
| High miss rates<br>which means<br>high energy<br>usage. | Moderate miss<br>rates.                                                                | Lowest miss rates.                                                                             |
| Individual<br>accesses are low<br>power.                | Individual<br>accesses are high<br>power because of<br>multiple tag and<br>data reads. | Individual<br>accesses are<br>moderate power.<br>Most of the<br>energy is in the<br>tag check. |

n Both set-associative and CAM-tag caches spend the majority of their energy in the tag check.

### The Solution — Pass Software Information To Hardware

- n The compiler often knows when the program is accessing the same piece of memory. Don't check the cache tags for the second access.
- **n** HW challenge make this path low power.
- **n** SW challenge find the opportunities for use.
  - Two compiler algorithms for two languages (C and Java).
- Interface challenge minimize ISA changes, don't disrupt HW, don't expose too much HW detail.
  - New flavors of memory ops are a common ISA change.
- n Security challenge Protect process data from other processes.
  - **o** Snoop on evicts, detect invalid state early in pipeline

### Direct Addressed CAM Tag Cache Virtually Indexed & Tagged



# **Direct Addressing**



### Spill Code Using Direct Address Registers

| n <b>Old code</b> |       | n Transformed code |               |                     |
|-------------------|-------|--------------------|---------------|---------------------|
| o <b>subu</b>     | \$sp, | 64                 | o <b>subu</b> | \$sp, 64            |
| 0 <b>SW</b>       | \$ra, | 60(\$sp)           | $\circ$ swlda | \$ra,60(\$sp),\$da0 |
| 0 <b>SW</b>       | \$fp, | 56(\$sp)           | o <b>swda</b> | \$fp,56(\$sp),\$da0 |
| 0 <b>SW</b>       | \$s0, | 52(\$sp)           | o <b>swda</b> | \$s0,52(\$sp),\$da0 |

## ${\rm n}$ One tag check per line used for spilling.

## **n** It is a simple transformation.

- Similar to load/store multiple on StrongARM
   Ld/st multiple is a limited model can't handle read-modify-write.
- Hardware only schemes capture many references, but add latency.

# **Compiler Algorithm (C)**



- S Find dominance relationship.
  - S E.g., Read of P[1] in A dominates read of P[0] in D.
- S Determine distance.
   S P[0] is offset -4 from P[1].
   S If dist == 0, done.

### **S** Determine alignment.

- Stack & static data are aligned by our backend.
- S Loop unrolling to increase alignment.
- S Eliminate tag check in the read of P[0].

## **C** Compiler Infrastructure

- **S** We use SUIF, with a C backend.
- **S** Loop unrolling to increase aligned references.
- S Distance information from memory object offset.
   S Use simple, local information for aliases.
- **§** Profile information to set pre-loop break condition.

```
for(i=0; i<N; i++) {
    A[i] = 0;
    if(&A[I] % line_size == 0)
        break;
    A[I] = 0;
    }
    for(; i<N; i += 4) {
        A[i + 0] = 0; A[i + 1] = 0;
        A[i + 2] = 0; A[i + 3] = 0;
        A[i + 2] = 0; A[i + 3] = 0;
    }
}</pre>
```

# **Results — C Implementation**



#### Mediabench

- **n** Data cache energy reduction 8.7 40%.
- n Function entry/exit code not included expect greater savings.

# **Java Compiler Infrastructure**

- § FLEX is a bytecode to native compiler developed at MIT.
- **S We wrote a MIPS back end** 
  - **§ Modified GNU as to accept new memory operations.**
  - **S** Modified ISA simulator to track DAR state.
- **S** Loops are unrolled.
- S Object type is tracked for additional opportunity.
  - S Allows low level optimization of access to e.g., hash code.

## **Results — Java Implementation**



- n One big advantage function entry/exit code was transformed.
  - Calling convention modified.
- n Data cache power savings 26-31%
- n No profile feedback.

### Results — Comparison with L0 Cache



- n DARs usually tie L0 or exceed it.
- n When L0 exceeds DARs, DARs help L0.

Mediabench

## **Related Work**

- n Fisher & Ellis used loop unrolling to reduce memory bank conflicts.
  - Barua expanded the work with Modulo Unrolling.
- n Burd and Kin have proposed hardware L0 caches.
- n Andras' FlexCache does software wayprediction to software controlled array of tag registers.

# Acknowledgements

n Mark Hampton — GNU assembler, simulator.

n Ronny Krashinsky — Energy modeling.

- n Sam Larsen SUIF compiler.
- n C. Scott Ananian Java compiler (FLEX)
- n DARPA, NSF, Infineon