# Chapter 2: Memory Hierarchy Design – Part 2

Introduction (Section 2.1, Appendix B) Caches

Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory

Virtual Memory

#### Fundamental Cache Parameters

Cache Size

How large should the cache be?

Block Size

What is the smallest unit represented in the cache?

Associativity

2

 $\triangleright$ 

4

How many entries must be searched for a given address?

1

# Cache Size Cache size is the total capacity of the cache Bigger caches exploit temporal locality better than smaller But are not always better

**Block Size** Block (line) size is the data size that is both (a) associated with an address tag, and (b) transferred to/from memory Advanced caches allow different (a) & (b) Problem with too small blocks Problem with large blocks  $\triangleright$ 

3

caches

Why?

# Set Associativity

Partition cache block frames & memory blocks in equivalence classes (usually w/ bit selection)

Number of sets, s, is the number of classes

Associativity (set size), n, is the number of block frames per class

Number of block frames in the cache is  $\boldsymbol{s}\times\boldsymbol{n}$ 

Cache Lookup (assuming read hit) Select set Associatively compare stored tags to incoming tag Route data to processor

5

# Associativity, cont.

Typical values for associativity 1 -- direct-mapped n = 2, 4, 8, 16 -- n-way set-associative All blocks – fully-associative Larger associativity

Smaller associativity

| Evaluation Methods<br>Two Levels of Cache<br>Getting Benefits of Associativity without Penalizing Hit Time<br>Reducing Miss Cost to Processor<br>Lockup-Free Caches |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Getting Benefits of Associativity without Penalizing Hit Time<br>Reducing Miss Cost to Processor                                                                    |
| Reducing Miss Cost to Processor                                                                                                                                     |
| ů –                                                                                                                                                                 |
| Lookup Free Caches                                                                                                                                                  |
| Luckup-riee Caches                                                                                                                                                  |
| Beyond Simple Blocks                                                                                                                                                |
| Prefetching                                                                                                                                                         |
| Pipelining and Banking for Higher Bandwidth                                                                                                                         |
| Software Restructuring                                                                                                                                              |
| Handling Writes                                                                                                                                                     |











11



Average Memory Access Time and Performance

14

13









20





#### Multilevel Inclusion, cont.

Multilevel inclusion takes effort to maintain (Typically L1/L2 cache line sizes are different) Make L2 cache have bits or pointers giving L1 contents Invalidate from L1 before replacing block from L2 Number of pointers per L2 block is (L2 blocksize / L1 blocksize)

# Multilevel Exclusion

What if the L2 cache is only slightly larger than L1? Multilevel exclusion => A line in L1 is never in L2 (AMD Athlon)

21

#### Level Two Cache Design

L1 cache design similar to single-level cache design when main memories were ``faster" Apply previous experience to L2 cache design?

What is ``miss ratio"?

Global -- L2 misses after L1 / references Local -- L2 misses after L1 / L1 misses

BUT: L2 caches bigger than L1 experience (several MB) BUT: L2 affects miss penalty, L1 affects clock rate

# Benefits of Associativity W/O Paying Hit Time

Victim Caches Pseudo-Associative Caches Way Prediction

23

24





25

## Way Prediction

Keep extra bits in cache to predict the "way" of the next access Access predicted way first If miss, access other ways like in set associative caches

Fast hit when prediction is correct



27

 $\triangleright$ 

## Reducing Miss Cost, cont.

 $t_{memory} = t_{access} + B \times t_{transfer} = M + B \times 1/2$  $\Rightarrow$  the whole block is loaded before data returned

If main memory returned the reference first (requested-word-first) and the cache returned it to the processor before loading it into the cache data array (fetch-bypass, early restart),

 $t_{memory} = t_{access} + W \times t_{transfer} = M + W \times 1/2$ where *W* is memory bus width in words BUT ...

29



30





32

#### Beyond Simple Blocks

#### Break block size into

Address block associated with tag Transfer block transferred to/from memory

Larger address blocks Decrease address tag overhead But allow fewer blocks to be resident

Larger transfer blocks

Exploit spatial locality Amortize memory latency But take longer to load But replace more data already cached But cause unnecessary traffic

33



34





35

Sarita Adve

# Software Prefetching

#### Use compiler to Prefetch early

E.g., one loop iteration ahead Prefetch accurately



38



|            | Software Prefetching Example                           |
|------------|--------------------------------------------------------|
| for (i = 0 | ; i < N-1; i++) {                                      |
| = A        | .(i)                                                   |
| /* con     | nputation */                                           |
| }          |                                                        |
| Assume     | each iteration takes 10 cycles with a hit,             |
|            | memory latency is 100 cycles, cache block is two words |
| Change     | s?                                                     |
| for (i = 0 | ; i < N-1; i++) {                                      |
| prefet     | ch(A[i+10])                                            |
| = A        | (i)                                                    |
| /* com     | nputation */                                           |
| }          |                                                        |
| -          |                                                        |
|            |                                                        |

# Software Restructuring

Restructure so that operations on a cache block done before going to next block

```
do i = 1 to rows
do j = 1 to cols
sum = sum + x[i,j]
```

What is the cache behavior?



| Software Restructuring (Cont.)                                                          |                  |
|-----------------------------------------------------------------------------------------|------------------|
| do i = 1 to rows<br>do j = 1 to cols<br>sum = sum + x[i,j]                              |                  |
| Column major order in memory                                                            |                  |
| Code access pattern                                                                     |                  |
| Better code??                                                                           |                  |
| Called loop interchange<br>Many such optimizations possible (merging, fusion, blocking) |                  |
|                                                                                         | $\triangleright$ |



| Handling Writes - Pipelining                                                      |
|-----------------------------------------------------------------------------------|
| Writing into a writeback cache                                                    |
| Read tags (1 cycle)                                                               |
| Write data (1 cycle)                                                              |
| Key observation                                                                   |
| Data RAMs unused during tag read                                                  |
| Could complete a previous write                                                   |
| Add a special ``Cache Write Buffer" (CWB)                                         |
| During tag check, write data and address to CWB                                   |
| If miss, handle in normal fashion                                                 |
| If hit, written data stays in CWB                                                 |
| When data RAMs are free (e.g., next write) store contents of<br>CWB in data RAMs. |
| Cache reads must check CWB (bypass)                                               |
| Used in VAX 8800                                                                  |



# Handling Writes - Writeback Buffers

Writeback caches need buffers too 10-20% of all blocks are written back 10-20% increase in miss penalty without buffer On a miss Initiate fetch for requested block Copy dirty block into writeback buffer Copy requested block into cache, resume CPU

Now write dirty block back to memory

Usually only need 1 or 2 writeback buffers