# Chapter 5: Multiprocessors (Thread-Level Parallelism)– Part 2 Introduction What is a parallel or multiprocessor system? Why parallel architecture? Performance potential Flynn classification Communication models Architectures Centralized sharedmemory Distributed sharedmemory Parallel programming Synchronization

### Memory consistency models

# Memory Consistency Model - Motivation

Example shared memory program

Initially all locations = 0 Processor 1 Processor 2 Data = 23 while (Flag != 1) {;} Flag = 1 ... = Data

Execution (only shared-memory operations)

Processor 1 Processor 2 Write, Data, 23 Write, Flag, 1 Read, Flag, 1 Read, Data, \_

### Memory Consistency Model: Definition

Memory consistency model

Order in which memory operations will appear to execute  $\Rightarrow$  What value can a read return?

Affects ease-of-programming and performance

### The Uniprocessor Model

- Program text defines total order = program order
- Uniprocessor model
  - Memory operations appear to execute one-at-a-time in program
  - order
  - $\Rightarrow$  Read returns value of last write
- BUT uniprocessor hardware
  - Overlap, reorder operations

Model maintained as long as

- maintain control and data dependences
- $\Rightarrow$  Easy to use + high performance



| Understanding Program Order – Example 1                       |                                                               |  |  |  |
|---------------------------------------------------------------|---------------------------------------------------------------|--|--|--|
| Initially Flag1 = Flag2 = 0                                   |                                                               |  |  |  |
| P1<br>Flag1 = 1<br>if (Flag2 == 0)<br><i>critical section</i> | P2<br>Flag2 = 1<br>if (Flag1 == 0)<br><i>critical section</i> |  |  |  |
| Execution:                                                    |                                                               |  |  |  |
| P1<br><i>(Operation, Location, Value)</i><br>Write, Flag1, 1  | P2<br>( <i>Operation, Location, Value)</i><br>Write, Flag2, 1 |  |  |  |
| Read, Flag2, 0                                                | Read, Flag1,                                                  |  |  |  |

| P1                                                                                                      | P2                                     |
|---------------------------------------------------------------------------------------------------------|----------------------------------------|
| Write, Flag1, 1                                                                                         | Write, Flag2, 1                        |
| Read, Flag2, 0                                                                                          | Read, Flag1,                           |
| <ul> <li>Write buffers with re</li> <li>Overlap, reorder wri</li> <li>Allocate Flag1 or Flag</li> </ul> | te followed by read in h/w or compiler |

| Understanding Program Order - Example 2 |  |  |  |  |
|-----------------------------------------|--|--|--|--|
| P2                                      |  |  |  |  |
| while (Flag != 1) {;}<br>= A;           |  |  |  |  |
| P2                                      |  |  |  |  |
| Read, Flag, 0                           |  |  |  |  |
| Read, Flag, 1<br>Read, A,               |  |  |  |  |
|                                         |  |  |  |  |
|                                         |  |  |  |  |
|                                         |  |  |  |  |
|                                         |  |  |  |  |

# Sarita Adve

### Understanding Program Order - Example 2

P2

Initially A = Flag = 0P1 A = 23; Flag = 1;

P1 Write, A, 23 Write, Flag, 1 while (Flag != 1) {;} ... = A; P2 Read, Flag, 0

Read, Flag, 1 Read, A,

Can happen if

Overlap or reorder writes or reads in hardware or compiler

| Understanding | Program | Order: | Summary |
|---------------|---------|--------|---------|
|               |         |        |         |

SC limits program order relaxation: Write  $\rightarrow$  Read Write  $\rightarrow$  Write Read  $\rightarrow$  Read, Write





| A = 1; A = 2; while $(B != 1) $ ; while $(B != 1) $ ; | Understanding Atomicity - Example 1 |        |                         |                    |  |
|-------------------------------------------------------|-------------------------------------|--------|-------------------------|--------------------|--|
| $ \begin{array}{llllllllllllllllllllllllllllllllllll$ |                                     |        | Initially $A = B = C =$ | 0                  |  |
| $B = 1;$ $C = 1;$ while (C != 1) {;} while (C != 1) { | P1                                  | P2     | P3                      | P4                 |  |
|                                                       | A = 1;                              | A = 2; | while (B != 1) {;}      | while (B != 1) {;} |  |
| tmp1 = A; tmp2 = A;                                   | B = 1;                              | C = 1; |                         |                    |  |
|                                                       |                                     |        | tmp1 = A;               | tmp2 = A;          |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |
|                                                       |                                     |        |                         |                    |  |

|         |               | Initially $A = B = C = 0$ | 0                     |
|---------|---------------|---------------------------|-----------------------|
| P1      | P2            | P3                        | P4                    |
| ,       | A = 2;        | while (B != 1) {;}        | while (B != 1) {;}    |
| B = 1;  | C = 1;        | while (C != 1) {;}        |                       |
|         |               | tmp1 = A;                 | tmp2 = A; 🔀           |
| Can hap | pen if update | es of A reach P3 and P4   | 4 in different order  |
| Coherer | ce protocol i | nust serialize writes to  | same location         |
| (Writ   | es to same l  | ocation should be seen    | in same order by all) |
|         |               |                           |                       |
|         |               |                           |                       |
|         |               |                           |                       |

| nitially $A = B = 0$ |                   |                        |
|----------------------|-------------------|------------------------|
| P1                   | P2                | P3                     |
| A = 1                | while (A != 1) ;v | /hile (B != 1) ;       |
|                      | B = 1;            | tmp = A                |
| P1                   | P2                | P3                     |
| Write, A, 1          |                   |                        |
|                      | Read, A, 1        |                        |
|                      | Write, B, 1       |                        |
|                      |                   | Read, B, 1<br>Read, A, |
|                      |                   | Road A                 |

| SC Summary                                                        |
|-------------------------------------------------------------------|
| SC limits                                                         |
| Program order relaxation:                                         |
| Write $\rightarrow$ Read                                          |
| Write $\rightarrow$ Write                                         |
| Read $\rightarrow$ Read, Write                                    |
| When a processor can read the value of a write                    |
| Unserialized writes to the same location                          |
| Alternative                                                       |
| (1) Aggressive hardware techniques proposed to get SC w/o penalty |
| using speculation and prefetching                                 |
| But compilers still limited by SC                                 |
| (2) Give up sequential consistency                                |
| Use relaxed models                                                |
|                                                                   |
|                                                                   |

### **Classification for Relaxed Models**

Typically described as system optimizations - system-centric Optimizations Program order relaxation: Write  $\rightarrow$  Read Write  $\rightarrow$  Write Read  $\rightarrow$  Read, Write

Read others' write early Read own write early

All models provide safety net

All models maintain uniprocessor data and control dependences, write serialization

### Some System-Centric Models

| Relaxation: | W →R<br>Order | W →W<br>Order | R →RW<br>Order | Read Others'<br>Write Early | Read Own<br>Write Early | Safety Net                      |
|-------------|---------------|---------------|----------------|-----------------------------|-------------------------|---------------------------------|
| IBM 370     | 1             |               |                |                             |                         | serialization<br>instructions   |
| TSO         | 1             |               |                |                             | 1                       | RMW                             |
| PC          | 1             |               |                | √                           | 1                       | RMW                             |
| PSO         | 1             | 1             |                |                             | 1                       | RMW, STBAR                      |
| wo          | 1             | 1             | 1              |                             | 1                       | synchronization                 |
| RCsc        | 1             | 1             | 1              |                             | 1                       | release, acquire,<br>nsync, RMW |
| RCpc        | 1             | 1             | 1              | 1                           | 1                       | release, acquire,<br>nsync, RMW |
| Alpha       | 1             | 1             | 1              |                             | 1                       | MB, WMB                         |
| RMO         | 1             | ✓             | 1              |                             | 1                       | various MEMBARs                 |
| PowerPC     | ✓             | 1             | 1              | ✓                           | 1                       | SYNC                            |

### System-Centric Models: Assessment

System-centric models provide higher performance than SC BUT 3P criteria

Programmability?

Lost intuitive interface of SC

Portability?

Many different models

Performance?

Can we do better?

Need a higher level of abstraction

### An Alternate Programmer-Centric View

One source of consensus

Programmers need SC to reason about programs But SC not practical today

How about the next best thing...

## A Programmer-Centric View

Specify memory model as a contract System gives sequential consistency IF programmer obeys certain rules

- + Programmability
- + Performance
- + Portability

### The Data-Race-Free-0 Model: Motivation

Different operations have different semantics

P2 while (Flag != 1) {;} ... = B; ... = A;

Flag = Synchronization; A, B = Data

Can reorder data operations

Distinguish data and synchronization

Need to

P1 A = 23;

B = 37;

Flag = 1;

- Characterize data / synchronization

- Prove characterization allows optimizations w/o violating SC

### Data-Race-Free-0: Some Definitions

Two operations conflict if

- Access same location
- At least one is a write



### Data-Race-Free-0 (DRF0) Definition

Data-Race-Free-0 Program

All accesses distinguished as either synchronization or data All races distinguished as synchronization (in any SC execution)

Data-Race-Free-0 Model

Guarantees SC to data-race-free-0 programs

It is widely accepted that data races make programs hard to debug independent of memory model (even with SC)

### Distinguishing/Labeling Memory Operations

Need to distinguish/label operations at all levels

- High-level language
- Hardware
- Compiler must translate language label to hardware label

Java: volatiles, synchronized

C++: atomics

Hardware: fences inserted before/after synchronization

### Data-Race-Free Summary

The idea

Programmer writes data-race-free programs System gives SC For programmer Reason with SC Enhanced portability For hardware and compiler More flexibility Finally, convergence on hardware and software sides (BUT still many problems...)