| Introduction                                 |  |
|----------------------------------------------|--|
| What is a parallel or multiprocessor system? |  |
| Why parallel architecture?                   |  |
| Performance potential                        |  |
| Flynn classification                         |  |
| Communication models                         |  |
| Architectures                                |  |
| Centralized sharedmemory                     |  |
| Distributed sharedmemory                     |  |
| Parallel programming                         |  |
| Synchronization                              |  |
| Memory consistency models                    |  |

1

#### Memory Consistency Model - Motivation Example shared-memory program Initially all locations = 0 Processor 1 Processor 2 Data = 23 while (Flag != 1) {;} Flag = 1 ... = Data Execution (only shared-memory operations) Processor 1 Processor 2 Write, Data, 23 Write, Flag, 1 Read, Flag, 1 Read, Data,

2

#### Memory Consistency Model: Definition

Memory consistency model

Order in which memory operations will appear to execute

 $\Rightarrow$  What value can a read return?

Affects ease-of-programming and performance



Program text defines total order = program order

Uniprocessor model

Memory operations appear to execute one-at-a-time in program

order

 $\Rightarrow$  Read returns value of last write

BUT uniprocessor hardware

Overlap, reorder operations

Model maintained as long as

4

maintain control and data dependences

 $\Rightarrow$  Easy to use + high performance





8



| Initially A = Flag = 0         |                       |
|--------------------------------|-----------------------|
| P1                             | P2                    |
| A = 23;                        | while (Flag != 1) {;} |
| Flag = 1;                      | = A;                  |
| P1                             | P2                    |
| Write, A, 23<br>Write, Flag, 1 | Read, Flag, 0         |
| 11110, 1 lag, 1                | Read, Flag, 1         |
|                                | Read, A,              |

|--|

P2

Initially A = Flag = 0P1 A = 23; Flag = 1;

P1 Write, A, 23 Write, Flag, 1 while (Flag != 1) {;} ... = A; P2 Read, Flag, 0

Read, Flag, 1 Read, A,

Can happen if

Overlap or reorder writes or reads in hardware or compiler

9



10





11

|    | Understa         | nding Atomicity                                       | - Example 1 |
|----|------------------|-------------------------------------------------------|-------------|
|    |                  | Initially $A = B = C =$                               | 0           |
| P1 | P2               | P3                                                    | P4          |
|    | A = 2;<br>C = 1; | while (B != 1) {;}<br>while (C != 1) {;}<br>tmp1 = A; |             |
|    |                  |                                                       |             |
|    |                  |                                                       |             |
|    |                  |                                                       |             |
|    |                  |                                                       |             |

|         |                | Initially $A = B = C = 0$                                                     |                    |
|---------|----------------|-------------------------------------------------------------------------------|--------------------|
| P1      | P2             | P3                                                                            | P4                 |
| A = 1;  | A = 2;         | while (B != 1) {;}                                                            | while (B != 1) {;} |
| B = 1;  | C = 1;         | while (C != 1) {;}                                                            |                    |
|         |                | tmp1 = A; 🔀                                                                   | tmp2 = A; 🔀        |
| Coherer | ice protocol i | es of A reach P3 and P4<br>must serialize writes to<br>ocation should be seen | same location      |
|         |                |                                                                               |                    |
|         |                |                                                                               |                    |
|         |                |                                                                               |                    |

14

16

**Understanding Atomicity - Example 2** Initially A = B = 0P1 P2 P3 A = 1 while (A != 1) ; while (B != 1) ; B = 1; tmp = A P1 P2 P3 Write, A, 1 Read, A, 1 Write, B, 1 Read, B, 1 Read, A, Can happen if read returns new value before all copies see it

|         | SC Summary                                                      |
|---------|-----------------------------------------------------------------|
| SC lim  | iits                                                            |
| Pr      | ogram order relaxation:                                         |
|         | Write $\rightarrow$ Read                                        |
|         | Write $\rightarrow$ Write                                       |
|         | Read $\rightarrow$ Read, Write                                  |
| W       | hen a processor can read the value of a write                   |
| Ur      | nserialized writes to the same location                         |
| Alterna | ative                                                           |
| (1      | ) Aggressive hardware techniques proposed to get SC w/o penalty |
|         | using speculation and prefetching                               |
|         | But compilers still limited by SC                               |
| (2      | ) Give up sequential consistency                                |
|         | Use relaxed models                                              |

15

13

Sarita Adve

## Classification for Relaxed Models

Typically described as system optimizations - system-centric Optimizations Program order relaxation:

Write  $\rightarrow$  Read

 $Write \rightarrow Write$ 

Read  $\rightarrow$  Read, Write

Read others' write early

Read own write early

All models provide safety net

All models maintain uniprocessor data and control dependences, write serialization

17

### Some System-Centric Models

| Relaxation: | W →R<br>Order | W →W<br>Order | R →RW<br>Order | Read Others'<br>Write Early | Read Own<br>Write Early | Safety Net                      |
|-------------|---------------|---------------|----------------|-----------------------------|-------------------------|---------------------------------|
| IBM 370     | 1             |               |                |                             |                         | serialization<br>instructions   |
| TSO         | 1             |               |                |                             | 1                       | RMW                             |
| PC          | 1             |               |                | ✓                           | 1                       | RMW                             |
| PSO         | 1             | 1             |                |                             | 1                       | RMW, STBAR                      |
| WO          | 1             | 1             | 1              |                             | 1                       | synchronization                 |
| RCsc        | 1             | 1             | 1              |                             | 1                       | release, acquire,<br>nsync, RMW |
| RCpc        | 1             | 1             | 1              | 1                           | 1                       | release, acquire,<br>nsync, RMW |
| Alpha       | 1             | 1             | 1              |                             | 1                       | MB, WMB                         |
| RMO         | 1             | 1             | 1              |                             | 1                       | various MEMBARs                 |
| PowerPC     | 1             | 1             | 1              | 1                           | 1                       | SYNC                            |

18

# System-Centric Models: Assessment

System-centric models provide higher performance than SC BUT 3P criteria

Programmability?

Lost intuitive interface of SC

Portability?

Many different models

Performance?

Can we do better?

Need a higher level of abstraction





Programmers need SC to reason about programs

But SC not practical today

How about the next best thing...



System gives sequential consistency IF programmer obeys certain rules

- + Programmability
- + Performance
- + Portability

The Data-Race-Free-0 Model: Motivation Different operations have different semantics

while (Flag != 1) {;}

P2

... = B;

... = A;

Flag = 1;

Flag = Synchronization; A, B = Data

Can reorder data operations

Distinguish data and synchronization

Need to

22

24

P1

A = 23;

B = 37;

- Characterize data / synchronization

- Prove characterization allows optimizations w/o violating SC

21

| Data-Race-Free-0: Some Definitions          |
|---------------------------------------------|
| Two operations conflict if                  |
| <ul> <li>Access same location</li> </ul>    |
| <ul> <li>At least one is a write</li> </ul> |
|                                             |
|                                             |
|                                             |
|                                             |
|                                             |
|                                             |
|                                             |
|                                             |
|                                             |
|                                             |
| 23                                          |



### Data-Race-Free-0 (DRF0) Definition

Data-Race-Free-0 Program

All accesses distinguished as either synchronization or data All races distinguished as synchronization

(in any SC execution)

Data-Race-Free-0 Model Guarantees SC to data-race-free-0 programs

It is widely accepted that data races make programs hard to debug independent of memory model (even with SC)

25

#### Distinguishing/Labeling Memory Operations

Need to distinguish/label operations at all levels

- High-level language
- Hardware
- Compiler must translate language label to hardware label

Java: volatiles, synchronized

C++: atomics

Hardware: fences inserted before/after synchronization

26

| The idea         |                                      |  |
|------------------|--------------------------------------|--|
| Programme        | r writes data-race-free programs     |  |
| System give      | es SC                                |  |
| For programme    | r                                    |  |
| Reason wit       | n SC                                 |  |
| Enhanced p       | portability                          |  |
| For hardware a   | nd compiler                          |  |
| More flexibi     | lity                                 |  |
| Finally, converg | gence on hardware and software sides |  |
| (BUT still ma    | any problems…)                       |  |