#### Chapter 5: Thread-Level Parallelism – Part 1

Introduction

- What is a parallel or multiprocessor system?
- Why parallel architecture?
- Performance potential
- Flynn classification
- Communication models
- Architectures
- Centralized shared-memory
- Distributed shared-memory
- Parallel programming
- Synchronization
- Memory consistency models

# What is a parallel or multiprocessor system?

Multiple processor units working together to solve the same problem

Key architectural issue: Communication model

# Why parallel architectures?

Absolute performance

Technology and architecture trends Dennard scaling, ILP wall, Moore's law

 $\Rightarrow$  Multicore chips



Vdf

Connect multicore together for even more parallelism



Core Processor

### **Performance Potential**



Gustafson's Corollary

Amdahl's law holds if run same problem size on larger machines But in practice, we run larger problems and "wait" the same time

# **Performance Potential (Cont.)**



Gustafson's Corollary (Cont.)

## Flynn classification





# **Communication models**

Shared-memory

- Message passing <
- Data parallel



Communicate, Synchronize (coordinate)

## **Communication Models: Message Passing**



Each node a computer

Processor – runs its own program (like SM)

Memory – local to that node, unrelated to other memory

Add messages for internode communication, send and receive like mail



Write sequential programs with "conceptual PC" and let parallelism be within the data (e.g., matrices)

C = A + B Typically SIMD architecture, but MIMD can be as effective

# **Architectures**

All mechanisms can usually be synthesized by all hardware Key: which communication model does hardware support best? Virtually all small-scale systems, multicores are shared-memory

# Which is Best Communication Model to Support?

Shared-memory

- Used in small-scale systems
- Easier to program for dynamic data structures
- Lower overhead communication for small data
- Implicit movement of data with caching \_\_\_\_\_ Hard to build?

Message-passing

Communication explicit harder to program? Larger overheads in communication OS intervention? Easier to build?

## **Shared-Memory Architecture**



For now, assume interconnect is a bus – *centralized architecture* 

### **Centralized Shared-Memory Architecture**



# **Centralized Shared-Memory Architecture (Cont.)**

For higher bandwidth (throughput)

For lower latency





# Centralized Shared-Memory Architecture (Cont.)\*\*



For lower latency

# Centralized Shared-Memory Architecture (Cont.)\*\*

For higher bandwidth (throughput)





#### **Cache Coherence Problem**





SNOOP Keg nests MESI Gettine Get live + Invalidate MOESI Rw Modified Senditive -Vsent line Getime Gettine/ Sullin Gettier, Invalidate MSI Shard 1 n Jalihote R9 PS Send lime on Getline Tom mi Sendtine on R, A hal are bus in risponse to bethin R A R P to Gotline + /rV A-12 R,A

## **Cache Coherence Solutions**



Problem with centralized architecture

# **Distributed Shared-Memory (DSM) Architecture**

#### Use a higher bandwidth interconnection network



Uniform memory access architecture (UMA)

# **Distributed Shared-Memory (DSM) - Cont.**

For lower latency: Non-Uniform Memory Access architecture (NUMA)



# **Distributed Shared-Memory (DSM) -- Cont.\*\***

For lower latency: Non-Uniform Memory Access architecture (NUMA)



## **Non-Bus Interconnection Networks**

Example interconnection networks





# **Distributed Shared-Memory - Coherence Problem**

#### **Directory scheme**



Level of indirection!



#### **Distributed Shared-Memory - Coherence Problem\*\***

#### **Directory scheme**



Level of indirection!



## **Parallel Programming Example**

```
Add two matrices: C = A + B
```

#### Sequential Program

# Parallel Program Example (Cont.)



## Parallel Program Example (Cont.)\*\*

```
main(argc, argv)
   int argc; char *argv;
    Read(A);
    Read(B);
     for (p = 1; p = number-of-processors; p++)
         create-thread (p, start-procedure);
  start-procedure();
     wait-for-all-threads-to-be-done(); +
     Print(C);
start-procedure()
     for (i = my-rows-begin; i != my-rows-end; i++)
         for (j = 0, j ! N, j++)
                C[i,j] = A[i,j] + B[i,j]
     indicate-done(); 斗
```



# **The Parallel Programming Process**



## **The Parallel Programming Process\*\***

Break up computation into tasks

Break up data into chunks

Necessary for message passing machines

Introduce synchronization for correctness



# **Synchronization**

```
Communication – Exchange data
```

```
Synchronization – Exchange data to order events
```

```
Mutual exclusion or atomicity
```

```
Event ordering or Producer/consumer
```

```
Point to Point
```

```
Flags
```

```
Global
Barriers
```

# **Mutual Exclusion**

Example

Each processor needs to occasionally update a counter



Hardware instructions

Test&Set Atomically tests for 0 and sets to 1 Unset is simply a store of 0

 $\rightarrow$  Unset(L)  $\rightarrow$  L<sub>20</sub>

Problem?

## **Mutual Exclusion Primitives\*\***

Hardware instructions

Test&Set

Atomically tests for 0 and sets to 1 Unset is simply a store of 0

while (Test&Set(L) != 0) {;}
Critical Section
Unset(L)

Problem - Traffic

### *Mutual Exclusion Primitives – Alternative?*

Test&Test&Set

## Mutual Exclusion Primitives – Alternative?\*\*

Test&Test&Set



Problem?

## Mutual Exclusion Primitives – Alternative?\*\*

Test&Test&Set

A: while (L != 0) {;}
if (Test&Set(L) == 0) {
 critical Section
 }
else go to loop A

Problem

Traffic on lock release

What if processor swapped out while holding lock?

## Mutual Exclusion Primitives – Fetch&Add

Fetch&Add(var, data) { /\* atomic action \*/  $\gamma$  temp = var var = temp + datareturn temp E.g., let X = 57P1: a = Fetch&Add(X,3)  $\leftarrow$ P2:  $b = Fetch \& Add(X, 5) \sim$ If P1 before P2, ? If P2 before P1, ? If P1, P2 concurrent ?~



# **Point to Point Event Ordering**

Example

Producer wants to indicate to consumer that data is ready



# Point to Point Event Ordering – Flags\*\*

Example

Producer wants to indicate to consumer that data is ready

| Processor 1 | Processor 2           |
|-------------|-----------------------|
|             | while (Flag != 1) {;} |
| A[1] =      | = A[1]                |
| A[2] =      | = A[2]                |
| •           | •                     |
| •           | -                     |
| A[n] =      | = A[n]                |
| Flag = 1    |                       |

# **Global Event Ordering – Barriers**

Example

- All processors produce some data
- Want to tell all processors that it is ready
- In next phase, all processors consume data produced previously

**Use barriers** 

# Implementing Barriers\*\*



## Implementing Barriers\*\*

Simple barrier

temp = Fetch&Inc(count)
while (count != N) {;}

Problem: Cannot use it again

# Implementing Barriers\*\*

