| Chapter 5: Thread-Level Parallelism – Part 1 |  |
|----------------------------------------------|--|
| troduction                                   |  |
| What is a parallel or multiprocessor system? |  |
| Why parallel architecture?                   |  |
| Performance potential                        |  |
| Flynn classification                         |  |
| ommunication models                          |  |
| rchitectures                                 |  |
| entralized shared-memory                     |  |
| istributed shared-memory                     |  |
| arallel programming                          |  |
| ynchronization                               |  |
| emory consistency models                     |  |

## What is a parallel or multiprocessor system?

Multiple processor units working together to solve the same problem Key architectural issue: Communication model

2

4

## Why parallel architectures?

Absolute performance

Technology and architecture trends Dennard scaling, ILP wall, Moore's law

 $\Rightarrow$  Multicore chips

Connect multicore together for even more parallelism

| ŀ                                                          | Performance Potential                         |
|------------------------------------------------------------|-----------------------------------------------|
| Amdahl's Law is pes                                        |                                               |
| Let s be the serie                                         | •                                             |
| Let p be the part                                          | that can be parallelized n ways               |
| Serial:<br>6 processors:                                   | SSPPPPPP<br>SSP<br>P<br>P<br>P<br>P<br>P<br>P |
| Speedup = 8/3 =                                            | 2.67                                          |
| $T(n) = \frac{1}{s+p/n}$<br>As $n \to \infty$ , $T(n) = 1$ |                                               |
| As $n \rightarrow \infty$ , T(n) -                         | $\rightarrow \frac{1}{s}$                     |
| Pessimistic                                                |                                               |
|                                                            |                                               |

3

## Performance Potential (Cont.)

Gustafson's Corollary

Amdahl's law holds if run same problem size on larger machines But in practice, we run larger problems and "wait" the same time

5

6 processors: SSPPPPPP PPPPP PPPPP PPPPP PPPPP PPPPP Hypothetical Serial: SSPPPPP PPPPP PPPPP PPPPP PPPPP PPPPP Speedup = (8+5\*6)/8 = 4.75 T'(n) = s + n\*p; T'( $\infty$ )  $\rightarrow \infty$ !!!! How does your algorithm "scale up"?

Gustafson's Corollary (Cont.)

Old Serial: SSPPPPPP

Assume for larger problem sizes

Serial time fixed (at s)

Performance Potential (Cont.)

Parallel time proportional to problem size (truth more complicated)

## Flynn classification

Single-Instruction Single-Data (SISD) Single-Instruction Multiple-Data (SIMD) Multiple-Instruction Single-Data (MISD) Multiple-Instruction Multiple-Data (MIMD)



8







Communication Models: Message Passing

P⊤M P⊤M ° ° ° P⊤M

interconnect

Memory - local to that node, unrelated to other memory

Add messages for internode communication, send and receive like

Processor - runs its own program (like SM)

Each node a computer

mail

10

11









16

Sarita Adve













| Add two matrices: ( | C = A + B |    |  |
|---------------------|-----------|----|--|
| Sequential Program  | ı         |    |  |
|                     | -         | ]; |  |







28

| Mutual                     | Exclusion                   |
|----------------------------|-----------------------------|
| Example                    |                             |
| Each processor needs to oc | casionally update a counter |
| Processor 1                | Processor 2                 |
| Load reg1, Counter         | Load reg2, Counter          |
| reg1 = reg1 + tmp1         | reg2 = reg2 + tmp2          |
| Store Counter, reg1        | Store Counter, reg2         |
|                            |                             |
|                            |                             |
|                            |                             |
|                            |                             |

| ardware instructions                 |  |
|--------------------------------------|--|
| Test&Set                             |  |
| Atomically tests for 0 and sets to 1 |  |
| Unset is simply a store of 0         |  |
| while (Test&Set(L) != 0) {;}         |  |
| Critical Section                     |  |
| Unset(L)                             |  |
|                                      |  |
|                                      |  |
| blem?                                |  |

Mutual Exclusion Primitives – Alternative? Test&Test&Set

29

| xample                                                    |             |
|-----------------------------------------------------------|-------------|
| Producer wants to indicate to consumer that data is ready |             |
| Processor 1                                               | Processor 2 |
| A[1] =                                                    | = A[1]      |
| A[2] =                                                    | = A[2]      |
|                                                           |             |
|                                                           |             |
| A[n] =                                                    | = A[n]      |
|                                                           |             |
|                                                           |             |
|                                                           |             |



Mutual Exclusion Primitives – Fetch&Add

Fetch&Add(var, data) { /\* atomic action \*/ temp = var var = temp + data

> P1: a = Fetch&Add(X,3) P2: b = Fetch&Add(X,5) If P1 before P2, ? If P2 before P1, ? If P1, P2 concurrent ?

} return temp E.g., let X = 57

30