### Data Parallel Architectures - SIMD

Motivation Vectors Multimedia SIMD

GPUs

# Motivation Recall SIMD from Chapter 5







| Vector Architectures                                                                                                                                |
|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| Vector-Register Machines<br>Load/store architecture                                                                                                 |
| All vector operations use registers (except load/store)                                                                                             |
| Multiple ports are cheaper<br>Optimized for small vectors                                                                                           |
| Memory-Memory Vector Machines<br>All vectors reside in memory<br>Long startup latency<br>Multiple ports are expensive<br>Optimized for long vectors |
| Often vectors are short<br>Early machines were memory-memory (TI ASC, CDC STAR)<br>Later machines use vector registers                              |

#### VMIPS Architecture

Strongly based on Cray Extend MIPS with vector instructions Scalar unit Eight vector registers (V0-V7) Each is 64 elements, 64 bits wide Five Vector Functional Units FP+, FP\*, FP/, integer & logical Fully pipelined Vector Load/Store Units Fully pipelined

#### VMIPS Architecture, cont.

Vector-Vector Instructions Operate on two vectors Produce a third vector for (i=0; i<64; i++) v1[i] = v2[i] + v3[i]

ADDVV.D V1, V2, V3

Vector-Scalar Instructions Operate on one vector, one scalar Produce a third vector

for (i=0; i<64; i++) V1[i] = F0 + V3[i]

ADDVS.D V1, V3, F0

#### VMIPS Architecture, cont.

Vector Load/Store Instructions Load/Store a vector from memory into a vector register Operates on contiguous addresses LV V1, R1 ; V1[i] = M[R1 + i] SV R1, V1 ; M[R1 + i] = V1[i]
Load/Store Vector with Stride Vectors not always contiguous in memory Add *non-unit stride* on each access LVWS V1, (R1, R2) ; V1[i] = M[R1 + i\*R2] SVWS (R1, R2), V1 ; M[R1 + i\*R2] = V1[i]
Vector Load/Store Indexed Indirect accesses through an index vector LVI V1, (R1+V2) ; V1[i] = M[R1 + V2[i]] SVI (R1+V2), V1 ; M[R1 + V2[i]] = V1[i]

| Double-precision A*X Plus Y (DAXPY): |                                                                                |  |
|--------------------------------------|--------------------------------------------------------------------------------|--|
| for                                  | (i=0; i<64; i++)                                                               |  |
|                                      | Y[i] = a * X[i] + Y[i]                                                         |  |
| LV                                   | V1, Rx<br>V2, V1, F0                                                           |  |
| SV                                   | Ry, V4                                                                         |  |
|                                      | ions instead of 600!<br>mber: MIPS means "Meaningless Indicator of Performance |  |

| Not All Vectors are 64 Elements Long                                                           |                                                                                                                              |  |  |  |
|------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Vector length register (VLR)<br>Controls length of vector operations<br>$0 < VLR \le MVL = 64$ |                                                                                                                              |  |  |  |
| for (i=0; i<<br>X[i]                                                                           | <100; i++)<br>] = a * X[i]                                                                                                   |  |  |  |
| LD<br>MTC1<br>LV<br>MULVS<br>SV<br>ADD<br>MTC1<br>LV<br>MULVS<br>SV                            | F0, a<br>VLR, 36 /* 100 - 64 */<br>V1, Rx<br>V2, V1, F0<br>Rx, V2<br>Rx, Rx, 36<br>VLR, 64<br>V1, Rx<br>V2, V1, F0<br>Rx, V2 |  |  |  |

Strip Mining for i = 1, n

| Strip Mining                                                                                                                                                                                                                                                                           |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| General case: Parameter n                                                                                                                                                                                                                                                              |  |  |  |  |
| DO 10 I = 1, n<br>X(i) = a * X(i)<br>10 CONTINUE                                                                                                                                                                                                                                       |  |  |  |  |
| Strip-mined version (pseudocode)                                                                                                                                                                                                                                                       |  |  |  |  |
| <pre>low = 1<br/>VL = (n mod MVL) /* Odd sized piece */<br/>DO 1 j = 0, (n / MVL) /* Outer loop */<br/>DO 10 i = low, low+VL1 /* Length */<br/>X(i) = a * X(i)<br/>10 CONTINUE<br/>low = low + VL /* Base of next chunk */<br/>VL = MVL /* Reset length to MAX */<br/>1 CONTINUE</pre> |  |  |  |  |

### Old Vector Machines Did Not Have Caches

Caches

Vectorizable codes often have poor locality Large vectors don't fit in cache Large vectors flush other data from the cache Cannot exploit known access patterns Unpredictability hurts Degrades cycle time Vector Registers (like all registers) Very fast Predictable Short id Multiple ports easier

## 

#### Compiler Technology

Must detect vectorizable loops Must detect dependences that prevent vectorization Data, anti, output dependences Only data (or true) dependences important, others can be eliminated with renaming