## Data Parallel Architectures - SIMD

Vectors Multimedia SIMD

Motivation

GPUs

Graphics Processing Units (GPUs) Graphics accelerators Heterogeneous computing: Host CPU + GPU (device) Great for graphics: exploit lots of data parallelism Can we use GPUs for other computing? Multiple forms of parallelism MIMD, SIMD, ILP, Multithreading

How to program?

2007: Nvidia developed a C like language Cuda: Compute Unified Device Architetcture
2009: Khronos group released OpenCL
Recent: Heterogeneous System Architecture (HSA): unified virtual address space





## Programming Model: Memory

Originally, separate memories for CPU/GPU: host vs. device or global Device/global memory accessible by all SMs Recent trend – shared virtual memory, integrated CPU+GPU

## DAXPY

```
for (int i=0; i<n; i++)
  y[i] = a*x[i] + y[i]
CUDA:
E.g., n threads, one per vector element, 256 threads per thread block
  _host_
    int nblocks = (n+255)/256
    daxpy<<<rh>locks,256>>>(n,2.0,x,y)
    _device_
    void daxpy(int n, double a, double *x, double *y)
    {
        int i = (blockldx * blockDim) + threadIdx;
        if (i < n) y[i] = a*x[i] + y[i]
    }
}</pre>
```





## Key Features and Challenges

Hardware managed thread/thread block scheduling Thread blocks to SMs Warps within SM Multithreading for latency tolerance Scratchpad aka Shared memory Caches, Coherence, Consistency Synchronization between SMs through atomics SIMD Divergence Future of accelerators/heterogeneous computing?