Spring 2018 :: CSE 502



# Data-Parallel Architectures

Nima Honarmand



## Overview

- Data-Level Parallelism (DLP) vs. Thread-Level Parallelism (TLP)
  - In DLP, parallelism arises from independent execution of the same code on a large number of data objects
  - In TLP, parallelism arises from independent execution of different threads of control
- Hypothesis: many applications that use massively parallel machines exploit data parallelism
  - Common in the Scientific Computing domain
  - Also, multimedia (image and audio) processing
  - And more recently data mining and AI



## Interlude: Flynn's Taxonomy (1966)

- Michael Flynn classified parallelism across two dimensions: Data and Control
  - Single Instruction, Single Data (SISD)
    - Our uniprocessors
  - Single Instruction, Multiple Data (SIMD)
    - Same inst. executed by different "processors" using different data
    - Basis of DLP architectures: vector, SIMD extensions, GPUs
  - Multiple Instruction, Multiple Data (MIMD)
    - TLP architectures: SMPs and multi-cores
  - Multiple Instruction, Single Data (MISD)
    - Just for the sake of completeness, no real architecture
- DLP originally associated w/ SIMD; now **SIMT** is also common
  - SIMT: Single Instruction Multiple Threads
  - SIMT found in NVIDIA GPUs



## **Examples of Data-Parallel Code**

• SAXPY: **Y** = a\***X** + **Y** 

for (i = 0; i < n; i++)
Y[i] = a \* X[i] + Y[i]</pre>

• Matrix-Vector Multiplication:  $\mathbf{A}_{m \times 1} = \mathbf{M}_{m \times n} \times \mathbf{V}_{n \times 1}$ 

for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
 A[i] += M[i][j] \* V[j]</pre>



## Overview

- Many incarnations of DLP architectures over decades
  - Vector processors
    - Cray processors: Cray-1, Cray-2, ..., Cray X1
  - SIMD extensions
    - Intel MMX, SSE\* and AVX\* extensions
  - Modern GPUs
    - NVIDIA, AMD, Qualcomm, ...
- General Idea: use statically-known DLP to achieve higher throughput
  - instead of discovering parallelism in hardware as OOO super-scalars do
  - Focus on throughput rather than latency



# Vector Processors



## Vector Processors

- Basic idea:
  - Read sets of data elements into "vector registers"
  - Operate on those registers
  - Disperse the results back into memory
- Registers are controlled by compiler
  - Used to hide memory latency
  - Leverage memory bandwidth
- Hide memory latency by:
  - Issuing all memory accesses for a vector load/store together
  - Using chaining (later) to compute on earlier vector elements while waiting for later elements to be loaded

### **Vector Processors**



Scalar processors operate on single numbers (scalars)

 Vector processors operate on linear sequences of numbers (vectors)



## **Components of a Vector Processor**

- A scalar processor (e.g. a MIPS processor)
  - Scalar register file (32 registers)
  - Scalar functional units (arithmetic, load/store, etc)
- A vector register file (a 2D register array)
  - Each register is an array of elements
  - E.g. 32 registers with 32 64-bit elements per register
  - MVL = maximum vector length = max # of elements per register
- A set of vector functional units
  - Integer, FP, load/store, etc
  - Some times vector and scalar units are combined (share ALUs)



## Simple Vector Processor Organization





## **Basic Vector ISA**

| <b>Instruction</b>      | <b>Operation</b>       | <u>Comments</u>                      |
|-------------------------|------------------------|--------------------------------------|
| vadd.vv v1, v2,         | v3 v1=v2+v3            | vector + vector                      |
| vadd.sv v1, <b>r0</b> , | v2 v1=r0+v2            | scalar + vector                      |
| vmul.vv v1, v2,         | v3 v1=v2*v3            | vector x vector                      |
| vmul.sv v1, <b>r0</b> , | v2 v1=r0*v2            | scalar x vector                      |
| vld v1, r1              | v1=m[r1r1+63]          | load, stride=1                       |
| vld <b>s</b> v1, r1,    | r2 v1=m[r1r1+63*r2]    | load, stride=r2                      |
| vld <b>x</b> v1, r1,    | v2 v1=m[r1+v2[i], i=06 | 53] indexed load ( <i>gather</i> )   |
| vst v1, r1              | m[r1r1+63]=v1          | store, stride=1                      |
| vst <b>s</b> v1, r1,    | r2 v1=m[r1r1+63*r2]    | store, stride=r2                     |
| vst <b>x</b> v1, r1,    | v2 v1=m[r1+v2[i], i=06 | 53] indexed store ( <i>scatter</i> ) |

#### + regular scalar instructions



## SAXPY in Vector ISA vs. Scalar ISA

• For now, assume array length = vector length (say 32)

| loop: | addi<br>fld<br>fmul<br>fld<br>fadd<br>fst<br>addi<br>addi | <pre>f1, 0(x5) f1, f1, f0 f2, 0(x6) f2, f2, f1 f2, 0(x6) x5, x5, 4 x6, x6, 4</pre> | <pre># last addr to load # load x[i] # a * X[i] # Load Y[i] # a * X[i] + Y[i]</pre> | Scalar |
|-------|-----------------------------------------------------------|------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|--------|
|       | vld<br>Vmul<br>vld<br>vadd                                | f0, a<br>v0, x5<br>v1, f0, v0<br>v2, x6<br>v3, v1, v2<br>v3, x6                    | ± ±                                                                                 | Vector |



# Vector Length (VL)

- Usually, array length not equal to (or a multiple of) maximum vector length (MVL)
- Can **strip-mine** the loop to make inner loops a multiple of MVL, and use an explicit VL register for the remaining part

| (i = j; ;<br>Y[i] = a<br>< n; i+4                      | i < mvl; i++)<br>* X[i] + Y[i<br>+)                                                                                                          |                                                                                                                                                                                                                    | Strip-mined<br>C code                                                                                                  |
|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| setvl<br>vld<br>Vmul<br>vld<br>vadd<br>vst<br>// decre | x1<br>v0, x5<br>v1, f0, v0<br>v2, x6<br>v3, v1, v2<br>v3, x6<br>ment x1 by VI                                                                | <pre># set VL = min(n, mvl) # load vector X # vector-scalar multiply # load vector Y # vector-vector add # store the sum in Y</pre>                                                                                | Strip-mined<br>Vector code                                                                                             |
|                                                        | <pre>(i = j;<br/>Y[i] = a<br/>&lt; n; i++<br/>= a * X<br/>fld<br/>setvl<br/>vld<br/>Vmul<br/>vld<br/>vld<br/>vadd<br/>vst<br/>// decre</pre> | <pre>Y[i] = a * X[i] + Y[:<br/>&lt; n; i++)<br/>= a * X[i] + Y[i];<br/>fld f0, a<br/>setvl x1<br/>vld v0, x5<br/>Vmul v1, f0, v0<br/>vld v2, x6<br/>vadd v3, v1, v2<br/>vst v3, x6<br/>// decrement x1 by VI</pre> | <pre>(i = j; i &lt; mvl; i++) Y[i] = a * X[i] + Y[i]; &lt; n; i++) = a * X[i] + Y[i]; fld f0, a  # load scalar a</pre> |



## Advantages of Vector ISA

- Compact: single instruction defines N operations
  - Amortizes the cost of instruction fetch/decode/issue
  - Also reduces the frequency of branches
- Parallel: N operations are (data) parallel
  - No dependencies
  - No need for complex hardware to detect parallelism
  - Can execute in parallel assuming N parallel functional units
- **Expressive**: memory operations describe patterns
  - Continuous or regular memory access pattern
  - Can prefetch or accelerate using wide/multi-banked memory
  - Can amortize high latency for 1st element over large sequential pattern



## **Optimization 1: Chaining**

• Consider the following code:

| vld     | v3, r4  |    |                                   |
|---------|---------|----|-----------------------------------|
| vmul.sv | v6, r5, | v3 | <pre># very long RAW hazard</pre> |
| vadd.vv | v4, v6, | v5 | <pre># very long RAW hazard</pre> |

#### • Chaining:

- v1 is not a single entity but a group of individual elements
- vmul can start working on individual elements of v1 as they become ready
- Same for v6 and vadd
- Can allow any vector operation to chain to any other active vector operation
  - By having register files with many read/write ports



## **Optimization 2: Multiple Lanes**



#### Modular, scalable design

- Elements for each vector register interleaved across the lanes
- Each lane receives identical control
- Multiple element operations executed per cycle
- No need for inter-lane communication for most vector instructions

## Chaining & Multi-Iane Example





## **Optimization 3: Vector Predicates**

• Suppose you want to vectorize this:

```
for (i=0; i<N; i++)
    if (A[i]!= B[i]) A[i] -= B[i];</pre>
```

- Solution: vector conditional execution (predication)
  - Add vector flag registers with single-bit elements (masks)
  - Use a vector compare to set the a flag register
  - Use flag register as mask control for the vector sub
    - Do subtraction only for elements w/ corresponding flag set

| vld         | v1, x5         | # load A           |
|-------------|----------------|--------------------|
| vld         | v2, x6         | # load B           |
| vcmp.neq.vv | m0, v1, v2     | # vector compare   |
| vsub.vv     | v1, v1, v2, m0 | # conditional vsub |
| vst         | v1, x5, m0     | # store A          |



## Strided Vector Load/Stores

• Consider the following matrix-matrix multiplication:

- Can vectorize multiplication of rows of B with **columns** of D
  - D's elements have non-unit stride
  - Use normal vld for B and vlds (strided vector load) for D



## Indexed Vector Load/Stores

- A.k.a, gather (indexed load) and scatter (indexed store)
- Consider the following **sparse** vector-vector addition:

- Can vectorize the addition operation?
  - Yes, but need a way to vector load/store to random addresses
  - Use indexed vector load/stores

| vld  | v0, x7                             | # load K[]                |  |
|------|------------------------------------|---------------------------|--|
| vldx | <b>v1, x5, v0</b>                  | <pre># load A[K[]]</pre>  |  |
| vld  | v2, x28                            | # load M[]                |  |
| vldx | <b>v</b> 3, <b>x</b> 6, <b>v</b> 2 | <pre># load C[M[]]</pre>  |  |
| vadd | v1, v1, v3                         | # add                     |  |
| vstx | <b>v1, x5, v0</b>                  | <pre># store A[K[]]</pre> |  |



## Memory System Design

- DLP workload are very memory intensive
  - Because of large data sets
  - Caches and compiler optimizations can help but not enough
- Supporting strided and indexed vector loads/stores can generate many parallel memory accesses
  - How to support efficiently?
- **Banking**: spread memory across many banks w/ fine interleaving
  - Can access all banks in parallel if no bank conflict; otherwise will need to stall (structural hazard)
- Example:
  - 32 processors, each generating 4 loads and 2 stores/cycle
  - Processor cycle time is 2.25 ns, Memory cycle time is 15 ns
  - How many memory banks needed?



# SIMD ISA Extensions



## SIMD Extensions (1)

- SIMD extensions are a smaller version of vector processors
  - Integrated with ordinary scalar processors
  - E.g., MMX, SSE and AVX extensions for x86
- The original idea was to use a functional unit built for a single large operation for many parallel smaller ops
  - E.g., using one 64-bit adder to do eight 8-bit addition by partitioning the carry chain
- Initially, they were not meant to focus on memory-intensive data-parallel applications, but rather digital signalprocessing (DSP) applications
  - DSP apps are more compute-bound than memory-bound
  - DSP apps usually use smaller data types

#### Hiding memory-latency was not originally an issue!



## SIMD Extensions (2)

- SIMD extensions were slow to add vector ideas such as vector length, strided and indexed load/stores, predicated execution, etc.
- Things are changing now because of Big Data applications that are memory bound
- E.g., AVX-512 (available in recent Intel processors)
  - Has vectors of 512 bits (8 64-bit elements or 64 8-bit elements)
  - Supports all of the above vector load/stores and other features



GDDR MC

TD

77

SOLE

77

SUC

77

SUC

TD

77

SOLE



Stony Brook University

Scalar RF

ALU 1

ALU O

L1 TLB and 32KB Data Cache



- Targeting HPC market (Goal: high GFLOPS, GFLOPS/Watt)

GDDR

- 4 hardware threads + wide SIMD units
  - Vector ISA: 32 vector registers (512b), 8 mask registers, scatter/gather

**VPU RF** 

VPU 512b SIMD X87 RF

X87

- In-order, short pipeline
  - Why in-order?

Spring 2018 :: CSE 502



# GPUs



## Graphics Processing Unit (GPU)

- An architecture for compute-intensive, highly dataparallel computation
  - Exactly what graphics rendering is about
  - Transistors devoted to data processing rather than caching and flow control







## Data Parallelism in GPUs

- GPUs take advantage of massive DLP to provide very high FLOP rates
  - More than 1 Tera DP FLOP in NVIDIA GK110
- *SIMT* execution model
  - Single instruction multiple threads
  - Trying to distinguish itself from both "vectors" and "SIMD"
  - A key difference: better support for conditional control flow
- Program it with CUDA or OpenCL (among other things)
  - Extensions to C
  - Perform a "shader task" (a snippet of scalar computation) over many elements
  - Internally, GPU uses scatter/gather and vector-mask-like operations



## CUDA

- Extension of the C language
- Function types
  - *Device code* (kernel) : run on the GPU
  - *Host code*: run on the CPU and calls device programs
- Extensions / API
  - Function type : \_\_global\_\_, \_\_device\_\_, \_\_host\_\_\_
  - Variable type : \_\_\_\_shared\_\_\_, \_\_\_constant\_\_\_
    - Affects allocation of variable in different types of memory
  - cudaMalloc(), cudaFree(), cudaMemcpy(),...
  - \_\_\_\_syncthread(), atomicAdd(),...



## **CUDA Software Model**

- A kernel is executed as a grid of thread blocks
  - Per-thread register and localmemory space
  - Per-block shared-memory space
  - Shared global memory space
- Blocks are considered cooperating arrays of threads
  - Share memory
  - Can synchronize
- Blocks within a grid are independent
  - can execute concurrently
  - No cooperation across blocks



Thread



### SAXPY in CUDA



- Each CUDA thread operates on one data element
  - That's the reason behind MT in SIMT
- Hardware tries to execute these threads in lock-step as long as they all execute the same instruction together
  - That's the SI part in SIMT
- We'll see how shortly

### **Heterogeneous Programming**



### Use the right processor for the right job



Compiling CUDA

- nvcc
  - Compiler driver
  - Invoke cudacc, g++, cl
- PTX

   Parallel Thread eXecution

ld.global.v4.f32 {\$f1,\$f3,\$f5,\$f7}, [\$r9+0]; mad.f32 \$f1, \$f5, \$f3, \$f1;



Stony Brook University



## CUDA Hardware Model

- Follows the software model closely
- Each thread block executed by a single multiprocessor
  - Synchronized using shared memory
- Many thread blocks assigned to a single multiprocessor
  - Executed concurrently in a FGMT fashion
  - Keep GPU as busy as possible
- Running many threads in parallel can hide DRAM memory latency
  - Global memory access can be several hundred cycles





Source: NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110

Stony Brook University

- 15 SMX processors, shared L2, 6 memory controllers
  - 1 TFLOP dual-precision FP
- HW thread scheduling
  - No OS involvement in scheduling



## Streaming Multiprocessor (SMX)

- Capabilities
  - 64K registers
  - 192 simple cores
    - Int and SP FPU
  - 64 DP FPUs
  - 32 LD/ST Units (LSU)
  - 32 Special Function Units (FSU)
- Warp Scheduling
  - 4 independent warp schedulers
  - 2 inst dispatch per warp

|      |       |       |         |      |      |        |          | Ins    | tructi   | on Ca | che   |        |         |      |      |        |         |        |    |
|------|-------|-------|---------|------|------|--------|----------|--------|----------|-------|-------|--------|---------|------|------|--------|---------|--------|----|
|      | War   | p Sch | neduler |      |      | Wa     | rp Schee | duler  |          |       | War   | rp Sch | eduler  |      |      | Wa     | rp Sche | duler  |    |
| Di   | spatc | h     | Dispat  | ch   | Di   | ispato | :h I     | Dispat | tch      | Di    | spatc | h      | Dispat  | ch   | D    | ispato | :h      | Dispat | ch |
|      |       |       |         |      |      |        | Regi     | ster F | File (I  | 65,53 | 6 x 3 | 2-bit) |         |      |      |        |         |        |    |
| Ŧ    | ÷     | ÷     | +       | ÷    | ÷    | ÷      | +        | ÷      |          |       | ÷     | +      | +       | ÷    | ÷    | ÷      | +       | ÷      | -  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | SF |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | SF |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | SF |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | SF |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | si |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | si |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | SI |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | S  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
| Core | Core  | Core  | DP Unit | Core | Core | Core   | DP Unit  | LD/ST  | SFU      | Core  | Core  | Core   | DP Unit | Core | Core | Core   | DP Unit | LD/ST  | s  |
|      |       |       |         |      |      |        | 64 KB    |        |          | et Ne |       | Cac    | he      |      |      |        |         |        |    |
|      |       |       |         |      |      |        |          | B Rea  |          |       |       |        |         |      |      |        |         |        |    |
|      | Tex   |       | Tex     |      |      | Tex    |          | Tex    | ¢        |       | Tex   |        | Tex     | :    |      | Tex    |         | Tex    | :  |
|      | Tex   |       | Tex     |      |      | Tex    |          | Tex    | <b>,</b> |       | Tex   |        | Tex     |      |      | Tex    |         | Tex    |    |

Source: NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110

Spring 2018 :: CSE 502



- Warp: A set of threads that execute the same instruction (on different data elements)
- Fine-grained multithreading
  - One instruction per thread in pipeline at a time (No branch prediction)
  - Interleave warp execution to hide latencies
- Register values of all threads stay in register file
  - No OS context switching



Stony Brook University



### Warp-based SIMT vs. Traditional SIMD

- Traditional SIMD consists of a single thread
  - − SIMD Programming model (no threads) → SW needs to know vector length
  - ISA contains vector/SIMD instructions
- Warp-based SIMT consists of multiple scalar threads
  - Same instruction executed by all threads
    - Does not have to be lock step
  - Each thread can be treated individually
    - i.e., placed in a different warp  $\rightarrow$  programming model not SIMD
    - SW does not need to know vector length
    - Enables memory and branch latency tolerance
  - ISA is scalar  $\rightarrow$  vector instructions formed dynamically



# Warp Scheduling in Kepler

- 64 warps per SMX
  - 32 threads per warp
  - 64K registers/SMX
  - Up to 255 registers per thread
- Scheduling
  - 4 schedulers select 1 warp per cycle each
  - 2 independent instructions issued per warp
  - Total bandwidth = 4 \* 2 \* 32 = 256 ops/cycle
- Register Scoreboarding
  - To track ready instructions for long latency ops
- Compiler handles scheduling for fixedlatency operations
  - Binary incompatibility?



Source: NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110



















## Memory Hierarchy



Source: NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110

- Each SMX has 64KB of memory
  - Split between shared mem and L1 cache
    - 16/48, 32/32, 48/16
  - 256B per access
- 48KB read-only data cache
   Compiler controlled
- 1.5MB shared L2
- Support for atomic operations

   atomicCAS, atomicADD, ...
- Throughput-oriented main memory
  - Memory coalescing
  - Graphics DDR (GDDR)
    - Very wide channels: 256 bit vs. 64 bit for DDR
    - Lower clock rate than DDR