13 - 14. Parallel Computing

MapReduce architecture gives us parallelization for ‘free’, but comes with trade-offs and overhead that may be too large for certain applications - if the algorithm is complex but requires little data transfer and movement (e.g. scientific simulations).

In the 90s, if you want your program to run twice as fast, you can either do a lot of work optimizing the program or… wait 18 months and buy a new computer. Back then, scaling was pretty simple: the clock speed would increase, so no work was needed to take advantage of the new hardware.

There is no more free lunch any more - hardware does not improve (quickly) in a way that makes your program run faster automatically. However, transistor density is still and increasing and there are more parallel units on a chip; we must rewrite our programs to take advantage of this hardware.

Types of Parallelism

Data Dependence Graph

Break a problem down into its components, and create a DAG out of it. Vertices are tasks and edges are dependencies.

Critical path: if any task in the critical path takes longer, it slows down the entire system

Independence: if a task is independent of another, it can run simultaneously with that task

Data Parallelism

Independent tasks that apply the same operations to different elements of a data set: operations can be performed concurrently e.g. a = [b[i] + c[i] for i in range(100)].

Functional Parallelism

Independent tasks apply different operations to different data elements e.g.

     a = 10;       b = 4;

c = 2 * a + b;    d = a ** 2

        e = c + d;

Processor will execute instructions out-of-order to execute instructions in parallel.

Pipelining

Divide a process into stages to produce several items simultaneously - an assembly line.

If a task is highly sequential, although each stage can only be done by a single process, all stages can be done simultaneously with each stage working on a separate item.

Parallelism

Why can’t this be done automatically by the compiler/hardware? Because its difficult for programmers to communicate intentions.

Instruction-Level Parallelism

There is very little parallelism between basic blocks of code; hence, exploit parallelism across multiple blocks.

Pipelining

Instruction execution occurs in multiple stages (e.g. reading instruction, copying memory to registers, executing) that use separates parts of the processor. Hence, multiple instructions can be processed per clock cycle, one instruction in each stage, although only one instruction will finish executing per cycle.

Dependencies and hazards

If two instructions are data-dependent - they rely on some piece of information stored in memory or registers, they cannot execute simultaneously.

Pipeline organization determines if the dependence results in a hazard or if it can cause a stall.

Data forwarding: store result in registers instead of memory if another instruction will use the result.

Out-of-order execution: some instructions require more cycles to execute e.g. floating point operations. Making every instruction execute for the same number of cycles is not acceptable - the performance hit is too large.

Out-of-order pipeline: scheduler reads instruction and executes them out-of-order. However, results are committed/written to memory in-order.

Flynn’s Taxonomy

SISD: single instruction stream, single data stream
SIMD: single instruction stream, multiple data streams
- Vector architectures
- Multimedia extensions
- Graphics processor units
MISD: multiple instruction streams, single data stream
MIMD: multiple instruction streams, multiple data streams
- Tightly/loosely-coupled MIMD

Memory Hierarchy

CPU registers
Cache: L1/L2/L3
RAM
Secondary/permanent storage: HDDs

Registers and cache have direct access to the CPU.

There will always be a speed-size trade-off in memory as the latency is bounded by the physical distance between the CPU and memory device.

Memory Model

If two different processes are accessing the same memory, bad things can happen.

Distributed memory model: each process gets its own slice of memory, no mechanism for data transfer

Shared memory model: processes have their own private areas and a common shared areas. Mechanisms required to avoid multiple processes writing to the same address simultaneously.