MapReduce architecture gives us parallelization for ‘free’, but comes with trade-offs and overhead that may be too large for certain applications - if the algorithm is complex but requires little data transfer and movement (e.g. scientific simulations).
In the 90s, if you want your program to run twice as fast, you can either do a lot of work optimizing the program or… wait 18 months and buy a new computer. Back then, scaling was pretty simple: the clock speed would increase, so no work was needed to take advantage of the new hardware.
There is no more free lunch any more - hardware does not improve (quickly) in a way that makes your program run faster automatically. However, transistor density is still and increasing and there are more parallel units on a chip; we must rewrite our programs to take advantage of this hardware.
Types of Parallelism
Data Dependence Graph
Break a problem down into its components, and create a DAG out of it. Vertices are tasks and edges are dependencies.
Critical path: if any task in the critical path takes longer, it slows down the entire system
Independence: if a task is independent of another, it can run simultaneously with that task
Data Parallelism
Independent tasks that apply the same operations to different elements of a data set: operations can be performed concurrently e.g. a = [b[i] + c[i] for i in range(100)].
Functional Parallelism
Independent tasks apply different operations to different data elements e.g.
a = 10; b = 4;
c = 2 * a + b; d = a ** 2
e = c + d;
Processor will execute instructions out-of-order to execute instructions in parallel.
Pipelining
Divide a process into stages to produce several items simultaneously - an assembly line.
If a task is highly sequential, although each stage can only be done by a single process, all stages can be done simultaneously with each stage working on a separate item.
Parallelism
Why can’t this be done automatically by the compiler/hardware? Because its difficult for programmers to communicate intentions.
Instruction-Level Parallelism
There is very little parallelism between basic blocks of code; hence, exploit parallelism across multiple blocks.
Pipelining
Instruction execution occurs in multiple stages (e.g. reading instruction, copying memory to registers, executing) that use separates parts of the processor. Hence, multiple instructions can be processed per clock cycle, one instruction in each stage, although only one instruction will finish executing per cycle.
Dependencies and hazards
If two instructions are data-dependent - they rely on some piece of information stored in memory or registers, they cannot execute simultaneously.
Pipeline organization determines if the dependence results in a hazard or if it can cause a stall.
Data forwarding: store result in registers instead of memory if another instruction will use the result.
Out-of-order execution: some instructions require more cycles to execute e.g. floating point operations. Making every instruction execute for the same number of cycles is not acceptable - the performance hit is too large.
Out-of-order pipeline: scheduler reads instruction and executes them out-of-order. However, results are committed/written to memory in-order.
Flynn’s Taxonomy
- SISD: single instruction stream, single data stream
- SIMD: single instruction stream, multiple data streams
- Vector architectures
- Multimedia extensions
- Graphics processor units
- MISD: multiple instruction streams, single data stream
- MIMD: multiple instruction streams, multiple data streams
- Tightly/loosely-coupled MIMD
Memory Hierarchy
- CPU registers
- Cache: L1/L2/L3
- RAM
- Secondary/permanent storage: HDDs
Registers and cache have direct access to the CPU.
There will always be a speed-size trade-off in memory as the latency is bounded by the physical distance between the CPU and memory device.
Memory Model
If two different processes are accessing the same memory, bad things can happen.
Distributed memory model: each process gets its own slice of memory, no mechanism for data transfer
Shared memory model: processes have their own private areas and a common shared areas. Mechanisms required to avoid multiple processes writing to the same address simultaneously.