15. Parallel Computing

Parallel programming requires one of:

Extending the compiler
Extending sequential programming language
Adding parallel programming layer (e.g. Spark)
Creating a parallel language (will probably never happen)

Threads and Processes

Process

Process context (data registers, condition codes, SPs, PCs, kernel context (VM structures, descriptor table, brk pointer)) + code, data and stack.

Threads

Like a process, but with its own stack, thread context (data registers, condition codes, SP and PC).

A process can have multiple threads:

Each has its own logical flow
Each shares the same code, data and kernel context (e.g. shared libraries, heap, R/W data)
- Shared common virtual address space space
- Has its own thread context: registers, condition codes, stack pointers, process context

Thread Execution

Single-core processors simulates concurrency by time slicing, while multi-core processors can have true concurrency.

Suspension of threads has some overhead as the thread context must be saved and restored.

Two threads are logically concurrent if their flows overlap: if one thread is running between the time another starts and ends.

Threads vs Processes

Similarities:

Each has its own logical flow
Each runs concurrently with others
Is context switched
- ~1K cycles to swap out register values, change page table register

Differences:

Threads share code and some data; processes typically don’t
Threads are less expensive than processes
- Process control (creating and destroying) is twice as expensive as thread control; ~20k vs 10k cycles in Linux

Threads:

Sharing resources/data easy
- Maybe too easy - unintentional sharing can cause subtle and hard-to-reproduce bugs
More efficient

Considerations

If data is in a register, it is also everywhere in the cache hierarchy: it is in L1, L2 and L3 cache and main memory.

Memory Consistency

int a = 1; int b = 100;

Thread 1 writes a, reads b. Thread 2 writes b, reads a; four instructions. What gets outputted?

The only constraint is that the instructions within a thread each appear to run sequentially; there is no constraint in the order instructions run between threads, or even that instructions in a different thread run in-order - there is only a guarantee of sequential consistency within your own thread. Thus, there are six different orders in which it could run (with two impossible orders being those that require the reads to run before the writes within the same process).

Avoid relying on order of operations as this will slow down the whole program.

SIMD also does not require the programmer to change mindset from sequential programming.

Using DAXPY:

for(int i = 0; i < n; i++) {
  y[i] = a * x[i] + y[i];
}

Using VMIPS:

ADDVV.D: add two vectors
ADDVS.D: add vector to scalar
LS/SV: vector load/vector store from address

L.D      F0, a     ; load scalar a
LV       V0, Rx    ; load vector x
MULVS.D  V2, V1, F0; vector-scalar multiply
LV       V3, Ry    ; load vector y
ADDVV    V4, V2, V3; vector add
SV       Ry, V4    ; store result

Hard to do if there is accumulation: result of operation on one index independent of the rest.

Roofline Performance Model

Plot peak floating-point throughput as a function of arithmetic intensity. From low to high:

Sparse matrix vector multiplication
Structured grids
Spectral methods (e.g. FFTs)
Dense matrix (e.g. BLAS)
N-body

Less communication -> distributed

More communication -> shared

Greater arithmetic intensity (more operations on one piece of data) -> more low-level vectorized operations