15. Parallel Computing

Parallel programming requires one of:

Threads and Processes

Process

Process context (data registers, condition codes, SPs, PCs, kernel context (VM structures, descriptor table, brk pointer)) + code, data and stack.

Threads

Like a process, but with its own stack, thread context (data registers, condition codes, SP and PC).

A process can have multiple threads:

Thread Execution

Single-core processors simulates concurrency by time slicing, while multi-core processors can have true concurrency.

Suspension of threads has some overhead as the thread context must be saved and restored.

Two threads are logically concurrent if their flows overlap: if one thread is running between the time another starts and ends.

Threads vs Processes

Similarities:

Differences:

Threads:

Considerations

If data is in a register, it is also everywhere in the cache hierarchy: it is in L1, L2 and L3 cache and main memory.

Memory Consistency

int a = 1; int b = 100;

Thread 1 writes a, reads b. Thread 2 writes b, reads a; four instructions. What gets outputted?

The only constraint is that the instructions within a thread each appear to run sequentially; there is no constraint in the order instructions run between threads, or even that instructions in a different thread run in-order - there is only a guarantee of sequential consistency within your own thread. Thus, there are six different orders in which it could run (with two impossible orders being those that require the reads to run before the writes within the same process).

Avoid relying on order of operations as this will slow down the whole program.

Hyperthreading

Replicates enough instruction-control hardware to process K instruction streams. They have K copies of all registers and share functional units.

Vectorization

Applying the same operation to multiple pieces of data:

SIMD Parallelism

Exploiting data-level parallelism for matrix-oriented image/sound processors or scientific computing.

More energy-efficient than MIMD - only one instruction needs to be fetched.

SIMD also does not require the programmer to change mindset from sequential programming.

Using DAXPY:

for(int i = 0; i < n; i++) {
  y[i] = a * x[i] + y[i];
}

Using VMIPS:

L.D      F0, a     ; load scalar a
LV       V0, Rx    ; load vector x
MULVS.D  V2, V1, F0; vector-scalar multiply
LV       V3, Ry    ; load vector y
ADDVV    V4, V2, V3; vector add
SV       Ry, V4    ; store result

Hard to do if there is accumulation: result of operation on one index independent of the rest.

Roofline Performance Model

Plot peak floating-point throughput as a function of arithmetic intensity. From low to high:

Less communication -> distributed

More communication -> shared

Greater arithmetic intensity (more operations on one piece of data) -> more low-level vectorized operations