Parallel programming requires one of:
- Extending the compiler
- Extending sequential programming language
- Adding parallel programming layer (e.g. Spark)
- Creating a parallel language (will probably never happen)
Threads and Processes
Process
Process context (data registers, condition codes, SPs, PCs, kernel context (VM structures, descriptor table, brk pointer)) + code, data and stack.
Threads
Like a process, but with its own stack, thread context (data registers, condition codes, SP and PC).
A process can have multiple threads:
- Each has its own logical flow
- Each shares the same code, data and kernel context (e.g. shared libraries, heap, R/W data)
- Shared common virtual address space space
- Has its own thread context: registers, condition codes, stack pointers, process context
Thread Execution
Single-core processors simulates concurrency by time slicing, while multi-core processors can have true concurrency.
Suspension of threads has some overhead as the thread context must be saved and restored.
Two threads are logically concurrent if their flows overlap: if one thread is running between the time another starts and ends.
Threads vs Processes
Similarities:
- Each has its own logical flow
- Each runs concurrently with others
- Is context switched
- ~1K cycles to swap out register values, change page table register
Differences:
- Threads share code and some data; processes typically don’t
- Threads are less expensive than processes
- Process control (creating and destroying) is twice as expensive as thread control; ~20k vs 10k cycles in Linux
Threads:
- Sharing resources/data easy
- Maybe too easy - unintentional sharing can cause subtle and hard-to-reproduce bugs
- More efficient
Considerations
If data is in a register, it is also everywhere in the cache hierarchy: it is in L1, L2 and L3 cache and main memory.
Memory Consistency
int a = 1; int b = 100;
Thread 1 writes a, reads b. Thread 2 writes b, reads a; four instructions. What gets outputted?
The only constraint is that the instructions within a thread each appear to run sequentially; there is no constraint in the order instructions run between threads, or even that instructions in a different thread run in-order - there is only a guarantee of sequential consistency within your own thread. Thus, there are six different orders in which it could run (with two impossible orders being those that require the reads to run before the writes within the same process).
Avoid relying on order of operations as this will slow down the whole program.
Hyperthreading
Replicates enough instruction-control hardware to process K instruction streams. They have K copies of all registers and share functional units.
Vectorization
Applying the same operation to multiple pieces of data:
SIMD Parallelism
Exploiting data-level parallelism for matrix-oriented image/sound processors or scientific computing.
More energy-efficient than MIMD - only one instruction needs to be fetched.
SIMD also does not require the programmer to change mindset from sequential programming.
Using DAXPY:
for(int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
Using VMIPS:
ADDVV.D: add two vectorsADDVS.D: add vector to scalarLS/SV: vector load/vector store from address
L.D F0, a ; load scalar a
LV V0, Rx ; load vector x
MULVS.D V2, V1, F0; vector-scalar multiply
LV V3, Ry ; load vector y
ADDVV V4, V2, V3; vector add
SV Ry, V4 ; store result
Hard to do if there is accumulation: result of operation on one index independent of the rest.
Roofline Performance Model
Plot peak floating-point throughput as a function of arithmetic intensity. From low to high:
- Sparse matrix vector multiplication
- Structured grids
- Spectral methods (e.g. FFTs)
- Dense matrix (e.g. BLAS)
- N-body
Less communication -> distributed
More communication -> shared
Greater arithmetic intensity (more operations on one piece of data) -> more low-level vectorized operations