Big Notation
Given input of size
If
Parallel
In an ideal case you can take
If this is not possible, we want to characterize the scalability:
- Vertically: add more resources to a single node (e.g. CPU, memory, storage)
- Horizontally: add more nodes to the system
Speed-up is a measure of how much parallelizing the program speeds up the task. This does not does not take into account the number of cores, model of processor etc. but simply takes the ratio of two times.
Amdahl’s Law
Some part of the code is inherently sequential* and cannot be parallelized.
If you have an infinite number of processors, the time taken for the parallelizable part of the code will tend to zero and you will be left with the sequential part of the code. Using this, we can calculate the maximum speed-up assuming you had an infinite number of processors and the program scaled linearly.
If
Hence, the time taken will be:
And the speed-up will be:
Gustafson’s Law
This recognizes that when you increase the number of processors, you also increase the problem size (and performance may not scale linearly).
- Strong scalability: keep the problem size fixed
- Weak scalability: ratio of problem size to number of cores fixed:
- If the time increases as it scales, it has sub-linear scalability. If it scales asymptotically, then there will be a maximum problem size
Karp-Flatt Metric
This allows determination of the parallel overhead.
e.g. gathering data to driver before distributing to nodes. As you add more nodes and increase problem size, the time spent on communication increases, creating a sequential bottleneck.
Cost Measure for MapReduce
Quantifying cost of algorithm:
- Communication cost: total I/O cost of all processes
- Elapsed communication cost: max of I/O along any path
- Draw a communication graph and find the longest path
- (Elapsed) computation cost: analogous to elapsed communication cost, but counting running time of processes
Usually, either the I/O or processing cost dominates.
This can be measured imperially using elapsed wall-clock time.
For a map-reduce algorithm:
- Communication cost: input file size + 2 * sum of sizes of all files passed from map to reduce processes (counted from map side and reduce side)
- Elapsed communication cost: sum of largest input + output file sizes for any map process, plus the same for reduce processes.
For a map-reduce join algorithm on relations
Elapsed communication cost is
Try to pick
If proper indexes are used, computation cost grows linearly with input/output size; hence, it should grow at approximately the same rate as the communication cost.
Counting number of time each word appears in a file:
- Minimum communication cost: make one node do all the work
- Minimum computation cost: one line per node; laughable amount of computation that is dwarfed by communication time
Frequent Item-sets
Example: N-Gram Statistical Machine Translation
Count number of times every
Word count is 1-gram; pairs is bi-gram.
2 out of 3 words
Set of words: count number of times two words occurred within a three word context (e.g. “ate a salad” and “ate the salad” and “he ate salad”).
Count the number of times
Market-Basket Model
A large set of items e.g. set of all words in the language.
A large set of baskets: each basket is a small subset of items
Want to discover association rules (e.g. people that bought x, y and z tend to buy w).
e.g. basket = sentences, items = documents containing these sentences; items that appear together too often could be plagiarism.
Find sets of items the appear together frequently:
- Support for item-set
: number of baskets containing all items in - Should have a support threshold: minimum support before a item-set is considered
- If an item appears in most baskets, it is likely not to be ‘interesting’ and hard to draw any meaningful conclusions from it