06. Performance: Cost and Complexity, Frequent Itemsets

Big $O$ Notation

Given input of size $n$ , $T(n)$ is the total time and $S(n)$ is the necessary storage.

If $T(n) = O(f(n))$ for some given algorithm $f$ , then $T(n) \ge c\cdot f(n)$ for some $c > 0$ .

Parallel

In an ideal case you can take $O(f(n))$ and divide it by the number of processors $P$ : this is almost possible for “embarrassingly parallel” problems.

If this is not possible, we want to characterize the scalability:

Vertically: add more resources to a single node (e.g. CPU, memory, storage)
Horizontally: add more nodes to the system

Speed-up is a measure of how much parallelizing the program speeds up the task. This does not does not take into account the number of cores, model of processor etc. but simply takes the ratio of two times.

\text{S} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}}

Amdahl’s Law

Some part of the code is inherently sequential* and cannot be parallelized.

If you have an infinite number of processors, the time taken for the parallelizable part of the code will tend to zero and you will be left with the sequential part of the code. Using this, we can calculate the maximum speed-up assuming you had an infinite number of processors and the program scaled linearly.

If $s$ is the proportion of time spent on sequential code, $1 - s$ is the amount of time spent on parellelizable code. Hence, with $p$ processors, this time drops to $\frac{1 - s}{p}$ .

Hence, the time taken will be:

s + \frac{1 - s}{p}

And the speed-up will be:

S = \frac{1}{s + \frac{1 - s}{p}} \le \frac{1}{s}

Gustafson’s Law

S(p) = p - s(p - 1)

This recognizes that when you increase the number of processors, you also increase the problem size (and performance may not scale linearly).

Strong scalability: keep the problem size fixed
Weak scalability: ratio of problem size to number of cores fixed:
- If the time increases as it scales, it has sub-linear scalability. If it scales asymptotically, then there will be a maximum problem size

Karp-Flatt Metric

$s$ , the proportion of time spent on sequential code, may vary as the size of the problem, $n$ , increases.

This allows determination of the parallel overhead.

e.g. gathering data to driver before distributing to nodes. As you add more nodes and increase problem size, the time spent on communication increases, creating a sequential bottleneck.

Cost Measure for MapReduce

Quantifying cost of algorithm:

Communication cost: total I/O cost of all processes
Elapsed communication cost: max of I/O along any path
- Draw a communication graph and find the longest path
(Elapsed) computation cost: analogous to elapsed communication cost, but counting running time of processes

Usually, either the I/O or processing cost dominates.

This can be measured imperially using elapsed wall-clock time.

For a map-reduce algorithm:

Communication cost: input file size + 2 * sum of sizes of all files passed from map to reduce processes (counted from map side and reduce side)
Elapsed communication cost: sum of largest input + output file sizes for any map process, plus the same for reduce processes.

For a map-reduce join algorithm on relations $R$ and $S$ :

\text{Total communication cost} = O(|R| + |S| + |R \bowtie S|)

Elapsed communication cost is $O(s)$ : pick the number of partitions $k$ and map processes so that the I/O limit $s$ is respected; $s$ amount of I/O any one process can have (e.g. what fits on main memory or local storage).

Try to pick $k$ so that the problem can be solved within one server/rack as that data transfer between threads/servers is faster.

If proper indexes are used, computation cost grows linearly with input/output size; hence, it should grow at approximately the same rate as the communication cost.

Counting number of time each word appears in a file:

Minimum communication cost: make one node do all the work
Minimum computation cost: one line per node; laughable amount of computation that is dwarfed by communication time

Want to discover association rules (e.g. people that bought x, y and z tend to buy w).

e.g. basket = sentences, items = documents containing these sentences; items that appear together too often could be plagiarism.

Find sets of items the appear together frequently:

Support for item-set $I$ : number of baskets containing all items in $I$
Should have a support threshold: minimum support before a item-set is considered
If an item appears in most baskets, it is likely not to be ‘interesting’ and hard to draw any meaningful conclusions from it

06. Performance: Cost and Complexity, Frequent Itemsets

Big $O$ Notation

Parallel

Amdahl’s Law

Gustafson’s Law

Karp-Flatt Metric

Cost Measure for MapReduce

Frequent Item-sets

Example: N-Gram Statistical Machine Translation

2 out of 3 words

Market-Basket Model