06. Performance: Cost and Complexity, Frequent Itemsets

Big OO Notation

Given input of size nn, T(n)T(n) is the total time and S(n)S(n) is the necessary storage.

If T(n)=O(f(n))T(n) = O(f(n)) for some given algorithm ff, then T(n)cf(n)T(n) \ge c\cdot f(n) for some c>0c > 0.

Parallel

In an ideal case you can take O(f(n))O(f(n)) and divide it by the number of processors PP: this is almost possible for “embarrassingly parallel” problems.

If this is not possible, we want to characterize the scalability:

Speed-up is a measure of how much parallelizing the program speeds up the task. This does not does not take into account the number of cores, model of processor etc. but simply takes the ratio of two times.

S=TsequentialTparallel \text{S} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}}

Amdahl’s Law

Some part of the code is inherently sequential* and cannot be parallelized.

If you have an infinite number of processors, the time taken for the parallelizable part of the code will tend to zero and you will be left with the sequential part of the code. Using this, we can calculate the maximum speed-up assuming you had an infinite number of processors and the program scaled linearly.

If ss is the proportion of time spent on sequential code, 1s1 - s is the amount of time spent on parellelizable code. Hence, with pp processors, this time drops to 1sp\frac{1 - s}{p}.

Hence, the time taken will be:

s+1sp s + \frac{1 - s}{p}

And the speed-up will be:

S=1s+1sp1s S = \frac{1}{s + \frac{1 - s}{p}} \le \frac{1}{s}

Gustafson’s Law

S(p)=ps(p1) S(p) = p - s(p - 1)

This recognizes that when you increase the number of processors, you also increase the problem size (and performance may not scale linearly).

Karp-Flatt Metric

ss, the proportion of time spent on sequential code, may vary as the size of the problem, nn, increases.

This allows determination of the parallel overhead.

e.g. gathering data to driver before distributing to nodes. As you add more nodes and increase problem size, the time spent on communication increases, creating a sequential bottleneck.

Cost Measure for MapReduce

Quantifying cost of algorithm:

Usually, either the I/O or processing cost dominates.

This can be measured imperially using elapsed wall-clock time.

For a map-reduce algorithm:

For a map-reduce join algorithm on relations RR and SS:

Total communication cost=O(R+S+RS) \text{Total communication cost} = O(|R| + |S| + |R \bowtie S|)

Elapsed communication cost is O(s)O(s): pick the number of partitions kk and map processes so that the I/O limit ss is respected; ss amount of I/O any one process can have (e.g. what fits on main memory or local storage).

Try to pick kk so that the problem can be solved within one server/rack as that data transfer between threads/servers is faster.

If proper indexes are used, computation cost grows linearly with input/output size; hence, it should grow at approximately the same rate as the communication cost.

Counting number of time each word appears in a file:

Frequent Item-sets

Example: N-Gram Statistical Machine Translation

Count number of times every nn word sequence occurs in a large corpus f documents.

Word count is 1-gram; pairs is bi-gram.

2 out of 3 words

Set of words: count number of times two words occurred within a three word context (e.g. “ate a salad” and “ate the salad” and “he ate salad”).

Count the number of times KK words occurred within WW words in a large corpus of NN documents (and wanted a list of the most frequent KK word sets).

Market-Basket Model

A large set of items e.g. set of all words in the language.

A large set of baskets: each basket is a small subset of items

Want to discover association rules (e.g. people that bought x, y and z tend to buy w).

e.g. basket = sentences, items = documents containing these sentences; items that appear together too often could be plagiarism.

Find sets of items the appear together frequently: