4. Greedy Algorithms

Greedy algorithms:

Are simple and intuitive
Choose the locally optimal choice (for some given objective function)
- Hope it will lead to a globally optional choice
Do not use backtracking

Examples

Minimum Spanning Tree

Prim’s Algorithm
- Start anywhere
- Connect to the cheapest node that doesn’t create a cycle
Kruskal’s algorithm
- Forest of trees
- Look for the cheapest node, create a new tree or connect two trees together

These can be proved using proof by counter-example.

Coin Changing Problem

Coins of denominations $C = \{c_1, \dots, c_n\}$ . Find the minimum number of coins such that the total value is $V$ . Assume:

You have an unlimited amount of each coin
$c_1 < c_2 < \dots < c_n$
$c_1=1$ and hence, there is a solution for every integer solution

Greedy strategy:

Select the largest coin such that $c_i<V$
Add the coin to the list and decrement $V$ : $V := V - c_i$

Counter example: $C=\{1, 10, 25\}$ , $V=30$ . The optimal solution is $3 \cdot 10$ , but the greed algorithm gives $1\cdot 25 + 5\cdot 1$ .

What is a sufficient property such that the greedy algorithm will always be optimal? Every coin is a multiple of the one below it.

Example: Interval Scheduling

There is a set of jobs with a start ( $s_j$ ) and end ( $f_j$ ) times. Find the subset that contains the most jobs.

You could:

Sort by start times
Sort by job length
Sort by finish times
- This is correct, but it is hard to prove this is optimal

To prove a greedy algorithm right, assume its wrong and find contradiction. To prove it wrong, simply find a counter-example.

Fractional Knapsack Problem

There is a set $S$ of $n$ items. Each item has a positive value $b_i$ and weight $w_i$ . Find the subset which yields the maximum value without exceeding given weight $W$ .

You can select fraction of item $x_i$ ; the weight and values are scaled proportionally.

To solve this with a greedy algorithm, sort it by the benefit/weight ratio ( $b_i/w_i$ ), filling up the knapsack with the item with the greatest ratio that is still available until target weight is reached.

0/1 Knapsack Problem

Can only take the whole item or not take it at all. Requires dynamic programming.

Huffman Coding

Variable vs Fixed length

Fixed length codes are convenient for indexing, counting (e.g. ASCII: 7-bit binary string per character, one character per byte).

Variable length codes may be more efficient (e.g. Morse code: frequent letters have short encodings; UTF-8: most common characters 1 byte long).

Variable-Length Coding

Let $a=0$ , $b=1$ , $c=01$ etc.

Problem: there is no spacing, so it is ambiguous where a character starts and ends.

Solution: ensure no binary string in the code is a prefix of any other binary string.

For $a$ to $f$ , one possible solution is $a=000$ , $b=0010$ , $c=0011$ , $d=01$ , $e=10$ , $f=11$ .

This can be represented as a tree with two children per node and characters being found in leaf nodes:

            .
           /  \
        0 /    \ 1
         .       .
        /\       /\
     0 /  \ 1 0 /  \ 1
      .    d   e    f
     /\
  0 /  \ 1
   a    .
        /\
     0 /  \ 1
      b    c

Encoding

The most frequent characters should have the shortest encoding. However, this approach will fail.

Instead, the key is that the two least frequent characters should be leaves at the bottom of the tree. Hence, to create a Huffman tree:

Create a min-heap where the value is:
- The frequency of the character for a leaf node or
- The sum of frequencies for all its children
While there are nodes:
- Join the two smallest nodes together
- Add the combined node back into the min-heap

Popping and insertion into a min-heap are $O(\log{n})$ operations and $n$ iterations are required. Hence, the time complexity for encoding a Huffman tree is $O(n \log{n})$

Disadvantages

The encoding table must be sent along with the encoded document. Hence, the overhead may be large for small documents.

Huffman encoding is optimal for global character frequencies. However:

The whole document must be processed
Local optimization not possible (e.g. character frequencies change half way through)
Can’t recognize patterns; it only recognizes letters, not words

Lempel-Ziv usually gives better compression with substring recognition. The deflate algorithm combines the LZ and Huffman algorithms.