12. Spark MMDS Wrap-up

Additional Spark Libraries

GraphX: represents graph as RDD; PageRank built in
MLib: common ML algorithms inc. clustering (for high-dimensional spaces that are niggly to represent as graphs)
Streaming: for live-streamed data (e.g. online ad matching)
SQL: wrapper around RDDs

Problem: what if everything is connected? e.g. roads connect all cities together. They may not be direct connections, or connections may not be connected (e.g. flying). Hence, it is a fully connected graph.

Hence we can always find cost between any two points; direct connection exists for every single point.

Given a set of points with a notion of distance between points, group the points into some number of clusters.

This is a problem, especially for higher-dimensional spaces. e.g. for movie recommendations, each movie is a dimension.

Point Assignment: $k$ -means

Pick the number of clusters $k$ and initialize clusters by picking one point for each cluster (hopefully each far away from the other points).

For each point, place it in the cluster whose centroid (‘center of mass’ of a cluster) is the nearest. Need to compute $nk$ distances per iteration.

After all points are assigned, update the locations of the centroids (a computed value so it is not necessarily an existing point).

Repeat until it converges (i.e. until centroids do not update).

Online Algorithms

Offline/classical algorithms: algorithms have access to the entire input

Online algorithms: algorithm given access to the input one piece at a time, and must make irrevocable decisions along the way.

Given a set of elements being selected and elements doing the selecting (and are willing to select some subset), determine which element should be given.

Example: Online Graph Matching

Two sets of vertices; connect vertex from set $(1, 2, 3, \dots)$ to a vertex on the set $(a, b, c, \dots)$ ` (e.g. advertisements and queries). This is called bipartite matching.

A greedy algorithm would simply pair the given vertex on the left with any eligible vertex on the right.

How do we assess how good the algorithm is?

Competitive Ratio

Algorithm’s worst performance over all possible inputs: ratio of algorithm’s solution to optimal solution (if future is known).

If average performance is taken, an adversary could ensure the worst case is always reached.

For input $I$ , suppose the greedy algorithm produces matching $M_\text{greedy}$ and the optimal matching is $M_\text{optimal}$ :

\text{Competitive ratio} = min_{\text{all possible inputs }I}{\left({\frac{|M_\text{greedy}|}{|M_\text{optimal|}}}\right)}

At worst, the greedy algorithm will make half the matches of the optimal algorithm.

AdWords

AdWords is a type of performance-based advertising: advertisers bid on search keywords; when someone searches for that the highest bidder’s ad is shown, and only charged if the ad is clicked.

A stream of queries $q_1, q_2, \dots$ arrives, with several advertisers bidding on each query. When query $q_i$ arrives, the engine must pick a subset of advertisers whose ads are shown.

Advertisers are only charged if the ads are clicked, so to maximize revenue, use the expected revenue per click: bid times the click-through rate (CTR).

Complications:

Each advertiser has a limited daily budget
CTR of the ad is unknown - it is measured historically

The latter is a very hard problem: exploration vs. exploitation: should we keep showing an ad for which we have good estimates of the CTR or show a brand-new ad to get an estimate of the CTR?

Greedy Algorithm

Assume a simplified environment where:

One ad is shown per query
All advertisers have the same budget $B$
All ads have the same CTR
The value of each ad is the same ( $1$ )

The greedy algorithm will simply pick any advertiser who has a bid of $1$ for that query and will have a competitive ratio of $1/2$ :

Advertiser $A$ bids on $x$
Advertiser $B$ bids on $x$ and $y$
Both have budgets of $4$
Query stream of $x, x, x, x, y, y, y, y$
Worst case choice of $B, B, B, B, \_, \_, \_, \_$
N.B. greedy algorithm is deterministic

BALANCE Algorithm

For each query, pick the advertiser with the largest unspent budget, breaking ties in an arbitrary but deterministic way. BALANCE is much better than greedy but not optimal.

Advertiser $A$ bids on $x$
Advertiser $B$ bids on $x$ and $y$
Both have budgets of $4$
Query stream of $x, x, x, x, y, y, y, y$
BALANCE will choose $A, B, A, B, B, B, \_, \_$