12. Spark MMDS Wrap-up

Additional Spark Libraries

Clustering

Problem: what if everything is connected? e.g. roads connect all cities together. They may not be direct connections, or connections may not be connected (e.g. flying). Hence, it is a fully connected graph.

Hence we can always find cost between any two points; direct connection exists for every single point.

Given a set of points with a notion of distance between points, group the points into some number of clusters.

This is a problem, especially for higher-dimensional spaces. e.g. for movie recommendations, each movie is a dimension.

Point Assignment: kk-means

Pick the number of clusters kk and initialize clusters by picking one point for each cluster (hopefully each far away from the other points).

For each point, place it in the cluster whose centroid (‘center of mass’ of a cluster) is the nearest. Need to compute nknk distances per iteration.

After all points are assigned, update the locations of the centroids (a computed value so it is not necessarily an existing point).

Repeat until it converges (i.e. until centroids do not update).

Online Algorithms

Offline/classical algorithms: algorithms have access to the entire input

Online algorithms: algorithm given access to the input one piece at a time, and must make irrevocable decisions along the way.

Given a set of elements being selected and elements doing the selecting (and are willing to select some subset), determine which element should be given.

Example: Online Graph Matching

Two sets of vertices; connect vertex from set (1,2,3,)(1, 2, 3, \dots) to a vertex on the set (a,b,c,)(a, b, c, \dots)` (e.g. advertisements and queries). This is called bipartite matching.

A greedy algorithm would simply pair the given vertex on the left with any eligible vertex on the right.

How do we assess how good the algorithm is?

Competitive Ratio

Algorithm’s worst performance over all possible inputs: ratio of algorithm’s solution to optimal solution (if future is known).

If average performance is taken, an adversary could ensure the worst case is always reached.

For input II, suppose the greedy algorithm produces matching MgreedyM_\text{greedy} and the optimal matching is MoptimalM_\text{optimal}:

Competitive ratio=minall possible inputs I(MgreedyMoptimal|) \text{Competitive ratio} = min_{\text{all possible inputs }I}{\left({\frac{|M_\text{greedy}|}{|M_\text{optimal|}}}\right)}

At worst, the greedy algorithm will make half the matches of the optimal algorithm.

AdWords

AdWords is a type of performance-based advertising: advertisers bid on search keywords; when someone searches for that the highest bidder’s ad is shown, and only charged if the ad is clicked.

A stream of queries q1,q2,q_1, q_2, \dots arrives, with several advertisers bidding on each query. When query qiq_i arrives, the engine must pick a subset of advertisers whose ads are shown.

Advertisers are only charged if the ads are clicked, so to maximize revenue, use the expected revenue per click: bid times the click-through rate (CTR).

Complications:

The latter is a very hard problem: exploration vs. exploitation: should we keep showing an ad for which we have good estimates of the CTR or show a brand-new ad to get an estimate of the CTR?

Greedy Algorithm

Assume a simplified environment where:

The greedy algorithm will simply pick any advertiser who has a bid of 11 for that query and will have a competitive ratio of 1/21/2:

BALANCE Algorithm

For each query, pick the advertiser with the largest unspent budget, breaking ties in an arbitrary but deterministic way. BALANCE is much better than greedy but not optimal.