Additional Spark Libraries
- GraphX: represents graph as RDD; PageRank built in
- MLib: common ML algorithms inc. clustering (for high-dimensional spaces that are niggly to represent as graphs)
- Streaming: for live-streamed data (e.g. online ad matching)
- SQL: wrapper around RDDs
Clustering
Problem: what if everything is connected? e.g. roads connect all cities together. They may not be direct connections, or connections may not be connected (e.g. flying). Hence, it is a fully connected graph.
Hence we can always find cost between any two points; direct connection exists for every single point.
Given a set of points with a notion of distance between points, group the points into some number of clusters.
This is a problem, especially for higher-dimensional spaces. e.g. for movie recommendations, each movie is a dimension.
Point Assignment: -means
Pick the number of clusters
For each point, place it in the cluster whose centroid (‘center of mass’ of a cluster) is the nearest. Need to compute
After all points are assigned, update the locations of the centroids (a computed value so it is not necessarily an existing point).
Repeat until it converges (i.e. until centroids do not update).
Online Algorithms
Offline/classical algorithms: algorithms have access to the entire input
Online algorithms: algorithm given access to the input one piece at a time, and must make irrevocable decisions along the way.
Given a set of elements being selected and elements doing the selecting (and are willing to select some subset), determine which element should be given.
Example: Online Graph Matching
Two sets of vertices; connect vertex from set
A greedy algorithm would simply pair the given vertex on the left with any eligible vertex on the right.
How do we assess how good the algorithm is?
Competitive Ratio
Algorithm’s worst performance over all possible inputs: ratio of algorithm’s solution to optimal solution (if future is known).
If average performance is taken, an adversary could ensure the worst case is always reached.
For input
At worst, the greedy algorithm will make half the matches of the optimal algorithm.
AdWords
AdWords is a type of performance-based advertising: advertisers bid on search keywords; when someone searches for that the highest bidder’s ad is shown, and only charged if the ad is clicked.
A stream of queries
Advertisers are only charged if the ads are clicked, so to maximize revenue, use the expected revenue per click: bid times the click-through rate (CTR).
Complications:
- Each advertiser has a limited daily budget
- CTR of the ad is unknown - it is measured historically
The latter is a very hard problem: exploration vs. exploitation: should we keep showing an ad for which we have good estimates of the CTR or show a brand-new ad to get an estimate of the CTR?
Greedy Algorithm
Assume a simplified environment where:
- One ad is shown per query
- All advertisers have the same budget
- All ads have the same CTR
- The value of each ad is the same (
)
The greedy algorithm will simply pick any advertiser who has a bid of
- Advertiser
bids on - Advertiser
bids on and - Both have budgets of
- Query stream of
- Worst case choice of
- N.B. greedy algorithm is deterministic
BALANCE Algorithm
For each query, pick the advertiser with the largest unspent budget, breaking ties in an arbitrary but deterministic way. BALANCE is much better than greedy but not optimal.
- Advertiser
bids on - Advertiser
bids on and - Both have budgets of
- Query stream of
- BALANCE will choose