PageRank
Spark Matrix Encoding
Encode the spark matrix using only non-zero entries - most pages will only have a few links. Hence, store as an adjacency list, storing the source page, out-degree, and a list of destination pages.
Assume there is enough RAM to fit rank vector
Power-Iteration
Initialize of entries of
For
Topic-Specific PageRank
Teleport can go to:
- Standard PageRank; to avoid dead-ends and spider-traps
- Topic-specific PageRank: a topic-specific set of ‘relevant’ pages
Bias the random walk:
- When the walker teleports, page a page from set
which contains only pages relevant to the topic - Each teleport set
gives a different vector
- Each teleport set
The user got bored of the page they were on, then went to a page on the same topic, not a completely random page.
Web Spam
Boosting a page’s position in search engine results.
Early spam: Term Spam
- Add the word ‘movie’ 1000 times in invisible text
- Copy text from top result for the term ‘movie’
Solution: use words in the anchor text of an incoming link and around it; hard to control who links to your page.
This solution allows communities to boost results to a specific page for any topic, even if the page itself never has that term in it.
Round Two: Spam Farms and TrustRank
A way to concentrate PageRank onto a single page.
Spammer has access to some pages e.g. forums; make post with link to target page
Target page
To combat term spam, we can analyze text with statistical methods (similar to email spam filtering).
TrustRank
To combat link spam, we can detect and blacklist structures that look like spam farms.
In addition, we can have a topic-specific PageRank with a set of trusted pages as the teleport set. This is TrustRank.
The core idea is that of approximate isolation; a trusted page is unlikely to point to a spam page. Hence, the closer a page is to a trusted page, the more it will be trusted itself.
TrustRank uses a list of trusted domains (e.g. .edu domains) that are the entry point to the simulation; this is done manually by a human, called an oracle, and hence is usually small to reduce costs.
Topic-specific PageRank forces teleportation to only send you to pages within the topic set. A similar idea is done for TrustRank, only allowing teleportation to the human-curated trusted set (on the given topic).
Recommendation Systems
Self space is a scarce commodity: this forces retailers to stock the products that appeal to the largest portion of the customer base.
On the internet, the shelf space is essentially unlimited; there is an abundance of information that can be disseminated to customers.
Types of recommendations:
- Editorial/hand-curated
- Simple aggregates e.g. most popular
- Tailored recommendations
Utility Matrix Model
Matrix of users (rows) and movies (columns). Each user can choose to give a rating to a movie.
Using this information, recommend a movie that they have not yet ranked.
Once an initial set of ratings are gathered (which itself is an issue), the problem is to extrapolate (high) unknown ratings from known ones and evaluating the performance of the method.
Content-Based Recommendations
Recommending items that are similar to previous items rated highly by the customer:
- Recommending movies with the same actors, directors, genre etc.
- Recommending blogs/websites with similar content
Collaborative Filtering
Find a set
Then, estimate
This works for any kind of item (no feature selection is required), but has several cons:
- Cold start: need enough users to in the system to find a match
- Sparsity: hard to find users that have rated the same items
- First rater: cannot recommend items that have not been previously rated (new/niche items)
- Popularity bias: tends to recommend popular items and not work well for people with niche tastes
Networks & Communities
Figure out what group of users you are similar to to recommend things.
To do this, use information on the type of users you interact with. Networks are organized into modules, clusters and communities. Find densely linked clusters and calculate the ‘distance’ of user to cluster.
Strength of Weak Ties
Edge betweenness: number of shortest paths passing over the edge.
Edges with high values will be the backbones of the network.
Girvan-Newman Algorithm uses this edge betweenness to decompose a undirected, unweighted network into subnetworks:
- Calculate the betweenness of edges
- Find and remove the edge with greatest edge-betweenness
- Connected components are communities
- Repeat until there are no edges left