10. TrustRank, Recommendation Systems, Social Network Graph Analysis

PageRank

Spark Matrix Encoding

Encode the spark matrix using only non-zero entries - most pages will only have a few links. Hence, store as an adjacency list, storing the source page, out-degree, and a list of destination pages.

Assume there is enough RAM to fit rank vector $r^{new}$ into memory and store $r^{new}$ and $M$ on disk.

Power-Iteration

Initialize of entries of $r^{new}$ to $\frac{1 - \beta}{N}$ . For each page $i$ with out-degree $d_i$ , read into memory $d_i$ , $r^{old}(i)$ and out-links $dest_{1 \dots d_i}$

For $j = 1 \dots d_i$ : $$ r^{new}(dest_j) \mathrel{{+}{=}} \beta \cdot \frac{r^{old}(i)}{d_i} $$ In each iteration, $r^{old}$ and $M$ must be read and $r^{new}$ written to disk. Hence, each iteration is $2|r| + |m|$ .

Topic-Specific PageRank

Teleport can go to:

Standard PageRank; to avoid dead-ends and spider-traps
Topic-specific PageRank: a topic-specific set of ‘relevant’ pages

Bias the random walk:

When the walker teleports, page a page from set $S$ which contains only pages relevant to the topic
- Each teleport set $S$ gives a different vector $r_s$

The user got bored of the page they were on, then went to a page on the same topic, not a completely random page.

Web Spam

Boosting a page’s position in search engine results.

Early spam: Term Spam

Add the word ‘movie’ 1000 times in invisible text
Copy text from top result for the term ‘movie’

Solution: use words in the anchor text of an incoming link and around it; hard to control who links to your page.

This solution allows communities to boost results to a specific page for any topic, even if the page itself never has that term in it.

Round Two: Spam Farms and TrustRank

A way to concentrate PageRank onto a single page.

Spammer has access to some pages e.g. forums; make post with link to target page $t$ .

Target page $t$ and pages that are owned by the spammer link to each other.

To combat term spam, we can analyze text with statistical methods (similar to email spam filtering).

TrustRank

To combat link spam, we can detect and blacklist structures that look like spam farms.

In addition, we can have a topic-specific PageRank with a set of trusted pages as the teleport set. This is TrustRank.

The core idea is that of approximate isolation; a trusted page is unlikely to point to a spam page. Hence, the closer a page is to a trusted page, the more it will be trusted itself.

TrustRank uses a list of trusted domains (e.g. .edu domains) that are the entry point to the simulation; this is done manually by a human, called an oracle, and hence is usually small to reduce costs.

Topic-specific PageRank forces teleportation to only send you to pages within the topic set. A similar idea is done for TrustRank, only allowing teleportation to the human-curated trusted set (on the given topic).

Recommendation Systems

Self space is a scarce commodity: this forces retailers to stock the products that appeal to the largest portion of the customer base.

On the internet, the shelf space is essentially unlimited; there is an abundance of information that can be disseminated to customers.

Types of recommendations:

Editorial/hand-curated
Simple aggregates e.g. most popular
Tailored recommendations

Utility Matrix Model

Matrix of users (rows) and movies (columns). Each user can choose to give a rating to a movie.

Using this information, recommend a movie that they have not yet ranked.

Once an initial set of ratings are gathered (which itself is an issue), the problem is to extrapolate (high) unknown ratings from known ones and evaluating the performance of the method.

Content-Based Recommendations

Recommending items that are similar to previous items rated highly by the customer:

Recommending movies with the same actors, directors, genre etc.
Recommending blogs/websites with similar content

Collaborative Filtering

Find a set $N$ of users whose ratings are ‘similar’ to user $x$ ’s ratings.

Then, estimate $x$ ’s ratings based on the ratings of users in $N$ .

This works for any kind of item (no feature selection is required), but has several cons:

Cold start: need enough users to in the system to find a match
Sparsity: hard to find users that have rated the same items
First rater: cannot recommend items that have not been previously rated (new/niche items)
Popularity bias: tends to recommend popular items and not work well for people with niche tastes

Networks & Communities

Figure out what group of users you are similar to to recommend things.

To do this, use information on the type of users you interact with. Networks are organized into modules, clusters and communities. Find densely linked clusters and calculate the ‘distance’ of user to cluster.

Strength of Weak Ties

Edge betweenness: number of shortest paths passing over the edge.

Edges with high values will be the backbones of the network.

Girvan-Newman Algorithm uses this edge betweenness to decompose a undirected, unweighted network into subnetworks:

Calculate the betweenness of edges
Find and remove the edge with greatest edge-betweenness
- Connected components are communities
Repeat until there are no edges left