10. TrustRank, Recommendation Systems, Social Network Graph Analysis

PageRank

Spark Matrix Encoding

Encode the spark matrix using only non-zero entries - most pages will only have a few links. Hence, store as an adjacency list, storing the source page, out-degree, and a list of destination pages.

Assume there is enough RAM to fit rank vector rnewr^{new} into memory and store rnewr^{new} and MM on disk.

Power-Iteration

Initialize of entries of rnewr^{new} to 1βN\frac{1 - \beta}{N}. For each page ii with out-degree did_i, read into memory did_i, rold(i)r^{old}(i) and out-links dest1didest_{1 \dots d_i}

For j=1dij = 1 \dots d_i: $$ r^{new}(dest_j) \mathrel{{+}{=}} \beta \cdot \frac{r^{old}(i)}{d_i} $$ In each iteration, roldr^{old} and MM must be read and rnewr^{new} written to disk. Hence, each iteration is 2r+m2|r| + |m|.

Topic-Specific PageRank

Teleport can go to:

Bias the random walk:

The user got bored of the page they were on, then went to a page on the same topic, not a completely random page.

Web Spam

Boosting a page’s position in search engine results.

Early spam: Term Spam

Solution: use words in the anchor text of an incoming link and around it; hard to control who links to your page.

This solution allows communities to boost results to a specific page for any topic, even if the page itself never has that term in it.

Round Two: Spam Farms and TrustRank

A way to concentrate PageRank onto a single page.

Spammer has access to some pages e.g. forums; make post with link to target page tt.

Target page tt and pages that are owned by the spammer link to each other.

To combat term spam, we can analyze text with statistical methods (similar to email spam filtering).

TrustRank

To combat link spam, we can detect and blacklist structures that look like spam farms.

In addition, we can have a topic-specific PageRank with a set of trusted pages as the teleport set. This is TrustRank.

The core idea is that of approximate isolation; a trusted page is unlikely to point to a spam page. Hence, the closer a page is to a trusted page, the more it will be trusted itself.

TrustRank uses a list of trusted domains (e.g. .edu domains) that are the entry point to the simulation; this is done manually by a human, called an oracle, and hence is usually small to reduce costs.

Topic-specific PageRank forces teleportation to only send you to pages within the topic set. A similar idea is done for TrustRank, only allowing teleportation to the human-curated trusted set (on the given topic).

Recommendation Systems

Self space is a scarce commodity: this forces retailers to stock the products that appeal to the largest portion of the customer base.

On the internet, the shelf space is essentially unlimited; there is an abundance of information that can be disseminated to customers.

Types of recommendations:

Utility Matrix Model

Matrix of users (rows) and movies (columns). Each user can choose to give a rating to a movie.

Using this information, recommend a movie that they have not yet ranked.

Once an initial set of ratings are gathered (which itself is an issue), the problem is to extrapolate (high) unknown ratings from known ones and evaluating the performance of the method.

Content-Based Recommendations

Recommending items that are similar to previous items rated highly by the customer:

Collaborative Filtering

Find a set NN of users whose ratings are ‘similar’ to user xx’s ratings.

Then, estimate xx’s ratings based on the ratings of users in NN.

This works for any kind of item (no feature selection is required), but has several cons:

Networks & Communities

Figure out what group of users you are similar to to recommend things.

To do this, use information on the type of users you interact with. Networks are organized into modules, clusters and communities. Find densely linked clusters and calculate the ‘distance’ of user to cluster.

Strength of Weak Ties

Edge betweenness: number of shortest paths passing over the edge.

Edges with high values will be the backbones of the network.

Girvan-Newman Algorithm uses this edge betweenness to decompose a undirected, unweighted network into subnetworks: