09 Graph Structures, PageRank and TrustRank

Web as a Graph

Nodes: webpages; edges: hyperlinks

Not all pages are equally important. We need to rank webpages; how? Rank pages by link structure.

Encode entire web as a sparse matrix (adjacency list - pages will only have a few links, so very wasteful to encode as an adjacency matrix).

PageRank: ‘Flow’ Formulation

Links are votes; a page is more important if it has more in-links.

Links from more important pages count more (enter: recursion):

Flow

Start with a base model that uses the number of in-links. Then, iterate, increasing the importance of links from highly-ranked pages.

The rank of a webpage is given as:

rj=ijridi r_j = \sum_{i \rightarrow j}\frac{r_i}{d_i}

Where did_i is the out-degree of ii (number of outgoing links from ii).

Hence, the rank of each page can be defined in terms of the (currently unknown) rank other links. This provides a set of simultaneous equations; if there are nn pages, there are nn equations and nn unknowns.

By making the sum of all ranks sum to 11 (or any other arbitrary value), we can find a unique solution to the problem.

Gaussian elimination works for small sets (n2n^2 algorithm) but does not scale to large web-size graphs.

Matrix Formulation

Stochastic Adjacency Matrix MM: $$ M_{ji} = \begin{cases} \frac{1}{d_i} & i\rightarrow j,\ 0 & \text{otherwise} \end{cases} $$ MM is a column stochastic matrix - columns sum to 1.

The rank vector rr is a vector where rir_i is the importance of page ii. Initially, set each value to 1/n1/n, where nn is the number of pages.

The flow equations can be written as $$ r = M \cdot r $$

Random Walk Interpretation

At time tt, a surfer is on some page ii. At time t+1t+1, the surfer follows an out-link from ii at random. This process repeats indefinitely.

Let p(t)p(t) be a vector whose ii coordinate is the probability that the surfer is on page ii at time tt; hence, p(t)p(t) is a probability distribution over time.

Stationary Distribution

At time tt, the surfer will follow a link at random: $$ p(t + 1) = M \cdot p(t) $$ For graphs that satisfy certain conditions, the stationary distribution pp is unique, giving the same result regardless of the initial probability distribution at time t=0t=0

Example

Graph:⟳y <---> a ---> m

if we let r0=[1/31/31/3]r_0 = \begin{bmatrix} 1/3 \\ 1/3 \\ 1/3 \end{bmatrix}, r1r_1 will equal:

[1/21/201/20001/21][1/31/31/3]=[2/61/63/6] \begin{bmatrix} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 0 \\ 0 & 1/2 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1/3 \\ 1/3 \\ 1/3 \end{bmatrix} = \begin{bmatrix} 2/6 \\ 1/6 \\ 3/6 \end{bmatrix}

Teleports

In the above example, the importance of mm gets larger over time as it has no out-links; importance ‘leaks’ out. After one more iteration, mm absorbs even more important.

r2=M[2/61/63/6]=[3/122/127/12] r_2 = M \cdot \begin{bmatrix} 2/6 \\ 1/6 \\ 3/6 \end{bmatrix} = \begin{bmatrix} 3/12 \\ 2/12 \\ 7/12 \end{bmatrix}

Although not present in this graph, spider-traps are another issue; all out-links are within a group, so the walker gets stuck in a trap. Eventually, the trap absorbs all importance.

To solve this, simply teleport.

With a probability 1β1 - \beta, teleport out to some random page at every time interval. β\beta is usually somewhere around 0.80.8 and 0.90.9. This leads to the PageRank equation.

PageRank Equation

If NN is the number of of pages, rjr_j is given by: $$ r_j = \frac{1 - \beta}{N} + \beta \sum_{i \rightarrow j}{\frac{r_i}{d_i}} $$

The first term is from the page the walker is currently on teleporting (with probability 1β1 - \beta) to page jj (probability 1/N1/N). The multiplication by β\beta in the second term reflects that fact the the walker can only reach jj from an out-link β\beta of the time.

If [1N]N×N\left[\frac{1}{N}\right]_{N\times N} is a NN by NN matrix where all entries are 1/N1/N : $$ A = \beta M + (1 - \beta)\left[\frac{1}{N}\right]_{N\times N} $$

PageRank can then be applied iteratively using:

r=Ar r = A \cdot r