Nodes: webpages; edges: hyperlinks
Not all pages are equally important. We need to rank webpages; how? Rank pages by link structure.
Encode entire web as a sparse matrix (adjacency list - pages will only have a few links, so very wasteful to encode as an adjacency matrix).
Links are votes; a page is more important if it has more in-links.
Links from more important pages count more (enter: recursion):
- If a page j with importance rj has n out-links, each link gets rj/n votes
- Page j’s importance is the sum of votes on its in-links
Start with a base model that uses the number of in-links. Then, iterate, increasing the importance of links from highly-ranked pages.
The rank of a webpage is given as:
rj=i→j∑diri
Where di is the out-degree of i (number of outgoing links from i).
Hence, the rank of each page can be defined in terms of the (currently unknown) rank other links. This provides a set of simultaneous equations; if there are n pages, there are n equations and n unknowns.
By making the sum of all ranks sum to 1 (or any other arbitrary value), we can find a unique solution to the problem.
Gaussian elimination works for small sets (n2 algorithm) but does not scale to large web-size graphs.
Stochastic Adjacency Matrix M:
$$
M_{ji} = \begin{cases}
\frac{1}{d_i} & i\rightarrow j,\
0 & \text{otherwise}
\end{cases}
$$
M is a column stochastic matrix - columns sum to 1.
The rank vector r is a vector where ri is the importance of page i. Initially, set each value to 1/n, where n is the number of pages.
The flow equations can be written as
$$
r = M \cdot r
$$
At time t, a surfer is on some page i. At time t+1, the surfer follows an out-link from i at random. This process repeats indefinitely.
Let p(t) be a vector whose i coordinate is the probability that the surfer is on page i at time t; hence, p(t) is a probability distribution over time.
At time t, the surfer will follow a link at random:
$$
p(t + 1) = M \cdot p(t)
$$
For graphs that satisfy certain conditions, the stationary distribution p is unique, giving the same result regardless of the initial probability distribution at time t=0
Graph:⟳y <---> a ---> m
if we let r0=⎣⎢⎡1/31/31/3⎦⎥⎤, r1 will equal:
⎣⎢⎡1/21/201/201/2001⎦⎥⎤⋅⎣⎢⎡1/31/31/3⎦⎥⎤=⎣⎢⎡2/61/63/6⎦⎥⎤
In the above example, the importance of m gets larger over time as it has no out-links; importance ‘leaks’ out. After one more iteration, m absorbs even more important.
r2=M⋅⎣⎢⎡2/61/63/6⎦⎥⎤=⎣⎢⎡3/122/127/12⎦⎥⎤
Although not present in this graph, spider-traps are another issue; all out-links are within a group, so the walker gets stuck in a trap. Eventually, the trap absorbs all importance.
To solve this, simply teleport.
With a probability 1−β, teleport out to some random page at every time interval. β is usually somewhere around 0.8 and 0.9. This leads to the PageRank equation.
If N is the number of of pages, rj is given by:
$$
r_j = \frac{1 - \beta}{N} + \beta \sum_{i \rightarrow j}{\frac{r_i}{d_i}}
$$
The first term is from the page the walker is currently on teleporting (with probability 1−β) to page j (probability 1/N). The multiplication by β in the second term reflects that fact the the walker can only reach j from an out-link β of the time.
If [N1]N×N is a N by N matrix where all entries are 1/N :
$$
A = \beta M + (1 - \beta)\left[\frac{1}{N}\right]_{N\times N}
$$
PageRank can then be applied iteratively using:
r=A⋅r