Home Artificial Intelligence Large Graph Evaluation with PageRank Assumptions Formal definition PageRank equation Eigenvectors Power iteration Random walk Convergence Increasing efficiency Complete algorithm Conclusion References

Large Graph Evaluation with PageRank Assumptions Formal definition PageRank equation Eigenvectors Power iteration Random walk Convergence Increasing efficiency Complete algorithm Conclusion References

0
Large Graph Evaluation with PageRank
Assumptions
Formal definition
PageRank equation
Eigenvectors
Power iteration
Random walk
Convergence
Increasing efficiency
Complete algorithm
Conclusion
References

Discover how Google search engine ranks documents based on their link structure

Towards Data Science

Ranking is a vital problem in machine learning. Given a set of documents, the goal is to sort them in a selected order based on certain criteria. Rating is widely utilized in information retrieval systems to sort search results or in recommender systems to filter content will potentially be interesting to a specific user.

Based on a given problem and objective, there exist an abundance of rating algorithms. The one we’re going to study in this text is known as PageRank. Its principal objective is to rank a set of documents (web pages) by utilizing the data about their connectivity. The rank assigned to every web page indicates its importance: the upper the rank is, the upper the importance is. The algorithm is predicated on two assumptions which we’re going to take a look at in the following section.

We are able to define the term “importance” of an internet page by making two assumptions.

The importance of an internet page is high if there are a lot of other web pages pointing to it.

Imagine we now have a well-liked research paper and plenty of other articles linking to it by utilizing quotes or results from it. Primarily, it is smart to present this text a big importance. Alternatively, if there may be an unknown web page with no links to it from other resources, it seems logical to assign low importance to the page.

In point of fact, we must always also care in regards to the quality of the incoming links.

The importance of an internet page is proportional to the importance of the net pages pointing to it.

If a page is originally cited by a high-quality article on Wikipedia, then such a link must have a bigger weight. Conversely, when a completely unknown resource points to a different web page, it shouldn’t normally have high importance.

Example of importance distribution made by PageRank algorithm from the official paper. Scores were normalized to sum as much as 100. The node with the worth of 38.4 has such high importance because of a lot of other nodes pointing to it. Alternatively, the node with the importance of 34.3 has just one incoming link however it is importance remains to be relatively high because its single input link comes from one other influential node. Nodes with low importances of 1.6 do not need any incoming links.

Allow us to say that the importance of a node is the same as the sum of the weights of incoming links.

Imagine a node i with importance rᵢ having k outcoming links. How can we determine the load of every link? Essentially the most straightforward approach is to take the node’s importance and divide it equally between all of the outcoming links. This fashion, each outcoming link will receive the load of rᵢ / k.

Example of calculating the rank of a node
The rank of a node is the same as the sum of ranks of incoming nodes divided by their total out degree.

Given a graph of n web pages, we will create a system of n linear equations to search out the weights of the graph. Nonetheless, such a system can have an infinite variety of solutions. That’s the reason we must always add one other constraint that can impose a singular solution. By the best way, PageRank adds the normalized condition that the sum of all node importance is the same as 1.

Finding an answer of a system of linear equations describing graph structure

We’ve give you an answer however it isn’t scalable. Even with Gaussian elimination, we find yourself with O(n³) complexity. Keeping in mind that the variety of analyzed web pages n can reach billions, we’d like to give you a more efficient approach.

Initially, allow us to simplify the notation. For this, we introduce the adjacency square matrix G which is able to contain link weights for each pair of linked web pages (if two web pages are usually not linked, we’ll put 0 in a corresponding matrix element). More formally:

Definition of a matrix element G[j][i]

Matrix G is known as stochastic because each of its columns sums as much as 1.

Next, we define the rank vector r whose i-th element is the same as the rank (importance) of page i. The sum of all elements of this vector also equals 1. Our ultimate goal is to search out values of this vector r.

Allow us to see what’s going to occur if we multiply matrix G by vector r. Based on the instance with the graph from the section above, we will see that it ends in the identical vector r!

Multiplying matrix G by vector r results again in vector r

Why does it occur? Is it only a coincidence? Do not forget that the i-th row of matrix G incorporates weights of all input links to the page i. Once we multiply the j-th element of the i-th row by r[j], we actually get the component rj / d[j]out — the importance which flows from node j to i. If there isn’t any link between nodes i and j, then the respective component is ready to 0. Logically, the end result of the multiplication of the i-th row by the vector r might be equal to the sum of all importances which flow from any connected node of the graph to node i. By definition, this value equals the rank of the node i. Generally, we will write the next equation:

PageRank equation

Subsequently, our goal is to search out such a vector r which being multiplied by the input matrix G will remain the identical.

We are able to find the answer to the equation above by revising the speculation on eigenvectors from linear algebra. Given a matrix A, the vector v is known as the eigenvector if there exists such a number α which satisfies the next equation:

Eigenvalue definition

The number α is known as the eigenvalue. We are able to notice that the PageRank equation corresponds to the eigenvalue equation where A = G, v = r and α = 1. Normally, any square matrix has several eigenvalues and eigenvectors but since our matrix G is stochastic, the speculation claims that its largest eigenvalue is the same as 1.

One of the crucial popular ways of finding matrix eigenvectors is the Power iteration method. It consists of initializing an initial vector r with some values (we’ll use 1 / n where n is the variety of web pages), then consistently computing the worth of G * r and assigning this value to r again. If on any iteration the space between vectors r and G * r is lower than a certain threshold ε, we stop the algorithm because it has converged successfully.

PageRank algorithm

In the instance above we will see that by setting ε to 0.0005 the algorithm appropriately converges just in 9 iterations:

Obviously, this is just a toy example but in practice, this method works thoroughly for a bigger variety of variables as well.

Imagine a surfer (walker) being at any node of the graph at time t. Allow us to denote by p(t) the vector whose i-th component equals the probability that at time t the surfer is present at node i. Then the surfer randomly (with equal probabilities) chooses one other linked node to the present one and moves there at time t + 1. Ultimately, we would like to search out the distribution vector p(t + 1) in the mean time t + 1.

Random walk of the surfer

We are able to notice that the probability of the surfer appearing at a node i in the mean time t + 1 is the sum of probabilities (over all linked nodes to i) that the surfer was previously at an adjoining node j multiplied by the probability of moving from node j to i.

  • We already know the probability of the surfer appearing at node j at moment t: p(t)[j].
  • The probability of moving from node j to i is the same as G[j][i].

By summing up these probabilities, we get the worth for p(t + 1)[i]. For locating the worth of p(t + 1) for all of the graph nodes, we will write the identical equation within the matrix form:

This equation has absolutely the identical form as what we now have obtained for the PageRank before! This implies these two problems have the identical solution and interpretation.

Sooner or later, the distribution vector p(t) will converge: p(t + 1) = M * p(t) = p(t). The converged vector p(t) in such case is known as the stationary distribution. In any respect the next moments of time, the probability of residing at any given node doesn’t change.

The PageRank rating of a node equals the probability that the surfer might be situated at this node in the longer term by randomly walking through the graph.

The described means of walking throughout the graph is sometimes called “Markov chains”. There exists a theorem in Markov chains theory which states that:

Under certain conditions on the graph structure, the stationary distribution is exclusive and may be reached with any initial probability distribution in the mean time t = 0.

In the next section, we’ll go more in-depth into the conditions that have to be satisfied for the unique convergence. It seems that for not all of the graphs the unique convergence may be achieved.

Principally, there exist 2 sorts of cases that we would like to avoid in any respect costs.

Dead ends

Nodes that do not need out links are called dead ends. The issue with such sort of nodes is that due to them the entire importance leaks out of the network. Here is an example:

Dead end problem. In the intervening time t = 2, the importance leaks out. In the intervening time t = 3, the rank vector converges.

Spider trap

A bunch of nodes form a spider trap in the event that they do not need out links to other nodes outside of this group. Mainly, once there, it’s inconceivable to get outside of this group of nodes. Spider traps result in the 2 following problems:

  • The algorithm never converges.
  • The group of nodes forming a spider trap absorbs all of the graph importance. In consequence, these nodes have very high importance while other nodes have importance being equal to 0.

The primary problem is illustrated within the figure below:

Spider trap problem. Ranging from the moment t = 0, the ranks of 1 and 0 infinitely alternate between two nodes. In consequence, the algorithm never converges.

The absorption of importance is demonstrated in the following figure. Though it may not seem like a significant issue within the toy example below, imagine an internet network with tens of millions of web pages where several of them form a spider trap. As a consequence, these several pages will distribute the entire available importance while the importance of all other web pages might be set to 0. Obviously, this isn’t what we normally want in real life.

Nodes b and d form a spider trap. In consequence, in the mean time t = 18 they already absorb the entire importance while other nodes are left with zero importance. Ranging from this moment, the importance alternates between nodes b and d making the algorithm divergent.

Teleports

Certainly one of the solutions proposed by Google is so as to add the next condition before each move of the surfer:

  • With probability β, move to a different linked node.
  • With probability (1 — β), move to a random node (through a teleport).

The parameter β is known as the dumping factor. Authors of the unique PageRank algorithm recommend selecting the worth for β = 0.85 meaning that on average after 5 transitions the surfer will randomly jump to a different node. The thought is that if the surfer falls right into a spider trap, then after a while it would eventually get out of there through a teleport.

The diagram below shows how teleports may help to take care of the spider trap problem. If the surfer walks into the node c, then it would stay there ceaselessly. Introducing teleports (blue lines) helps eliminating this problem guaranteeing that after a while the surfer may have to walk to a different random node.

Teleports (blue lines) eliminate the spider trap problem

Nonetheless, for dead-end nodes, we’d like to barely modify the approach. From considered one of the examples above, we all know that dead ends result in importance leakage in a graph. This phenomenon may be observed through the power iteration method, when the rank vector becomes stuffed with zeros due to a corresponding zero column within the initial matrix G. Ultimately, what we will do is the next:

Every time the surfer lands on a dead-end node, then it should immediately jump to a random node (with an equal probability) of the graph.

In actual fact, we will modify the initial matrix G to satisfy this statement: we just need to interchange zeros to 1 / n instead of all the weather of the columns of all dead-end nodes of matrix G. The instance below demonstrates this principle.

The node c is a dead-end node with a corresponding column of zeros within the matrix G. Adding n = 3 teleports from c to the entire nodes of the graph imposes equal probability p = 1 / 3 of moving from c to any node. To account for this, we fill the column of the matrix G corresponding to node c with values of 1 / 3.

We are able to notice that after adding teleports the sum of all matrix columns is now equal to 1. In other words, the matrix G becomes stochastic. That is an important property which we might be used later.

Convergence condition

There exists an important theorem from the speculation of Markov chains that defines the convergence condition.

For any start vector, the transition matrix G converges to a singular positive stationary distribution vector r if the chain corresponding to G is stochastic, aperiodic and irreducible.

Allow us to remind the last three properties on this theorem and check if introduced teleports solve the issues above.

A series G is known as stochastic if sum of its each column equals to 1.

As we observed above, adding teleports to dead-end nodes eliminates zero columns within the matrix and makes the sum of all its columns equal to 1. The condition is already satisfied.

A series G is known as periodic if there exists a number k > 1 that the trail length between any pair of nodes is at all times a multiple of k. Otherwise, the chain is known as aperiodic.

This condition signifies that any return to the identical state must occur in multiple of k times. Within the case of aperiodicity, the return occurs at irregular times. Mainly, the condition refers back to the spider trap problem. Since we now have already handled spider traps by adding teleports, the chain G is aperiodic.

A series G is known as irreducible if the probability of transitioning from anyone node to any one other node is at all times greater than 0.

This condition implies that there at all times exists a link between any two nodes, so it’s inconceivable to stuck at any node. In other words, the matrix G must consist of all non-zero elements. We’re going to see in the following section below how this condition might be satisfied by connecting all of the nodes of the graph.

Modifying the algorithm

PageRank algorithm proposed by Google takes the initial matrix G and adjusts it by adding teleports from dead ends to other nodes. This ensures stochasticity. To ensure aperiodicity and irreducibility it then adds the condition described before to every node:

  • With probability β, move to a different linked node.
  • With probability (1 — β), move to a random node.

Mathematically, it ends in the next rank equation for each node:

Vector equation of PageRank

We are able to transform this equation into the matrix form:

Matrix equation of PageRank from Google
The matrix R must satisfy the vital conditions for existence of unique stationary distribution r which must be found.

Allow us to draw the modified graph and the corresponding transition matrix R from on of the examples above:

Matrix R composed from the unique link matrix G and the teleport matrix. In this instance β = 0.9.

The one problem left for us is the best way to store the transition matrix R. Do not forget that R is a square matrix of size n x n where n is the variety of web pages. Currently, Google has greater than 25 billion web pages! The matrix R doesn’t have any zeros, so it’s dense which implies we now have to completely store it. Allow us to assume that each matrix element requires 4 bytes to be stored. The entire memory size required to store the matrix R equals (25 * 10⁹)² * 4 (bytes) ~ 3 * 10²¹ (bytes). It is a gigantic memory size! We want to give you one other approach to scale back not less than by several orders.

Firstly, we will simply notice that adding teleports is reminiscent of reducing initial matrix G elements by (1 — β)% and distributing them evenly across every node. Keeping this in mind we will transform the matrix equation of PageRank into one other format:

Transforming PageRank equation

Allow us to have a look at the last obtained equation. G is the initial link matrix with many of the elements being equal to 0. Why is it so? In point of fact, in case you take any web page, it would probably contain at most a number of dozen links to other web pages. Keeping in mind which might be greater than 25 billion web pages we get that the relative variety of total links in comparison with the variety of web pages is incredibly small. Subsequently, there are a variety of zeros in G, so G is sparse.

Storing sparse matrices requires much less memory than dense ones. Allow us to assume that every web page links on average to other 40 pages. The entire variety of bytes required to store the matrix G now becomes 25 * 10⁹ * 40 (bytes) = 10¹² (bytes) = 1 (TB). It seems we only need 1 terabyte to store G. In comparison with what we had previously, this can be a fabulous improvement!

In actual fact, at each iteration, we only must compute the multiplication of matrix G by vector r, multiply it by β and add a continuing (1 — β) / n to every element of the resulting vector.

Resulting PageRank equation

Also remember that if the initial chain G incorporates dead-end nodes, then the sum of vector r at each iteration might be lower than 1. To take care of this, it is sufficient to renormalise it, so all of the vector components sum as much as 1.

Within the figure below we will see the complete version of the PageRank algorithm. At each iteration, the update of ranks proceeds in 2 stages. The primary stage includes only update in line with the initial matrix G. Then we sum up the components of the rank vector into the variable s. This fashion, the worth of (1 — s) is the worth by which the entire input rank of a single node was reduced. To compensate for this, within the second stage, we account for teleports and add them from a node to all of the nodes with the equal value of (1 — s) / n.

Complete PageRank algorithm

In this text, we now have looked through different formulations of the PageRank algorithm to ultimately give you its optimised version. Despite the existence and evolution of other methods for rating search results, PageRank stays probably the most efficient algorithm amongst others which works under the hood of Google’s search engine.

The logical structure of this text is predicated on the lecture from Stanford University on large graphs.

All images unless otherwise noted are by the creator

LEAVE A REPLY

Please enter your comment!
Please enter your name here