Reference: Google’s PageRank algorithm that changed the world
Pagerank algorithm is used to calculate node importance
Thought
A web page is more important if it has more in-degrees (references).
Being cited by important websites is more important than being cited by ordinary websites.
So to consider whether a website is important, you need to look at whether the websites that reference it are important. This becomes a recursive question.
Understand the five perspectives of pagerank
Iteratively solve a system of linear equations
example
It looks like there are three equations and three unknowns, but in fact there are only two equations.
Although Gaussian elimination can be solved, its scalability is poor. Rank value rj r_j
of node jrjis to consider everything to jjThe rank value of node j is divided by its out-degree, and then summed.
Iterative solution
Iterate left multiplication M matrix
The iterative process is represented by a matrix: (The i row and j column A ij of the left matrix have non-zero values and A_{ij} has non-zero values.AijA non-zero value indicates the existence of a directed edge from the j-th node to the i-th node)
The matrix on the left is called the column probability matrix (column transfer matrix/column substitution matrix, column stochastic matrix) and
the vector on the right is called the pagerank vector.
Eigenvectors of a matrix
Iteration formula:
r = M ⋅ rr=M \cdot rr=M⋅r can actually be regarded as
1 ⋅ r = M ⋅ r 1 \cdot r=M \cdot r1⋅r=M⋅rFrom
this perspective, the pagerank vector is the eigenvector of the M matrix with an eigenvalue of 1.
For the Column Stochastic matrix, according to the Perreon-Frobenius theorem, the largest eigenvalue is 1, and there is a unique main eigenvector (the eigenvector corresponding to the eigenvalue 1), and the sum of all elements of the vector is 1.
Through power iteration, the pagerank vector can be quickly solved.
random walk
Random walk -> count summation -> normalized to probability, the result is the pagerank vector.
Markov chain
Solve for pagerank
Convergence analysis
1. Whether to converge - Convergence, converge to the same result
Ergodic Theorem
According to Ergodic Theorem, for irreducible and aperiodic Markov chains:
1. There is a unique stable Markov distribution
2. And all initial distributions converge to the same distribution
Reducible Markov chain and irreducible Markov chain
Reducible means that there are isolated states
; irreducible means that all states are reachable
Periodic Markov chain and non-periodic Markov chain
2. Does the result represent importance - two types of questions
Spider trap problem
All out-degree edges are in the group, causing this group to absorb all importance
dead end problem
Without out-degree, the importance is ultimately 0.
For these two cases, even if it converges, it is not a reasonable network importance.
example
Solution
Solution to spider trap problem
Solution to dead end
final solution
Pagerank upgrade-mapreduce work
The pagerank algorithm is used to calculate node similarity - used in recommendation systems
Given: A bipartite graph is used to represent the interaction between users and products
. Goal: Find the node most similar to the specified node.
Assumption: Nodes visited by the same user are more likely to be similar.
pagerank, inspired by random walk perspective
One explanation of pagerank is: a random walk, with a probability of being randomly transmitted to any node in the network, and continuing to walk.
Topic-Specific PageRank (also called personalized pagerank): a random walk, with a probability of being transmitted to a specified number of nodes. Node, continue to walk
random walks with restarts: Random walk, and sent to a specified node, continue to walk
Number of random walk visits - a measure of similarity
Given a node set query_nodes, simulate a random walk:
- Record visits
- With probability α \alphaUnder α , restart walk in query_nodes
- Nodes with high access times have higher similarity with points in query_nodes.
pseudocode
advantage
Code practice
Reference: https://www.bilibili.com/video/BV1Wg411H7Ep/?p=16&spm_id_from=pageDriver