[Study Notes]PageRank Algorithm

Reference: Google’s PageRank algorithm that changed the world

Pagerank algorithm is used to calculate node importance

Thought

A web page is more important if it has more in-degrees (references).
Being cited by important websites is more important than being cited by ordinary websites.
So to consider whether a website is important, you need to look at whether the websites that reference it are important. This becomes a recursive question.

Understand the five perspectives of pagerank

Iteratively solve a system of linear equations

Insert image description here

example

Insert image description here

It looks like there are three equations and three unknowns, but in fact there are only two equations.
Although Gaussian elimination can be solved, its scalability is poor. Rank value rj r_j
of node jrjis to consider everything to jjThe rank value of node j is divided by its out-degree, and then summed.

Iterative solution

Insert image description here

Iterate left multiplication M matrix

The iterative process is represented by a matrix: (The i row and j column A ij of the left matrix have non-zero values ​​and A_{ij} has non-zero values.AijA non-zero value indicates the existence of a directed edge from the j-th node to the i-th node)
Insert image description here

The matrix on the left is called the column probability matrix (column transfer matrix/column substitution matrix, column stochastic matrix) and
the vector on the right is called the pagerank vector.
Insert image description here

Eigenvectors of a matrix

Iteration formula:
r = M ⋅ rr=M \cdot rr=Mr can actually be regarded as
1 ⋅ r = M ⋅ r 1 \cdot r=M \cdot r1r=MrFrom
this perspective, the pagerank vector is the eigenvector of the M matrix with an eigenvalue of 1.
Insert image description here

For the Column Stochastic matrix, according to the Perreon-Frobenius theorem, the largest eigenvalue is 1, and there is a unique main eigenvector (the eigenvector corresponding to the eigenvalue 1), and the sum of all elements of the vector is 1.
Through power iteration, the pagerank vector can be quickly solved.

random walk

Random walk -> count summation -> normalized to probability, the result is the pagerank vector.
Insert image description here
Insert image description here

Markov chain

Insert image description here
Insert image description here

Solve for pagerank

Insert image description here
Insert image description here

Convergence analysis

Insert image description here

1. Whether to converge - Convergence, converge to the same result

Ergodic Theorem

According to Ergodic Theorem, for irreducible and aperiodic Markov chains:
1. There is a unique stable Markov distribution
2. And all initial distributions converge to the same distribution

Reducible Markov chain and irreducible Markov chain

Reducible means that there are isolated states
; irreducible means that all states are reachable
Insert image description here

Periodic Markov chain and non-periodic Markov chain

Insert image description here

2. Does the result represent importance - two types of questions

Spider trap problem

All out-degree edges are in the group, causing this group to absorb all importance
Insert image description here

dead end problem

Without out-degree, the importance is ultimately 0.
Insert image description here
For these two cases, even if it converges, it is not a reasonable network importance.

example

Insert image description here
Insert image description here
Insert image description here
Insert image description here

Solution

Solution to spider trap problem

Insert image description here

Solution to dead end

Insert image description here

final solution

Insert image description here
Insert image description here

Pagerank upgrade-mapreduce work

Insert image description here

The pagerank algorithm is used to calculate node similarity - used in recommendation systems

Given: A bipartite graph is used to represent the interaction between users and products
. Goal: Find the node most similar to the specified node.
Assumption: Nodes visited by the same user are more likely to be similar.

pagerank, inspired by random walk perspective

One explanation of pagerank is: a random walk, with a probability of being randomly transmitted to any node in the network, and continuing to walk.
Topic-Specific PageRank (also called personalized pagerank): a random walk, with a probability of being transmitted to a specified number of nodes. Node, continue to walk
random walks with restarts: Random walk, and sent to a specified node, continue to walk

Number of random walk visits - a measure of similarity

Given a node set query_nodes, simulate a random walk:

  • Record visits
  • With probability α \alphaUnder α , restart walk in query_nodes
  • Nodes with high access times have higher similarity with points in query_nodes.

pseudocode

Insert image description here

advantage

Insert image description here

Code practice

Reference: https://www.bilibili.com/video/BV1Wg411H7Ep/?p=16&spm_id_from=pageDriver

Guess you like

Origin blog.csdn.net/zhangyifeng_1995/article/details/132802525