The Beauty of Mathematics - Chapter 9 Personal Notes

Chapter 9 Graph Theory and Web Crawlers

1 Graph theory

Does each bridge happen to cross over and return to the original starting point?

 

One of the most important algorithms for graphs is the traversal algorithm, that is, how to access the various nodes of the graph through arcs.

Breadth -First Search (BFS): To access other nodes directly connected to each node as 'widely' as possible

Depth -First Search (DFS): One way to black = =!

 

2 Web crawler

Think of the Internet as a big picture, each web page as a node, and hyperlinks as arcs connecting web pages.

In a web crawler, a 'Hash Table' (also called a hash table) is used instead of a notepad to record whether a web page has been downloaded.

 

3 Further reading: Two additional explanations of graph theory

3.1 Proof of Euler's Seven Bridges Problem
Taking each connected piece of land as a vertex and each bridge as an edge of the graph, then abstract the seven bridges into the following graph:

For each vertex in the graph, the number of edges connected to it is defined as its degree

Theorem: If a graph can start from a vertex and traverse each edge back to this vertex without repetition, then the degree of each vertex must be an even number.

Proof: If you can traverse each edge of the graph once, then for each vertex, you need to enter the vertex from one edge and leave the vertex from the other edge. The number of incoming and outgoing vertices is the same, so each vertex has as many outgoing edges as there are incoming edges. That is, the number of edges connected to each vertex comes in pairs, that is, the degree of each vertex is even.

 

3.2 Engineering points for building a web crawler

"How to build a crawler"? (My first reaction is Kuah = =! I am a salted fish)

① First of all, use BFS or DFS?

The order in which the web crawler traverses the web page is not a simple BFS or DFS, but a relatively complex method of downloading priority sorting. In reptiles, BFS has more components.

②Second, the analysis of the page and the extraction of the URL

URL extraction is much more complicated than before

③Third, a small notebook that records which webpages have been downloaded - URL table

In order to prevent a web page from being downloaded multiple times, a hash table is used to record which web pages have been downloaded. Difficulty: It is not easy to build and maintain a hash table on thousands of servers. First, the hash table will be too large to store on a single server. Secondly, since each download server needs to access and maintain this table before starting the download and after completing the download, so as to prevent different servers from doing repeated work, the communication of the server storing the hash table becomes the bottleneck of the whole crawler system. How to solve?

A good approach: clarify the division of labor for each server. On this basis, judging whether the URL is downloaded can be batched, such as sending a large number of queries to the hash table (a group of independent servers) each time, or updating the content of a large number of hash tables each time.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324884789&siteId=291194637