The Beauty of Mathematics Chapter 9 Graph Theory and Web Crawlers

1 Graph theory

    The origin of graph theory can be traced back to the era of the great mathematician Euler.

    A graph discussed in graph theory consists of some nodes and arcs connecting those nodes.

    Breadth-First Search (BFS)

    Depth-First Search (DFS)

2 Web crawler

    In web crawlers, people use a "Hash Table" (also called a hash table) instead of a notepad to record whether or not a web page has been downloaded.

    Today's Internet is very large, and it is impossible to complete the download task through one or several computer servers. A commercial web crawler requires thousands of servers connected by high-speed networks.

3 Two Supplements to Graph Theory

    3.1 Proof of Euler's Seven Bridges Problem

        For each vertex in the graph, the number of edges connected to it is defined as its degree.

        Theorem: If a graph can start from a vertex and traverse each edge back to this vertex without repetition, then the degree of each vertex must be an even number.

        Proof: If it is possible to facilitate each edge of the graph once, then for each vertex, it is necessary to enter the vertex from one edge and leave the vertex from the other edge. The number of incoming and outgoing vertices is the same, so each vertex has as many incoming edges as there are outgoing edges. That is, the number of edges connected to each vertex comes in pairs, i.e. the degree of each vertex must be an even number.

    3.2 Engineering points for building a web crawler

        First, use BFS or DFS?

            The order in which web crawlers traverse web pages is not simple DFS and BFS, but has a relatively complex method of downloading priority sorting.

            The subsystem that manages this prioritization is generally referred to as the scheduling system, which decides which one to download next when a web page is downloaded.

            In reptiles, BFS has more components.

        Second, the analysis of the page and the extraction of the URL

            If you find that some web pages exist but are not indexed by search engines, one possible reason is that the parsing program in the web crawler did not successfully parse the irregular script programs in the web pages.

        Third, record which pages have been downloaded in a small book -- URL table

            In order to prevent a webpage from being downloaded multiple times, we can use a hash table to record which webpages have been downloaded, and when we encounter this webpage again, we can skip it.

            How to solve the communication of the server storing the hash table becomes the bottleneck of the whole crawler system.

                First, clarify the division of labor for each download server.

                Then, on the basis of a clear division of labor, it can be batched to determine whether the URL is downloaded, such as sending a batch of queries to the hash table each time, or updating the content of a large number of hash tables each time.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325829365&siteId=291194637