Basic Principles of Web Crawler (1): Process and Strategy

Web crawler is an important part of search engine crawling system. The main purpose of the crawler is to download the web pages on the Internet to the local to form a mirror backup of the network content. This blog mainly provides a brief overview of crawlers and crawling systems.

 

1. Basic structure and workflow of web crawler

    The framework of a general web crawler is shown in the figure

  The basic workflow of a web crawler is as follows:

    1. First select a part of carefully selected torrent URLs;

    2. Put these URLs into the URL queue to be crawled;

    3. Take out the URL to be crawled from the queue of URLs to be crawled, parse the DNS, get the IP of the host, download the webpage corresponding to the URL, and store it in the downloaded webpage library. Also, put those URLs into the Crawl URLs queue.

    4. Analyze the URLs in the queue of URLs that have been crawled, analyze other URLs in them, and put the URLs into the queue of URLs to be crawled, thereby entering the next cycle.

 

2. Dividing the Internet from the perspective of crawlers

    Correspondingly, all pages on the Internet can be divided into five parts:


    1. Downloaded but not expired web pages

    2. Downloaded and expired web pages: The captured web pages are actually a mirror and backup of the Internet content. The Internet is dynamic, and some of the content on the Internet has changed. At this time, this part of the captured web pages has expired.

    3. Web pages to be downloaded: those pages in the URL queue to be crawled

    4. Knowable webpage: It has not been crawled yet, nor is it in the URL queue to be crawled, but the URL obtained by analyzing the crawled page or the page corresponding to the URL to be crawled can be considered as a known webpage.

    5. There are also some web pages that crawlers cannot directly grab and download. called agnostic web pages.

 

3. Grab strategy

    In a crawler system, the queue of URLs to be crawled is an important part. The order in which the URLs in the queue of URLs to be crawled are arranged is also an important issue, because it involves which page to crawl first and which page to crawl later. The method of determining the order in which these URLs are arranged is called a crawling strategy. The following highlights several common crawling strategies:

    1. Depth-first traversal strategy

The depth-first traversal strategy means that the web crawler will start from the start page, follow one link by one link, and then go to the next start page after processing the line, and continue to follow the link. Let's take the following figure as an example:


     Path traversed: AFG EHI BCD

    2. Breadth-first traversal strategy

    The basic idea of ​​breadth-first traversal is to insert links found in newly downloaded web pages directly at the end of the queue of URLs to be crawled. That is to say, the web crawler will first crawl all the webpages linked in the starting webpage, and then select one of the linked webpages, and continue to crawl all the webpages linked in this webpage. Or take the above picture as an example:

    Traversal path: ABCDEF GHI

    3. Backlink count strategy

    The number of backlinks refers to the number of links pointing to a web page from other web pages. The number of backlinks indicates the degree to which the content of a web page is recommended by others. Therefore, in many cases, the crawling system of the search engine will use this indicator to evaluate the importance of the webpage, thereby determining the order of crawling different webpages.

    In the real network environment, due to the existence of advertising links and hidden links, the number of backlinks cannot be completely equal to the importance of others. Therefore, search engines tend to consider some reliable backlink counts.

    4. Partial PageRank Strategy

    The Partial PageRank algorithm draws on the idea of ​​the PageRank algorithm: for the downloaded web pages, together with the URLs in the URL queue to be crawled, a set of web pages is formed, and the PageRank value of each page is calculated. URLs are sorted by PageRank value and pages are crawled in that order.

    If the PageRank value is recalculated every time a page is crawled, a compromise solution is: after every K pages are crawled, the PageRank value is recalculated once. But there is still a problem in this situation: for the link analyzed in the downloaded page, that is, the part of the unknown webpage we mentioned earlier, there is no PageRank value for the time being. In order to solve this problem, a temporary PageRank value will be given to these pages: the PageRank values ​​passed in from all incoming links of this web page are aggregated, thus forming the PageRank value of the unknown page, thus participating in the sorting. The following examples illustrate:

    5. OPIC strategy strategy

    The algorithm actually assigns an importance score to the page. All pages are given the same initial cash before the algorithm starts. When a certain page P is downloaded, the cash of P is distributed to all the links analyzed from P, and the cash of P is emptied. Sort all pages in the queue of URLs to be crawled according to the amount of cash.

    6. Big site priority strategy

    All webpages in the URL queue to be crawled are classified according to the websites they belong to. For websites with a large number of pages to be downloaded, download them first. This strategy is also called the big site first strategy. 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326440278&siteId=291194637