【Introduction to web crawler】

The basic workflow of a web crawler is as follows:

1. First select a part of carefully selected torrent URLs;

2. Put these URLs into the URL queue to be crawled;

3. Take out the URL to be crawled from the queue of URLs to be crawled, parse the DNS, get the IP of the host, download the webpage corresponding to the URL, and store it in the downloaded webpage library. Also, put those URLs into the Crawl URLs queue.

4. Analyze the URLs in the queue of URLs that have been crawled, analyze other URLs in them, and put the URLs into the queue of URLs to be crawled, thereby entering the next cycle.

 

 

 

 

All pages of the Internet are divided into five sections:

1. Downloaded but not expired web pages

2. Downloaded and expired web pages: The captured web pages are actually a mirror and backup of the Internet content. The Internet is dynamic, and some of the content on the Internet has changed. At this time, this part of the captured web pages has expired.

3. Web pages to be downloaded: those pages in the URL queue to be crawled

4. Knowable webpage: It has not been crawled yet, nor is it in the URL queue to be crawled, but the URL obtained by analyzing the crawled page or the page corresponding to the URL to be crawled can be considered as a known webpage.

5. There are also some web pages that crawlers cannot directly grab and download. called agnostic web pages.

 

 

Crawl strategy

In a crawler system, the queue of URLs to be crawled is an important part. The order in which the URLs in the queue of URLs to be crawled are arranged is also an important issue, because it involves which page to crawl first and which page to crawl later. The method of determining the order in which these URLs are arranged is called a crawling strategy. The following highlights several common crawling strategies:

1. Depth-first traversal strategy

The depth-first traversal strategy means that the web crawler will start from the start page, follow one link by one link, and then go to the next start page after processing the line, and continue to follow the link.

2. Breadth-first traversal strategy

The basic idea of ​​breadth-first traversal is to insert links found in newly downloaded web pages directly at the end of the queue of URLs to be crawled. That is to say, the web crawler will first crawl all the webpages linked in the starting webpage, and then select one of the linked webpages, and continue to crawl all the webpages linked in this webpage.

3. Backlink count strategy

The number of backlinks refers to the number of links pointing to a web page from other web pages. The number of backlinks indicates the degree to which the content of a web page is recommended by others. Therefore, in many cases, the crawling system of the search engine will use this indicator to evaluate the importance of the webpage, thereby determining the order of crawling different webpages.

In the real network environment, due to the existence of advertising links and Bi links, the number of backlinks cannot be completely equal to the importance of others and me. Therefore, search engines tend to consider some reliable backlink counts.

4. Partial PageRank Strategy

The Partial PageRank algorithm draws on the idea of ​​the PageRank algorithm: for the downloaded web pages, together with the URLs in the URL queue to be crawled, a set of web pages is formed, and the PageRank value of each page is calculated. URLs are sorted by PageRank value and pages are crawled in that order.

If the PageRank value is recalculated every time a page is crawled, a compromise solution is: after every K pages are crawled, the PageRank value is recalculated once. But there is still a problem in this situation: for the links analyzed in the downloaded pages, that is, the part of the unknown page we mentioned before, there is no PageRank value for the time being. In order to solve this problem, these pages will be given A temporary PageRank value: Summarize the PageRank values ​​passed in from all incoming links of this web page, thus forming the PageRank value of the unknown page, so as to participate in the sorting.

5. OPIC strategy strategy

The algorithm actually assigns an importance score to the page. All pages are given the same initial cash before the algorithm starts. When a certain page P is downloaded, the cash of P is distributed to all the links analyzed from P, and the cash of P is emptied. Sort all pages in the queue of URLs to be crawled according to the amount of cash.

6. Big site priority strategy

All webpages in the URL queue to be crawled are classified according to the websites they belong to. For websites with a large number of pages to be downloaded, download them first. This strategy is also called the big site first strategy. 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326449829&siteId=291194637