Web crawlers-----Classification and principles of crawlers

Table of contents

Classification of crawlers

1. Universal web crawler: search engine crawler

2. Focus on web crawlers: crawlers targeting specific web pages

3. Incremental web crawler

4. Deep web crawler

The principles of universal crawlers and focused crawlers

Common reptiles:

Focus on crawlers:


Classification of crawlers

Web crawlers can be roughly divided into four categories according to system structure and implementation technology, namely general web crawlers , focused web crawlers , incremental web crawlers and deep web crawlers.

1. Universal web crawler: search engine crawler

        For example, when users search for corresponding keywords on the Baidu search engine, Baidu will analyze and process the keywords, find relevant ones from the included web pages, sort them according to certain ranking rules, and then display them to the user. Then it is necessary to have as many keywords as possible high-quality web pages on the Internet.

        Collect web pages and information from the Internet. This web page information is used to build indexes for search engines to provide support. It determines whether the content of the entire engine system is rich and whether the information is real-time. Therefore, its performance directly affects the performance of the search engine. Effect.

2. Focus on web crawlers: crawlers targeting specific web pages

        Also called theme web crawlers, crawlers 目标网页定位在与主题相关的页面中mainly provide services for a specific type of people, which can save a lot of server resources and bandwidth resources. Focused crawlers will process and filter the content when crawling web pages, and try to ensure that only web page information related to needs is captured.

For example, if you want to obtain data in a certain vertical field or have clear retrieval requirements, you need to filter out some useless information.

For example : those websites that compare prices are crawling products from other websites.

3. Incremental web crawler

Incremental Web Crawler, the so-called incremental, means incremental update. Incremental update means that when updating, only the changed parts are updated, and the changed parts are not updated, so the crawler only crawls web pages with changed content or newly generated web pages . For example: recruitment website crawler

4. Deep web crawler

Deep Web Crawler, first of all, what is a deep page?

In the Internet, web pages are divided into surface pages and deep pages based on their existence. The so-called surface pages refer to static pages that can be reached using static links without submitting a form; while deep pages are pages that can only be obtained after adjusting certain keywords. In the Internet, the number of deep pages is often much greater than that of surface pages.

The deep web crawler mainly consists of URL list, LVS [virtual server] list, crawling controller, parser, LVS controller, form analyzer, form processor, response analyzer, etc.

Later we will mainly learn about focused crawlers. Once you learn focused crawlers, you can easily write other types of crawlers.

The principles of universal crawlers and focused crawlers

Common reptiles:

Step 1 : Crawl the web page (url)

  1. start_url sends a request and gets the response parsed;

  2. The required new URLs are obtained from the response parsing, and these URLs are put into the URL queue to be crawled;

  3. Take out the URL to be crawled, analyze the DNS to get the IP of the host, download the web page corresponding to the URL, store it in the downloaded web page library, and put these URLs into the crawled URL queue.

  4. Analyze the URL in the crawled URL queue, analyze other URLs in it, and put the URL into the queue of URLs to be crawled, thus entering the next cycle....

Step 2 : Data Storage

The search engine crawls the web pages through crawlers and stores the data into the original page database. The page data is exactly the same as the HTML obtained by the user's browser.

Search engine spiders also do certain duplicate content detection when crawling pages. Once they encounter a large amount of plagiarized, collected or copied content on a website with very low access weight, they are likely to stop crawling.

Step 3 : Preprocessing

The search engine performs various preprocessing steps on the pages crawled back by the crawler.

  • Extract text

  • Chinese word segmentation

  • Eliminate noise (such as copyright statement text, navigation bar, advertisements, etc...)

  • Index processing

  • Link relationship calculation

  • Special file handling

  • ....

In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files, etc. We also often see these file types in search results.

However, search engines cannot yet process non-text content such as images, videos, and Flash, nor can they execute scripts and programs.

Step 4 : Provide search services and website ranking

After the search engine organizes and processes the information, it provides users with keyword retrieval services and displays relevant information to users.

Focus on crawlers :

Step one: start_url sends request

Step 2: Get the response (response)

Step 3: Parse the response. If there is a required new URL address in the response, repeat step 2;

Step 4: Extract data

Step 5: Save data

Usually, we will complete the response acquisition and parsing in one step. Therefore, the steps of focusing on the crawler are generally four steps in total.

The basic classification and principles of crawlers have been introduced here. See you in the next issue!

Share a wallpaper: 

Guess you like

Origin blog.csdn.net/m0_73633088/article/details/133047477