Table of contents
1. Universal web crawler: search engine crawler
2. Focus on web crawlers: crawlers targeting specific web pages
The principles of universal crawlers and focused crawlers
Classification of crawlers
Web crawlers can be roughly divided into four categories according to system structure and implementation technology, namely general web crawlers , focused web crawlers , incremental web crawlers and deep web crawlers.
1. Universal web crawler: search engine crawler
For example, when users search for corresponding keywords on the Baidu search engine, Baidu will analyze and process the keywords, find relevant ones from the included web pages, sort them according to certain ranking rules, and then display them to the user. Then it is necessary to have as many keywords as possible high-quality web pages on the Internet.
Collect web pages and information from the Internet. This web page information is used to build indexes for search engines to provide support. It determines whether the content of the entire engine system is rich and whether the information is real-time. Therefore, its performance directly affects the performance of the search engine. Effect.
2. Focus on web crawlers: crawlers targeting specific web pages
Also called theme web crawlers, crawlers
目标网页定位在与主题相关的页面中
mainly provide services for a specific type of people, which can save a lot of server resources and bandwidth resources. Focused crawlers will process and filter the content when crawling web pages, and try to ensure that only web page information related to needs is captured.For example, if you want to obtain data in a certain vertical field or have clear retrieval requirements, you need to filter out some useless information.
For example : those websites that compare prices are crawling products from other websites.
3. Incremental web crawler
Incremental Web Crawler, the so-called incremental, means incremental update. Incremental update means that when updating, only the changed parts are updated, and the changed parts are not updated, so the crawler only crawls web pages with changed content or newly generated web pages . For example: recruitment website crawler
4. Deep web crawler
Deep Web Crawler, first of all, what is a deep page?
In the Internet, web pages are divided into surface pages and deep pages based on their existence. The so-called surface pages refer to static pages that can be reached using static links without submitting a form; while deep pages are pages that can only be obtained after adjusting certain keywords. In the Internet, the number of deep pages is often much greater than that of surface pages.
The deep web crawler mainly consists of URL list, LVS [virtual server] list, crawling controller, parser, LVS controller, form analyzer, form processor, response analyzer, etc.
Later we will mainly learn about focused crawlers. Once you learn focused crawlers, you can easily write other types of crawlers.
The principles of universal crawlers and focused crawlers
Common reptiles:
Step 1 : Crawl the web page (url)
start_url sends a request and gets the response parsed;
The required new URLs are obtained from the response parsing, and these URLs are put into the URL queue to be crawled;
Take out the URL to be crawled, analyze the DNS to get the IP of the host, download the web page corresponding to the URL, store it in the downloaded web page library, and put these URLs into the crawled URL queue.
Analyze the URL in the crawled URL queue, analyze other URLs in it, and put the URL into the queue of URLs to be crawled, thus entering the next cycle....
Step 2 : Data Storage
The search engine crawls the web pages through crawlers and stores the data into the original page database. The page data is exactly the same as the HTML obtained by the user's browser.
Search engine spiders also do certain duplicate content detection when crawling pages. Once they encounter a large amount of plagiarized, collected or copied content on a website with very low access weight, they are likely to stop crawling.
Step 3 : Preprocessing
The search engine performs various preprocessing steps on the pages crawled back by the crawler.
Extract text
Chinese word segmentation
Eliminate noise (such as copyright statement text, navigation bar, advertisements, etc...)
Index processing
Link relationship calculation
Special file handling
....
In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files, etc. We also often see these file types in search results.
However, search engines cannot yet process non-text content such as images, videos, and Flash, nor can they execute scripts and programs.
Step 4 : Provide search services and website ranking
After the search engine organizes and processes the information, it provides users with keyword retrieval services and displays relevant information to users.
Focus on crawlers :
Step one: start_url sends request
Step 2: Get the response (response)
Step 3: Parse the response. If there is a required new URL address in the response, repeat step 2;
Step 4: Extract data
Step 5: Save data
Usually, we will complete the response acquisition and parsing in one step. Therefore, the steps of focusing on the crawler are generally four steps in total.
The basic classification and principles of crawlers have been introduced here. See you in the next issue!
Share a wallpaper: