Python crawler from beginner to proficient - crawler basics: the basic principles of crawler, learn from scratch!

1. Overview of crawlers

简单来说,爬虫就是获取网页并提取和保存信息的自动化程序,下面概要介绍一下。
(1) Get the web page

The first thing the crawler has to do is to get the web page, here it is to get the source code of the web page. The source code contains some useful information of the web page, so as long as you get the source code, you can extract the desired information from it.

We talked about the concepts of request and response earlier. When you send a request to the website server, the response body returned is the source code of the web page. Therefore, the most critical part is to construct a request and send it to the server, then receive the response and parse it. So how to implement this process? You can't intercept the source code of a web page manually, right?

Don't worry, Python provides many libraries to help us achieve this operation, such as urllib, requests, etc. We can use these libraries to help us implement HTTP request operations. Both requests and responses can be represented by the data structure provided by the class library. After getting the response, we only need to parse the Body part of the data structure, that is, get the source code of the web page. In this way, we can use a program to implement the process of obtaining web pages.

(2) Extract information

After obtaining the source code of the web page, the next step is to analyze the source code of the web page and extract the data we want from it. First of all, the most common method is to use regular expression extraction. This is a universal method, but it is more complicated and error-prone when constructing regular expressions.

In addition, since the structure of web pages has certain rules, there are also some libraries that extract web page information based on web page node attributes, CSS selectors or XPath, such as Beautiful Soup, pyquery, lxml, etc. Using these libraries, we can efficiently and quickly extract web page information, such as node attributes, text values, etc.

Extracting information is a very important part of the crawler, which can make messy data organized so that we can subsequently process and analyze the data.

(3) Save data

After extracting information, we generally save the extracted data somewhere for subsequent use. There are various ways of saving here. For example, it can be simply saved as TXT text or JSON text, or it can be saved to a database, such as MySQL and MongoDB, or it can be saved to a remote server, such as operating with SFTP.

(4) Automated procedures

When it comes to automated programs, it means that crawlers can complete these operations instead of humans. First of all, of course we can extract this information manually, but if the volume is particularly large or we want to obtain a large amount of data quickly, we must still use a program. A crawler is an automated program that completes the crawling work on our behalf. It can perform various exception handling, error retry and other operations during the crawling process to ensure that crawling continues to run efficiently.

2. The use of crawlers

Nowadays, the era of big data has arrived, and web crawler technology has become an indispensable part of this era. Enterprises need data to analyze user behavior, shortcomings of their own products, competitor information, etc., and the first condition for all this is data collection. The value of web crawlers is actually the value of data. In the Internet society, data is priceless and everything is data. Whoever has a large amount of useful data has the initiative to make decisions.

The current main application areas of web crawlers include: search engines, data collection, data analysis, information aggregation, competitive product monitoring, cognitive intelligence, public opinion analysis, etc. There are countless companies related to crawler business, such as Baidu, Google, Tianyancha, Qi. Chacha, Xinbang, Feigua, etc. In the era of big data, crawlers have a wide range of applications and great demand. Here are a few examples that are close to life:

  • **Job job requirements:** Obtain recruitment information and salary standards in various cities to easily select the ones that suit you;

  • **Rental demand:** Obtain rental information in various cities to select your favorite housing;

  • **Gourmet needs:** Get good reviews from various places so that foodies don’t get lost;

  • **Shopping needs:** Obtain the price and discount information of the same product from various merchants to make shopping more affordable;

  • **Car buying needs: **Get the price fluctuations of your favorite vehicles in recent years, as well as the prices of various models through different channels, to help you choose your car.

3. The meaning of URI and URL

URI (Uniform Resource Identifier), which is the Uniform Resource Identifier, URI (Uniform Resource Location), which is the Uniform Resource Locator, for examplehttps://www.kuaidaili.com / is both a URI and a URL. URL is a subset of URI. For general web links, it is customary to call it URL. The basic composition format of a URL is as follows:

scheme://[username:password@]host[:port][/path][;parameters][?query][#fragment]

The meaning of each part is as follows:

  1. **scheme:** The protocol used to obtain resources, such as http, https, ftp, etc., there is no default value, scheme is also called protocol;

  2. **username:password:**Username and password. In some cases, the URL requires a username and password to access. This is a special existence. It is generally used when accessing ftp. It explicitly indicates the username for accessing resources. and password, but you don’t have to write it down. If you don’t write it down, you may be asked to enter your username and password;

  3. **host: **Host address, which can be a domain name or IP address, such aswww.kuaidaili.com, 112.66.251.209;

  4. **port: **port, the service port set by the server, the default port of http protocol is 80, the default port of https protocol is 443, for examplehttps:// www.kuaidaili.com/ is equivalent tohttps://www.kuaidaili.com:443;

  5. **path: **path refers to the specified address of network resources in the server. We can find the host through host:port, but there are many files on the host. Specific files can be located through path. For examplehttps://www.baidu.com/file/index.html, the path is /file/index.html, which means we access /file/ index.html this file;

  6. **parameters: **parameters, used to specify additional information when accessing a resource. The main function is to provide additional parameters to the server to represent some characteristics of this request, such as https://www.kuaidaili.com/dps;kspider, kspider is a parameter, it is rarely used now, most of them use the query part as a parameter;

  7. **query: **Query, due to querying certain types of resources, if there are multiple queries, use & to separate the parameters requested through GET, for example: https ://www.kuaidaili.com/dps/?username=kspider&type=spider, the query part is username=ksspider&type=spider, the username is kspider and the type is spider;

  8. **fragment: **Fragment, a partial supplement to the resource description, used to identify secondary resources, such ashttps://www.kuaidaili.com/dps#kspider , kspider is the value of fragment:

    • Application: single page routing, HTML anchor;

    • # is different from ?, the query string following ? will be brought to the server by the network request, but the fragment will not be sent to the server; < /span>

    • Changes to the fragment will not trigger the browser to refresh the page, but will generate browsing history;

    • Fragments will be processed by the browser according to the file media type (MIME type);

    • By default, Google's search engine will ignore # and the following strings. If you want it to be read by the browser engine, you need to follow # Following a !, Google will automatically convert the following content into the value of the query string _escaped_fragment_, such as https: //www.kuaidaili.com/dps#!kspider, converted to https://www.kuaidaili.com/dps< a i=8>?escaped_fragment=kspider.

Since the goal of the crawler is to obtain resources, and the resources are stored on a certain host, when the crawler crawls data, it must have a target URL to obtain the data. Therefore, the URL is the basic basis for the crawler to obtain data. Accurately understand the URL. The meaning is very helpful for crawler learning.

4. Basic process of crawler

  1. **Initiate a request:** Initiate a Request request to the server through the URL (same as opening a browser and entering the URL to browse the web). The request can contain additional headers, cookies, proxies, data and other information. Python provides many libraries to help us implement This process completes HTTP request operations, such as urllib, requests, etc.;
  2. **Get the response content:** If the server responds normally, it will receive the Response. The Response is the web page content we requested, including HTML (web page source code), JSON data or binary data (video, audio, pictures), etc. ;
  3. **Parsing content:** After receiving the response content, it needs to be parsed and the data content extracted. If it is HTML (web page source code), it can be parsed using a web page parser, such as regular expression (re), Beautiful Soup, pyquery, lxml, etc.; if it is JSON data, it can be converted into a JSON object for parsing; if it is binary data, it can be saved to a file for further processing;
  4. **Save data:** It can be saved to a local file (txt, json, csv, etc.), or to a database (MySQL, Redis, MongoDB, etc.), or to a remote server, such as using SFTP.

5. Basic architecture of crawler

The basic architecture of the crawler mainly consists of five parts, namely crawler scheduler, URL manager, web page downloader, web page parser, and information collector:

  1. **Crawler scheduler:** Equivalent to the CPU of a computer. It is mainly responsible for scheduling the coordination between the URL manager, downloader, and parser. It is used for communication between various modules. It can be understood as the entrance and exit of the crawler. Core, the execution strategy of the crawler is defined in this module;
  2. **URL manager: **Includes URL addresses to be crawled and crawled URL addresses to prevent repeated crawling of URLs and loop crawling of URLs. There are three main ways to implement the URL manager, through memory, database, and cache. Database to achieve;
  3. **Web page downloader: **Responsible for downloading web pages through URLs, mainly performing corresponding camouflage processing to simulate browser access and download web pages. Commonly used libraries are urllib, requests, etc.;
  4. **Web page parser:** Responsible for parsing web page information. It can extract useful information as required, and can also parse it according to the DOM tree parsing method. Such as regular expressions (re), Beautiful Soup, pyquery, lxml, etc., which can be used flexibly according to the actual situation;
  5. **Data storage:** Responsible for storing, displaying and other data processing of the parsed information.

Insert image description here

01

6. Robots protocol

The robots protocol, also known as crawler protocol, crawler rules, etc., means that the website can create a robots.txt file to tell the search engine which pages can be crawled and which pages cannot be crawled, and the search engine identifies them by reading the robots.txt file. Whether this page is allowed to be crawled. However, this robots protocol is not a firewall and has no enforcement power. Search engines can completely ignore the robots.txt file and grab a snapshot of the web page**. ** If you want to separately define the behavior of search engine robots when accessing subdirectories, you can merge custom settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as metadata).

The robots agreement is not a specification, but just a convention, so it does not guarantee the privacy of the website. It is commonly known as a "gentleman's agreement."

The meaning of robots.txt file content:

  • User-agent:*, where * represents all search engine types, * is a wildcard character
  • Disallow: /admin/, the definition here is to prohibit crawling directories under the admin directory
  • Disallow: /require/, the definition here is to prohibit crawling directories under the require directory.
  • Disallow:/ABC/, the definition here is to prohibit crawling directories under the ABC directory
  • Disallow:/cgi-bin/*.htm, prohibits access to all URLs with “.htm” suffix (including subdirectories) in the /cgi-bin/ directory
  • Disallow:/*?*, disallows access to all URLs on the site that contain a question mark (?)
  • Disallow:/.jpg$, prohibits crawling all .jpg format images on the web page
  • Disallow:/ab/adc.html, prohibit crawling the adc.html file under the ab folder
  • Allow:/cgi-bin/, the definition here is to allow crawling of directories under the cgi-bin directory
  • Allow:/tmp, this definition allows crawling the entire directory of tmp
  • Allow: .htm$, only allows access to URLs with the suffix “.htm”
  • Allow: .gif$, allows crawling web pages and gif format images
  • Sitemap: Site map, tells crawlers that this page is a site map

Check the robots protocol of the website. Just add the suffix robotst.txt to the website URL. Take the fast proxy as an example:

https://www.kuaidaili.com/

02

  • Block all search engines from accessing any part of the site
  • Disable crawling of directories under the /doc/using/ directory
  • It is forbidden to crawl all directories and files starting with sdk under the /doc/dev directory

Guess you like

Origin blog.csdn.net/2301_78096295/article/details/130705302
Recommended