Basic knowledge record of crawlers Part 2

1.HTTP response

As mentioned before, a URL request is issued in Http mode. Now let's introduce the response of the server after receiving the request. The response is the feedback of the server corresponding to the request information, which consists of the response status code, response header and response body.
Common response status codes include: 200 for success, 400 for incorrect request, 401 for unauthorized, 403 for forbidden, 404 for resource not found, and 504 for gateway timeout. . . . . . For details, please refer to the HTTP response status code list of the Rookie Tutorial.
The HTTP status code list of the Rookie Tutorial.

Response header

Contains important server response information: Content-Type, Server, Set-Cookie are the key points: Content
-Type, document type, specifies the returned data type, such as text/html returns html text, application/x-javascript means returning Javascript Script, image/jpeg returns a picture in jpeg format;
Content-Encoding, specifies the encoding of the response content;
Date, the response generation time;
Expires, specifies the expiration time of the response, the response content here can be used as a cache to make it the same client request Access can speed up response times and shorten loading times.
Last-Modified, specifies the last modification time of the resource;
Set-Cookie, sets the cookie, which is used by the server to notify the browser that this content needs to be placed in the cookie and used as verification information for the next request.

response body

Important text data, the parsed content obtained when the crawler requests a web page is the html text in the response body or the binary data of an image, etc. The important html code and json data inside are the goals that our crawler needs to get, but they need to be processed in the end to output the information we want. In the www.baidu.com request entry entered earlier, click to enter Response. What is shown below is the source code of the web page, and Preview is the rendering effect of the response body. as follows:
Insert image description here

2.HTTP2.0

HTTP is an application layer protocol implemented based on TCP. You can see a simple implementation process by referring to Chapter 24 "Making HTTP Server" written by Yin Shengyu. As far as TCP itself is concerned, it has a flow control mechanism. The Advertised-Window field in the TCP header is used to represent this information and to remind the other party of the current buffer remaining amount of the local machine. Many people call this As a window, this mechanism is also called the sliding window traffic mechanism.

HTTP2 flow control

In HTTP1, in order to implement multiple parallel requests to improve performance, the client will use multiple TCP connections, and the browser's method of controlling resources will limit a single domain name to 6-8 TCP connection requests; this multiplexing The implementation in HTTP2 is based on binary framing technology, so that multiple TCP connections are no longer required to achieve multi-channel parallelism. That is, only one TCP connection is needed in HTTP2 to connect a domain name and send multiple requests and responses in parallel. This brings What impact will it have?

This will result in TCP flow control that is not precise enough. The specific internal principles are not described, but HTTP2.0's approach is to provide simple building blocks for both parties to realize their own data flow and connection-level flow control.

Server push

The above is that the client can issue multiple parallel requests. Here is a server facing multiple responses from a client. That is, in addition to the client's initial request, the server also pushes additional resources (feels like advertising?). This does not require the client. Additional requests.

In addition, the server can push the content of the affirmative request in advance, and the client also has the right to refuse to receive it, because this part of the content may have been cached in the browser, so it will refuse to receive it at this time. In addition, the resources actively pushed must be information confirmed by both the server and the client, and third-party resources cannot be pushed casually.

3.Web page composition

For websites, the visual part often consists of three parts: html + css + javascript. HTML is a language that uses tags to describe components. Being proficient in using different tags to form a beautiful and usable web page is what front-end programmers and even some back-end programmers now need. Ability requirements. css, the full name of cascading style sheets, means cascading style sheets. The text font, color, size, element spacing, arrangement, etc. in the web page are all "styles"; javascript, one of the scripting languages, has always been very popular, but there is no time to learn However, it mainly realizes the interaction between users and the interface, making the static web pages we call dynamic web pages.
The novice tutorial web page construction guide
refers to the tags in html. You can see that the content of a web page is an html node tree composed of different tags. The root node is the html tag, and below it are the head and body tags. Layers down are the important resources we need. .

4. Basic overview of crawlers

To summarize the above knowledge, a crawler is an automated program that simulates a browser to make a request, then parses the information in the response, and stores or displays relevant useful information.

After this is clear, the first step in writing a crawler program is to implement the function of obtaining web pages. We mainly use libraries such as urllib and requests to construct requests sent to the website server, and then obtain the response body in the response and store it using the data structure of the class library. Extract the body; after the body is extracted, we need to parse it. The general method is to construct a regular expression. Based on the complex and different rules of the website, there are some libraries used to extract web page information based on the web page node attributes. Commonly used libraries include pyquery, Beautiful Soup; After the information is extracted, we need to store and process it, save it into txt, excel or json, or save it to a related database.

Login verification

In addition to the web pages that we can directly access daily, there are also many web pages that need to be logged in to access. This part of the content involves a certificate, which is the joint crystallization of session and cookie. Sessions and cookies are a technology for maintaining HTTP connection sessions. Before this, HTTP had a stateless connection. The client sent a request, the server responded, and another request was sent, and the server responded again. The client needed the information previously requested. Send another request and the server responds again instead of retaining the information previously requested. The session is on the server side, and the cookie is on the client side. The browser automatically attaches the cookie when visiting the same web page. The server uses the cookie to identify the client's identity, whether to log in, and then returns the corresponding response. Therefore, in the crawler, we construct the cookie obtained after successful login in the request header, so that there is no need to log in repeatedly when sending the request. Now let’s understand the specific concepts of session and cookie and their important attributes.

Session, there is an object concept in object-oriented programming. A session object is used to save the attributes and configuration information required for a specific user session. When the user jumps on the page, the user's information will not be lost, such as the login status. Information can be kept forever. When a user makes a request, a session object will be created if it has not been established yet, and will be destroyed if it expires.

Cookie, stored data for tracking in order to authenticate the session. When the user makes the first request, the server responds with a response with the Set-Cookie field to the client and tags the user. Then the client saves the cookie and submits it together with the cookie on the next request. The server checks the session ID in the cookie. Confirm the identity, then determine the user status, and return the corresponding response. If it fails, you need to log in again for verification and so on.
Insert image description here
The above is the cookie interface under Baidu. Some of the cookie entries indicate the following attributes:
name, cookie name, which cannot be changed after creation;
value, cookie value, which involves encoding. If it is unicode, it needs to be character encoded. If it is binary, it needs to be base64 encoding;
domain, specifies the accessible domain name;
path, the cookie usage path, that is, the page under this path can access the cookie, if it is /, it is the root directory, which means that all pages under the domain name can access this cookie;
max-age, the expiration time , with seconds as the timing unit;
http, the httponly attribute mark of the cookie, if it is true, cookie information can only be carried in http headers. It cannot be accessed through document.cookie (if it is false, it is ok?);
secure, whether only secure protocols are allowed to be used to transmit cookies, that is, they are encrypted and transmitted using secure protocols such as https and ssl.

Misunderstanding record: Regarding sessions and cookies, when we are still in the login interface, we exit the browser and other software without logging out. The server will not destroy the session object. It will only clear the cookie information locally, and then reopen the browser to access the same When accessing a web page, you need to log in again, so closing the browser does not mean that the session is cleared. The session will only be destroyed when the client logs out or the max-ag time expires. At this time, the performance is to log in again, so we see that a problem occurs in multiple situations, which is why it is very troublesome that the performance status is the same.

5. About agency

Crawlers were originally just a convenient tool for searching information and collecting data, but if abused, they can be very dangerous. Therefore, many websites have an audit mechanism for requests that are too frequent. If it is determined to be a crawler, a 403 Forbidden Access status code will be returned. This It is an anti-crawler measure for websites to deal with crawlers. Based on this, our measure is to wrap our IP so that it does not recognize the request initiated by this machine. One of the methods is to use a proxy.

A proxy
is a proxy server, that is, a proxy server. It has the function of obtaining network information on behalf of network users. It is similar to a transfer station. When we send a request, it is sent to the proxy server, and the proxy server requests a response. After receiving the response, it forwards it to native, thus achieving an IP masquerading. We constantly change the proxy when crawling to achieve IP camouflage, so that the local IP will not be blocked by the server. (But the annoying thing is that the proxies used for practice in the past are often expired, so it seems that there are other ways to achieve similar functions).

Agent classification

According to the proxy protocol, it can be divided into:
FTP proxy, which mainly accesses the FTP server and has uploading, downloading and caching functions. The ports are 21, 2121, etc.;
HTTP proxy, which mainly accesses web pages, has content filtering and caching functions, and the ports are 80 and 8080;
RTSP server, mainly used for Realplayer to access Real streaming media server, with caching function, port 23;
SSL/TLS proxy, used to access encrypted websites, with SSL and TLS encryption functions, the common port is 443;
Telnet proxy, used for Telnet For remote control, the port is 23 (everyone should be careful);
SOCKS proxy, which simply transmits data packets, is very fast, has a caching function, and the port is 1080.

Distinguished according to the degree of anonymity:
highly anonymous proxy, which forwards the data packet intact;
ordinary anonymous proxy, which will make some changes to the data packet, and the server may discover that it is a proxy;
transparent proxy, which changes the data packet and tells the server its true identity IP, use caching for speed, and content filtering for enhanced security.

When we choose an agent, we should use highly anonymous ones if we can use them. If we can use paid ones, we should pay. Of course, there are also some such as ADSL dial-up. You can learn about them (the written advice of Cui Qingcai is as follows) )

Finally, I want to tell the students who are reading this article that crawlers are sharp tools, but they are double-edged swords that can easily hurt others and ourselves. We must grasp the right balance and don’t turn crawler-oriented programming into prison-oriented programming. Pay attention, Amitabha. .

Guess you like

Origin blog.csdn.net/weixin_44948269/article/details/121899829