03-python crawler basic principle

Crawler is a program that simulates the user's operation on the browser or an application, and automates the operation process

When we enter a url in the browser and press Enter, what happens in the background? For example, if you enter http://www.sina.com.cn/

 

Simply put, the following four steps occurred in this process :

  • Find the IP address corresponding to the domain name.

  • Send a request to the server corresponding to the IP.

  • The server responds to the request and sends back the content of the web page.

  • The browser parses the content of the web page.

 

The essence of web crawlers

The essence is the browser http request

Browsers and web crawlers are two different web clients, both of which fetch web pages in the same way:

What a web crawler needs to do, in simple terms, is to implement the functions of a browser. By specifying the url, the data is directly returned to the user, without the need to manually manipulate the browser to obtain it step by step.

 

How does the browser send and receive this data?

Introduction to HTTP

The purpose of the HTTP protocol (HyperText Transfer Protocol, HyperText Transfer Protocol) is to provide a method for publishing and receiving HTML (HyperText Markup Language) pages.

The protocol layer where the HTTP protocol is located (understand)

HTTP is based on the TCP protocol. The protocols corresponding to each layer of the TCP/IP protocol reference model are as follows, where HTTP is the application layer protocol. The default HTTP port number is 80, and the HTTPS port number is 443.

 

HTTP working process

An HTTP operation is called a transaction, and the whole process is as follows:

1), address resolution,

If you request this page with a client browser: http://localhost.com:8080/index.htm

The protocol name, host name, port, object path and other parts are decomposed from it. For our address, the results of the analysis are as follows: Protocol name: http host name: localhost.com port: 8080 object path: /index.htm

In this step, the domain name system DNS is required to resolve the domain name localhost.com to obtain the IP address of the host.

 

2), encapsulate the HTTP request packet

Combine the above part with the machine's own information and encapsulate it into an HTTP request packet

 

3) Encapsulate it into a TCP packet and establish a TCP connection (TCP three-way handshake)

Before the start of HTTP work, the client (Web browser) must first establish a connection with the server through the network. The connection is completed through TCP. This protocol and the IP protocol jointly build the Internet, the famous TCP/IP protocol suite. The Internet is also called a TCP/IP network.

HTTP is a higher-level application layer protocol than TCP. According to the rules, connections with higher-level protocols can only be made after lower-level protocols are established. Therefore, a TCP connection must be established first. Generally, the port number of a TCP connection is 80. Here is port 8080

 

4) The client sends the request command

After establishing a connection, the client sends a request to the server. The format of the request is: Uniform Resource Identifier (URL), protocol version number, followed by MIME information including request modifiers, client information, and content.

 

5) Server response

After receiving the request, the server gives the corresponding response information, the format of which is a status line, including the protocol version number of the information, a success or error code, followed by MIME information including server information, entity information, and possible content.

 
  1. 实体消息是服务器向浏览器发送头信息后,它会发送一个空白行来表示头信息的发送到此为结束,接着,它就以Content-Type应答头信息所描述的格式发送用户所请求的实际数据

 

6) The server closes the TCP connection

In general, once the web server sends the request data to the browser, it will close the TCP connection, and then if the browser or server adds this line of code to its header

Connection:keep-alive

The TCP connection will remain open after sending, so the browser can continue to send requests through the same connection. Keeping connected saves the time required to establish a new connection for each request and also saves network bandwidth.

 

 

HTTPS

HTTPS (full name: Hypertext Transfer Protocol over Secure Socket Layer) is an HTTP channel with security as the goal. Simply put, it is a secure version of HTTP. That is, the SSL layer is added under HTTP, and the security foundation of HTTPS is SSL. The port number used is 443.

SSL: Secure Socket Layer, which is a secure transmission protocol designed by Netscape, which is mainly used on the web. This protocol has been widely used on the WEB. Certificate authentication is used to ensure that the communication data between the client and the website server is encrypted and secure.

There are two basic types of encryption and decryption algorithms:

1) Symmetrcic encryption: There is only one key, encryption and decryption are the same password, and the encryption and decryption speed is fast. Typical symmetric encryption algorithms are DES, AES, RC5, 3DES, etc.;

The main problem of symmetric encryption is to share the secret key. Unless your computer (client) knows the private key of another computer (server), it cannot encrypt and decrypt the communication stream. The solution to this problem is asymmetric key.

2) Asymmetric encryption : Two secret keys are used: a public secret key and a private secret key. The private key is stored by one party's password (usually stored by the server), and anyone on the other party can obtain the public key.

This kind of key appears in pairs (and the private key cannot be derived from the public key, and the public key cannot be derived from the private key), and different keys are used for encryption and decryption (public key encryption requires private key decryption, private key encryption requires public key decryption) , Relatively symmetric encryption is slower. Typical asymmetric encryption algorithms include RSA and DSA.

Advantages of https communication:

  • The key generated by the client can only be obtained by the client and the server;

  • Only the client and server can get the plaintext of encrypted data;

  • The communication from the client to the server is secure.

Getting started with IT Thanks for your attention

 

 

Practice address: www.520mg.com/it

0Basic python crawler series tutorial

 

 

Guess you like

Origin blog.csdn.net/bigzql/article/details/108685512