Invincible python crawler tutorial study notes (3)

Series Article Directory

Invincible python crawler tutorial study notes (1)
invincible python crawler tutorial study notes (2)
invincible python crawler tutorial study notes (3)
invincible python crawler tutorial study notes (4)



foreword

Comprehensive analysis of the web request process to understand the working principle of data request return.
Understand the HTTP protocol


Analysis of the web request process

#1、服务器渲染:在服务器那边直接把搜索的数据和HTML整合在一起,统一返回给浏览器。
#在页内源代码中可以看到数据
#2、客户端渲染:第一次请求只有一个HTML框架,第二次请求拿到数据,进行整合然后展示数据。
#在页内源代码中看不到数据




#熟练使用浏览器抓包工具F12

There are two types. We need to simulate the browser to crawl the data. First, we need to get the browser's request return method. The simple understanding is whether there is data in the returned page, and we need to be proficient in using the browser's own packet capture tool F12 or check, etc. , find the data you need.

HTTP protocol

The Hyper Text Transfer Protocol (HTTP) is a simple request-response protocol that typically runs on top of TCP. It specifies what kind of messages the client might send to the server and what kind of response it gets. The headers of the request and response messages are given in ASCII form; the message content has a MIME-like format. This simple model was instrumental in the early success of the Web because it made development and deployment very straightforward.

Introduction

The World Wide Web WWW (World Wide Web) originated from CERN, a quantum physics laboratory in Geneva, Europe. It is the emergence of WWW technology that enables the Internet to develop rapidly and rapidly beyond imagination. This technology based on TCP/IP has quickly become the largest information system on the Internet that has been developed for decades in just ten years. Its success is attributed to its simplicity and practicality. Behind the WWW, there are a series of protocols and standards that support it to accomplish such a grand work. This is the Web protocol family, which includes the HTTP hypertext transfer protocol.
In 1990, HTTP became the supporting protocol of the WWW. At that time, it was proposed by Tim Berners-Lee, the father of its founder, and then the WWW Consortium was established, which organized the IETF (Internet Engineering Task Force) group to further improve and publish HTTP.
Please add image description

HTTP is an application layer protocol. Like other application layer protocols, it is a protocol for implementing a certain type of specific application, and its function is realized by an application program running in the user space. HTTP is a protocol specification, which is documented in a document for the implementation of HTTP that actually communicates over HTTP.
HTTP is based on B/S architecture for communication, and HTTP server-side implementation programs include httpd, nginx, etc. The client-side implementation programs are mainly Web browsers, such as Firefox, Internet Explorer, Google Chrome, Safari, Opera, etc. In addition , the client's command line tools include elink, curl, etc. Web services are based on TCP, so in order to be able to respond to client requests at any time, the Web server needs to listen on port 80/TCP. This allows communication between the client browser and the web server over HTTP.

working principle

HTTP is based on the client/server model and is connection-oriented. A typical HTTP transaction has the following processes:
(1) The client establishes a connection with the server;
(2) The client makes a request to the server;
(3) The server accepts the request and returns the corresponding file as a response according to the request;
(4) The client communicates with The server closed the connection.
The HTTP connection between the client and the server is a one-time connection, which limits each connection to process only one request. When the server returns a response to this request, the connection is immediately closed, and the connection is re-established for the next request. This one-time connection mainly considers that the WWW server faces thousands of users in the Internet and can only provide a limited number of connections, so the server will not keep a connection in a waiting state, and releasing the connection in time can greatly improve the server's performance. effectiveness.
HTTP is a stateless protocol, i.e. the server does not keep any state of the transaction with the client. This greatly reduces the memory burden of the server, thereby maintaining a faster response speed. HTTP is an object-oriented protocol. Data objects of any type are allowed to be passed. It identifies the content and size of the transmitted data by data type and length, and allows the data to be compressed and transmitted. When the user defines a hypertext link in an HTML document, the browser will establish a connection with the specified server through the TCP/IP protocol.
HTTP supports persistent connections, and in HTTP/0.9 and 1.0 the connection is closed after a single request/response pair. In HTTP/1.1, the keep-alive mechanism was introduced, where a connection can be reused for multiple requests. Such persistent connections can significantly reduce request latency, since the client does not need to renegotiate the TCP 3-Way-Handshake connection after sending the first request. Another positive side effect is that, in general, connections get faster over time due to TCP's slow-start mechanism.
Version 1.1 of the protocol also includes bandwidth optimization improvements over HTTP/1.0. For example, HTTP/1.1 introduced chunked transfer encoding to allow streaming rather than buffering content on persistent connections. HTTP pipelining further reduces latency, allowing clients to send multiple requests before waiting for each response. Another additional feature of the protocol is the byte service, i.e. the server transmits only the portion of the resource that the client explicitly requests.
Technically the client opens a socket on a specific TCP port (usually port number 80). If the server has been listening for a connection on this well-known port, the connection will be established. The client then sends a request block containing the request method over the connection.
The HTTP specification defines 9 request methods, each of which specifies a different information exchange method between the client and the server. The commonly used request methods are GET and POST. The server will complete the corresponding operation according to the client's request, and return it to the client in the form of a reply block, and finally close the connection.

status message

1xx: Information

insert image description here

2xx: success

insert image description here

3xx: redirect

insert image description here

4xx: Client error

insert image description here
insert image description here

5xx: server error

insert image description here

HTTP request

1、请求行 -> 请求方式 请求url地址 协议
2、请求头 -> 放一些服务器要使用的附加信息
34、请求体 -> 一般放一些请求参数

HTTP response

1、状态行 -> 协议 状态码
2、响应头 -> 放一些客户端需要的附加信息
34、响应体 -> 一般放一些请求参数

Reptiles need attention

Some of the most common important things in request headers:

  1. User-Agent : The identity of the request carrier (the request sent by what)
  2. Referer: Anti-Hot Chain (Which page is this request from? Anti-crawling will be used)
  3. cookie: local string data information (user login information, anti-crawling token)

Some important things in the response headers:

  1. cookie: local string data information (user login information, anti-crawling token)
  2. Various magical and inexplicable strings (this requires experience, usually the word token, to prevent various attacks and anti-crawling)

request method

GET
explicit submission
POST
implicit submission
The two need to be distinguished. However, it is not very different from reptiles, and friends who are interested can understand it by themselves.

Guess you like

Origin blog.csdn.net/qq_53571321/article/details/123097367