Http: quick understanding of the HTTP protocol

Quickly understand the HTTP protocol

The basic principle is to simulate the reptile browser HTTP request, the HTTP protocol is written reptile understanding the essential basis for job recruitment site crawlers also screamed master HTTP protocol specification, write reptile also had to start with start with HTTP protocol

What HTTP protocol is?

You browse every page based on the HTTP protocol is presented, the HTTP protocol is the Internet application, make a protocol for data communication between the client (browser) and server. The agreement provides for the client in what format should send a request to the server, but also agreed the results returned from the server in response to what should be the format.

As long as we have initiated the request and returns the response result, anyone can achieve their own Web client (browser, reptiles) and Web server (Nginx, Apache, etc.) based on the HTTP protocol agreement in accordance with the prescribed manner.

HTTP protocol itself is very simple. It provides, can initiate a request to the client, the server returns the response after receiving the request processing result, while HTTP is a stateless protocol, the protocol itself is not recorded in the client request history records.

http

How HTTP protocol format is prescribed request and response format it? In other words, in what format the client sends an HTTP request to correct it? In what format the server returns a response to the results of the client resolved correctly?

HTTP request

HTTP request consists of three parts, namely, the request line, request header, the request body, and the body of the request header is optional and is not required for every request.

http-request.jpg

Request line

Each request line is an essential part of the request, which consists of three parts, namely, the request method (Method), a request URL (URI), HTTP protocol version, separated by a space.

The most commonly used protocol HTTP request methods are: GET, POST, PUT, DELETE. GET method for obtaining resources from the server, it is based on 90% of the crawler GET request to fetch data.

Refers to a path request URL address of the server resources, such as the example of the diagram represents the client wishes to obtain index.html resources, in its path to the root server foofish.net of (/) below.

Request header

Because the request line carried a very limited amount of information that the client would like to say there are many things that have to be placed to the server request header (Header), the server request header is used to provide some additional information, such as User- Agent used to indicate the identity of the client, let the server know that you are requesting from the browser or crawler, from Chrome browser or FireFox. HTTP / 1.1 of 47 predetermined header field type. Format like Python HTTP header fields in the dictionary type of composition, separated by a colon key. such as:

User-Agent: Mozilla/5.0 

Because the client sends a request, transmitted data (packet) is composed of character strings, in order to distinguish the start request and the end request of the first body portion, is represented by a blank line, blank line is encountered, it means this is the end of the header, body start request.

Request body

The client request body is submitted to the real content server, such as when the user needs to use the login user name and password, such as file upload data, information submitted in a form such as user registration information.

Now we offer the most original Python API socket module to simulate initiate an HTTP request to the server

socket.socket with (socket.AF_INET, socket.SOCK_STREAM) AS S: 
    # 1. connections to the server 
    s.connect (( "www.seriot.ch", 80)) 
    # 2. Construction request line, a request resource index .php 
    request-line = B "the GET /index.php the HTTP / 1.1" 
    # 3. Construction header request, specifies the host name of the 
    headers = B "the host: seriot.ch" 
    # 4. blank line marks the end position of the header request 
    blank_line = b "\ R & lt \ n-" 

    # request line, header, a three part blank lines separated by line feed character string composed of a request packet 
    # transmitted to the server 
    message = b "\ r \ n " .join ([request_line, headers , BLANK_LINE]) 
    s.send (the Message) 

    # returned by the server response content for analysis later in 
    the response = s.recv (1024) 
    Print (the response)

HTTP response

After the server receives the request and processing, and returns the response content to the client. Similarly, the response content must adhere to a fixed format browser to parse correctly. Also HTTP response consists of three parts, namely: in response to the line, in response to the first portion, in response to the body, and the HTTP request corresponding to the format.

http-response.jpg

Response line

Response line is also part 3, supported by the HTTP server protocol version, status code, and a brief description of the status code reasons composition.

Status code is a very important field in response to a row. By status code, the client requests that the server can know the normal processing. If the status code is 200, indicating that the client's request processing is successful, if it is 500, indicating that the server when processing the request appeared abnormal. Resource 404 indicates that the request is not found on the server. In addition, HTTP protocol defines a number of other very status code, but it is not beyond the scope of this article.

Response header

Similar response header and request header, used to supplement the response content in the header which tells the client can respond to what type of data is the body? Response time is returned content when and whether the response was compressed, and last modified time response of the body.

Response Body

Response body (body) is the real content returned by the server, it can be an HTML page or a picture, a video, and so on.

We continue to use the previous example to see what the result is returned by the server's response? Because I only received the first 1024 bytes, it is part of the response content is invisible.

b'HTTP/1.1 200 OK\r\n
Date: Tue, 04 Apr 2017 16:22:35 GMT\r\n Server: Apache\r\n Expires: Thu, 19 Nov 1981 08:52:00 GMT\r\n Set-Cookie: PHPSESSID=66bea0a1f7cb572584745f9ce6984b7e; path=/\r\n Transfer-Encoding: chunked\r\n Content-Type: text/html; charset=UTF-8\r\n\r\n118d\r\n <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n <head>\n\t <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" /> \n\t <meta http-equiv="content-language" content="en" />\n\t ... </html> 

From the results, it format protocol specification is the same as, the first row lines in response to a status code 200 indicating the request was successful. The second part is in response to the header information, composed of a plurality of first portions, there server returns a response time, Cookie information or the like. The third part is the real body of the response HTML text.

At this point, you should have the HTTP protocol a general understanding, the reptile's behavior is essentially analog browser sends an HTTP request, so in order to Plow in the field of reptiles to understand the HTTP protocol is necessary.

Of course, far more than the HTTP protocol so little content, it is impossible to use an article tried to clear it all, I am here only initiate HTTP want to learn more, you can extend the public reference number "Zen of Python" Recommended read.

Further reading

Guess you like

Origin www.cnblogs.com/yinguo/p/11222315.html