Article directory
Preface
Earlier we introduced each layer in the five-layer network TCP/IP model. Among these five layers, the application layer is closely related to our programmers. We need our programmers to write code to implement it. Earlier we just briefly talked about the application layer. Although custom protocols are very flexible and can be changed at any time according to needs, in real life there are still only a few custom protocols used. A common protocol in the application layer is the HTTP protocol. Today I will share with you knowledge about the HTTP protocol.
What is HTTP
HTTP, which stands for Hypertext Transfer Protocol, is an application layer protocol used to transmit hypertext (such as HTML) over the network. It is the most widely used network protocol on the Internet, and all www files must comply with this standard. HTTP is an application layer communication protocol between a client browser or other program and a web server. Hypertext information is stored on Web servers on the Internet, and HTTP clients obtain text information on the server by sending requests. The HTTP protocol works at the application layer of the TCP/IP protocol stack and is used to retrieve information from the website's server. The request is sent to the server (hereinafter referred to as the HTTP client).
HTTP was born in 1991 and has now developed into the most mainstream application layer protocol. Starting from the initial HTTP 0.9 used for personal/institutional homepages, through HTTP 1.0 portals and HTTP 1.1 versions used for search engines and social networks, to HTTP 2.0, and to today's HTTP 3, HTTP has gone through many versions of iterations, among which HTTP 1.1 is what we mainly use currently, so in this blog I will also use HTTP 1.1 as an example to share knowledge about HTTP.
HTTP is often implemented based on the TCP protocol at the transport layer (HTTP 1.0, HTTP 1.1, and HTTP 2.0 are all TCP, and HTTP 3 is implemented based on UDP).
When we visit websites in our daily lives, we transmit data through the HTTP protocol.
When we enter a Baidu website (URL) in the browser, the browser will send an HTTP server request to Baidu's server, and then the Baidu server will return an HTTP response. The browser will parse this response and then present it to us in the above way. (This response contains HTML, CSS, JavaScript, images, text and other information)
Understand HTTP request and response formats
Different from the previous TCP/IP protocol, the message format of HTTP needs to be divided into request messages and response messages for analysis, because the message formats of HTTP request and response are different.
If you want to learn the HTTP request and response format, you need to first get the HTTP request and response packets, and capture the HTTP request and response packets through the proxy tool Fiddler, the HTTP packet capture tool I shared with you earlier.
HTTP request format
The HTTP request format is roughly divided into four parts: first line, request header (header), blank line, and body (body)
1. First line
The first line of HTTP is divided into three parts, each separated by spaces.
The first part, GET, is called the "method" of the request. There are not only GET but also methods like POST. We will briefly know about it first, and will share it with you in detail later.
The second part is the URL (unique resource locator), which describes the location of a resource on the network. URLs are not only used in HTTP, URLs are also used in many other places.
-
Protocol scheme: This section defines the type of network service used by the web page, such as http or https.
-
Login information: Username and password entered by the user. This is generally not used nowadays.
-
Server Address: This part defines the domain name of the website, for example www.aspxfans.com. In the URL, you can also use the IP address as the domain name.
-
Server Port Number: This section defines the port number on the host machine. The port is not a required part of the URL. If the port part is omitted, the default port will be used. For HTTP requests, the default port number is port 80; for HTTPS protocol, the default port number is port 443.
-
Virtual directory part: This part starts from the first "/" after the domain name and ends with the last "/". It is the virtual directory part. The virtual directory is also not a required part of a URL.
-
- Although the writing method here is in the form of a directory, the server does not necessarily store resources in the form of a directory. The data may be hard disk resources, memory data, data obtained through network access to other servers, or data calculated by the CPU.
-
Query string: This part contains some parameters that can be used to pass some additional information.
-
- The query string starts with ? For data in the key-value pair structure at the beginning, keys and values are connected using =. There can be multiple key-value pairs, and different key-value pairs are connected using &. This query string is customized by the programmer to supplement related query requests, and this query string will also be transcoded through urlencode. For example: I search for c++, and the %2B%2B in the query string part of the first line in the search bar is the transcoding of ++, because these special symbols may cause ambiguity with other identifiers, and the transcoded characters are % To identify.
-
Fragment Identifier: This section defines a link to a specific part of the web page, usually used to point to specific content or navigation points on the page.
The third part in the first line is the HTTP version number.
2. Request header
The HTTP request header is data in a key-value pair structure, which contains many key-value pairs. Each key-value pair occupies its own line. The keys and values are connected by colons and spaces:, and these key-value pairs belong to the "standard "Provisions" require us to write like this. The specific meanings of these key-value pairs will be introduced in detail later.
3. Blank line
The blank line here is the end tag of the request header.
Here we see that there is a blank line at the end of the captured Baidu request packet. This is the end of the request header.
4. Body
The body in an HTTP packet usually refers to the message body of the request or response, which contains the actual transmitted data content. In an HTTP request, the body usually contains the data that the client wants to send to the server, such as form data, JSON data, etc. In an HTTP response, the body usually contains the data returned by the server to the client, such as HTML pages, JSON data, etc.
The HTTP body is composed of some bytes and can be any type of data, including text, binary data, etc. In the HTTP protocol, the body uses the Content-Type header to specify the type and encoding of its data. Common Content-Type types include text/html, application/json, etc.
It should be noted that the body of the HTTP request and response is optional, they are not required parts of every HTTP packet. If the body does not exist, the message body of the request or response will be empty.
There is no body part in the HTTP request packet we captured here.
HTTP response format
The HTTP response packet is also divided into four parts: first line, response header (header), blank line, and body (body).
1. First line
The first line of the HTTP response message is also divided into three parts: HTTP version number, status code, and status code description.
- HTTP version number: Describes which HTTP version the HTTP response message uses.
- Status code: Describes the server's processing result of the client's request. Status codes are divided into 5 categories, each category has different meanings.
- Status code description: Status information is a simple text description corresponding to the status code. It provides the client with an understandable text description corresponding to the status code. For example, "OK" corresponds to status code 200, "Not Found" corresponds to status code 404, "Internal Server Error" corresponds to status code 500, etc.
2. Response header
The response headers here are also some key-value pairs. Keys and values are connected by colons and spaces:. Each key-value pair occupies its own line, and these key-value pairs are also "standardized".
3. Blank line
The blank line here is also the end mark of the response header.
4. Body
The body here is similar to the body of the HTTP request message.
First line
There are many ways to use method in the first line.
Although there are many methods in requests, the use of GET and POST accounts for 80% of daily use, so in this article we mainly study GET and POST.
What is the difference between GET and POST methods
GET is the most commonly used HTTP method. It is often used to obtain a resource on the server. Enter the URL directly into the browser, and the browser will send a GET request. In addition, tags such as link, img, and script in HTML will also trigger GET requests. A GET request adds the data to be passed to the server into the URL's query string. The POST method is mostly used to submit user-entered data to the server (such as a login page).
The POST method will put the data to be transmitted to the server into the body.
However, it is not necessarily necessary that the data to be transmitted to the server by the GET method must be placed in the query string and not in the body (body). The data transmitted to the server by the GET method can also be placed in the body (body), as long as the client and As long as the server abides by the same rules, although the data passed to the server through the GET method can be placed in the body, it is still recommended that you put it in the query string.
When I enter www.baidu.com
in the navigation bar, it is actually a GET method request sent.
When I log in to gitee, it is the request packet of the POST method sent by HTTP.
But sometimes it happens that Fiddler does not capture the HTTP packet. That is, when I visit a website multiple times, the HTTP request packet may not be captured. Why is this? This is because the request to access the website just hit the browser's buffer. When this happens, the browser will not actually send an HTTP request to the server.
The web pages displayed by the browser are actually HTML downloaded from the server. Because HTML may contain a lot of content and is relatively large, it will take a lot of time to load through the network. Therefore, in order to speed up access, the browser will have its own The cache stores previously loaded pages in the local hard disk. The next time you visit this website, you can read the data directly from the local disk, so this visit will not send an HTTP request packet to the server. .
Suppose that when I want to upload a file, a POST method request packet will be sent to the server, and the body of the packet will look like this.
This is that pictures are originally binary data. When putting the binary data of the picture into an HTTP request, base64 transcoding is often required. Base64 transcoding is to re-encode the binary data to ensure that the encoded data is plain text. data.
The original intention of these HTTP requests was to express different "semantics", but in the actual use process, this original intention has been gradually forgotten, and the use of these methods has become more casual, so the current GET and POST methods are actually There is no difference.
Some misconceptions about the difference between GET method and POST method
1. There is an upper limit on the data that can be transferred by GET request, and there is no upper limit on the amount of data that can be transferred by POST.
This statement is actually a "historical legacy". In early versions of browsers, because hardware resources were particularly scarce, the browser limited the length of the URL, because the GET method usually adds the data transmitted to the server to the query of the URL. string, so when there is a lot of data to be transmitted to the server using the GET method, the URL will be very long.
But in fact, the RFC standard document does not clearly stipulate how long a URL can be, and now due to the advancement of technology, URLs can also be very long, and URLs can even be used to transfer some pictures, etc. So this statement is wrong.
2. It is not safe to transmit data through GET request, and it is more secure to transmit data through POST request.
Why do you say this? Because when logging in using a GET request, the username and password will be put into the URL and further displayed in the browser's address bar, so won't others be able to see it easily? The reason why POST is more secure is because POST will put the username and password into the body so that they will not be displayed in the browser search bar, so it is safe at this time. But this statement is wrong and can only fool novices. What is safety? Is security just about not showing your username and password in the search bar? That's not the case. The so-called security means that this request packet will not be easily intercepted by hackers. Even if it is intercepted, hackers will need to spend more than the value of the data itself to crack the data packet. Whether it is a GET request packet or a POST data packet, when the data packet is intercepted, it is easy to know the first line and body part of the request packet.
Therefore, in order to solve the problem of being intercepted by hackers during the data transmission process and then being able to directly know the content of your data, measures have been taken to encrypt the username and password. In this way, even if you obtain the request packet, because the username and password It is encrypted and cannot be easily cracked by hackers.
3. GET can only transmit text data to the server, and POST can transmit text and binary data to the server.
The body part of the request packet can contain text data or binary data. Why is this said? In fact, this statement is still based on the first statement, because it is mistakenly believed that the data transmitted to the server by the GET method can only be written to the query string of the URL, and the data in the URL is often text data, but the data transmitted to the server by the GET method Data can be written into the body part, so the GET method can also transmit binary data.
The difference between the GET method and the POST method has some truth but is not rigorous.
1. The GET method is idempotent, and the POST method is non-idempotent.
What are idempotent and non-idempotent?
Idempotent means that in the same system, with the same parameter conditions, one request and multiple identical repeated requests have the same impact on resources. For example, if one insert request inserts a piece of data, the impact of multiple insert requests is one more piece of data, so it is idempotent.
Non-idempotence means that in the same system, with the same parameter conditions, one request and multiple identical repeated requests have different impacts on resources. For example, if the user repeats operations, such as the user submits a request to create an order, and the page keeps redirecting due to network problems, and the user clicks the create order button again, two identical orders will be created in the database. This is non-power. Wait.
Why is the GET method idempotent and the POST method non-idempotent? Because there is such a suggestion in the RFC standard document: it is recommended that the data requested by GET be idempotent. But this is just a suggestion, not a hard requirement, so this statement is not rigorous.
2. The GET request can be cached by the browser, but the POST method cannot be cached by the browser.
Although this statement is correct to a certain extent, it is not rigorous. This is because both GET and POST requests themselves can be cached, but the behavior and mechanisms of browsers when handling these two requests are different.
For GET requests, the browser will cache the response result of the GET request so that subsequent requests for the same can use the cached response directly without sending a request to the server again. This is due to the fact that GET requests are idempotent, i.e. sending the same GET request multiple times will obtain the same result without any additional impact on the resource.
However, for POST requests, browsers typically do not cache the response results. This is because in most cases, POST requests are used to submit data or perform update operations, and the results of each request may vary. In order to ensure that each POST request will send the latest data to the server and produce the latest results, the browser will not cache the response results of the POST request.
However, this does not mean that POST requests will never be cached. In fact, some browsers or proxy servers may cache POST requests, especially in certain circumstances, such as when using a caching proxy. In addition, some web applications may also implement caching of POST requests programmatically.
Therefore, although this statement "GET requests can be cached by the browser, and POST methods cannot be cached by the browser" reflects the browser's processing mechanism of GET and POST requests to a certain extent, it does not apply to all situations. In specific context, browser behavior, web application implementation, and other factors need to be considered to evaluate the accuracy of this statement.
3. GET requests can be collected by the browser, but data from POST requests cannot be collected.
This statement is also imprecise. In fact, both GET requests and POST requests can be collected by the browser, but there are some differences in the collection process.
For GET requests, the browser saves the requested URL and parameters together so that the user can directly click the URL to reload the page or obtain data in the future. This method is common, such as saving web page URLs in browser bookmarks. When the user clicks the URL, the browser will send a GET request to the server, and the server will return the corresponding web page content. Therefore, GET requests can be saved by the browser.
For POST requests, the browser does not directly save the requested URL and parameters, because the POST request is a request method for sending data to the server. Browsers include data for POST requests in the request body, not in the URL. However, users can manually copy and paste the URL and parameters of the POST request in the browser, or use the developer tools to view and copy the contents of the POST request. Therefore, although the browser does not automatically save the data of the POST request, the user can still save it manually.
It should be noted that the GET request URL saved by the browser may contain sensitive information, such as authentication tokens or passwords. To protect user privacy and security, this information should be sent to the server via POST request or other secure mechanism.
Header
There are many key-value pairs in Header, but here we mainly select a few to introduce.
Host
Host: Indicates the address and port of the server host. The content here is usually reflected in the URL, but if a proxy is used, the content of the Host and the content of the URL may be different.
Content-length
Content-length: Indicates the length of data in body.
What is the use of Content-length? Because HTTP is based on TCP, TCP may have packet sticking problems during the transmission process. If the same TCP connection is used to transmit multiple HTTP data packets, multiple HTTP data packets will They will wait side by side in the receiving buffer. When receiving these HTTP packets, the receiver needs to know the boundaries of the HTTP packets. For GET packets that generally do not have a body part, a blank line must be used as a separator; For POST, there is usually a body part, so blank lines and Content-length are needed to distinguish different data packets.
Only when there is a body part in the request, there will be two key-value pairs, Content-length and Content-type.
Content-type
Content-type: Indicates the format of the data in body.
There are many types of data formats in body:
Request:
- JSON
- form form format
- Format of form-data
This data format is a json data format.
When Content-type is in the following form, it means form format.
The form is equivalent to moving the GET query string into the body.
The form-data format is usually involved when uploading files, but when uploading files, it may not always be in the form-data format, it may also be in the form format.
response:
- HTML
- CSS
- JS
- JSON
- picture
Usually the blue packets captured in Fiddler are HTML format data packets; purple is CSS, green is JavaScript, and black is JSON.
HTML CSS JS is used to form the body of the web page.
- HTML represents the skeleton of the page (what is on the page)
- CSS represents the style of the page (what the page looks like)
- JavaScript represents the behavior of the page
The Content-type handed over to the server is different, and the server's data processing logic is also different. The server also needs to indicate the Content-type when returning the processed data, and the browser will also perform different processing according to different Content-types.
User-Agent (UA for short)
Through the above packet capture and observing User-Agent, we can see several very common messages: Windows NT 10.0; win64; x64. This represents information about the host we are using, while Chrome/118.0.0.0 represents The version number of the browser we are using. From these two pieces of information, we can probably know that User-Agent represents information related to the current browser.
Why does this UA appear in the HTTP packet? In fact, a long time ago, when browsers were just developing, browser web pages were just text at first, without any pictures, and the related functions of the browser were also very simple. However, with the development of browsers, browsers Once pictures and some sound information can be displayed, some people will choose to update the browser version. However, in the process of popularization of new browsers, the process of popularization is slow, so old browsers and new browsers will coexist. situation, then when a user visits, should the server return a web page with pictures or a web page without pictures? If data with pictures is returned, the browser does not support this function. But if a web page without pictures is returned, what is the use of updating the new version of the browser? So this time UA comes into play, and the server can return different web pages based on the browser version reflected by UA in the request.
But now UA is actually not that critical. Because the functions of different versions of browsers are almost the same now, the current UA is mainly used to identify whether you are on PC or mobile. Although it can be used to identify whether you are on PC or mobile, the returned web page has nothing to do with this UA. . This UA is used for statistics. Today's front-end development has "responsive web" programming technology, and the same HTML can be compatible with different devices.
Refer
Referer describes the page from which the current web page is redirected. There is no Referer when entering the URL directly into the search bar.
Referer is the header part of the HTTP request. When the browser sends a request to the web server, it usually brings the Referer to tell the server which page the web page is linked from, so the server can obtain some information for processing.
Referer mainly has two functions:
- Anti-hotlinking: Only allow my own website to access my own image server. If the domain name is www.google.com, then the image server will get the Referer every time to determine whether the domain name is www.google.com. If so, continue to visit. If not, just intercept.
- Prevent malicious requests: For some higher-risk file types, you can use Referer so that files of this type can only come from websites specified by me.
Advertisements in many search engines require advertisers to pay advertising fees to the search engine company. Some advertising fees are calculated based on how long the advertisement hangs on the search engine, while most advertising fees are based on the click. Charges are based on the number of times. So how do search engines or advertisers determine which website a click comes from? Here you need to use the Referer in the HTTP packet. Through the Referer in the HTTP packet, you can know which page this click comes from.
But will there be a phenomenon: the Referer in this HTTP packet will be maliciously modified to other companies, causing the final advertising fees to be given to them? In fact, it is possible. A long time ago, because network data transmission needed to go through the network operators' switches/routers, these network operators could modify the Referer in the data passing through their switches/routers, so that they would eventually become the beneficiaries. , and this phenomenon was very rampant at the time, because the Internet had just developed and the relevant laws had not yet been perfected, so these search engine companies could not do anything to these network operators. So these search engines decided to make technical modifications and encrypt these HTTP packets so that network operators cannot easily modify the Referer.
Cookie
An HTTP cookie is a small piece of data sent by the server to the user's browser and stored locally. It is usually sent to the browser by the server using the HTTP response header Set-Cookie. When the browser makes a request to the same server again, these cookies will be carried by default and sent to the server.
Similar to some information: last login time, last visit time, user identity information, cumulative number of visits, etc. This information will be stored in the cookie of the browser, and the cookie will be sent together the next time you visit this website.
Because these data are temporary and may change at any time, it is most suitable for these data to be stored in the browser. In fact, it is easier to think of saving these data in our local files, but this is not possible. In order to ensure security Sexually, the browser will prohibit the website from directly accessing your local files, so the web page code cannot directly generate a hard disk file to store data. Therefore, in order to ensure security and save data, Cookies were introduced. Cookies also save data in the form of hard disk files, but the browser encapsulates this file, and web pages can only store key-value pairs in Cookies.
Cookies are often data returned from the server (or can be generated by yourself); Cookies are stored in the local computer where the browser is located, and are stored according to the domain name as the dimension. Each domain name has its own Cookie, and each other has its own Cookie. They do not affect each other; the key-value pairs in the cookie are customized by the programmer; when subsequently requesting the server, the content in the cookie will be automatically brought into the request, and then sent to the server, and the server will pass the cookie in the cookie. Make some logical processing of the content.