Crawler http protocol and Chrome packet capture tool

What are http and https protocols

  1. HTTP protocol: The full name is HyperText Transfer Protocol, which means Hypertext Transfer Protocol in Chinese. It is a method of publishing and receiving HTML pages. The server port number is 80
  2. HTTPS protocol: It is an encrypted version of HTTP. The SSL layer is added to HTTP. The server port number is 403.

The process of sending an http request in the browser

  1. When the user enters a URL in the browser address bar and presses the Enter key, the browser will send an http request to the http server. HTTP requests are mainly divided into two methods: "get" and "post".
  2. When we enter the URL http://www.baidu.com in the browser, the browser sends a request to obtain the html file of http://www.baidu.com, and the server sends the response file object back to the browser device.
  3. The browser analyzes the HTML in the response and finds that it references many other files, such as images files, CSS files, and JS files. The browser will automatically send the Request again to obtain images, CSS files, or JS files.
  4. When all files are downloaded successfully, the web page will be displayed completely according to the HTML syntax structure.

URL parsing

Detailed explanation of URL : URL is the abbreviation of Uniform Resource Locator , Uniform Resource Locator.
A URL consists of the following parts:

scheme://host:port/path/?query-string=xxx#anchor

  • scheme: represents the access protocol, usually http or https and ftp, etc.
  • host: host name, domain name, such as www.baidu.com
  • port: port number. When you access a network, the browser uses port 80 by default.
  • path: search path. For example: https://item.jd.com/40468351063.html, the following 40468351063.html is the path
  • query-string: query string, https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=python&rsv_pq=ada012bc00018917&rsv_t=2637WdXSzepoAL7f7r0vjB58CANn8ZHzUiloyQQrE%2Bf3FlZ eufHmXuo2vNw&rqlang=cn&rsv_enter=1&rsv_sug3=7&rsv_sug1=7&rsv_sug7= 101&rsv_sug2=0&inputT=2136&rsv_sug4=3071&rsv_sug=1 wd=python here is the query string
  • Anchor: Anchor point, which is generally ignored in the background. The front-end is used for page positioning
    . Request a URL in the browser, and the browser will encode the URL . Except for English letters, numbers and some symbols, all others are used . Semicolon + hexadecimal code value to encode

Commonly used request methods

  1. get: Under normal circumstances, the get request is used when only obtaining data from the server and not having any impact on server resources.
  2. Post: Send data (login), upload data, etc. to the server. Post requests are used when it affects server resources.
    The above are two methods commonly used in website development. And generally the principles of use will be followed. However, some websites and servers often do not follow common sense in order to have anti-crawler mechanisms. It is possible that a request that should use the get method must be changed to a post request. This should be treated on a case-by-case basis.

As shown below, the request method is limited to: POST
Insert image description here

Request common parameters

In the http protocol, when a request is sent to the server, the data is divided into three parts. The first is to put the data in the URL, the second is to put the data in the body (in the post request), and the third is to put the data in the head. The following introduces some request header parameters that are often used in web crawlers:

  1. User-Agent: Browser name. This is often used in web crawlers. When requesting a web page, the server can use this parameter to know which browser sent the request. If we send a request through a crawler, then our User-Agent is python. For websites with anti-crawler mechanisms, you can easily determine that your request is a crawler, so we often set this value to the value of some browsers. Disguise our reptiles.
  2. Referer: Indicates which URL the current request comes from. This can generally also be used as anti-crawler technology. If it does not come from the specified page, then no corresponding response will be made.
  3. Cookie: The http protocol is stateless, that is, the same person sends two requests, and the server has no ability to know whether the two requests are from the same person. Therefore, cookies are used as identifiers at this time. Generally, if you want to create a website that can only be accessed after logging in, you need to send cookie information.
    Insert image description here

Common response status codes

Classification Classification description
1** message, the server receives the request and needs the requester to continue performing the operation
2** Success, the operation was successfully received and processed
3** Redirect, further action is required to complete the request
4** Client error, the request contains a syntax error or the request cannot be completed
5** Server error. An error occurred while the server was processing the request.

Specific status code

status code name meaning
100 Continue The client should continue sending requests.
101 Switching Protocols Switch protocols. The server switches protocols based on the client's request. You can only switch to a higher-level protocol, for example, switch to a new version of the HTTP protocol
200 OK The request was successful. Generally used for GET and POST requests // but the status code may be 200, but because the User-Agent is not the required one, a false message will be returned
201 Created Created. Successfully requested and created a new resource
202 Accepted accepted. The request has been accepted, but the processing has not been completed. The processing will be performed asynchronously.
203 Non-Authoritative Information Unauthorized information. The request was successful. But the meta information returned is not on the original server, but a copy
204 No Content No content. The server processed successfully, but no content was returned. Ensures that the browser continues to display the current document without updating the web page
205 Reset Content Reset content. Server processing is successful and the user terminal (e.g. browser) should reset the document view. This return code clears the browser's form fields
206 Partial Content Part. The server successfully processed a partial GET request
300 Multiple Choices multiple choices. The requested resource may include multiple locations, and a list of resource characteristics and addresses may be returned accordingly for user terminal (e.g. browser) selection.
301 Moved Permanently Move permanently. The requested resource has been permanently moved to the new URI, the return information will include the new URI, and the browser will automatically be directed to the new URI. Any new requests in the future should use the new URI instead
302 Found Temporary move. Similar to 301. But the resource is only moved temporarily. Clients should continue to use the original URI
303 See Other View other addresses. Similar to 301. View using GET and POST requests
304 Not Modified Not modified. The requested resource has not been modified. When the server returns this status code, no resource will be returned. Clients typically cache accessed resources by providing a header indicating that the client wishes to return only resources that have been modified after a specified date.
305 Use Proxy Use a proxy. The requested resource must be accessed through a proxy
306 Unused Deprecated HTTP status codes
307 Temporary Redirect Temporary redirection. Similar to 302. Redirect using GET request
400 Bad Request The syntax of the client request is incorrect and the server cannot understand it.
401 Unauthorized Unauthorized (no authentication information included in the request), the request requires user authentication
402 Payment Required Reserved for future use
403 Forbidden The server understands the client's request, but refuses to execute it
404 Not Found The server cannot find the resource (webpage) based on the client's request. Through this code, website designers can set up a personalized page that says "The resource you requested cannot be found"
405 Method Not Allowed Method in client request is prohibited
406 Not Acceptable The server was unable to complete the request based on the content characteristics requested by the client.
407 Proxy Authentication Required The request requires proxy authentication, similar to 401, but the requester should use the proxy for authorization
408 Request Time-out The server waited too long for the request sent by the client and timed out.
409 Conflict The server may return this code when completing the client's PUT request. A conflict occurred when the server processed the request.
410 Gone The resource requested by the client no longer exists. 410 is different from 404. If the resource has been permanently deleted, the 410 code can be used. The website designer can specify the new location of the resource through the 301 code.
411 Length Required The server cannot process the request information sent by the client without Content-Length.
412 Precondition Failed Wrong prerequisite for client requesting information
413 Request Entity Too Large The request is rejected because the requested entity is too large for the server to handle. To prevent continued requests from the client, the server may close the connection. If the server cannot process it temporarily, a Retry-After response message will be included.
414 Request-URI Too Large The requested URI is too long (URI is usually a web address) and the server cannot handle it
415 Unsupported Media Type The server cannot handle the media format attached to the request
416 Requested range not satisfiable The client requested an invalid range
417 Expectation Failed The server cannot satisfy the Expect request header information.
500 Internal Server Error Internal server error, unable to complete the request
501 Not Implemented The server does not support the requested feature and cannot complete the request
502 Bad Gateway A server acting as a gateway or proxy received an invalid request from a remote server.
503 Service Unavailable Due to overload or system maintenance, the server is temporarily unable to process the client's request. The length of the delay can be included in the server's Retry-After header.
504 Gateway Time-out The server acting as a gateway or proxy did not obtain the request from the remote server in time.
505 HTTP Version not supported The server does not support the requested version of the HTTP protocol and cannot complete processing.
  1. 200: The request is normal and the server returns data normally.
  2. 301: Permanent redirection, for example, when visiting www.jingdong.com, it will be redirected to www.jd.com
  3. 302: Temporary redirect. For example, when accessing a page that requires login, but you are not logged in at this time, you will be redirected to the login page.
  4. 400: The requested url cannot be found on the server
  5. 403: Server denies access, insufficient permissions
  6. 500: Server internal error. Maybe there is a bug in the server

Chrome packet capture tool

Right-click on the web page to check.
Insert image description here
Insert image description here
Insert image description here
Some of the information in this article is reprinted and can be used as notes.

Guess you like

Origin blog.csdn.net/Pang_ling/article/details/105419614