"Want to learn crawler must-see series" master http and https

Knowledge points

  • Master the concepts and default ports of http and https

  • Grasp the request headers and response headers that the crawler pays attention to

  • Understand common response status codes

  • Understand the difference between browser and crawler crawling


When you mention the http protocol, everyone will remember that it is an application layer protocol, so what does the http protocol have to do with crawlers? Please see the picture below:

1. The concept and difference between http and https

HTTPS is more secure than HTTP, but has lower performance

  • HTTP: Hypertext Transfer Protocol, the default port number is 80

    • Hypertext: refers to more than text, not limited to text; it also includes pictures, audio, video and other files

    • Transmission protocol: refers to the use of a common agreed fixed format to transfer the hypertext content converted into a string

  • HTTPS: HTTP + SSL (Secure Socket Layer), which is the Hypertext Transmission Protocol with Secure Socket Layer, the default port number: 443

    • SSL encrypts the transmitted content (hypertext, that is, the request body or the response body)

  • You can open a browser to visit a url, right-click to check, click net work, click a url, and view the form of the http protocol


Knowledge points: master the concepts and default ports of http and https


 

2. Request headers and response headers that the crawler pays special attention to

2.1 Request header fields of special concern

The form of the http request is shown in the figure above. The crawler pays special attention to the following request header fields

  • Content-Type

  • Host (host and port number)

  • Connection (link type)

  • Upgrade-Insecure-Requests (upgrade to HTTPS request)

  • User-Agent (Browser Name)

  • Referer (the place where the page jumps)

  • Cookie (Cookie)

  • Authorization (used to indicate the authentication information of the resource that needs to be authenticated in the HTTP protocol, such as the jwt authentication used in the previous web course)

Bold request headers are commonly used request headers. They are used most frequently for crawler identification on the server, and are more important than the rest of the request headers. However, it should be noted that it does not mean that the rest are not important, because there are The operation and maintenance of the website or the developer may be slanted, and will use some of the less common request headers to screen crawlers

2.2 Response header fields of special concern

The form of the http response is shown in the figure above, the crawler only pays attention to one response header field

  • Set-Cookie (the other party's server sets cookies to the cache of the user's browser)


Knowledge points: master the request header and response header that the crawler pays attention to


 

3. Common response status codes

  • 200: success

  • 302: Jump, the new url is given in the Location header of the response

  • 303: The browser redirects to the new url in response to the POST response

  • 307: The browser redirects to the new url in response to the GET

  • 403: The resource is unavailable; the server understands the client's request, but refuses to process it (no permission)

  • 404: The page cannot be found

  • 500: server internal error

  • 503: The server failed to respond due to maintenance or heavy load, and the response may carry the Retry-After response header; it may be because the crawler frequently visits the URL, causing the server to ignore the crawler's request, and finally return a 503 response status code

We have already learned the status code related knowledge when we are learning web knowledge. We know that this is the relevant feedback given to me by the server. When we are learning, we are educated that we should feed back the real situation to the client, but we are crawling In order to prevent the data from being easily obtained by the crawler, the developers or operation and maintenance personnel of the site may be tampering with the status code, which means that the returned status code is not necessarily the real situation, for example: the server has recognized that you are Crawler, but in order to make you negligent, so the status code 200 is still returned, but the response weight does not have data.

All status codes are not credible, everything depends on whether the data is obtained from the response obtained from the packet capture.


Knowledge point: understand common response status codes


 

4. The running process of the browser

After reviewing the http protocol, let’s understand the process of sending http requests from the following browsers

4.1 http request process

  1. After the browser gets the ip corresponding to the domain name, it first initiates a request to the url in the address bar and gets the response

  2. In the returned response content (html), there will be url addresses such as css, js, pictures, and ajax code. The browser sends other requests in the order in the response content and obtains the corresponding response.

  3. The browser adds (loads) the displayed result every time it gets a response, js, css and other content will modify the content of the page, js can also resend the request and get the response

  4. From getting the first response and displaying it in the browser, until finally getting all the responses, and adding content or modification to the displayed results-this process is called browser rendering

 

4.2 Note:

But in the crawler, the crawler will only request the url address, and get the response corresponding to the url address (the content of the response can be html, css, js, pictures, etc.)

The page rendered by the browser is often different from the page requested by the crawler because the crawler does not have the ability to render (of course, we will use other tools or packages to help the crawler render the response content in subsequent courses)

  • The final result displayed by the browser is the result of multiple responses corresponding to multiple requests sent by multiple URL addresses.

  • Therefore, in the crawler, the response corresponding to a URL address of the request is required to extract the data


Knowledge point: Understand that the results displayed by the browser can be rendered by multiple responses corresponding to multiple requests, and the crawler is one request corresponding to one response


 

5. Other reference reading about http protocol

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/113574831