Python crawler (ii) the Request request and response

HTTP和HTTPS

HTTP is the Internet's most widely used network protocol, a client and a server-side request and response standard (TCP), hypertext transfer protocol for transmission from the WWW server to the local browser, which can make the browser more efficient, so that network traffic is reduced.
Since the data transfer protocol is HTTP unencrypted plaintext, although high efficiency, but the use of the confidential information transmitted very unsafe, in order to ensure the privacy of data transmission can be encrypted, so we designed Netscape SSL (Secure Sockets Layer) protocol for HTTP protocol for data transmission is encrypted, and thus was born HTTPS.
HTTPS protocol is constructed by HTTP + SSL (Secure Sockets Layer) may be encrypted transmission, authentication of network protocols, security protocols than HTTP.

HTTP protocol request

In this section, with Chrome in Baidu, for example, about the more important for Python Reptile browser requests in several elements.

url

Right-click interface, click on the "check", tell tab to switch to the next NetWork, click on our request, we can see the Request URL in the right box. Protocol (http / https) URL = request + site's domain name ( www.baidu.com ) + path + resource parameters.

15064033-b557ad187703a587.png
Chrome checks .png interface
Many students will feel wondering why we just simply request a domain name, but the request box on the left side, but there have been many requests. This is not only because we requested URL address of requested data, also requested a picture, JS, CSS, etc., after rendering is Element tab page content, that is, we see the contents of the page.
Thus, Element = current URL corresponding response + JS + CSS + images . However, our crawler will only respond to a request to the current address is not sent JS, CSS, and images of the request, so the crawlers need to respond to the current URL address corresponding to prevail extract data . (I.e. under Response NetWork subject, not subject to Element)

Request Headers (request header)

View the next NetWork of Request Headers, click view source, shown below:
15064033-45704c2a164cc48f.png
Request Headers.png
  • The first part is the request line, including the method (GET / POST) request, and the protocol version request parameters and request, we just need to focus on this part of the request method is a GET or POST it.
  • The second part is the HOST, namely the domain name.
  • The third part is the User-Agent, this part is critical. We can translate it as user agent, also can be understood as producing identification. Different versions, different browsers, there is a User-Agent, the other server can know by what resources User-Agent when initiating the request, so the crawlers to simulate current browser sends a request that also contains the parameters of this section .
  • The fourth part is Cookie. Cookie browser user information is stored locally, the reason to say here, there are two reasons. The first is the site then we need to be logged in to request access to, to deal with the contents of the cookie; The second aspect is the other servers are often judged by whether we carry whether the cookie is a crawler.
Request Body (request form)

In this example, you'll wonder why only request headers, without the request body. This is because the GET request parameter data in the URL, whereas POST request parameters are placed in the Body request.
About the difference between GET and POST requests requests, simply summarize here:

  • GET and POST are the essence of TCP connections, and non-discriminatory. But due to the limitations of the provisions of HTTP and the browser / server, causing them to reflect some differences in the application process.
  • GET request URL has a length limit mass participation, and POST requests no length limit. On the restriction of the URL, more is to limit the browser and the server, inconsistent different URL length limitations of the browser and server.
  • Generating a TCP packet GET, POST generates two TCP packets. Browser requests a TCP connection (the first handshake), after the server confirmation link (second handshake) for the GET request, the browser will http header and data together sent, the server responds 200 (return data); and for POST, the browser to send Header, server response 100 continue, the browser then transmits data, in response to the server 200 ok (return data). Because POST requires two steps, the time to consume a little more, it seems more effective than GET POST.
    It should be emphasized that the transmission Header and Body separate request method is part of the browser or frame, not necessarily POST behavior. Firefox is sent only once.
    According to the research, in the network environment is good, send a packet time difference between time and send the two packets basically ignored. And in the case of poor network environment, TCP packets over the two packet integrity verification, there is great advantage.

HTTP response

Response Status

In response to a variety of state,

  • 1xx message - the server request has been received, processing continues;
  • 2xx Success - the server request has been successfully received, understood and accepted;
  • 3xx Redirection - subsequent operations needed to complete this request;
  • 4xx Request Failure - the request contains lexical error or can not be performed;
  • 5xx Server Error - Server error occurred while processing a request properly.
And in response to the first response member

HTTP response header and the response into the response body. In response header, we need to focus on Set-Cookie, the other server settings Cookie to the local by the sub-segments; in response to the body, we have to climb should be a response (Response under NetWork) URL address of the corresponding, rather than through JS, CSS, and images of the response after rendering (Element).

Reproduced in: https: //www.jianshu.com/p/69587b309c6b

Guess you like

Origin blog.csdn.net/weixin_34119545/article/details/91077764