Web crawler-----request and response principles of http and https

 Table of contents

Preface

Introduction

HTTP requests and responses

The process of the browser sending an HTTP request:

HTTP requests are mainly divided into two methods: Get and Post.

View web request

Common request headers

1. Host (host and port number)

2. Connection (link type)

3. Upgrade-Insecure-Requests (upgrade to HTTPS requests)

4. User-Agent (browser name)

5. Accept (transfer file type)

Example:

6. Referer (page jump point)

7. Accept-Encoding (file encoding and decoding format)

Example: Accept-Encoding:gzip;q=1.0, identity; q=0.5, *;q=0

8. Accept-Language (language type)

9. Accept-Charset (Character encoding)

Example: Accept-Charset:iso-8859-1,gb2312,utf-8

10. Cookie (Cookie)

11. Content-Type (POST data type)

Example: Content-Type = Text/XML; charset=gb2312:

Server HTTP response

Commonly used response headers (understand)

1. Cache-Control:must-revalidate, no-cache, private。

2. Connection:keep-alive

3. Content-Encoding:gzip

4. Content-Type:text/html;charset=UTF-8

5. Date:Sun, 21 Sep 2016 06:18:21 GMT

6. Expires:Sun, 1 Jan 2000 01:00:00 GMT

7. Pragma:no-cache

8.Server:Tengine/1.4.6

9. Transfer-Encoding:chunked

10. Vary: Accept-Encoding

Cookies and Sessions:

Response status code

Common status codes:

Two ways to load web pages

Understand the composition of web page source code

Crawler protocol (understanding)


Preface

           Before starting to learn crawlers, we must understand and know the relevant operating principles of web pages, that is, how the http and https protocols operate. Below I will introduce the relevant knowledge points in this aspect in detail, let’s look down!

Introduction

HTTP协议(HyperText Transfer Protocol, Hypertext Transfer Protocol): It is a method of publishing and receiving HTML pages.

HTTPS(Hypertext Transfer Protocol over Secure Socket Layer) is simply a secure version of HTTP, adding an SSL layer under HTTP.

SSL(Secure Sockets Layer) is a secure transmission protocol mainly used for the Web. It encrypts network connections at the transport layer to ensure the security of data transmission on the Internet.

  • HTTPThe port number is 80,

  • HTTPSThe port number is443

HTTP requests and responses

HTTP communication consists of two parts: client request message and server response message

The process of the browser sending an HTTP request:

  1. When the user enters a URL in the browser's address bar and presses the Enter key, the browser sends an HTTP request to the HTTP server. HTTP requests are mainly divided into two methods: "Get" and "Post".

  2. When we enter the URL Baidu click, you will know, in the browser, the browser sends a Request request to obtain the HTML file of Baidu click, you will know, and the server sends the Response file object back to the browser.

  3. The browser analyzes the HTML in the Response and finds that it references many other files, such as Images files, CSS files, and JS files. The browser will automatically send the Request again to obtain images, CSS files, or JS files.

  4. When all files are downloaded successfully, the web page will be displayed completely according to the HTML syntax structure.

Uniform Resource Locator: URL (short for Uniform/Universal Resource Locator) is an identification method used to completely describe the address of web pages and other resources on the Internet.

Basic format:scheme://host[:port#]/path/…/[?query-string][#anchor]

Protocol://host:[port number]/path/?[request or query parameters]…/[#anchor]

  • scheme: protocol (for example: http, https, ftp)

  • host: IP address or domain name of the server

  • port#: Server port (if the protocol default port is used, the default port is 80)

  • path: path to access resources

  • query-string: parameter, data sent to http server

  • anchor: anchor (used for jumping within the page)

For example:

HTTP requests are mainly divided into two methods Get: andPost

  • GET is to obtain data from the server, and POST is to transmit data to the server.

  • The GET request parameters are displayed on the browser URL. The HTTP server generates response content based on the parameters in the URL contained in the request, that is, the parameters of the "Get" request are part of the URL. For example:http://www.baidu.com/s?wd=Chinese

  • The POST request parameters are in the request body. There is no limit on the length of the message and it is sent in an implicit way. It is usually used to submit a relatively large amount of data to the HTTP server (for example, the request contains many parameters or file upload operations, etc.). The parameters of the request Contained in the "Content-Type" message header, indicating the media type and encoding of the message body,

Note: Avoid using the Get method to submit a form as it may cause security issues. For example, when using the Get method in the login form, the username and password entered by the user will be fully exposed in the address bar.

View web request

Taking the chrome browser as an example, right-click on the web page, check (or directly F12), select network, refresh the page, select the first link under ALL, so that you can see various request information of the web page, as shown below Give a detailed introduction

Common request headers

1. Host (host and port number)

Host: corresponds to the Web name and port number in the URL. It is used to specify the Internet host and port number of the requested resource. It is usually part of the URL.

2. Connection (link type)

Connection: Indicates the connection type between the client and the service

  1. Client initiates a Connection:keep-aliverequest containing , HTTP/1.1 uses keep-aliveas the default value.

  2. After the Server receives the request:

    • If the Server supports keep-alive, reply with a response containing Connection:keep-alive without closing the connection;

    • If the Server does not support keep-alive, reply with a response containing Connection:close to close the connection.

  3. If the client receives Connection:keep-alivea response containing , it sends the next request to the same connection until one party actively closes the connection.

Keep-alive can reuse connections in many cases, reduce resource consumption, and shorten response time. For example, when the browser requires multiple files (such as an HTML file and related graphic files), there is no need to request a connection every time.

3. Upgrade-Insecure-Requests (upgrade to HTTPS requests)

Upgrade-Insecure-Requests: Upgrade insecure requests, which means that they will be automatically replaced with https requests when loading http resources, so that the browser will no longer display http request alerts in https pages.

*HTTPS is an HTTP channel aimed at security, so HTTP requests are not allowed on pages hosted by HTTPS. Once they occur, a prompt or error will be reported. *

4. User-Agent (browser name)

User-Agent: It is the name of the customer's browser, which is the complete standard link of the browser when we shrink the browsing page. We will talk about it in detail later.

5. Accept (transfer file type)

Accept: refers to the MIME (Multipurpose Internet Mail Extensions) file type that the browser or other client can accept, and the server can judge and return the appropriate file format based on it.

Example:

Accept: */*: Indicates that anything can be received.

Accept:image/gif: Indicates that the client wishes to accept resources in GIF image format;

Accept:text/html: Indicates that the client wishes to accept html text.

Accept: text/html, application/xhtml+xml;q=0.9, image/*;q=0.8: Indicates that the MIME types supported by the browser are html text, xhtml and xml documents, and all image format resources.

*q is the weight coefficient, ranging from 0 =< q <= 1. The larger the q value, the more likely the request is to obtain the content represented by the type before ";". If the q value is not specified, it defaults to 1, sorted from left to right; if it is assigned a value of 0, it is used to indicate that the browser does not accept this content type. *

*Text: used for standardized representation of text information. Text messages can be in multiple character sets and or in multiple formats; Application: used for transmitting application data or binary data. Please click * for details

6. Referer (page jump point)

Referer: Indicates which URL the requested web page comes from, and the user accesses the currently requested page from the Referer page. This attribute can be used to track which page the web request comes from, what website it comes from, etc.

Sometimes when downloading pictures from a certain website, you need a corresponding referer, otherwise you cannot download the pictures. That is because people have implemented anti-leeching. The principle is to judge whether it is the address of this website based on the referer. If not, reject it. If so, You can download it;

7. Accept-Encoding (file encoding and decoding format)

Accept-Encoding: Indicates the encoding method acceptable to the browser. Encoding is different from file format in that it compresses files and speeds up file delivery. The browser decodes the web response after receiving it and then checks the file format, which can save a lot of download time in many cases.

Example: Accept-Encoding:gzip;q=1.0, identity; q=0.5, *;q=0

If multiple Encodings match at the same time, they are arranged in the order of q value. In this example, gzip and identity compression encoding are supported in order. Browsers that support gzip will return a gzip-encoded HTML page. If this domain is not set in the request message the server assumes that the client can accept various content encodings.

8. Accept-Language (language type)

Accept-Langeuage: Indicates the language types that the browser can accept, such as en or en-us refers to English, zh or zh-cn refers to Chinese, and is used when the server can provide more than one language version.

9. Accept-Charset (Character encoding)

Accept-Charset: Indicates the character encoding acceptable to the browser.

Example: Accept-Charset:iso-8859-1,gb2312,utf-8
  • ISO8859-1: Commonly called Latin-1. Latin-1 includes additional characters indispensable for writing all Western European languages. The default value for English browsers is ISO-8859-1.

  • gb2312: Standard Simplified Chinese character set;

  • UTF-8: A variable-length character encoding of UNICODE, which can solve the problem of text display in multiple languages, thereby realizing application internationalization and localization.

If this field is not set in the request message, the default is that any character set is accepted.

10. Cookie (Cookie)

Cookie: The browser uses this attribute to send a cookie to the server. Cookie is a small data body stored in the browser. It can record user information related to the server and can also be used to implement session functions, which will be discussed in detail later.

11. Content-Type (POST data type)

Content-Type: The content type used to represent the POST request.

Example: Content-Type = Text/XML; charset=gb2312:

Indicates that the message body of the request contains plain text XML type data, and the character encoding is "gb2312".

Server HTTP response

The HTTP response also consists of four parts, namely: 状态行, 消息报头, 空行,响应正文

HTTP/1.1 200 OK
Server: Tengine
Connection: keep-alive
Date: Wed, 30 Nov 2016 07:58:21 GMT
Cache-Control: no-cache
Content-Type: text/html;charset=UTF-8
Keep-Alive: timeout=20
Vary: Accept-Encoding
Pragma: no-cache
X-NWS-LOG-UUID: bd27210a-24e5-4740-8f6c-25dbafa9c395
Content-Length: 180945
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ....

Commonly used response headers (understand)

In theory, all response header information should be in response to request headers. However, for efficiency, security, and other considerations, the server will add corresponding response header information, as you can see from the picture above.

1. Cache-Control:must-revalidate, no-cache, private。

This value tells the client that the server does not want the client to cache resources. The next time a resource is requested, the server must be requested again and the resource cannot be obtained from a cached copy.

  • Cache-Control is very important information in the response header. When the client request header contains a Cache-Control:max-age=0 request, which clearly indicates that the server resources will not be cached, Cache-Control is usually returned as the response information. no-cache means, "then don't cache it."

  • When the client does not include Cache-Control in the request header, the server often determines different caching strategies for different resources. For example, oschina's strategy for caching image resources is Cache-Control: max-age=86400, which means Yes, for 86400 seconds starting from the current time, the client can read the resource directly from the cached copy without requesting it from the server.

2. Connection:keep-alive

This field responds to the client's Connection: keep-alive, telling the client that the server's tcp connection is also a long connection, and the client can continue to use this tcp connection to send http requests.

3. Content-Encoding:gzip

Tell the client that the resources sent by the server are gzip encoded. After the client sees this information, it should use gzip to decode the resources.

4. Content-Type:text/html;charset=UTF-8

Tell the client the type of resource file and character encoding. The client decodes the resource through UTF-8 and then performs HTML parsing of the resource. Usually we will see that some websites are garbled, often because the server does not return the correct encoding.

5. Date:Sun, 21 Sep 2016 06:18:21 GMT

This is the server time when the server sends resources, and GMT is the standard time in Greenwich. The times sent in the http protocol are all in GMT. This is mainly to solve the problem of time confusion when different time zones request resources from each other on the Internet.

6. Expires:Sun, 1 Jan 2000 01:00:00 GMT

This response header is also related to caching, telling the client that it can directly access the cache copy before this time. Obviously there will be a problem with this value, because the time of the client and the server are not necessarily the same. If the time is different, can cause problems. Therefore, this response header is not as accurate as the Cache-Control: max-age=* response header, because the date in max-age=date is a relative time, which is not only easier to understand, but also more accurate.

7. Pragma:no-cache

This meaning is equivalent to Cache-Control.

8.Server:Tengine/1.4.6

This is the server and the corresponding version, which only tells the client server information.

9. Transfer-Encoding:chunked

This response header tells the client that the resources sent by the server are sent in chunks. Generally, resources sent in chunks are dynamically generated by the server. The size of the resource is not known when sending, so it is sent in chunks. Each chunk is independent, and each independent chunk can indicate its own length. The last chunk is 0 length. When the client reads this 0 length block, it can be sure that the resource has been transferred.

10. Vary: Accept-Encoding

Tell the caching server to cache both compressed and uncompressed file versions. This field is not very useful now because modern browsers support compression.

Cookies and Sessions:

The interaction between the server and the client is limited to the request/response process, and is disconnected after the end. The server will consider a new client on the next request.

In order to maintain the link between them and let the server know that this is a request sent by the previous user, the client's information must be saved in one place.

Cookie : Determine the user's identity through information recorded on the client side.

Session : Determine the user's identity through information recorded on the server side.

Response status code

The response status code consists of three digits. The first digit defines the response category and has five possible values.

Common status codes:

  • 100~199: Indicates that the server successfully received part of the request and requires the client to continue submitting the remaining requests to complete the entire processing process.

  • 200~299: Indicates that the server successfully received the request and completed the entire processing process. Commonly used is 200 (OK request successful).

  • 300~399: In order to complete the request, the customer needs to further refine the request. For example: the requested resource has moved to a new address, commonly used 302 (the requested page has been temporarily moved to a new URL), 307 and 304 (using cached resources).

  • 400~499: There is an error in the client's request, commonly used are 404 (the server cannot find the requested page) and 403 (the server denies access, insufficient permissions).

  • 500~599: An error occurred on the server side, usually 500 (the request was not completed. The server encountered an unpredictable situation).

Two ways to load web pages

  • Synchronous loading: Changing some request parameters on the URL will cause the web page to change, for example: www.itjuzi.com/company?page=1 (Change the number after page=, the web page will change)

  • Asynchronous loading: Changing the request parameters on the URL will not cause the web page to change, for example: www.lagou.com/gongsi/ (the URL will not change after turning the page)

Understand the composition of web page source code

Right-click on the web page to view the source code of the web page to view the source code information of the web page. Source code generally consists of three parts, namely:

  • html: describes the content structure of the web page

  • css: describes the layout of the web page (advanced anti-crawling, css)

  • JavaScript (js file): describes the event processing of the web page, that is, the program after the mouse or keyboard moves on the web page element

Crawler protocol (understanding)

Robots protocol: Through the robots protocol, the website tells search engines which pages can be crawled and which pages cannot be crawled, but it is only a moral constraint.

[Robots protocol Baidu Encyclopedia] link:

https://baike.baidu.com/item/robots%E5%8D%8F%E8%AE%AE/2483797?fr=aladdin 

 The above is the entire content of this issue. I will introduce the relevant knowledge points about http and https here. See you in the next issue!

 Share a wallpaper:

Guess you like

Origin blog.csdn.net/m0_73633088/article/details/133099929