Web crawler HTTP principle, the page request, the page basis

Copyright: Attribution, allow others to create paper-based, and must distribute paper (based on the original license agreement with the same license Creative Commons )

table of Contents

1, URI and URL 2, hypertext hypertext

3, HTTP and HTTPS 4, HTTP request procedure

5, 6 request method, request header

7, 8 the request, response

9, page basis

1, URI given URL

URI is a uniform resource identifier (URL is a subset of URI, URI also includes a subclass URN Uniform Resource Name, only the named resource without specifying how to locate resources)

URL Uniform Resource Locator (such as: https: //baidu.com/wd=leebeloved a URL is also a URI, after dismantling as: access protocol https, access path baidu.com, resource names wd = leebeloved)

2, hypertext hypertext, the web browser to show the resolve to come hypertext, source code for web page HTML code.

3、HTTP和HTTPS

HTTP called Hypertext Transfer Protocol, is a local browser protocol data from the network to the hypertext transmission. The HTTPS is the secure version of HTTP, adding ssl layer under HTTP, HTTPS content is transmitted through ssl encrypted.

HTTPS role is divided into: 1, establish a secure information channel; 2, to ensure the authenticity of the site

4, HTTP request procedure

HTTP request process: process presented from the input URL to the web page is: browser → Send request → the site where the server → site processing, resolution requests → return the appropriate response → back to the browser (a request website in the Developer Tools parameters: name request name, usually the last part of the URL; status response status code; type of document type requested; initiator request source, used to mark the plea by which objects or processes initiated; size of the request or downloaded from the server resource size; time request to initiate acquisition response time used; waterfall cascade network visualization request)

Web developer toolbar: General section, request url request URL, request method for the request method, status source response status code, remote address discrimination policy for the referrer to the remote server address and port, referrer policy, response headers for the response headers, request headers of the request headers (the request header contains identification browser, cookies, host, etc.)

5, the request method

The client sends a request to end the service is divided into four parts: the request method, the URL, the head body. Commonly used methods are get request and the post.
1, get request parameters included in the inside URL, the URL can be seen in the data, while POST request URL does not contain data, the data are transmitted through the form of the form, it is included in the request body (URL viewed less).
2, GET request submitted data only up to 1024 bytes, and the POST method is not limited.

6, the request header

Additional information server to be used for illustration, more important information to have Cookie, Referrer, user-agent and so on.
1, Accept: request header field to specify what type of information the client acceptable;
2, the Accept-Language: Specifies the type of client acceptable words;
. 3, Accept-Encoding: specifying the client content encoding acceptable;
4, host: host specifies the IP and port number of the requested resource, the position of the original contents of the gateway or the service request URL;
. 5, cookies: also used in the plural cookies, to identify sites for session tracking user plane memory in local user data. Its main function is to maintain the current access per session. Cookies in our information server identifies the corresponding
session, each time the browser requests a page at the site, will add Cookies in the request header and sends it to the server, the server is identified by our own Cookies and find out the current status is logged on. So the result is the same net return items after login to see.
6, Referer: This is used to identify the contents of the request is sent over the page from which the server can get this information and make the appropriate treatment, as do statistical sources, anti-hotlinking treatment.
7, User- Agent: referred to as UA, it is a special character thing first, you can make the operating system and version, browser and version and other information used by the server to identify the client. In doing this information plus crawler can masquerade as a browser; if not, can easily be identified reptile.

7, the request body: the content of the general form data is carried in the post request, request body get request is empty

8, in response to: the server returns to the client, into the response status code, header, body

Response status code: normal response indicates that the server 200, 404 indicate that the page is not found, the server 500 on behalf of an internal error, the server 403 rejects the access request, access is prohibited.

Response body: do reptiles time response body to get through the main web page source code, json data.

9, page basis

9.1 of web pages: html page description language, web including text, images, videos, buttons and so on. (Represented by different types of characters of different types of tags, IMG pictures, video videos, P Paragraphs, div 'label layout, the whole page frame is different tags and various combinations nested arrangement);

JavaScript: scripting languages, html and css to provide users with the use of just static information, the lack of interaction;

css: Cascading Style Sheets, laminating several means style file referenced in the HTML, the style and the conflict occurs, the browser may be processed according to the stacking order, text style refers to the page size, color, element spacing, arrangement and other formats.

Between 9.2 and node tree node relationship

DOM Document Object Model, which defines a standard for accessing HTML and XML.

HTML DOM The HTML document as a tree structure:

Node have a hierarchy, parent, child nodes, siblings, the top node in the tree is called the root node, each node has a parent node, and can have any number of child nodes or sibling.

9.3 css selector for positioning node, selection can be nested.