Answer:web crawler question 1

Answer:web crawler question 1

Outline

Question 1

1, cookie, JavaScript relationships? How generated cookie? cookie What is included? JavaScript functionality through third-party libraries selenium python in the crawler code, selenium scripting language for execution of JavaScript

Words, what selenium that? Scripting language runtime scripting language?

Cookie generated background

Cookie background and significance: Web application contains multiple pages, each page corresponds to a url address. web browser sent two requests to the Web server, application two pages. The two pages are requested separately using two separate

HTTP connection. Since HTTP is a stateless protocol, the browser and the Web server will close the connection channel after the first request is completed, re-establish the connection request in a second time. Without the cookie , which does not distinguish between Web server

From which client requests, all requests are treated equally and will establish a separate connection for these requests. Advantage stateless protocol: each connection resources can quickly be reused by other clients; disadvantages: Please send the same user

We are seeking to establish a separate connection to generate extra time consuming! The cookie : Record user login status, can be optimized HTTP is stateless shortcomings, and to set the cookie domain-wide cookie achieve shared cross between third-level domain , then there is a problem

How to achieve login information sharing between the two domain names across it, such as between www.taobao.com and www.tmall.com login information is shared. HTTP 1.0 version stateless protocol , HTTP1.1 version is support for persistent connections!

A, cookie

Suppose: web server receives http request header client IP IP packets http request request according to (key-value should mapping relation), Coding (language client compiler), the browser url, and other information ,use? ? ?

Abstract data structure of a storage structure information. Saved in the browser? This structure is the cookie.

Development: 1, JSESSIONID is a Cookie, Servlet container (tomcat, jetty) for recording user session

cookie

1, web crawler books explanation: cookie is the site to record and track user login credentials, whether to include the user has logged information. cookie structure comprising: a token, valid login time, the state tracking information

2, https://www.jianshu.com/p/6fc9cea6daa2 explained: http protocol itself is stateless, web clients using the http protocol to interact with the web server, web server may submit each IP packet request header acquiring customers

End IP, will deliver the web server resources to the corresponding web client, but the web server can not track web client. Enter the url address of the web client (the same resource locator) in a web browser, sends an IP datagram web server, web

The service uses a Cookie response issued to the client browser . Cookie saved in the web browser . web browser and then request the site when, web browser's URL along with the request submitted with the Cookie to the web server. web

The server checks the Cookie, in order to track user state.

The structure of the cookie

1、https://blog.csdn.net/talking12391239/article/details/9665185的分析:

  Cookie 结构体:Set-Cookie: NAME=VALUE;Expires=DATE;Path=PATH;Domain=DOMAIN_NAME;SECURE

NAME = VALUE: NAME is the Cookie name, VALUE is the value of the Cookie. String "NAME = VALUE", the non semicolons, commas, and characters such as spaces  

Expires = DATE: Expires Cookie variables determine the effective date of termination. The property value DATE must be written in a specific format: day of the week, DD-MM-YY HH: MM: SS GMT, GMT indicates that this is Greenwich Mean Time. on the contrary,

Not be written in such a format, the system will not be recognized.

Note: the Expires variables can be saved, if the default, the Cookie attribute values are not saved in the user's hard drive, but only stored in the memory of them , Cookie file with the closure of the browser automatically disappear

Domain = DOMAIN-NAME: Domain variables determine which Web server Internet domain can read the browser to access Cookie, that only pages from this domain can use the information in the Cookie. This setting is optional

And if the default, set the value of the attribute Cookie Web server domain name, the domain must be designated with a dot (.) Is started as: . Cnblogs.com

NOTE: Internet domain that contains multiple web servers

Path = PATH: Path attribute defines the Web server page in which paths are available Cookie server settings. In general, when the path portion of the URL input by the user from the first character string containing the Path attribute defined,

Browser it passes inspection. If the value of the Path property "/", the Web server on the WWW resources all can read the Cookie. The same setting is optional, if the default, the Path attribute value passed to the Web server Liu

Pathname of the resource browser.

Secure: Secure variables showed that: only when the communication protocol between the web browser and the Web server, the browser was submitted to the appropriate Cookie server is encrypted authentication protocol. This agreement is currently only one, that is, HTTPS!

max-age: and expires the same effect, used to tell the browser how long the cookie expires (in seconds), rather than a fixed point in time. Normally, max-age is higher than the priority expires. maxAge attribute is negative, the table

The Cookie is a temporary illustrates Cookie, not persisted, valid only in this sub-window or the browser window of the present window is open , the Cookie fail immediately after the browser is closed. maxAge Set to 0 to immediately remove the Cookie

Send a cookie

  the web server using the Set-Cookie header in response to key COOKIE way to send information, SET-COOKIE defined in RFC2109 header format in response to:

Set-Cookie: Name = Value; Comment = value; Domain = value; Max-Age = value; Path = Value;Secure; Version = 1 * DIGIT;

Note: 1, Name = Value of property values ​​must appear first, after which the property - value pairs may appear in any order

   2, Comment attribute is optional, since other related Cookie may contain the user's private information, this property allows the server to use the Cookie is described, the user can check the message

   3, web browser, domain and path will have the same cookie stored in a file, separated by inter-cookie *

   4, a plurality of sub-keys of the cookie is the format name = key1 = value1 & key2 = value2. It can be understood as a multi-string key-value storage of a single key for custom

the cookie client access cookie , server-side parsing cookie

1, the server parses Cookie : Cookie may be provided in a different domain path, for the same name value, different paths in different domains can be repeated, the browser will request url or sequentially with the current page address in accordance with the best match

To prioritize the

2, client access cookie : Web browser backstage pass over cookie management. Developers use in JavaScript document.cookie to access cookie, but the cookie set the HttpOnly attribute, by js

Script (document.cookie) will not be able to read cookie information , and document.cookie due to the use of different ways and exhibit different behavior

2.1, document.cookie does not overwrite cookie, name value domain path unless settings are repeated with a cookie already exists

Cookie Security

       data in the cookie will typically contain the user's private data, Cookie Security First, to ensure the confidentiality of data , and secondly to ensure that data can not be forged or tampered with . Based on these two points, cookie content needs to be encrypted, the encryption method:

Symmetric encryption (single key, such as DES) or asymmetric encryption (a pair of keys, such as the RSA) . Key needs to be saved on the server side in a safe place. Without the key will not be able to decrypt the data, it can not be forged or altered data. Significant number of cookie

It needs to be set to HttpOnly avoid cross-site scripting get your cookie , the cookie ensures the security of the browser . cookie can only act on the set protocol security (https) by setSecure Cookie class (boolean flag)

To set the cookie is only sent under https, but will not send the next http, to ensure that the cookie in the security server . Cookie class of JavaEE.

Two, selenium

  

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/yinminbo/p/12014453.html