python reptile (1) - Basic Introduction

1. What is a reptile

The concept: Analog Internet browser to fetch operation program desired information.

Worth 110 reptiles

  • dedicate data
  • The data products, commercialization

1.2 Legitimacy

  • Reptile itself is not legally prohibited
  • Good faith Reptilia
  • Malicious reptiles

Risk reptile brought

  • Interfere with the normal operation of the access to the site
  • A particular type of crawling legally protected data or information

Avoid offending

  • Strict compliance with the site's robots protocol
  • Anti-circumvention measures reptile while optimizing their code, to avoid affecting the normal operation of the site
  • In use, the dissemination of information should be reviewed to crawl content, personal information if the user belongs to, privacy or trade secrets, should be promptly stopped crawling

1.3 Classification of reptiles

  • General reptiles: common reptile is an important part of search engine (Baidu, Google, Yahoo, etc.) to "grab system". The main purpose of the web page on the Internet is downloaded to the local, mirroring a form of Internet content. Simply put, that is, as much as possible; all the pages on the Internet downloaded, put the local server where Backup form, making the handling of these related pages (extract keywords, remove the ad), and finally provides a user search interface.
  • Focused crawler: focused crawler to crawl is specified on the network data according to the specified requirements. For example: Get the name and watercress on a movie critic, instead of getting all the data values ​​the entire page.
  • Incremental reptiles: Incremental is used to detect the site where the updated data, and can update the site crawling data.

1.4 Mao and shield

  • Anti-climbing mechanism
  • Anti-anti-climbing strategy
  • robots protocol

Details protocol can access the site through the domain name of the form + /robots.txt

2. HTTP protocol

2.1 WHAT?

The official concept : HTTP protocol is the Hyper Text Transfer abbreviation Protocol (Hypertext Transfer Protocol) is used from the World Wide Web: hypertext (WWW World Wide Web) server transport protocol to transfer the local browser.

Vernacular concept : HTTP protocol is as a form of data interaction (mutual transmission of data) between the server (Server) and a client (Client).

2.2 HTTP works

HTTP protocol works on the client - server architecture on. Browser as an HTTP client URL that is WEB server sends all requests to the server via HTTP. Web server according to the received request, transmits the response information to the client.

2.2.1 Common request headers

accept:浏览器通过这个头告诉服务器,它所支持的数据类型
Accept-Charset: 浏览器通过这个头告诉服务器,它支持哪种字符集
Accept-Encoding:浏览器通过这个头告诉服务器,支持的压缩格式
Accept-Language:浏览器通过这个头告诉服务器,它的语言环境
Host:浏览器通过这个头告诉服务器,想访问哪台主机
If-Modified-Since: 浏览器通过这个头告诉服务器,缓存数据的时间
Referer:浏览器通过这个头告诉服务器,客户机是哪个页面来的 防盗链
Connection:浏览器通过这个头告诉服务器,请求完后是断开链接还是何持链接
X-Requested-With: XMLHttpRequest 代表通过ajax方式进行访问
User-Agent:请求载体的身份标识

2.2.2 Common response headers

Location: 服务器通过这个头,来告诉浏览器跳到哪里
Server:服务器通过这个头,告诉浏览器服务器的型号
Content-Encoding:服务器通过这个头,告诉浏览器,数据的压缩格式
Content-Length: 服务器通过这个头,告诉浏览器回送数据的长度
Content-Language: 服务器通过这个头,告诉浏览器语言环境
Content-Type:服务器通过这个头,告诉浏览器回送数据的类型
Refresh:服务器通过这个头,告诉浏览器定时刷新
Content-Disposition: 服务器通过这个头,告诉浏览器以下载方式打数据
Transfer-Encoding:服务器通过这个头,告诉浏览器数据是以分块方式回送的
Expires: -1 控制浏览器不要缓存
Cache-Control: no-cache 
Pragma: no-cache

2.2.3 HTTPS

HTTPS (Secure Hypertext Transfer Protocol) secure hypertext transfer protocol, HTTPS SSL encryption layer is built on HTTP, and encrypt data is secure version of the HTTP protocol.

HTTPS encryption algorithm

  • Symmetric encryption keys: the client sends a message to the server, the client first information is encrypted using a known algorithm, such as MD5 or Base64 encryption, when receiving end information of the encrypted decrypting key need to use intermediate passed key (encryption and decryption keys are the same), the transmission intermediate key is encrypted. This approach seems safe, but there are still potentially dangerous, once tapped, or information being held hostage, it is possible to crack the key, and break the information. Therefore, "shared key encryption" security risk this way.
  • Asymmetric encryption keys: When "asymmetric encryption" used are two locks, one called the "private key", a "public key" encryption when using non-object of encryption, the server first to tell the client is encrypted in accordance with their given public key, according to the client after the public key encryption, the server receives information and then decrypt it, using their own private key to decrypt the benefits of doing so is the key would not be transmission and, therefore avoiding the risk of being kidnapped. Even if the public key is to get the eavesdropper, it is difficult to decrypt, decryption process because of the discrete logarithm evaluated, this is not easily able to do.
  • Certificate secret key encryption: asymmetric encryption shortcomings, the first of which is the public key is kidnapped there may be circumstances, can not guarantee that the public key is the public key server offering clients receive. At this time, there is caused a public key certificate mechanism. Certificate authority is a third party client and server are trusted. DETAILED communication process certificate as follows:
    • Server developers to carry public key, the public key to apply to the certificate authority, certificate authority in a clear understanding of the identity of the applicant, after approval, the public key of the application developer will do a digital signature, then assign the public key has been signed certificate and key on the inside, bound together
    • The server sends this digital certificate to the client, because the client also recognized certificate authority, a client can verify the authenticity of the public key digital signature by certificate number to ensure that the server pass over the public key is true. Under normal circumstances, the certificate of digital signature is difficult to forge, depending on the credibility of the certification body. Once confirmation is correct, the client will be encrypted by the public key to send the message, the server receives later with his private key to decrypt it.

Guess you like

Origin www.cnblogs.com/drfung/p/11797164.html