Python crawler introduced

What is a crawler?

  • By writing a program, it simulates the process of a browser surfing the Internet and allowing it to grab data on the Internet.

The value of crawlers

  • Practical application
    • Crawl the data on the Internet for my use.
  • Employment

Are crawlers legal or illegal?

  • Not prohibited by law
  • Risk of illegality
  • Well-meaning crawlers/malicious crawlers

The risks brought by crawlers can be reflected in the following two aspects

  • The crawler interferes with the normal operation of the visited website
  • The crawler crawled specific types of data or information protected by law

How to avoid breaking the law in the process of writing crawlers

  • Optimize your program from time to time to avoid interference with the normal operation of the visited website
  • When using and disseminating the crawled data, review the captured content. If we find sensitive content related to user privacy or trade secrets, we need to stop crawling or dissemination in time

Classification of crawlers in usage scenarios

  • Common reptile
    • An important part of the crawling system. What is crawled is a whole page of data.
  • Focus crawler
    • It is based on a general crawler. Specific partial content in the crawled page.
  • Incremental crawler
    • Monitor the status of data updates in the network. Only the latest updated data from the website will be crawled.

Anti-climb mechanism

  • Portal websites can prevent crawlers from crawling website data by specifying corresponding strategies or technical means.

Anti-anti-climbing strategy

  • The crawler program can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain relevant data in the portal website.

robots.txt agreement (gentleman agreement)

  • It specifies which data in the website can be crawled and which data cannot be crawled.

http protocol

  • Concept: It is a form of data interaction between the server and the client.

Common request header information

  • User-Agent: the identity of the request carrier
  • Connection: After the request is completed, whether to disconnect or keep the connection

Common response header information

  • Contect-Type: The type of data the server responds back to the client

https protocol

  • Secure Hypertext Transfer Protocol

Encryption

  • Symmetric key encryption: The client sends a message to the server. First, the client encrypts the information with a known algorithm, such as MD5 or Base64 encryption. The receiver needs to use the key when decrypting the encrypted information. The key is passed (the encryption and decryption keys are the same), and the key is encrypted during transmission. This method looks safe, but there is still potential danger. Once it is eavesdropped or the information is hijacked, it is possible to crack the key and crack the information in it. Therefore, the "shared key encryption" method has security risks.
    Insert picture description here

  • Asymmetric key encryption

    • Asymmetric key encryption: There are two locks when using "asymmetric encryption", one is called the "private key" and the other is the "public key". When using asymmetric encryption, the server first tells you The client performs encryption processing according to the public key given by itself. After the client encrypts according to the public key, the server receives the information and decrypts it with its own private key. The advantage of this is that the decryption key will not be performed at all. Transmission, thus avoiding the risk of being hijacked. Even if the public key is obtained by an eavesdropper, it is difficult to decrypt, because the decryption process is to evaluate discrete logarithms, which is not easy to do. The following is the schematic diagram of asymmetric encryption:
      Insert picture description here

    • However, the asymmetric key encryption technology also has the following disadvantages:

      • The first one is: how to ensure that when the receiving end sends a public key to the sending end, the sending end ensures that what is received is what is to be sent in advance and will not be hijacked. As long as the key is sent, there may be a risk of being hijacked.
      • The second is: the efficiency of asymmetric encryption is relatively low, and it is more complicated to handle. There are certain efficiency problems in the communication process, which affects the communication speed.
  • Certificate key encryption: We talked about the disadvantages of asymmetric encryption above. The first one is that the public key is likely to be hijacked, and there is no guarantee that the public key received by the client is the public key issued by the server. At this time, the public key certificate mechanism is introduced. A digital certificate certification authority is a trusted third-party organization for both the client and the server. The specific dissemination process of the certificate is as follows:

    • The developer of the server carries the public key and submits an application for the public key to the digital certificate certification authority. After the digital certificate certification authority recognizes the identity of the applicant and passes the review, it will digitally sign the public key applied by the developer. Then distribute the signed public key, put the key in the certificate, and bind it together
    • The server sends this digital certificate to the client. Because the client also recognizes the certificate authority, the client can verify the authenticity of the public key through the digital signature in the digital certificate to ensure that the public key passed by the server is authentic. In general, the digital signature of a certificate is difficult to forge, which depends on the credibility of the certification body. Once the information is confirmed to be correct, the client will encrypt and send the message with the public key, and the server will decrypt it with its private key after receiving it.

Insert picture description here

Guess you like

Origin blog.csdn.net/Han_V_Qin/article/details/112999872