What is a Python3 web crawler?

definition:

A web crawler (Web Spider), also known as a web spider, is a program or script that automatically grabs website information according to certain rules. The crawler is actually the process of simulating the browser to surf the Internet by writing a program, and then letting it go to the Internet to grab data.

The value of reptiles:

Grab the data on the Internet and use it for me. With a large amount of data, it is like having a data bank. The next step is how to productize and commercialize these data.

Are reptiles legal?

Web crawlers are not prohibited by law, but they have the risk of breaking the law. Generally speaking, crawlers are divided into benign crawlers and malicious crawlers. The risks brought by crawlers can be reflected in the following two aspects:

  • The crawler interferes with the normal operation of the visited website

  • The crawler captures specific types of data or information protected by law

So how do we avoid the bad luck of entering the bureau in the process of using and writing crawlers?

  • Always optimize your program to avoid interfering with the normal operation of the visited website

  • When using and disseminating the crawled data, review the crawled content. If you find sensitive content such as user privacy or business secrets, you need to stop crawling or dissemination in time

Classification of reptiles in usage scenarios

  • An important part of the general crawler crawling system, which crawls a whole page of data

  • The focused crawler is based on the general crawler and crawls specific local content on the page

  • Incremental crawlers detect data updates in the website, and only crawl the latest updated data in the website

Reptilian Spear and Shield

Anti-crawling mechanism: Portal websites can prevent crawlers from crawling website data by formulating corresponding strategies or technical means. Anti-crawling strategy: The crawler program can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain relevant data in the portal website.

Next, let's take a look at an important protocol in web crawlers: the robots.txt protocol. The robots.txt agreement is a gentleman's agreement, which stipulates which data in the website can be crawled and which data cannot be crawled.

http protocol and https protocol

http protocol: It is a form of data interaction between the server and the client. Commonly used request header information in the http protocol:

  • User-Agent: the identity of the request carrier

  • Connection: After the request is completed, whether to disconnect or keep the connection Response header information commonly used in the http protocol:

  • Content-Type: The data type that the server responds to the client. The https protocol is actually a secure http protocol.

Guess you like

Origin blog.csdn.net/m0_67373485/article/details/129763938