Python web crawler from 0 to 1 (2): characteristics, problems and specifications of web crawlers

  While the development of web crawlers provides convenience for users to understand and collect network information, it also brings many large and small problems, and even causes certain harm to network security. Therefore, before we really start to understand web crawlers, we also need to understand the characteristics of web crawlers, the problems they bring, and the specifications that need to be followed in the process of developing and using web crawlers.

Web Crawler

Size classification of web crawlers

size characteristic purpose Method to realize
Small scale The amount of data is small, insensitive to crawling speed, and the amount is very large Crawl web pages, explore web page information Requests library
Medium scale Large amount of data, more sensitive to crawling speed Crawling websites and series of websites Scrapy framework
Large scale The data volume and scale are extremely large, mostly used in search engines, and crawling speed is extremely critical Crawl all websites on the Internet custom made

Problems caused by web crawlers

 System burden

  Due to technical limitations and crawling purposes, the network load generated by some web crawlers is far greater than the user's access load, which imposes a large bandwidth burden on the website and web server system resource overhead, and causes a certain burden for the normal operation of the website. Therefore, for website owners, web crawlers (especially crawlers that run irregularly) cause certain harassment to the website.

 legal risks

  Since the data on the website generally has property rights, if some web crawlers obtain the data after analyzing and selling for profit, it will bring economic losses to the website and also bring legal risks to the development and users. Therefore, web crawler designers must strictly follow the Internet and corporate crawler management regulations when designing crawlers, otherwise they may bear certain consequences.

 Privacy leak

  There may be private data on the website that some owners do not want users to access. Some websites may use weaker access controls to protect data, but some web crawlers may have the ability to bypass or break through such access controls. If the website does not properly protect the information and the information is accidentally leaked, it will cause serious information security problems.


Specification of web crawlers

  Because web crawlers may bring the above-mentioned problems, but at the same time, web crawlers themselves are not a harmful tool, so the Internet and various websites allow web crawlers to run on the basis of given restrictions. Next, let’s take a look at how websites and the Internet restrict crawlers

 Limitations of web crawlers

  Source review

  Some websites use relatively simple source checks to restrict access. The principle is User-Agentto filter the request by analyzing the field of the request header in the network request. The website only responds to visit requests from browsers and friendly crawlers. Source review has certain requirements for the technical level of website developers

  Release announcement (Robots agreement)

  The website uses the method of issuing announcements to inform all crawlers of the crawling strategy of this website, and requires every crawler that crawls the website to comply. At present, the mainstream announcement protocol on the Internet is the Robotsprotocol, which stipulates the content that can and cannot be crawled in the website. This method is simpler and has lower requirements for website developers. However, this method of publishing announcements only imposes moral restrictions on crawler developers in some cases. If it does not cooperate with other access control methods, if crawlers visit purposefully, they can easily find private content. However, the development and users of crawlers that violate this agreement will face high legal risks. We should also follow the requirements of these announcements when designing crawlers.

 Robots Agreement

RobotsThe protocol (Robots Exclusion Standard) informs web crawlers which pages are allowed to be crawled and which cannot be crawled through files placed in the root directory of the websiterobots.txt . If a website does not provide this file in the root directory, all crawlers on the network are allowed to crawl the website content by default

  Robots protocol format

The contents of the Robots agreement are mainly composed of the following keywords, wildcards, specific names and paths

Keywords, wildcards Description
User-agent: Provisions for crawlers with specific names (beginning)
Allow: Allowed content
Disallow: Prohibited content
* Wildcard, which means anything
$ Wildcard, which means ending with the preceding content
  • Note: The colons of keywords are all in English, and there is a space after the colon
  • User-agentThe keyword corresponds to the User-agentheader field in the crawler request
  • There is a blank line between the rules for different crawlers
  • AllowKeywords have the Disallowopposite effect of keywords and can exclude certain subdirectories rejected by them
  • If there is no webpage that needs to be forbidden to access, you should also write a line of rules, usually Disallowdirectly wrap after the space of the Allowkeyword (or use /it after the keyword to indicate the root directory)
  • ? Has no special meaning, only means? Literally

The following is an excerpt from Baidu robots.txtfor introduction.

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh

User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh

...

User-agent: *
Disallow: /

Baidu Baiduspiderrestricts specific directories to the main crawler of the Baidu search engine, and Googlebotrestricts more content to the main crawler of the Google search engine. There is one after all allowed lists User-agent: *and prohibits the root directory, which means that Baidu website prohibits all crawlers except the above friendly crawlers from crawling web pages.

  How to follow the Robots protocol

  The web crawler should be able to automatically or manually identify the robots.txtcontent of the file and crawl the website in accordance with the protocol requirements. Although the Robotsagreement on the operation of crawlers is only a recommendation, it does not mean that there will be no legal risks in any crawling of the website without following the agreement. Especially for content that contains commercial value, if you use crawlers to illegally crawl to gain benefits, you may face the risk of being sued.
  In principle, all web crawlers should follow Robotsprotocol specifications. If the web crawler has a large number of visits or for commercial interests, it must abide by the Robotsagreement, but if it is only for personal non-profit purposes and the number of website visits is small, the binding of the agreement can be appropriately reduced. If a crawler's access to a website (frequency and crawled content) is equivalent to the user's access pattern, this type of behavior is called human-like behavior, and this type of behavior does not comply with the Robotsagreement.

Guess you like

Origin blog.csdn.net/Zheng__Huang/article/details/108583115