While the development of web crawlers provides convenience for users to understand and collect network information, it also brings many large and small problems, and even causes certain harm to network security. Therefore, before we really start to understand web crawlers, we also need to understand the characteristics of web crawlers, the problems they bring, and the specifications that need to be followed in the process of developing and using web crawlers.
Size classification of web crawlers
size | characteristic | purpose | Method to realize |
---|---|---|---|
Small scale | The amount of data is small, insensitive to crawling speed, and the amount is very large | Crawl web pages, explore web page information | Requests library |
Medium scale | Large amount of data, more sensitive to crawling speed | Crawling websites and series of websites | Scrapy framework |
Large scale | The data volume and scale are extremely large, mostly used in search engines, and crawling speed is extremely critical | Crawl all websites on the Internet | custom made |
Problems caused by web crawlers
System burden
Due to technical limitations and crawling purposes, the network load generated by some web crawlers is far greater than the user's access load, which imposes a large bandwidth burden on the website and web server system resource overhead, and causes a certain burden for the normal operation of the website. Therefore, for website owners, web crawlers (especially crawlers that run irregularly) cause certain harassment to the website.
legal risks
Since the data on the website generally has property rights, if some web crawlers obtain the data after analyzing and selling for profit, it will bring economic losses to the website and also bring legal risks to the development and users. Therefore, web crawler designers must strictly follow the Internet and corporate crawler management regulations when designing crawlers, otherwise they may bear certain consequences.
Privacy leak
There may be private data on the website that some owners do not want users to access. Some websites may use weaker access controls to protect data, but some web crawlers may have the ability to bypass or break through such access controls. If the website does not properly protect the information and the information is accidentally leaked, it will cause serious information security problems.
Specification of web crawlers
Because web crawlers may bring the above-mentioned problems, but at the same time, web crawlers themselves are not a harmful tool, so the Internet and various websites allow web crawlers to run on the basis of given restrictions. Next, let’s take a look at how websites and the Internet restrict crawlers
Limitations of web crawlers
Source review
Some websites use relatively simple source checks to restrict access. The principle is User-Agent
to filter the request by analyzing the field of the request header in the network request. The website only responds to visit requests from browsers and friendly crawlers. Source review has certain requirements for the technical level of website developers
Release announcement (Robots agreement)
The website uses the method of issuing announcements to inform all crawlers of the crawling strategy of this website, and requires every crawler that crawls the website to comply. At present, the mainstream announcement protocol on the Internet is the Robots
protocol, which stipulates the content that can and cannot be crawled in the website. This method is simpler and has lower requirements for website developers. However, this method of publishing announcements only imposes moral restrictions on crawler developers in some cases. If it does not cooperate with other access control methods, if crawlers visit purposefully, they can easily find private content. However, the development and users of crawlers that violate this agreement will face high legal risks. We should also follow the requirements of these announcements when designing crawlers.
Robots Agreement
Robots
The protocol (Robots Exclusion Standard) informs web crawlers which pages are allowed to be crawled and which cannot be crawled through files placed in the root directory of the websiterobots.txt
. If a website does not provide this file in the root directory, all crawlers on the network are allowed to crawl the website content by default
Robots protocol format
The contents of the Robots agreement are mainly composed of the following keywords, wildcards, specific names and paths
Keywords, wildcards | Description |
---|---|
User-agent: | Provisions for crawlers with specific names (beginning) |
Allow: | Allowed content |
Disallow: | Prohibited content |
* | Wildcard, which means anything |
$ | Wildcard, which means ending with the preceding content |
- Note: The colons of keywords are all in English, and there is a space after the colon
User-agent
The keyword corresponds to theUser-agent
header field in the crawler request- There is a blank line between the rules for different crawlers
Allow
Keywords have theDisallow
opposite effect of keywords and can exclude certain subdirectories rejected by them- If there is no webpage that needs to be forbidden to access, you should also write a line of rules, usually
Disallow
directly wrap after the space of theAllow
keyword (or use/
it after the keyword to indicate the root directory) - ? Has no special meaning, only means? Literally
The following is an excerpt from Baidu robots.txt
for introduction.
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
...
User-agent: *
Disallow: /
Baidu Baiduspider
restricts specific directories to the main crawler of the Baidu search engine, and Googlebot
restricts more content to the main crawler of the Google search engine. There is one after all allowed lists User-agent: *
and prohibits the root directory, which means that Baidu website prohibits all crawlers except the above friendly crawlers from crawling web pages.
How to follow the Robots protocol
The web crawler should be able to automatically or manually identify the robots.txt
content of the file and crawl the website in accordance with the protocol requirements. Although the Robots
agreement on the operation of crawlers is only a recommendation, it does not mean that there will be no legal risks in any crawling of the website without following the agreement. Especially for content that contains commercial value, if you use crawlers to illegally crawl to gain benefits, you may face the risk of being sued.
In principle, all web crawlers should follow Robots
protocol specifications. If the web crawler has a large number of visits or for commercial interests, it must abide by the Robots
agreement, but if it is only for personal non-profit purposes and the number of website visits is small, the binding of the agreement can be appropriately reduced. If a crawler's access to a website (frequency and crawled content) is equivalent to the user's access pattern, this type of behavior is called human-like behavior, and this type of behavior does not comply with the Robots
agreement.