[Web Crawler Notes] Detailed explanation of crawler Robots protocol syntax

Robots protocol refers to a protocol called Robots Exclusion Protocol. The main function of this protocol is to provide a standard access control mechanism to search engine crawlers such as web spiders and robots, telling them which pages can be crawled and which pages cannot be crawled. This article will provide a detailed explanation of the syntax of the crawler Robots protocol and provide relevant code and cases.

1. Basic syntax of Robots protocol

The basic syntax of the Robots protocol is as follows:

User-agent: [user-agent name]
Disallow: [URL string not to be crawled]

Among them, User-agent is used to specify the name of the search engine crawler, and Disallow is used to specify the page URL that is not allowed to be crawled by the search engine crawler.

For example, here is an example of a Robots protocol file:

User-agent: Googlebot
Disallow: /private/
Disallow: /admin/
Disallow: /login/

In the above example, we specified the name of the search engine Googlebot, and set up settings to prohibit crawling of the three pages /private/, /admin/, and /login/.

2. Common parameters of Robots protocol

The Robots protocol also has some commonly used parameters, including:

  • Allow: Allow search engine crawlers to access page URLs;
  • Sitemap: Specify the URL of the site map. This URL will be provided when the search engine crawler crawls the site, so that the search engine can obtain the structural information of the entire site;
  • Crawl-delay: Specifies the crawling interval of the search engine crawler, in seconds.

For example, here is an example of a Robots protocol file:

User-agent: Googlebot
Disallow: /private/
Disallow: /admin/
Disallow: /login/
Allow: /public/
Sitemap: http://www.example.com/sitemap.xml
Crawl-delay: 10

In the above example, we added the Allow parameter to allow search engine crawlers to access pages under the /public/ path. At the same time, the URL of the site map is specified as http://www.example.com/sitemap.xml, and the crawling interval of the search engine crawler is 10 seconds.

3. Robots protocol case

Next, a practical case will be used to illustrate how to use the Robots protocol to limit the access of search engine crawlers.

Suppose we want to make an e-commerce website and don't want search engine crawlers to crawl our shopping cart page.

First, we need to create a file named robots.txt in the root directory of the website and specify the page URLs that we do not want search engine crawlers to crawl. The sample code is as follows:

User-agent: *
Disallow: /cart/

In the above code, we use the `*` wildcard, which means it is applicable to all search engine crawlers, and also specifies that access to pages under the /cart/ path is not allowed.

In this way, the search engine crawler will first read the Robots protocol file when visiting our website, and decide whether to crawl our shopping cart page based on the content of the protocol file.

4. Python implements Robots protocol

In Python, you can use the robotparser module in the urllib library to parse and use the Robots protocol. The sample code is as follows:

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("http://www.example.com/robots.txt")
rp.read()

if rp.can_fetch("Googlebot", "http://www.example.com/cart/"):
    print("Googlebot is allowed to fetch the content!")
else:
    print("Googlebot is not allowed to fetch the content!")

In the above code, we first create a RobotFileParser object, specify the URL of the Robots protocol file, and read the contents of the protocol file. Then use the can_fetch() method to determine whether the specified search engine crawler is allowed to crawl the specified URL.

Summarize

The Robots protocol is a website management standard. By creating a robots.txt file in the root directory of the website, it can stipulate the rules for search engine crawlers to crawl website content. The Robots protocol is simple and easy to understand and has stable execution effects. It is an important tool for website administrators to perform search engine optimization.

Guess you like

Origin blog.csdn.net/wq10_12/article/details/132713538