Sesame HTTP: Analyzing the Robots Protocol

Using the urllib robotparsermodule, we can implement the analysis of the website Robots protocol. In this section, let's take a brief look at the usage of this module.

1. Robots Protocol

The Robots protocol is also known as the crawler protocol and the robot protocol. Its full name is called the Robots Exclusion Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. It is usually a text file called robots.txt, usually placed in the root directory of the website.

When a search crawler visits a site, it will first check whether the robots.txt file exists in the root directory of the site. If it exists, the search crawler will crawl according to the crawling scope defined therein. If this file is not found, the search crawler will visit all directly accessible pages.

Let's look at a sample robots.txt:

User-agent: *
Disallow: /
Allow: /public/

 This realizes the function that all search crawlers are only allowed to crawl the public directory, save the above content as a robots.txt file, put it in the root directory of the website, and the entry files of the website (such as index.php, index.html and index .jsp etc) together.

The above User-agentdescribes the name of the search crawler, here it is set to * to mean that the protocol is valid for any crawler. For example, we can set:

User-agent: Baiduspider

 This means that the rules we set are valid for Baidu crawlers. If there are multiple User-agentrecords, multiple crawlers will be limited by crawling, but at least one needs to be specified.

DisallowThe directory that is not allowed to be crawled is specified. For example, if it is set to / in the above example, it means that all pages are not allowed to be crawled.

AllowGenerally Disallowused together, and generally not alone, to exclude certain limitations. Now we set it to /public/, which means that all pages are not allowed to be crawled, but the public directory can be crawled.

Let's look at a few more examples. The code to prohibit all crawlers from accessing any directory is as follows:

User-agent: *
Disallow: /

 The code to allow all crawlers to access any directory is as follows:

User-agent: *
Disallow:

 In addition, it is also possible to leave the robots.txt file blank.

The code to prohibit all crawlers from accessing certain directories of the website is as follows:

User-agent: *
Disallow: /private/
Disallow: /tmp/

 The code that only allows a certain crawler to access is as follows:

User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /

 These are some common ways to write robots.txt.

2. Reptile name

You may be wondering, where did the reptile name come from? Why is it called this name? In fact, it has a fixed name. For example, Baidu's is called BaiduSpider. Table 3-1 lists the names of some common search crawlers and their corresponding websites.

Table 3-1 The names of some common search crawlers and their corresponding websites

Reptile name

name

website

BaiduSpider

Baidu

www.baidu.com

Googlebot

Google

www.google.com

360Spider

360 search

www.so.com

YodaoBot

Have a good way

www.youdao.com

ia_archiver

Alexa

www.alexa.cn

Scooter

altavista

www.altavista.com

3. robot parser

After understanding the Robots protocol, we can use the robotparsermodule to parse robots.txt. This module provides a class RobotFileParserthat can judge whether a crawler has permission to crawl this webpage according to the robots.txt file of a website.

This class is very simple to use, just pass in the link of robots.txt in the constructor. First look at its declaration:

urllib.robotparser.RobotFileParser(url='')

 Of course, you can also not pass it in when declaring, the default is empty, and you set_url()can use the method to set it at the end.

Several methods commonly used in this class are listed below.

  • set_url():用来设置robots.txt文件的链接。如果在创建RobotFileParser对象时传入了链接,那么就不需要再使用这个方法设置了。
  • read():读取robots.txt文件并进行分析。注意,这个方法执行一个读取和分析操作,如果不调用这个方法,接下来的判断都会为False,所以一定记得调用这个方法。这个方法不会返回任何内容,但是执行了读取操作。
  • parse():用来解析robots.txt文件,传入的参数是robots.txt某些行的内容,它会按照robots.txt的语法规则来分析这些内容。
  • can_fetch():该方法传入两个参数,第一个是User-agent,第二个是要抓取的URL。返回的内容是该搜索引擎是否可以抓取这个URL,返回结果是TrueFalse
  • mtime():返回的是上次抓取和分析robots.txt的时间,这对于长时间分析和抓取的搜索爬虫是很有必要的,你可能需要定期检查来抓取最新的robots.txt。
  • modified():它同样对长时间分析和抓取的搜索爬虫很有帮助,将当前时间设置为上次抓取和分析robots.txt的时间。

下面我们用实例来看一下:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

 这里以简书为例,首先创建RobotFileParser对象,然后通过set_url()方法设置了robots.txt的链接。当然,不用这个方法的话,可以在声明时直接用如下方法设置:

rp = RobotFileParser('http://www.jianshu.com/robots.txt')

 接着利用can_fetch()方法判断了网页是否可以被抓取。

运行结果如下:

True
False

 这里同样可以使用parser()方法执行读取和分析,示例如下:

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

rp = RobotFileParser()
rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

 运行结果一样:

True
False

 本节介绍了robotparser模块的基本用法和实例,利用它,我们可以方便地判断哪些页面可以抓取,哪些页面不可以抓取。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326250710&siteId=291194637