Python crawler learning 12
-
robot parser
In the previous article, we learned about the Robots protocol together. After understanding the protocol, we can use the robotparser module to parse the robot.txt file.
-
RobotFileParser
-
statement:
Just pass in the url connection of robot.txt when using this class
-
set_url () method: used to set the connection to the robot.txt file
-
read() method: Read the robot.txt file for analysis
-
parse() method: used to parse the robot.txt file
-
can_fetch() method: There are two parameters, the first is the user-Agent, the second is the URL to be fetched, and the returned result is True or False. Indicates whether the search engine indicated by user-Agent can crawl this URL.
-
mtime() method: Returns the last time of crawling and analyzing robots.txt, which is necessary for search crawlers that analyze and crawl robots.txt files for a long time. We need to check regularly to grab the latest robots.txt
-
Modified() method: It is also necessary for search crawlers that crawl and analyze for a long time, and can set the current time to the last time of crawling and analyzing the robots.txt file.
-
Example
from urllib import robotparser rp = robotparser.RobotFileParser() rp.set_url('https://www.baidu.com/robots.txt') rp.read() # 用can_fetch 判断网页是否可以被抓取 print(rp.can_fetch('Baiduspider','https://www.baidu.com')) print(rp.can_fetch('Baiduspider','https://www.baidu.com/homepage/')) print(rp.can_fetch('Googlebot','https://www.baidu.com/homepage/')) # 从结果我们可以看到 用Baiduspider可以抓取homepage页面,而Googlebot就不行。
operation result :
Open Baidu's robots.txt
You can see that it has no restrictions on Baiduspider
-
-
-
Conclusion of this chapter
This concludes the use of the urllib library. For the urllib library, we have learned the basic usage of the request module, error module, parse module and robotparser module. In the next article we will learn more powerful requests library.
To be continued…