Python crawler learning 12

Python crawler learning 12

  • robot parser

    ​ In the previous article, we learned about the Robots protocol together. After understanding the protocol, we can use the robotparser module to parse the robot.txt file.

    • RobotFileParser

      • statement:

        insert image description here

        Just pass in the url connection of robot.txt when using this class

      • set_url () method: used to set the connection to the robot.txt file

        insert image description here

      • read() method: Read the robot.txt file for analysis

        insert image description here

      • parse() method: used to parse the robot.txt file

        insert image description here

      • can_fetch() method: There are two parameters, the first is the user-Agent, the second is the URL to be fetched, and the returned result is True or False. Indicates whether the search engine indicated by user-Agent can crawl this URL.

        insert image description here

      • mtime() method: Returns the last time of crawling and analyzing robots.txt, which is necessary for search crawlers that analyze and crawl robots.txt files for a long time. We need to check regularly to grab the latest robots.txt

      • Modified() method: It is also necessary for search crawlers that crawl and analyze for a long time, and can set the current time to the last time of crawling and analyzing the robots.txt file.

      • Example

        from urllib import robotparser
        
        rp = robotparser.RobotFileParser()
        rp.set_url('https://www.baidu.com/robots.txt')
        rp.read()
        
        # 用can_fetch 判断网页是否可以被抓取
        print(rp.can_fetch('Baiduspider','https://www.baidu.com'))
        print(rp.can_fetch('Baiduspider','https://www.baidu.com/homepage/'))
        print(rp.can_fetch('Googlebot','https://www.baidu.com/homepage/'))
        
        # 从结果我们可以看到 用Baiduspider可以抓取homepage页面,而Googlebot就不行。
        

        operation result :

        insert image description here

        Open Baidu's robots.txt

        insert image description here

        You can see that it has no restrictions on Baiduspider

  • Conclusion of this chapter

    This concludes the use of the urllib library. For the urllib library, we have learned the basic usage of the request module, error module, parse module and robotparser module. In the next article we will learn more powerful requests library.

To be continued…

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123389610