Python crawler learning 11

Python crawler learning 11

  • resolve connection

    • urlencode
      • As mentioned before, urlencode can transform data

        from urllib import parse
        
        params = {
                  
                  
            'name': 'germey'
            , 'age': '25'
        }
        base_url = 'http://www.baidu.com?'
        url = base_url + parse.urlencode(params)
        print(url)
        

        Running result: You can see that the parameters have been converted into GET type request parameters

        insert image description here

    • prase_qs
      • The role is to restore the serialized request parameters to a dictionary

        # parse_qs
        from urllib.parse import parse_qs
        
        qurey = 'name=germey&age=25'
        print(parse_qs(qurey))
        

        operation result:

        insert image description here

    • parse_qsl
      • parse_qsl is used to convert arguments to a list of tuples

        # parsa_qsl
        from urllib.parse import parse_qsl
        
        query = 'name=germey&age=25'
        print(parse_qsl(query))
        

        operation result:

        insert image description here

    • quote
      • Convert content into URL encoding format, which can convert Chinese characters to character encoding

        # quote
        from urllib.parse import quote
        
        kw = '雪容融'
        url = 'http://www.baidu.com/s?wd'+quote(kw)
        print(url)
        
        

        operation result:

        insert image description here

    • unquote
      • In contrast to quote, decoding can be achieved

        # unquote
        from urllib.parse import unquote
        
        url = '%E9%9B%AA%E5%AE%B9%E8%9E%8D'
        print('%E9%9B%AA%E5%AE%B9%E8%9E%8D 解码后为:',unquote(url))
        

        operation result:

        insert image description here

  • Analyze Robots Protocol

    • Robots Agreement
      • Also known as crawler protocol and robot protocol, the full name is Robots Exclusion Protocol, which is used to list which pages can be crawled by crawlers or search engines and which cannot. It is usually a text file called robots.txt, usually placed in the root directory of the website.

      • When a search crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the site. If it exists, it will crawl according to the crawling scope defined in it. If this file is not found, the crawler will visit all directly accessible pages.

        # 样例
        
        user-agent: *	# 搜索爬虫名称,*代表对所有爬虫都有效 
        Disallow:/		# 指定了不允许爬虫爬取的目录,/代表不允许爬虫爬取所有页面
        Allow:/pubilc/	# 一般会和disallow相互配合,用来排除某些限制,所以综上所述,此例中代表所有页面都不允许爬取,但是可以爬取pubilc目录
        
    • Reptile name
      • The crawler has a fixed name. For example, Baidu's crawler is called BaiduSpider.

      insert image description here

To be continued. . . .

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123364718