Application of Python in the field of web crawlers

Python is a powerful programming language used by many people. So the question is, what are the application areas of Python? In fact, Python has a wide range of applications, covering almost all walks of life on the entire Internet, especially many large and medium-sized Internet companies are using Python to complete various tasks. In foreign countries, there are Google, Youtube, etc.; in China, there are Baidu, Sina, Ali, Netease, Taobao, Zhihu, Douban Meituan, etc. After an overall analysis, the fields involved in Python mainly include Web application development, automated operation and maintenance, artificial intelligence, web crawlers, game development, and so on.

Here we focus on the field of web crawlers. Python was used to write web crawlers from the beginning. Search engine companies such as Baidu use the Python language to write web crawlers in large numbers. And from a technical point of view, Python provides many tools for writing web crawlers, such as urllib, Selenium and BeautifulSoup, etc., and also provides a web crawler framework Scrapy. The Scrapy framework is a relatively mature Python crawler framework. It is a fast and high-level information crawling framework developed using Python. It can efficiently crawl web pages and extract structured data. In the process of using Scrapy to crawl data, the target website often has a very strict anti-crawling mechanism. The more common one is the access restriction for IP. How to add proxy IP to bypass the anti-crawling mechanism to successfully obtain data during the crawling process. For example, here we can visit Baidu to search for keywords as requirements and add proxy IP to achieve data acquisition. The code implementation process is as follows:

  #! -*- encoding:utf-8 -*-
        import base64            
        import sys
        import random

        PY3 = sys.version_info[0] >= 3

        def base64ify(bytes_or_str):
            if PY3 and isinstance(bytes_or_str, str):
                input_bytes = bytes_or_str.encode('utf8')
            else:
                input_bytes = bytes_or_str

            output_bytes = base64.urlsafe_b64encode(input_bytes)
            if PY3:
                return output_bytes.decode('ascii')
            else:
                return output_bytes

        class ProxyMiddleware(object):                
            def process_request(self, request, spider):
                # 代理服务器(产品官网 www.16yun.cn)
                proxyHost = "t.16yun.cn"
                proxyPort = "31111"

                # 代理验证信息
                proxyUser = "16MNGEBC"
                proxyPass = "854726"

                request.meta['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)


                # [版本>=2.6.2](https://docs.scrapy.org/en/latest/news.html?highlight=2.6.2#scrapy-2-6-2-2022-07-25)无需添加验证头,会自动在请求头中设置Proxy-Authorization     
                # 版本<2.6.2 需要手动添加代理验证头
                # request.headers['Proxy-Authorization'] = 'Basic ' +  base64ify(proxyUser + ":" + proxyPass)                    

                # 设置IP切换头(根据需求)
                # tunnel = random.randint(1,10000)
                # request.headers['Proxy-Tunnel'] = str(tunnel)

                # 每次访问后关闭TCP链接,强制每次访问切换IP
                request.header['Connection'] = "Close"

Guess you like

Origin blog.csdn.net/DEVELOPERAA/article/details/129800146