(Asynchronous crawler) The use of proxy IP in requests and aiohttp

(Asynchronous crawler) The use of proxy IP in requests and aiohttp


If crawlers want to crawl well, IP proxy is indispensable. . Now websites basically have some anti-crawling measures. If the access speed is slightly faster, you will find that the IP is blocked, otherwise it is submitted for verification. Here are two commonly used modules to talk about the use of proxy IP. Not much to say, just start.

Insert picture description here
Use of proxy IP in
requests: To use proxy IP in requests, you only need to add a proxies parameter. The parameter value of proxies is a dictionary, the key is the proxy protocol (http/https), and the value is the ip and port number. The specific format is as follows.

try:
    response = requests.get('https://httpbin.org/ip', headers=headers, 
    	proxies={
    
    'https':'https://221.122.91.74:9401'}, timeout=6)
    print('success')
    # 检测代理IP是否使用成功
    # 第一种方式,返回发送请求的IP地址,使用时要在 get() 添加 stream = True
    # print(response.raw._connection.sock.getpeername()[0])
    # 第二种方式,直接返回测试网站的响应数据的内容
    print(response.text)
except Exception as e:
    print('error',e)

Insert picture description here
Note: The key value (http/https) of peoxies must be consistent with the url, otherwise it will be directly accessed using the local IP.

The use of proxy IP in aiohttp:
Because the requests module does not support asynchrony, aiohttp is forced to use, and many pits are dropped.
Its usage is similar to requests. It also adds a parameter to the get() method, but at this time the parameter name is proxy, the parameter value is a string, and the proxy protocol in the string only supports http. If it is written as https, an error will be reported. .
Here is a record of my error correction process. .
First of all, according to the online usage, I tried the following code first.

async def func():
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get("https://httpbin.org/ip", headers=headers, 
            			proxy='http://183.220.145.3:80', timeout=6) as response:
                page_text = await response.text()
                print('success')
                print(page_text)
        except Exception as e:
            print(e)
            print('error')

if __name__=='__main__':
    asyncio.run(func())

Insert picture description here
Come again after modification

async def func():
    con = aiohttp.TCPConnector(verify_ssl=False)
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
        try:
            async with session.get("https://httpbin.org/ip", headers=headers, 
            proxy='http://183.220.145.3:80', timeout=6) as response:
                # print(response.raw._connection.sock.getpeername()[0])
                page_text = await response.text()
                print(page_text)
                print('success')
        except Exception as e:
            print(e)
            print('error')

Insert picture description here
Insert picture description here
Instead of solving it, there is an additional warning. Fortunately, it will be better to change it. Uh~ I'm too lazy to stick, let's go straight to the final version. .

# 修改事件循环的策略,不能放在协程函数内部,这条语句要先执行
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
async def func():
	# 添加trust_env=True
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
        try:
            async with session.get("https://httpbin.org/ip", headers=headers,
             proxy='http://183.220.145.3:80', timeout=10) as response:
                page_text = await response.text()
                print(page_text)
                print('success')
        except Exception as e:
            print(e)
            print('error')

Insert picture description here
Although the error correction process is a bit long, fortunately, I know how to use it.

At the end, you will see a beautiful picture, Sa Yunara~
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/109622238