In use the single-threaded asynchronous coroutine crawler, comprising a single-tasking and multi-task, data analysis and the use of callback

aiohttpBrief introduction

aiohttpThreaded concurrent IO operations may be implemented, instead of by his non-asynchronous request module to send the request, the request ua, headers, and parameters can be added, add the following method:

Environment Installation

pip install aiohttp

aiohttp使用

1. initiate a request

Copy the code
async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.baidu.com') as resposne:
            print(await resposne.text())

loop = asyncio.get_event_loop()
tasks = [fetch(),]
loop.run_until_complete(asyncio.wait(tasks))
Copy the code

2. The method of adding the request parameters:

Copy the code
params = {'key': 'value', 'page': 10}
async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.baidu.com/s',params=params) as resposne:
            print(await resposne.url)

loop = asyncio.get_event_loop()
tasks = [fetch(),]
loop.run_until_complete(asyncio.wait(tasks))
Copy the code

3.UA camouflage Add method:

Copy the code
url = 'http://httpbin.org/user-agent'
headers = {'User-Agent': 'test_user_agent'}

async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get(url,headers=headers) as resposne:
            print(await resposne.text())

loop = asyncio.get_event_loop()
tasks = [fetch(),]
loop.run_until_complete(asyncio.wait(tasks))
Copy the code

4. The method of custom cookies:

Copy the code
url = 'http://httpbin.org/cookies'
cookies = {'cookies_name': 'test_cookies'}

async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get(url,cookies=cookies) as resposne:
            print(await resposne.text())
          

loop = asyncio.get_event_loop()
tasks = [fetch(),]
loop.run_until_complete(asyncio.wait(tasks))
Copy the code

5.post request parameters

Copy the code
url = 'http://httpbin.org'
payload = {'username': 'zhang', 'password': '123456'}
async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.post(url, data=payload) as resposne:
            print(await resposne.text())

loop = asyncio.get_event_loop()
tasks = [fetch(), ]
loop.run_until_complete(asyncio.wait(tasks))
Copy the code

6. Set proxy

Copy the code
url = "http://python.org"
async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get(url, proxy="http://some.proxy.com") as resposne:
        print(resposne.status)

loop = asyncio.get_event_loop()
tasks = [fetch(), ]
loop.run_until_complete(asyncio.wait(tasks))
Copy the code

Asynchronous IO processing

Copy the code
# Installation environment: the install aiohttp PIP 
# using the module in a ClientSession 
Import Requests 
Import ASYNCIO 
Import Time 
Import aiohttp 

Start the time.time = () 
URLs = [ 
    'http://127.0.0.1:5000/tiger','http:/ /127.0.0.1:5000/jay','http://127.0.0.1:5000/tom ', 
    ' http://127.0.0.1:5000/tiger ',' http://127.0.0.1:5000/jay ',' http://127.0.0.1:5000/tom ', 
    ' http://127.0.0.1:5000/tiger ',' http://127.0.0.1:5000/jay ',' HTTP: //127.0 .0.1: 5000 / Tom ', 
    ' http://127.0.0.1:5000/tiger ',' http://127.0.0.1:5000/jay ',' http://127.0.0.1:5000/tom ', 

] 

the async DEF get_page (url): 
    the async with aiohttp.ClientSession () AS the session:
        #get()、post():
        #headers,params/data,proxy='http://ip:port' 
        the async with the await Session.get (URL) AS Response: 
            #text () returns the response data string 
            #read () returns the form of binary response data 
            #JSON () returns the object is json 
            # Note: before acquiring the response data operation must be performed manually using await suspend 
            page_text = await response.text () 
            Print (page_text) 

Tasks = [] 

for URLs in URL: 
    C = the get_page (URL) 
    Task = asyncio.ensure_future (C) 
    tasks.append (Task) 

Loop = asyncio.get_event_loop () 
loop.run_until_complete (asyncio.wait (Tasks)) 

End = the time.time () 

Print ( 'total time: ', end-start)
Copy the code
# Use aiohttp alternative requests module 
Import Time 
Import ASYNCIO 
Import aiohttp 

the async DEF get_page (url): 
    the async with aiohttp.ClientSession () AS the session: 
        # will be blocked as long as there is time-consuming, you have to use await pending operations carried out 
        async with await Session.get (URL = URL) AS response: 
            page_text the await response.text = () # binary Read () / JSON () 
            Print ( 'response data', page_text) 

Start the time.time = () 
URLs = [ 
    'HTTP: //127.0.0.1:5000/tiger ', 
    ' http://127.0.0.1:5000/jay ', 
    ' http://127.0.0.1:5000/tom ', 
] 
Loop = asyncio.get_event_loop () 

Tasks = [ ] 
for url in urls: 
    Cone = get_page (url) 
    Task = asyncio.ensure_future (Cone)
    tasks.append(task)

loop.run_until_complete(asyncio.wait(tasks))
print('总耗时: ', time.time()-start)
It supports asynchronous web request module: aiohttp

Here we will request from the library into a aiohttp requests, a request by a method of the class get ClientSession aiohttp of (), the following results:

Copy the code
Hello tom
Hello jay
Hello tiger
Hello tiger
Hello jay
Hello tiger
Hello tom
Hello jay
Hello jay
Hello tom
Hello tom
Hello tiger
总耗时: 2.037203073501587
Copy the code

Success! We found that the request by the time-consuming six seconds into a two seconds, time-consuming directly into the original 1/3.

We use the code inside await, followed by a get () method, in the implementation of the five co-process, if you encounter await, it will suspend the current coroutine instead to perform other coroutine until other coroutine or suspend or finished, and then execute the next coroutine.

It starts running, the cycle time will run the first task, for the first task, when executed to first await follow the get () method, which is pending, but the get () method first step the execution is non-blocking, suspended immediately after being awakened, so they immediately entered the implementation, ClientSession created objects, then met with the second await, called session.get () request method, and then was suspended, because requests require time-consuming for a long time, it has not been awakened, a good first task is suspended, next, how to do it? Event loop will look not currently suspended coroutine continue, so he turned to the implementation of the second task, the process is the same operation until after the execution of the session.get fifth task () method, all task have been hung up. All task have been in a suspended state, it is supposed to? I had to wait. After three seconds, several requests almost simultaneously have a response, and then wake up several task is also performed, the output result of the request, and finally takes 3 seconds!

how about it? This is the convenience offered by the asynchronous operation, when it comes to blocking operation, the task is suspended, then program to perform other tasks, rather than innocently waiting for, so you can make full use of CPU time, without having to waste time waiting on IO.

Visible, after the use of asynchronous co-routines, we can achieve almost the same time hundreds of times times of network requests, this use in reptiles, the speed increase can be described as very impressive.

 

How to achieve data analysis - binding callback mechanism tasks (complete coroutine process)

Copy the code
Time Import 
Import ASYNCIO 
Import aiohttp 

# callback: mainly used to parse the response data 
DEF the callback (Task): 
    Print ( 'the callback This IS') 
    # fetch response data 
    page_text = task.result () 
    Print ( "you can then callback function for data parsing ") 

the async DEF get_page (url): 
    the async with aiohttp.ClientSession () AS the session: 
        # will be blocked as long as there is time-consuming, you have to use await pending operations carried out 
        async with await session.get (url URL =) AS response: 
            page_text the await response.text = () # binary Read () / JSON () 
            Print ( 'response data', page_text) 
            return page_text 

Start the time.time = () 
URLs = [ 
    'HTTP: //127.0 .0.1: 5000 / Tiger ', 
    ' http://127.0.0.1:5000/jay ', 
    ' http://127.0.0.1:5000/tom ',
]
# The first step in generating the object event loop Loop = asyncio.get_event_loop ()# list of tasks Tasks = [] for url in urls: Cone = get_page (url)  # The second step will coroutine function objects into the task task = asyncio. ensure_future (Cone) # object is bound to the task callback function to parse the response data task.add_done_callback (callback)  # the third step is to add all of the tasks to the task list tasks.append (task) # step four running event loop objects , asyncio.wait () to multi-task to run automatically in a loop loop.run_until_complete (asyncio.wait (tasks)) Print ( 'total time:', time.time () - start


Guess you like

Origin www.cnblogs.com/caiwenjun/p/11761736.html