Using the thread pool implementation + multitasking multiple tasks to achieve efficient induction Coroutine crawling. Crawling below to audio data for demonstration
Efficient crawling large amount of audio data
The main idea:
<variable name with the following code is not the same, but the convenience of description shown>
1. First, get all the audio url address, put in two url_list to simulate multiple tasks <500W have such a task, it is they will be divided into two, respectively, to perform the task list>
2. create coroutine function: to achieve access to the main audio data, returns a dictionary <name of audio, audio bytes>
3. create a callback function _1: main audio data persistence storage
4 . coroutine _obj_list stored objects instantiated coroutine
5. task_list stored task coroutine encapsulated <i.e. task>, and the binding at this stage callback _1
6. instantiated objects event loop
7. func_args_list stored dictionary , Dictionary content: cyclic object event, a task object coroutine <i.e. task>
8. the Create pool_func: main multitasking coroutine execution asynchronously crawling tasks
9. the thread pool is instantiated, open thread pool, and the event loop Association Object Cheng task object to pool_func function, officially started crawling tasks
import requests
from time import time
from lxml import etree
from multiprocessing.dummy import Pool
import asyncio
import aiohttp
import os
pool = Pool(12)
url = 'https://www.ximalaya.com/revision/play/album?albumId=20337620&pageNum=1&sort=1&pageSize=30'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
}
index_dic = requests.get(url,headers=headers).json()
urls1 = []
urls2 = []
count = 0
#获取音频数据 url
for dic in index_dic['data']['tracksAudioPlay']:
my_dic = {}
my_dic['url'] = dic['src']
my_dic['title'] = dic['trackName']
if count == 0:
urls1.append(my_dic)
count += 1
else:
count -= 1
urls2.append(my_dic)
#协程函数: 获取字节形式的音频数据
async def download(dic):
down_dic = {}
async with aiohttp.ClientSession() as s: #实例化一个支持异步的请求方法
async with await s.get(dic['url'],headers=headers) as response:
bytes_music = await response.read()#字节形式的音频数据
down_dic['title'] = dic['title']
down_dic['bytes_music'] = bytes_music
return down_dic
def callback_(dic):
#存储
my_dic = dic.result()
path_ = 'xxx'
with open(os.path.join(path_,my_dic['title']),'wb')as f:
f.write(my_dic['bytes_music'])
print("%s已完成!!!"%my_dic['title'])
async_obj_list1 = [download(i) for i in urls1] #实例化 协程对象
async_obj_list2 = [download(i) for i in urls2]
task_list1 = []
task_list2 = []
for async_obj in async_obj_list1:
task = asyncio.ensure_future(async_obj) #实例化 task
task.add_done_callback(callback_)
task_list1.append(task)
for async_obj in async_obj_list2:
task = asyncio.ensure_future(async_obj) #实例化 task
task.add_done_callback(callback_)
task_list2.append(task)
for_loop1 = asyncio.get_event_loop()
for_loop2 = asyncio.get_event_loop()
func_args_list = [{'for_loop':for_loop1,'task_list':task_list1},{'for_loop':for_loop2,'task_list':task_list2}]
def pool_func (dic):
loop_ = dic['for_loop']
task_list = dic['task_list']
loop_.run_until_complete(asyncio.wait(task_list))
pool.map(pool_func ,func_args_list )