「本章,啥也不干,就来简略的比较比较爬虫速度」
先上结果,以下结果是多次运行后取的最优结果。不同时间段对于速率影响还是有的。参考即可
"""
普通函数执行:总耗时 3.330171585083008 S
线程池执行:总耗时 总耗时 1.6058530807495117 S
多线程执行:总耗时 总耗时 1.8512330055236816 S
协程异步执行:总耗时 总耗时 总耗时 1.091230869293213 S
线程池协程异步:总耗时 0.8936080932617188 S
"""
普通函数
也就是同步爬虫
import requests
import time
List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']
def run():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
'Referer': 'https://www.fabiaoqing.com/biaoqing'
}
for Url in List_Url:
response = requests.get(Url,headers=headers)
with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
content =response.content
w.write(content)
if __name__ == '__main__':
s = time.time()
run()
print("总耗时 {} S".format(time.time() - s))
这个没什么看的,下一个
线程池与多线程爬虫
import requests
import time
from concurrent.futures import ThreadPoolExecutor
import threading
List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']
def run(Url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
'Referer': 'https://www.fabiaoqing.com/biaoqing'
}
response = requests.get(Url,headers=headers)
with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
content =response.content
w.write(content)
if __name__ == '__main__':
s = time.time()
with ThreadPoolExecutor(max_workers=5) as pool:
# for Url in List_Url:
# pool.submit(run,Url)
pool.map(run,List_Url)
# threads = []
# for url in List_Url:
# thread = threading.Thread(target=run, args=(url,))
# thread.start()
# threads.append(thread)
# [j.join() for j in threads]
print("总耗时 {} S".format(time.time() - s))
一般来说,爬虫最常用的就是它两了,效率上其实影响不大,主要还是跟当前的响应什么的挂钩。
值得注意的是用法上的区别。
协程
import asyncio
import time
import aiohttp
List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']
async def run(session, Url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
'Referer': 'https://www.fabiaoqing.com/biaoqing'
}
async with session.get(Url,headers=headers) as response:
with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
content = await response.content.read()
w.write(content)
async def main():
async with aiohttp.ClientSession() as session:
tasks = [asyncio.create_task(run(session, Url)) for Url in List_Url]
await asyncio.wait(tasks)
if __name__ == '__main__':
s = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
print("总耗时 {} S".format(time.time() - s))
对于协程不懂的就看看基础,再来吧,此处的并发协程,速率还是有比较的明显的提升的。网络波动不大的情况下,是要比多线程要快的。
线程池协程异步
「小小的说一句,此处并没有做任何验证,只是单纯的测试了一下」 如果不对欢迎指出。
import asyncio
import time
import aiohttp
from concurrent.futures import ThreadPoolExecutor
List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']
async def run(session, Url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
'Referer': 'https://www.fabiaoqing.com/biaoqing'
}
async with session.get(Url, headers=headers) as response:
with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
content = await response.content.read()
w.write(content)
async def main():
async with aiohttp.ClientSession() as session:
tasks = [asyncio.create_task(run(session, Url)) for Url in List_Url]
await asyncio.wait(tasks)
if __name__ == '__main__':
s = time.time()
loop = asyncio.get_event_loop()
with ThreadPoolExecutor(max_workers=5) as pool:
pool.submit(loop.run_until_complete(main()))
print("总耗时 {} S".format(time.time() - s))
小结
上述代码中,请忽略用法上的冗余!