单线程爬虫与多进程分布式爬虫的对比、异步加载Asyncio来加速爬虫运行、使用aiohttp实现异步requests

单线程爬虫与多进程分布式爬虫的对比：

我们来做一个单线程的爬虫与多进程的分布式爬虫的对比。

该分布式爬虫的原理图：

分布式爬虫的主要工作过程：

我们最开始打开一个网站的首页，首页中有很多url。我们使用Python多进程同时下载这些url，得到其HTML代码后，同时开始解析并寻找这个网站中还没有爬过的链接，最终爬完整个网站的所有url页面。

如：

我们想下载下面这个网页的所有页面：

https://morvanzhou.github.io/

打开该首页。查看源代码，可以发现其链接的代码形式如下图：

这给我们解析并提取url页面提供了依据。

我们先用单线程爬虫试一下：

import time
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import re

# urlopen爬取网页，BeautifulSoup解析网页
base_url = 'https://morvanzhou.github.io/'


def crawl(url):
	response = urlopen(url)
	return response.read().decode('utf-8')


# 返回爬取的网页

def parse(html):
	# 解析函数
	soup = BeautifulSoup(html, 'lxml')
	# 解析网页
	urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
	# ^匹配字符串的开头。
	# $匹配字符串的末尾。
	# /顺斜杠是表示表达式开始和结束的“定界符”。
	# \反斜杠是表示转义字符。
	# .+?表示匹配任意字符并最小匹配。
	title = soup.find('h1').get_text().strip()
	# strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。
	page_urls = set([urljoin(base_url, url['href']) for url in urls])
	# 上面一句是为了去除重复网址，set集合里的元素是不会重复的，urljoin就是将基地址与一个相对地址形成一个绝对地址
	url = soup.find('meta', {'property': 'og:url'})['content']
	# 该网站源代码中表示网页链接的标签均为meta，且property值为og:url，我们只需要其content部分的内容，即链接本身
	return title, page_urls, url


unseen = set([base_url, ])
seen = set()

count, t1 = 1, time.time()
while len(unseen) != 0:
	if len(seen) > 10:
		break
	# 限制爬取最多20个网页
	print('\n开始爬取网页')
	htmls = [crawl(url) for url in unseen]
	# 获得下载的网页的源代码
	print('\n开始解析网页')
	results = [parse(html) for html in htmls]
	# 获取分析得到的title, page_urls, url
	print('\n分析网页')
	seen.update(unseen)
	#  update()函数把字典unseen的键/值对更新到seen里。
	unseen.clear()
	# 删除字典内所有元素

	for title, page_urls, url in results:
		print(count, title, url)
		# 打印爬取的网页编号、标题和链接地址
		count += 1
		unseen.update(page_urls - seen)
		# 更新unseen为还未爬取的链接集合
print('total time:%.1f s' % (time.time() - t1,))

运行结果如下：

开始爬取网页

开始解析网页

分析网页
1 教程 https://morvanzhou.github.io/

开始爬取网页

开始解析网页

分析网页
2 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
3 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
4 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
5 为了更优秀 https://morvanzhou.github.io/support/
6 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
7 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
8 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
9 说吧~ https://morvanzhou.github.io/discuss/
10 近期更新 https://morvanzhou.github.io/recent-posts/
11 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
12 关于莫烦 https://morvanzhou.github.io/about/
13 计算机视觉 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/computer-vision/
14 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
15 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
16 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
17 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
18 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
19 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
20 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
21 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
22 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
23 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
24 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
25 其他教学系列 https://morvanzhou.github.io/tutorials/others/
26 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
27 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
28 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
29 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
30 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
31 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
32 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
33 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
34 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
total time:24.5 s

再改用多线程爬虫试一下：

import multiprocessing as mp
# 需要使用多进程模块
import time
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import re

# urlopen爬取网页，BeautifulSoup解析网页
base_url = 'https://morvanzhou.github.io/'


def crawl(url):
	response = urlopen(url)
	return response.read().decode('utf-8')


# 返回爬取的网页

def parse(html):
	# 解析函数
	soup = BeautifulSoup(html, 'lxml')
	# 解析网页
	urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
	# ^匹配字符串的开头。
	# $匹配字符串的末尾。
	# /顺斜杠是表示表达式开始和结束的“定界符”。
	# \反斜杠是表示转义字符。
	# .+?表示匹配任意字符并最小匹配。
	title = soup.find('h1').get_text().strip()
	# strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。
	page_urls = set([urljoin(base_url, url['href']) for url in urls])
	# 上面一句是为了去除重复网址，set集合里的元素是不会重复的，urljoin就是将基地址与一个相对地址形成一个绝对地址
	url = soup.find('meta', {'property': 'og:url'})['content']
	# 该网站源代码中表示网页链接的标签均为meta，且property值为og:url，我们只需要其content部分的内容，即链接本身
	return title, page_urls, url


unseen = set([base_url, ])
# 未爬取的网页集合
seen = set()
# 已经爬取的网页的集合，初始为0
if __name__ == '__main__':
	# 在windows中用multiprocessing必须加上上面这句if __name__ == '__main__':，否则会报错！
	pool = mp.Pool(4)
	# 创建一个进程池，最多可以同时运行4个进程
	count, t1 = 1, time.time()
	while len(unseen) != 0:
		if len(seen) > 20:
			break
		# 爬取超过20个页面则中断
		crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
		# apply_async()本身就可以返回被进程调用的函数的返回值。
		# 如果在函数func中返回一个值，那么pool.apply_async(func, (msg, ))的结果就是返回pool中所有进程的值的对象。
		# pool.apply_async()是apply()函数的变体。它既是Pool的方法，也是Python内置的函数，两者等价。
		# apply()是阻塞的。主进程会被阻塞直到函数执行结束。
		# apply_async 是异步非阻塞的。完全没有等待子进程执行完毕，主进程就已经执行完毕，并退出程序。
		print('\n开始爬取网页')
		htmls = [j.get() for j in crawl_jobs]
		# 得到抓取的所有页面
		print('\n开始解析网页')
		parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
		# 分析页面，同样使用pool.apply_async()
		# 把抓取页面信息和分析页面信息这两步由原来的循环执行改为多进程执行
		results = [j.get() for j in parse_jobs]
		print('\n分析网页')
		seen.update(unseen)
		#  update()函数把字典unseen的键/值对更新到seen里。
		unseen.clear()
		# 删除字典内所有元素

		for title, page_urls, url in results:
			print(count, title, url)
			# 打印爬取的网页编号、标题和链接地址
			count += 1
			unseen.update(page_urls - seen)
	# 更新unseen为还未爬取的链接集合

	print('total time:%.1f s' % (time.time() - t1,))

运行结果如下：

开始爬取网页

开始解析网页

分析网页
1 教程 https://morvanzhou.github.io/

开始爬取网页

开始解析网页

分析网页
2 关于莫烦 https://morvanzhou.github.io/about/
3 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
4 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
5 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
6 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
7 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
8 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
9 为了更优秀 https://morvanzhou.github.io/support/
10 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
11 近期更新 https://morvanzhou.github.io/recent-posts/
12 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
13 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
14 计算机视觉 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/computer-vision/
15 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
16 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
17 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
18 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
19 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
20 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
21 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
22 说吧~ https://morvanzhou.github.io/discuss/
23 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
24 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
25 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
26 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
27 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
28 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
29 其他教学系列 https://morvanzhou.github.io/tutorials/others/
30 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
31 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
32 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
33 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
34 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
total time:7.6 s

观察两种爬取方式的结果可以发现，它们爬取的顺序是不同的。而且多线程由于每次运行时的先后爬取完成的顺序都不一样（受网速和系统等影响），每次结果的顺序和页面都不完全一样。

异步加载Asyncio来加速爬虫运行：

AsyncioPython3的自带库，它的原理就是在单线程里使用异步计算, 下载网页的时候和处理网页的时候是不连续的, 更有效利用了等待下载的这段时间.

Asyncio不是多进程, 也不是多线程,只是一个单线程, 但在 Python 的功能间切换着执行。切换点用await来标记, 能够异步的功能用async标记。

我们先举一个不使用async的爬虫。

如：

import time


def job(t):
	print('start job', t)
	time.sleep(t)
	print('job', t, 'takes', t, 's')


# 序号是几的job就休眠几秒


def main():
	[job(t) for t in range(1, 5)]


t1 = time.time()
main()
print("no async total time: ", time.time() - t1)

运行结果如下：

start job 1
job 1 takes 1 s
start job 2
job 2 takes 2 s
start job 3
job 3 takes 3 s
start job 4
job 4 takes 4 s
no async total time:  10.000571966171265

下面我们在爬虫中运用async。

import time
import asyncio


async def job(t):
	# 能够异步的函数用async标记。
	print('start job', t)
	asyncio.sleep(t)
	# 注意sleep切换点不能用await来标记，必须按上面写法
	# 因为time.sleep()是不是一个awaitable对象。它返回None。
	print('job', t, 'takes', t, 's')


# 序号是几的job就休眠几秒


async def main(loop):
	# 能够异步的函数用async标记。
	tasks = [loop.create_task(job(t)) for t in range(1, 5)]
	# 创建任务但不执行
	await asyncio.wait(tasks)


# 切换点用await来标记


# 执行并等待所有任务完成
t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print("async total time: ", time.time() - t1)

运行结果如下：

start job 1
job 1 takes 1 s
start job 2
job 2 takes 2 s
start job 3
job 3 takes 3 s
start job 4
job 4 takes 4 s
async total time:  0.004000186920166016

使用aiohttp实现异步requests：

先安装aiohttp这个第三方库：pyhon3 -m pip install aiohttp

asyncio也可以用于爬虫非常，达到在等待一个网页下载的时候, 切换到其它代码的效果。但是 asycio 自己没办法完成这项任务的，我们还需要安装另一个模块aiohttp将 requests 模块代替成一个异步的 requests。

我们先举一个用requests模块爬取网页的例子：

import requests
import time

URL = 'https://morvanzhou.github.io/'


def normal():
	for i in range(4):
		r = requests.get(URL)
		url = r.url
		print(url)


t1 = time.time()
normal()
print("normal total time:", time.time() - t1)

运行结果如下：

https://morvanzhou.github.io/
https://morvanzhou.github.io/
https://morvanzhou.github.io/
https://morvanzhou.github.io/
normal total time: 2.9301674365997314

我们使用aiohttp模块来爬取网页：

import asyncio
import aiohttp
import time

URL = 'https://morvanzhou.github.io/'


async def job(session):
	response = await session.get(URL)
	return str(response.url)


async def main(loop):
	async with aiohttp.ClientSession() as session:
		tasks = [loop.create_task(job(session)) for _ in range(4)]
		finished, unfinished = await asyncio.wait(tasks)
		all_results = [r.result() for r in finished]
		print(all_results)


t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print("async total time:", time.time() - t1)

运行结果如下：

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
async total time: 1.2340705394744873

可以看到时间明显变短了。