实战单线程爬取, 单线程+协程爬取, 多线程爬取

一.目标网页:https://lusongsong.com/default_2.html .爬取该页面链接(有17个)下详情内容并保存到本地

二 分别采取单线程爬取,多线程爬取,单线程+协程爬取

  2.1 单线程爬取

import requests
from lxml import etree
import time

def get_request(url):
response = requests.get(url).text
return response

def parse(html):
tree = etree.HTML(html)
title = tree.xpath('//div[@class="post-title"]/h1/a/text()')[0]
text = tree.xpath('//dd[@class="con"]/p/text()')
text = "".join(text)
with open('1.txt','a+',encoding='utf-8') as fp:
fp.write(title + '\n' + text + '\n')

if __name__ == '__main__':
start = time.time()
# p = Pool(3)
index = requests.get('https://lusongsong.com/default_2.html').text
tree = etree.HTML(index)
urls = tree.xpath('//div[@class="post"]/h2/a/@href')
for url in urls:
c = get_request(url)
parse(c)
print('总耗时:',time.time()-start)

总耗时: 13.49609375
2.2.多线程爬取:先用requests模块获取页面链接到列表,再用多任务协程爬虫爬取内容
import requests
from lxml import etree
import time
from multiprocessing.dummy import Pool
def get_request(url):
response = requests.get(url).text
return response

def parse(html):
tree = etree.HTML(html)
title = tree.xpath('//div[@class="post-title"]/h1/a/text()')[0]
text = tree.xpath('//dd[@class="con"]/p/text()')
text = "".join(text)
with open('1.txt','a+',encoding='utf-8') as fp:
fp.write(title + '\n' + text + '\n')


if __name__ == '__main__':
start = time.time()
p = Pool(3)
index = requests.get('https://lusongsong.com/default_2.html').text
tree = etree.HTML(index)
urls = tree.xpath('//div[@class="post"]/h2/a/@href')
res_list = p.map(get_request,urls)
for res in res_list:
parse(res)

print('总耗时:',time.time()-start)

总耗时: 1.737304925918579

2.3 单线程+协程爬取
import time
import asyncio
import aiohttp
from lxml import etree
import requests

async def get_request(url):
async with aiohttp.ClientSession() as sess:
async with await sess.get(url=url) as response:
page_text = await response.text()
return page_text

def parse(task):
page_text = task.result()
tree = etree.HTML(page_text)
title = tree.xpath('//div[@class="post-title"]/h1/a/text()')[0]
text = tree.xpath('//dd[@class="con"]/p/text()')
text = "".join(text)
with open('3.txt','a+',encoding='utf-8') as fp:
fp.write(title + '\n' + text + '\n')


if __name__ == '__main__':
start = time.time()
index = requests.get('https://lusongsong.com/default_2.html').text
tree = etree.HTML(index)
urls = tree.xpath('//div[@class="post"]/h2/a/@href')
tasks = []
for url in urls:
c = get_request(url)
task = asyncio.ensure_future(c)
task.add_done_callback(parse)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print('总耗时:',time.time()-start)
总耗时: 0.5029296875

结论:对于网络爬虫,多线程和协程能够有效提升效率,原因:单线程下有IO操作会进行IO等待,造成不必要的时间浪费,而开启多线程能在线程A等待时,自动切换到线程B,可以不浪费CPU的资源,从而能提升程序执行效率。

猜你喜欢

转载自www.cnblogs.com/kkdadao/p/13394915.html