协程爬取贴吧里发帖内容（redis做任务队列，mongo存储）

是用redis做任务队列时，要思考：

用什么数据类型来做任务队列

怎样才能防止重复爬取

首先了解一下redis可以存储什么数据类型：

字符串String

哈希hash

列表list

集合set

有序集合zset

浏览完这几种数据类型的功能之后，决定用list来做任务队列，用set来解决思考的问题，就是防止重复爬取的问题。
大概思路：

使用list当作未完成任务队列，存储还没有爬的url

使用set当作已完成任务队列，存储已经爬取的url

每次爬虫程序从list未完成任务队列获取任务的时候，都去set已完成任务队列里面验证一下，如果已完成队列里已经有了，就舍弃掉，如果没有，就开始爬取，并将这个url加入到已爬取的任务队列

这样做的方便之处在于：每当我往list未完成任务队列里加任务的时候，我不用考虑这个任务有没有爬过，这个任务是不是已经在未爬取任务队列了，我只需要往里加就行了，当爬虫去取的时候，让爬虫程序去做这个操作。

以下是具体代码
算是一个生产消费把，master往队列里塞任务，parser使用get_html的返回值进行解析，然后入库

import requests
from lxml import etree
import redis
import asyncio,aiohttp

import pymongo
conn = pymongo.MongoClient('localhost',27017)

db = conn.nicedb # 指定数据库名称，连接nicedb数据库，没有则自动创建
my_set = db.test_set # 使用test_set集合，没有则自动创建
# 以上两步都是延时操作，当往数据库插入第一条数据的时候，才会真正的创建数据库和集合

# decode_responses=True，记得加这个参数，不加的话取出来的数据都是bytes类型的
r = redis.StrictRedis(host = '127.0.0.1', port = 6379, db = 2,decode_responses=True)
# pool = redis.ConnectionPool(host = '127.0.0.1', port = 6379, db = 2)
# r = redis.StrictRedis(connection_pool=pool,decode_responses=True)

def master(page):
    url = 'https://tieba.baidu.com/f?kw=美女&ie=utf-8&pn={}'.format(page*50)
    base = 'https://tieba.baidu.com'
    res = requests.get(url).text
    html = etree.HTML(res)
    half_urls = html.xpath("//div[@class='threadlist_title pull_left j_th_tit ']/a/@href")
    full_urls = [base + i for i in half_urls]
    for url in full_urls:
        # 从url_list列表头部塞任务，也就是url
        r.lpush('url_list',url)
    #print(r.llen('url_list'))

async def get_html(url):
    async with asyncio.Semaphore(5):  # 限制并发数为5个
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as html:
                # errors='ignore'，不加这个参数的话，会报错，具体错误内容见下面图片
                response = await html.text(encoding='utf-8',errors='ignore')
                return response
async def parse():
    while True:
        # 从redis的url_list列表取任务，从右边开始取
        url = r.rpop('url_list')
        if url == None:
            break
        # 判断这个任务是否已经做过了，也就是判断这个url在没在redis的history集合里
        if r.sismember('history',url) == 1:
            continue
        response = await get_html(url)
        html = etree.HTML(response)
        content = html.xpath("//div[@class='left_section']/div[2]/div[1]//cc/div[1]/text()")[0].strip()
        if content != '':
            # 当内容不为空时，将内容存到mongo里
            my_set.save({'content':content})
            #print(content)
        # 将爬取过的任务放到redis的history集合里，也就是已完成任务队列
        r.sadd('history', url)
t1 = time.time()
# 爬取前10页
for i in range(10):
    master()

# async的一些步骤
loop = asyncio.get_event_loop()
tasks = [parse() for _ in range(15)]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

t2 = time.time()
print(t2-t1)
# 最后用时：32.930299043655396
# 把mongo数据库换成mysql后，用时：43.06192493438721

这是代码注释中提到的，编码错误
这里写图片描述

协程爬取贴吧里发帖内容（redis做任务队列，mongo存储）

猜你喜欢