爬虫学习之scrapy框架入门

爬虫学习之scrapy框架入门

爬取的页面是百度拇指([http://muzhi.baidu.com])的问答对,使用scrapy爬虫框架。

可以看到一个医生最多展现760个问答,所以只爬取这些问答。

首先打开cmd命令行,使用cd命令打开指定路径,在路径下命令 scrapy startproject projectname 便建立了爬虫项目。

我使用vscode打开项目文件夹。

在spiders文件下新建knowledge.py文件,这里是爬虫逻辑。
import scrapy,re,requests,json
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.spider import CrawlSpider
from baidumuzhi.items import BaidumuzhiItem
uids = ['73879479','1246344246','1231532126','618625720','484658950','201748607','200140822','1690937','38227344','930048074','797647705','795334291','161087120','83187968','949887302','591339998','359728620','111266359','63320665','924213326','900849154','838701150','838701150','680796252']
#这些是用来测试的uid,即每个医生的uid
class KnowledgeSpider(CrawlSpider):
    name = 'knowledge'
    start_urls = ['http://muzhi.baidu.com/doctor/list/answer?pn=0&rn=10&uid=3450738847']
    def parse(self, response):
        for uid in uids:
            item = BaidumuzhiItem()
            site = json.loads(response.text)
            targets = site['data']['list']
            num_of_page = site['data']['total']//10+1
            if num_of_page>76: num_of_page = 76
            for target in targets:
                item['qid'] = target['qid']
                item['title'] = target['title']
                item['createTime'] = target['createTime']
                item['answer'] = target['answer']
                yield item

            urls = ['http://muzhi.baidu.com/doctor/list/answer?pn={0}&rn=10&uid={1}'.format(i*10,uid) for i in range(num_of_page)]
            for url in urls:
                yield Request(url,callback=self.parse)
yield与return不同。

items.py文件定义了需要提取的数据类,这里提取问题码,问题,时间,答案。代码如下
import scrapy
class BaidumuzhiItem(scrapy.Item):
    qid = scrapy.Field()
    title = scrapy.Field()
    createTime = scrapy.Field()
    answer = scrapy.Field()
    pass
pipelines.py定义了所爬取数据入数据库的一些东西。这里使用MongoDB数据库
import pymongo
class BaidumuzhiPipeline(object):
    def __init__(self):
        cilent = pymongo.MongoClient('localhost',27017)
        mydata = cilent['mydata']
        qanda = mydata['qandaLast']
        self.post = qanda
    def process_item(self, item, spider):
        infor = dict(item)
        self.post.insert(infor)
        return item
setting.py中需要对request头进行设置。User-Agent从浏览器中复制即可。
网站没有封禁,于是download_delay不进行设置。
关于scrapy的文档(http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html)

行,基本就这些,下面我们调好MongoDB,就进行爬取吧!

大功告成

写这篇博客怕忘了记录下。

猜你喜欢

转载自blog.csdn.net/ishandsomedog/article/details/79435560