爬虫学习之scrapy框架入门
爬取的页面是百度拇指([http://muzhi.baidu.com])的问答对,使用scrapy爬虫框架。
可以看到一个医生最多展现760个问答,所以只爬取这些问答。
首先打开cmd命令行,使用cd命令打开指定路径,在路径下命令 scrapy startproject projectname 便建立了爬虫项目。
我使用vscode打开项目文件夹。
在spiders文件下新建knowledge.py文件,这里是爬虫逻辑。
import scrapy,re,requests,json
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.spider import CrawlSpider
from baidumuzhi.items import BaidumuzhiItem
uids = ['73879479','1246344246','1231532126','618625720','484658950','201748607','200140822','1690937','38227344','930048074','797647705','795334291','161087120','83187968','949887302','591339998','359728620','111266359','63320665','924213326','900849154','838701150','838701150','680796252']
class KnowledgeSpider(CrawlSpider):
name = 'knowledge'
start_urls = ['http://muzhi.baidu.com/doctor/list/answer?pn=0&rn=10&uid=3450738847']
def parse(self, response):
for uid in uids:
item = BaidumuzhiItem()
site = json.loads(response.text)
targets = site['data']['list']
num_of_page = site['data']['total']//10+1
if num_of_page>76: num_of_page = 76
for target in targets:
item['qid'] = target['qid']
item['title'] = target['title']
item['createTime'] = target['createTime']
item['answer'] = target['answer']
yield item
urls = ['http://muzhi.baidu.com/doctor/list/answer?pn={0}&rn=10&uid={1}'.format(i*10,uid) for i in range(num_of_page)]
for url in urls:
yield Request(url,callback=self.parse)
yield与return不同。
items.py文件定义了需要提取的数据类,这里提取问题码,问题,时间,答案。代码如下
import scrapy
class BaidumuzhiItem(scrapy.Item):
qid = scrapy.Field()
title = scrapy.Field()
createTime = scrapy.Field()
answer = scrapy.Field()
pass
pipelines.py定义了所爬取数据入数据库的一些东西。这里使用MongoDB数据库
import pymongo
class BaidumuzhiPipeline(object):
def __init__(self):
cilent = pymongo.MongoClient('localhost',27017)
mydata = cilent['mydata']
qanda = mydata['qandaLast']
self.post = qanda
def process_item(self, item, spider):
infor = dict(item)
self.post.insert(infor)
return item
setting.py中需要对request头进行设置。User-Agent从浏览器中复制即可。
网站没有封禁,于是download_delay不进行设置。
关于scrapy的文档(http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html)
行,基本就这些,下面我们调好MongoDB,就进行爬取吧!
写这篇博客怕忘了记录下。