版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
scrapy 实战案例
1. 新建项目
scrapy startproject tencent
2. 新建爬虫文件
scrapy genspider tencent_spider aaaa
3. 新建run.py
文件, 用来运行爬虫
# @Time : 2019/10/27 13:27
# @Author : GKL
# FileName : run.py
# Software : PyCharm
from scrapy import cmdline
cmdline.execute('scrapy crawl tencent_spider'.split())
4. 编写items.py
文件 (需要爬取的字段)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class TencentItem(scrapy.Item):
# 职位名称
positionName = scrapy.Field()
# 发布日期
publishDate = scrapy.Field()
# 工作地点
workPosition = scrapy.Field()
# 详情信息
detailLink = scrapy.Field()
5. 编写tencent_spider.py
文件(爬虫文件)
# -*- coding: utf-8 -*-
import json
import scrapy
from ..items import TencentItem
class TententSpiderSpider(scrapy.Spider):
name = 'tencent_spider'
# allowed_domains = ['aaaa']
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10']
offset = 1
def parse(self, response):
# 获取数据列表
data_list = json.loads(response.text)['Data']['Posts']
# 判断结束程序
if data_list is None:
return
for data in data_list:
items = TencentItem()
items['positionName'] = data['RecruitPostName']
items['publishDate'] = data["LastUpdateTime"]
items["workPosition"] = data['LocationName']
items['detailLink'] = data['PostURL']
yield items
# 翻页
self.offset += 1
next_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={}&pageSize=10'.format(self.offset)
yield scrapy.Request(next_url, callback=self.parse)
6. 编写pipelines.py
文件 (保存数据)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
class MongoPipeline(object):
collectionName = 'tencent'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
db = self.client[self.mongo_db]
self.collection = db[self.collectionName]
def process_item(self, item, spider):
self.collection.insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
7. settings.py
文件配置
ITEM_PIPELINES = {
'tencent.pipelines.MongoPipeline': 300,
}
MONGO_URI = '127.0.0.1'
MONGO_DB = 'tencent'