scrapy 爬取腾讯招聘保存MongoDB

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/gklcsdn/article/details/102766972

scrapy 实战案例

目标网址

1. 新建项目
scrapy startproject tencent
2. 新建爬虫文件
scrapy genspider tencent_spider aaaa
3. 新建run.py文件, 用来运行爬虫
# @Time : 2019/10/27 13:27
# @Author : GKL
# FileName : run.py
# Software : PyCharm

from scrapy import cmdline

cmdline.execute('scrapy crawl tencent_spider'.split())
4. 编写items.py 文件 (需要爬取的字段)
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):

    # 职位名称
    positionName = scrapy.Field()

    # 发布日期
    publishDate = scrapy.Field()

    # 工作地点
    workPosition = scrapy.Field()

    # 详情信息
    detailLink = scrapy.Field()
5. 编写tencent_spider.py文件(爬虫文件)
# -*- coding: utf-8 -*-

import json
import scrapy
from ..items import TencentItem


class TententSpiderSpider(scrapy.Spider):
    name = 'tencent_spider'
    # allowed_domains = ['aaaa']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10']
    offset = 1

    def parse(self, response):
        # 获取数据列表
        data_list = json.loads(response.text)['Data']['Posts']

        # 判断结束程序
        if data_list is None:
            return
        for data in data_list:
            items = TencentItem()
            items['positionName'] = data['RecruitPostName']
            items['publishDate'] = data["LastUpdateTime"]
            items["workPosition"] = data['LocationName']
            items['detailLink'] = data['PostURL']
            yield items

        # 翻页
        self.offset += 1
        next_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={}&pageSize=10'.format(self.offset)
        yield scrapy.Request(next_url, callback=self.parse)

6. 编写pipelines.py文件 (保存数据)
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo


class MongoPipeline(object):

    collectionName = 'tencent'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        db = self.client[self.mongo_db]
        self.collection = db[self.collectionName]


    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()

7. settings.py 文件配置
ITEM_PIPELINES = {
   'tencent.pipelines.MongoPipeline': 300,
}

MONGO_URI = '127.0.0.1'
MONGO_DB = 'tencent'
8. 执行run.py 运行爬虫, 结果如下

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sy1iXU0P-1572159473809)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1572157658529.png)]

9. 源码地址

github

猜你喜欢

转载自blog.csdn.net/gklcsdn/article/details/102766972