Python爬虫【实战篇】scrapy 框架爬取某招聘网存入mongodb

创建项目

scrapy startproject zhaoping

创建爬虫

cd zhaoping
scrapy genspider hr zhaopingwang.com

目录结构

items.py

    title = scrapy.Field()
    position = scrapy.Field()
    publish_date = scrapy.Field()

pipelines.py

from pymongo import MongoClient

mongoclient = MongoClient(host='192.168.226.150',port=27017)
collection = mongoclient['zhaoping']['hr']

class TencentPipeline(object):
    def process_item(self, item, spider):
        print(item)
        # 需要转换为 dict
        collection.insert(dict(item))
        return item

spiders/hr.py

    def parse(self, response):
        # 不要第一个 和最后一个
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = TencentItem()
            # xpath 从1 开始数起
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()
            yield item

        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        # 构造url
        if next_url != "javascript:;":
            print(next_url)
            next_url = "https://hr.tencent.com/" + next_url
            yield scrapy.Request(url=next_url,callback=self.parse,)

就是这么简单,就获取到数据

猜你喜欢

转载自www.cnblogs.com/tangkaishou/p/10264628.html