创建项目

scrapy startproject tencent

cd tencent 打开项目目录

scrapy genspider hr tence.com# hr为spder文件的名字,tencent.com是允许爬的域名范围

在这里插入图片描述

hr.py

设置初始的url地址
打开网页源代码根据xpath找需要的信息
取标签的文本值使用text()函数,去标签的属性值用@,比如取a标签的地址response.xpath("//a[@id='next']/@href").extract_first()
另外每个后面要加上extract_first()

# -*- coding: utf-8 -*-
import scrapy


class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = {}
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_data"] = tr.xpath("./td[5]/text()").extract_first()
            yield item
        # 首先找到下一页的url地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url != "javascript:;":
            next_url = "https://hr.tencent.com/"+next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse,
                # meta={"item":item}
            )

    # def parse1(self,response):
    #    item = response.meta["item"]

yield item把itemyield到管道

pipelines.py

初始化MongoBD
创建tencent数据库和hr表
collection去insert

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient

client = MongoClient()
collection = client["tencent"]["hr"]

class TencentPipeline(object):
    def process_item(self, item, spider):
        collection.insert(item)
        with open("./position.text","a") as f:
            f.write(str(item))
        print(item)
        return item

#　回到hr.py看看

通过ｘｐａｔｈ拿到每一个下一页的ａ标签的ｈｒｅｆ值
但是这个ｈｒｅｆ值是不完整的需要拼接
yield scrapy.Request( next_url, callback=self.parse, # meta={"item":item} )
去执行爬下一页的函数，ｃａｌｌｂａｃｋ的函数是自己
如果callback函数不是自己而是另一个parse,比如下面注释掉的那个也可以,这个时候如果想让item在解析函数之间传递就需要meta参数,做法如注释掉的一样

# -*- coding: utf-8 -*-
import scrapy


class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = {}
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_data"] = tr.xpath("./td[5]/text()").extract_first()
            yield item
        # 首先找到下一页的url地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url != "javascript:;":
            next_url = "https://hr.tencent.com/"+next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse,
                # meta={"item":item}
            )

    # def parse1(self,response):
    #    item = response.meta["item"]

03--构造爬虫爬腾讯招聘,实现翻页爬取

创建项目

hr.py

pipelines.py

猜你喜欢