03--构造爬虫爬腾讯招聘,实现翻页爬取

创建项目

scrapy startproject tencent
cd tencent 打开项目目录 
scrapy genspider hr tence.com# hr为spder文件的名字,tencent.com是允许爬的域名范围

在这里插入图片描述

hr.py

  • 设置初始的url地址
  • 打开网页源代码根据xpath找需要的信息
  • 取标签的文本值使用text()函数,去标签的属性值用@,比如取a标签的地址response.xpath("//a[@id='next']/@href").extract_first()
  • 另外每个后面要加上extract_first()
# -*- coding: utf-8 -*-
import scrapy


class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = {}
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_data"] = tr.xpath("./td[5]/text()").extract_first()
            yield item
        # 首先找到下一页的url地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url != "javascript:;":
            next_url = "https://hr.tencent.com/"+next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse,
                # meta={"item":item}
            )

    # def parse1(self,response):
    #    item = response.meta["item"]
  • yield item把itemyield到管道

pipelines.py

  • 初始化MongoBD
  • 创建tencent数据库和hr表
  • collection去insert
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient

client = MongoClient()
collection = client["tencent"]["hr"]

class TencentPipeline(object):
    def process_item(self, item, spider):
        collection.insert(item)
        with open("./position.text","a") as f:
            f.write(str(item))
        print(item)
        return item

# 回到hr.py看看

  • 通过xpath拿到每一个下一页的a标签的href值
  • 但是这个href值是不完整的需要拼接
  • yield scrapy.Request( next_url, callback=self.parse, # meta={"item":item} )
    去执行爬下一页的函数,callback的函数是自己
  • 如果callback函数不是自己而是另一个parse,比如下面注释掉的那个也可以,这个时候如果想让item在解析函数之间传递就需要meta参数,做法如注释掉的一样
# -*- coding: utf-8 -*-
import scrapy


class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = {}
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_data"] = tr.xpath("./td[5]/text()").extract_first()
            yield item
        # 首先找到下一页的url地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url != "javascript:;":
            next_url = "https://hr.tencent.com/"+next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse,
                # meta={"item":item}
            )

    # def parse1(self,response):
    #    item = response.meta["item"]

猜你喜欢

转载自blog.csdn.net/qq_34788903/article/details/89681360