Scrapy crawler advanced case-crawling 51job recruitment information

Last time we explained the introductory case of scrapy, I believe you have a certain understanding of this. For details, please refer to the Scrapy crawler operation for beginners-super detailed case to get you started . Next we come to another case to consolidate the scrapy operation.

1. The crawled website

Here I chose the position of Hangzhou data analysis, the URL is as follows: https://search.51job.com/list/080200,000000,0000,32,9,99,%25E6%2595%25B0%25E6%258D%25AE %25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
Insert picture description here

Second, the detailed steps of crawling

The basic scrapy operations, such as creating a scrapy project, etc. will not be repeated here. If you forget, you can read my last article: Scrapy crawler operation for beginners-super detailed case to take you into the
goal: job name, company name, company type, salary, job information (city, experience, number of recruits, release Date), position information, job address, job details link, and the fields are saved in mysql.

1. The analysis process of crawling information

Since the information of each job is different, we need to click to jump to the job details page to crawl. Here we can see that each post information corresponds to a div.
Insert picture description here
Click on the div to see the link to the job details. So we thought of using xpath to get the detailed link of each post and then jump to get the required information.
The black box above is the Google plugin xpath helper, which is very easy to use, you can download it.
Here is a small shortcut, right-click on the element you select to copy the xpath path, you will get the xpath path of the element, and then modify it to get all the links.

Insert picture description here
Click Jump to go to the details page
to divide the information that needs to be crawled:
Insert picture description here

2. Specific crawling code

Here is the meaning of the files in scrapy:

profile project: scrapy.cfg
Spiders /: We wrote reptiles files in this folder, I have here is job_detail.py
the init .py: usually empty file, but it must exist, not that he lies __init__.py The directory is only the directory is not the package
items.py: the target file of the project, defines the structured fields, saves the crawled data
middlewares.py: the project middleware
pipelines.py: the project pipeline file
setting.py: the project settings file

(1), write items.py

Fields to be crawled:

import scrapy

class ScrapyjobItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 职位名
    positionName = scrapy.Field()
    # 公司名
    companyName = scrapy.Field()
    # 公司类型
    companyType = scrapy.Field()
    # 薪资
    salary = scrapy.Field()
    # 工作信息(城市,经验,招聘人数,发布日期)
    jobMsg = scrapy.Field()
    # 职位信息
    positionMsg = scrapy.Field()
    # 工作地址
    address = scrapy.Field()
     # 工作详情连接
    link = scrapy.Field()

(2) Write the crawler file under the spider folder

Note: There is a pit here. When I wrote allowed_domains, I wrote www.search.51job.com and found that it was always empty when crawling the data. Later, Baidu searched and found that it was when we jumped from the job details link. The domain name is filtered, not under the original domain name, here is changed to the first-level domain name.

import scrapy
from scrapy_job.items import ScrapyjobItem

class JobSpiderDetail(scrapy.Spider):
    # 爬虫名称  启动爬虫时必要的参数
    name = 'job_detail'
    allowed_domains = ['51job.com']  # 二次迭代时域名被过滤了  改成一级域名
    # 起始的爬取地址
    start_urls = [
        'https://search.51job.com/list/080200,000000,0000,32,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=']

    # 找到详细职位信息的链接 进行跳转
    def parse(self, response):
        # 找到工作的详情页地址,传递给回调函数parse_detail解析
        node_list = response.xpath("//div[2]/div[4]")
        for node in node_list:
            # 获取到详情页的链接
            link = node.xpath("./div/div/a/@href").get()
            print(link)
            if link:
                yield scrapy.Request(link, callback=self.parse_detail)

        # 设置翻页爬取
        # 获取下一页链接地址
        next_page = response.xpath("//li[@class='bk'][last()]/a/@href").get()
        if next_page:
            # 交给schedule调度器进行下一次请求                     开启不屏蔽过滤
            yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)

    # 该函数用于提取详细页面的信息
    def parse_detail(self, response):
        item = ScrapyjobItem()
        # 详细页面的职业信息  
        item['positionName'] = response.xpath("//div[@class='cn']/h1/@title").get()
        item['companyName'] = response.xpath("//div[@class='com_msg']//p/text()").get()
        item['companyType'] = response.xpath("//div[@class='com_tag']//p/@title").extract()
        item['salary'] = response.xpath("//div[@class='cn']/strong/text()").get()
        item['jobMsg'] = response.xpath("//p[contains(@class, 'msg')]/@title").extract()
        item['positionMsg'] = response.xpath("//div[contains(@class, 'job_msg')]//text()").extract()
        item['address'] = response.xpath("//p[@class='fp'][last()]/text()").get()
        item['link'] = response.url
        # print(item['positionMsg'])
        yield item

(3), write pipelines.py

# 在 pipeline.py 文件中写一个中间件把数据保存在MySQL中
class MysqlPipeline(object):
    # from_crawler 中的参数crawler表示这个项目本身
    # 通过crawler.settings.get可以读取settings.py文件中的配置信息
    @classmethod
    def from_crawler(cls, crawler):
        cls.host = crawler.settings.get('MYSQL_HOST')
        cls.user = crawler.settings.get('MYSQL_USER')
        cls.password = crawler.settings.get('MYSQL_PASSWORD')
        cls.database = crawler.settings.get('MYSQL_DATABASE')
        cls.table_name = crawler.settings.get('MYSQL_TABLE_NAME')
        return cls()

    # open_spider表示在爬虫开启的时候调用此方法(如开启数据库)
    def open_spider(self, spider):
        # 连接数据库
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8')
        self.cursor = self.db.cursor()

    # process_item表示在爬虫的过程中,传入item,并对item作出处理
    def process_item(self, item, spider):
        # 向表中插入爬取的数据  先转化成字典
        data = dict(item)
        table_name = self.table_name
        keys = ','.join(data.keys())
        values = ','.join(['%s'] * len(data))
        sql = 'insert into %s (%s) values (%s)' % (table_name, keys, values)
        self.cursor.execute(sql, tuple(data.values()))
        self.db.commit()
        return item

    # close_spider表示在爬虫结束的时候调用此方法(如关闭数据库)
    def close_spider(self, spider):
        self.db.close()


# 写一个管道中间件StripPipeline清洗空格和空行
class StripPipeline(object):
    def process_item(self, item, job_detail):
        item['positionName'] = ''.join(item['positionName']).strip()
        item['companyName'] = ''.join(item['companyName']).strip()
        item['companyType'] = '|'.join([i.strip() for i in item['companyType']]).strip().split("\n")
        item['salary'] = ''.join(item['salary']).strip()
        item['jobMsg'] = ''.join([i.strip() for i in item['jobMsg']]).strip()
        item['positionMsg'] = ''.join([i.strip() for i in item['positionMsg']]).strip()
        item['address'] = ''.join(item['address']).strip()
        return item

(4), set settings.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 把我们刚写的两个管道文件配置进去,数值越小优先级越高
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
    # 'scrapy_qcwy.pipelines.ScrapyQcwyPipeline': 300,
    'scrapy_qcwy.pipelines.MysqlPipeline': 200,
    'scrapy_qcwy.pipelines.StripPipeline': 199,
}

# Mysql 配置
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_DATABASE = 'qcwy'
MYSQL_TABLE_NAME = 'job_detail'

View database results

Insert picture description here
Insert picture description here
The final source code is detailed at: https://github.com/zmk-c/scrapy/tree/master/scrapy_qcwy

Guess you like

Origin blog.csdn.net/qq_40169189/article/details/107790834