Scrapy crawls worry-free

One, determine the crawl content and create a mysql table

Insert picture description here
1. Determine the URL to be crawled. Insert picture description here
Through observation, you can find that the URL is
https://search.51job.com/list/000000,000000,0000,32,9,99,+,2,xxxx.html,
just modify the xxxx , You can achieve multiple web page crawling.
2. The web page data of 51job can dynamically obtain json data, which is received by js variables and displayed on the web page. Therefore, when crawling, you need to parse the variables in the script tag.
3. Determine the crawling field , And then create a mysql table
mysql table structure is as follows:
Insert picture description here

Two, scrapy project crawling

(1) Preparation work:
1. Execute scrapy startproject qcwy, create a scrapy project
2. Execute scrapy genspider qcwyCrawler www.xxx.com, create a crawler file
(2), change the project configuration file settings.py:

# Scrapy settings for qcwy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent

BOT_NAME = 'qcwy'

SPIDER_MODULES = ['qcwy.spiders']
NEWSPIDER_MODULE = 'qcwy.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = UserAgent().random  # 生成随机请求头

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  # 不遵守robot协议
LOG_LEVEL = 'ERROR'  # 只打印error级别的日志

ITEM_PIPELINES = {
    
    
    'qcwy.pipelines.QcwyPipeline': 300,
}  # 开启爬虫管道

(3) Change the items.py file and determine the crawl field

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QcwyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    company = scrapy.Field()
    job_name = scrapy.Field()
    salary = scrapy.Field()
    requirement = scrapy.Field()
    welfare = scrapy.Field()

(4) Change the qcwyCrawler.py file and write crawler code

import scrapy
import json
from qcwy.items import QcwyItem


class QcwycrawlerSpider(scrapy.Spider):
    name = 'qcwyCrawler'
    # allowed_domains = ['www.xxx.com']
    start_urls = []  # start_urls列表中的url会被scrapy自动请求

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        url = 'https://search.51job.com/list/000000,000000,0000,32,9,99,+,2,{}.html'  # 生成要爬取的网址
        for i in range(1, 1001):
            self.start_urls.append(url.format(i))

    def parse(self, response): # 利用xpath和json解析爬取到的数据
        json_str = response.xpath('/html/body/script[2]/text()').extract_first()[29:]
        data = json.loads(json_str)
        item = QcwyItem()
        for row in data['engine_search_result']:
            item['company'] = row['company_name']
            item['job_name'] = row['job_name']
            item['salary'] = row['providesalary_text']
            item['requirement'] = ' '.join(row['attribute_text'])
            item['welfare'] = row['jobwelf']
            yield item

(5) Create dbutil package and connection file, write mysql connection class

from pymysql import connect


class MysqlConnection:
    host = '127.0.0.1'
    port = 3306
    user = 'root'
    password = 'qwe12333'
    db = 'study'
    charset = 'utf8'

    @classmethod
    def getConnection(cls):
        conn = None
        try:
            conn = connect(
                host=cls.host,
                port=cls.port,
                user=cls.user,
                password=cls.password,
                db=cls.db,
                charset=cls.charset
            )
        except Exception as e:
            print(e)
        return conn

    @classmethod
    def closeConnection(cls, conn):
        if not conn:
            conn.close()

(6) Change the crawler pipeline file and save the data in the mysql table

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from dbutil.connection import MysqlConnection
from time import time


class QcwyPipeline:
    def __init__(self):
        self.start_time = time()
        self.conn = MysqlConnection.getConnection()
        self.sql = 'insert into qcwy values(null ,%s, %s, %s, %s, %s);'
        self.count = 0

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        company = item['company']
        job_name = item['job_name']
        salary = item['salary']
        requirement = item['requirement']
        welfare = item['welfare']
        print('{}: {}'.format(company, job_name))
        self.count += self.cursor.execute(self.sql, (company, job_name, salary, requirement, welfare))

    def close_spider(self, spider):
        if self.cursor:
            self.cursor.close()
        self.conn.commit()
        MysqlConnection.closeConnection(self.conn)
        print('总共爬取{}条记录,耗时:{}秒'.format(self.count, time() - self.start_time))

(7) Write the main.py file to start the scrapy project

from scrapy import cmdline

cmdline.execute('scrapy crawl qcwyCrawler'.split())

The created project structure is as follows:

Insert picture description here
operation result:
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/yeyu_xing/article/details/113101681