Scrapy Framework Actual Combat (3): Detailed Scrapy Item Pipeline

When the crawled data has been stored in Items, if the Spider (crawler) parses the Response (response result), the Items will be passed to the Item Pipeline (item pipeline), and then create a class for processing data in the Item Pipeline , This class is the project pipeline component, and data cleaning and storage can be realized by performing a series of processing.

1. The core method of the project pipeline

The typical uses of Item Pipeline are as follows:

  1. Clean up HTML data.
  2. Verify the captured data (check whether the project contains certain fields).
  3. Check for duplicates (and delete them).
  4. Store the crawled results in the database.

When writing a custom Item Pipeline, you can implement the following methods:

  1. process_item(): This method is a method that must be implemented when customizing the Item Pipeline. Two parameters need to be provided in this method. The specific meanings of the parameters are as follows:
    1. The item parameter is an Item object (Item being processed) or a dictionary.
    2. The spider parameter is the Spider object (the crawler that crawls the information).
  2. open_spider(): This method is called when the crawler is started, so the initialization operation can be performed in this method, where the spider parameter is the opened Spider object.
  3. close_spider(): This method is the opposite of the previous method. It is called when the spider is closed. Some finishing work can be done in this method. The spider parameter is the spider object that is closed.
  4. from_crawler(): This method is a class method and needs to be identified by @classmethod. When calling this method, an instance object needs to be created through the parameter cls, and finally this instance object needs to be returned. Through the crawler parameter, you can get all the core components of Scrapy, such as configuration information.

2. Crawl JD data and store it in MySQL database

After understanding the role of the Item Pipeline, then you can store the crawled data information through the Item Pipeline in the database. Here, take the crawling of the Jingdong book ranking information as an example, and store the crawled data information in MySQL In the database. The specific steps are as follows:

(1) Installation and commissioning MySQL database, then create a database for the name by MySQL Navicat jd_data, as shown below:

(2) jd_datacreate a name for which the database rankingdata table, as shown below:
Insert picture description here
(3) opened by Google Chrome jingdong Book ranking page address

https://book.jd.com/booktop/0-0-0.html?category=3287-0-0-0-10001-1

After successful authentication, press <F12>to open the browser shortcut key or right mouse developer tools, first select the Network option, and then locate the URL of the page, click, switch to the Response tab, verify that we have to crawl data in response among If not, we need to find the URL of the requested page data again.
Insert picture description here
Then select the Elements option, and then click the arrow icon in the upper left corner to select the data to be extracted from the web page, and finally locate the data location. As shown in the figure below:
Insert picture description here
Obtain the data corresponding to the book title, author, and publisher in sequence according to the above positioning data steps.
(4) Determine the location of the data in the HTML code, and then create the project folder and crawler file through the following commands in the command line window:
Insert picture description here
Then open the project with the development tool Pycharm, and the complete project structure is shown in the figure:

(5 ) Open the items.py file in the project folder structure, define Item in this file, and the code is as follows:

import scrapy


class JdItem(scrapy.Item):
    book_name = scrapy.Field()  # 保存图书名称
    author = scrapy.Field()  # 保存作者
    press = scrapy.Field()  # 保存出版社

(6) Open the JdSpider.py crawler file, and rewrite the start_requests method in the file to implement network requests for the Jingdong book rankings. And below the method, rewrite the parse() method to realize the crawling of web page data, and then add the crawled data to the Item object. code show as below:

import scrapy
from jd.items import JdItem


class JdspiderSpider(scrapy.Spider):
    name = 'JdSpider'
    allowed_domains = ['book.jd.com']
    start_urls = ['http://book.jd.com/']

    def start_requests(self):
        # 需要访问的地址
        url = "https://book.jd.com/booktop/0-0-0.html?category=3287-0-0-0-10001-1"
        yield scrapy.Request(url=url, callback=self.parse)  # 发送网络请求

    def parse(self, response):
        li_list = response.xpath('//div[@class="m m-list"]/div/ul/li')
        for li in li_list:
            # 获取图书名称
            book_name = li.xpath('div[@class="p-detail"]/a/text()').extract_first()
            # 获取作者
            author = li.xpath('//div[@class="p-detail"]/dl[1]/dd/a[1]/text()').extract_first()
            # 获取出版社
            press = li.xpath('//div[@class="p-detail"]/dl[2]/dd/a[1]/text()').extract_first()
            item = JdItem()  # 创建Item对象
            # 将数据添加至Item对象
            item['book_name'] = book_name
            item['author'] = author
            item['press'] = press
            yield item

(8) Use Pycharm to run the current crawler, as follows: After
Insert picture description here
Insert picture description here
starting the crawler, the console will print all the crawled information in the Item object.
(9) Confirm that the data has been crawled. Next, you need to store the data in the MySQL database in the project pipeline. First, open the pipelines.py file, and first import the PyMySQL database operation module in the file, and then initialize the database through the init() method Connection parameters, rewrite from_crawler() method, open_spider() method, close_spider() method, process_item() method and other operations. code show as below:

import pymysql  # 导入数据库连接PyMySQL模块


class JdPipeline:
    # 初始化数据库参数
    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        # 返回cls()实例对象,其中包含通过crawler获取的配置文件中的数据库参数
        return cls(
            host=crawler.settings.get('SQL_HOST'),
            database=crawler.settings.get('SQL_DATABASE'),
            user=crawler.settings.get('SQL_USER'),
            password=crawler.settings.get('SQL_PASSWORD'),
            port=crawler.settings.get('SQL_PORT'),
        )

    # 打开爬虫时调用
    def open_spider(self, spider):
        # 数据库连接
        self.db = pymysql.connect(self.host, self.user, self.password, self.database,
                                  self.port, charset="utf8")
        self.cursor = self.db.cursor()  # 创建游标

    # 关闭爬虫时调用
    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        data = dict(item)  # 将Item转换成字典类型
        # SQL语句
        sql = "insert into ranking(book_name,press,author) values (%s,%s,%s)"
        # 执行数据的插入
        self.cursor.execute(sql, (data['book_name'], data['press'], data['author']))
        self.db.commit()  # 提交
        return item  # 返回Item

(10) Open the settings.py file, find the code that activates the project pipeline in the file, uncomment the state, and then set the variables of the database information. code show as below:

ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
SQL_HOST = 'localhost'  # 数据库地址
SQL_USER = 'root'  # 用户名
SQL_PASSWORD = 'mysql'  # 密码
SQL_DATABASE = 'jd_data'  # 数据库名称
SQL_PORT = 3306  # 端口
# 开启jd项目管道
ITEM_PIPELINES = {
    
    
   'jd.pipelines.JdPipeline': 300,
}

(11) Open the main.py file and start the crawler again in the file. After the crawler program is executed, open the ranking data table and the data information as shown in the figure below will be displayed.
Insert picture description here
So far today’s case is over. The author hereby declares that the author is only writing this article to learn to communicate, and to allow more readers who learn the basics of Python to avoid some detours, save time, and do not use it for other purposes. If there is infringement, Contact the blogger to delete it. Thank you for reading this blog post, and hope this article can be your leader in programming. I wish you a happy reading!


Insert picture description here

    A good book never gets tired of reading a hundred times. And if I want to be the most beautiful boy in the audience, I must insist on acquiring more knowledge through learning, use knowledge to change my destiny, use blog to witness growth, and use actions to prove that I am working hard.
    If my blog help you, if you like my blog, please 点赞, 评论,收藏 a key triple Oh! I heard that people who like it won’t have bad luck and will be full of energy every day! If you really want to be a prostitute, I wish you happy every day, and welcome to my blog.
 Coding is not easy, and your support is my motivation to stick to it. After the thumbs do not forget 关注me!

Guess you like

Origin blog.csdn.net/xw1680/article/details/111321681