Lecture 45: Store anywhere, the usage of Item Pipeline

In the previous example, we have already understood the basic concept of the Item Pipeline project pipeline. In this lesson, we will explain its usage in detail.

First, let's look at the architecture of Item Pipeline in Scrapy, as shown in the figure.
Insert picture description here
The leftmost in the figure is the Item Pipeline, which is called after the Spider generates the Item. After the Spider has parsed the Response, the Item will be passed to the Item Pipeline, and the defined Item Pipeline components will be called sequentially to complete a series of processing procedures, such as data cleaning and storage.

Its main functions are:

Clean HTML data;
Verify the crawled data and check the crawled fields;
Check duplicate content and discard duplicate content;
Store the crawling results in the database.

1. Core Method

We can customize the Item Pipeline, we only need to implement the specified method, one of which must be implemented is:

process_item(item, spider)

In addition, there are several more practical methods, they are:

open_spider(spider)
close_spider(spider)
from_crawler(cls, crawler)

Below we give a detailed introduction to the usage of these methods:

process_item(item, spider)

process_item() is a method that must be implemented. The defined Item Pipeline will call this method by default to process the Item. For example, we can perform operations such as data processing or writing data to a database. It must return a value of type Item or throw a DropItem exception.

The parameters of the process_item() method are as follows:

item, is the Item object, that is, the Item being processed;
spider is the Spider object, that is, the Spider that generated the Item.

The return type of the method is summarized as follows:

If the returned item is an Item object, then this Item will be processed by the process_item() method of the low-priority Item Pipeline until all methods are called.
If a DropItem exception is thrown, the Item will be discarded and no further processing will be performed.

open_spider(self, spider)

The open_spider() method is automatically called when the Spider is started. Here we can do some initialization operations, such as opening a database connection. The parameter spider is the Spider object that is opened.

close_spider(spider)

The close_spider() method is automatically called when the Spider is closed. Here we can do some finishing work, such as closing the database connection, etc. The parameter spider is the Spider object being closed.

from_crawler(cls, crawler)

The from_crawler() method is a class method, identified by @classmethod, which is a way of dependency injection. Its parameter is crawler. Through the crawler object, we can get all the core components of Scrapy, such as each information of the global configuration, and then create a Pipeline instance. The parameter cls is Class, and finally a Class instance is returned.

Below we use an example to deepen the understanding of the usage of Item Pipeline.

2. Objectives of this section

Let's take the example of crawling 360 photography beautiful pictures to realize the three pipelines of MongoDB storage, MySQL storage, and Image storage.

3. Preparation

Please make sure that you have installed MongoDB and MySQL databases, and installed Python's PyMongo, PyMySQL, and Scrapy frameworks. In addition, you need to install the pillow image processing library. If you don't have it, you can refer to the previous installation instructions.

4. Crawl analysis

The target website we crawled this time is: https://image.so.com. Open this page and switch to the photography page. There are many beautiful photography pictures on the page. We open the browser developer tools, switch the filter to the XHR option, and then pull down the page, you can see that many Ajax requests will be presented below, as shown in the figure.
Insert picture description here
We check the details of a request and observe the returned data structure, as shown in the figure.

The return format is JSON. The list field is the detailed information of each picture, including the ID, name, link, thumbnail and other information of the 30 pictures. In addition, observe the parameter information of the Ajax request. There is a parameter sn that has been changing. This parameter is obviously the offset. When sn is 30, the first 30 pictures are returned, and when sn is 60, the 31st to 60th pictures are returned. In addition, the ch parameter is the photography category, the listtype is the sorting method, and the temp parameter can be ignored.

So we only need to change the value of sn when crawling. Next, we use Scrapy to capture images, save the image information to MongoDB, MySQL, and store the images locally.

5. New project

First create a new project, the command is as follows:

scrapy startproject images360

Next, create a new Spider, the command is as follows:

scrapy genspider images images.so.com

In this way, we have successfully created a Spider.

6. Constructing the request

Next, define the number of pages crawled. For example, to crawl 50 pages and 30 images per page, which is 1500 images, we can first define a variable MAX_PAGE in settings.py, and add the following definition:

MAX_PAGE = 50

Define the start_requests() method to generate 50 requests, as shown below:

def start_requests(self):
    data = {
    
    'ch': 'photography', 'listtype': 'new'}
    base_url = 'https://image.so.com/zjl?'
    for page in range(1, self.settings.get('MAX_PAGE') + 1):
        data['sn'] = page * 30
        params = urlencode(data)
        url = base_url + params
        yield Request(url, self.parse)

Here we first define the initial two parameters, the sn parameter is generated by the traversal loop. Then use the urlencode method to convert the dictionary into URL GET parameters, construct a complete URL, construct and generate a Request.

Also need to introduce scrapy.Request and urllib.parse modules, as shown below:

from scrapy import Spider, Request
from urllib.parse import urlencode

Then modify the ROBOTSTXT_OBEY variable in settings.py and set it to False, otherwise it cannot be captured, as shown below:

ROBOTSTXT_OBEY = False

Run the crawler, you can see that the link request is successful, the execution command is as follows:

scrapy crawl images

The result of running the example is shown in the figure.
Insert picture description here
The status code of all requests is 200, which proves that the image information crawling was successful.

7. Extract information

First define an Item called ImageItem, as follows:

from scrapy import Item, Field
class ImageItem(Item):
    collection = table = 'images'
    id = Field()
    url = Field()
    title = Field()
    thumb = Field()

Here we define 4 fields, including image ID, link, title, and thumbnail. In addition, there are two other attributes, collection and table, which are both defined as images strings, representing the collection name stored in MongoDB and the table name stored in MySQL.

Next, we extract the relevant information in the Spider and rewrite the parse method as follows:

def parse(self, response):
    result = json.loads(response.text)
    for image in result.get('list'):
        item = ImageItem()
        item['id'] = image.get('id')
        item['url'] = image.get('qhimg_url')
        item['title'] = image.get('title')
        item['thumb'] = image.get('qhimg_thumb')
        yield item

First parse the JSON, traverse the list field, take out the picture information, and then assign the ImageItem to generate the Item object.
In this way, we have completed the extraction of information.

8. Store information

Next we need to save the picture information to MongoDB, MySQL, and save the picture locally.

MongoDB

First make sure that MongoDB has been installed and can run normally.

We use a MongoPipeline to save the information in MongoDB, and add the implementation of the following classes in pipelines.py:

import pymongo
class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
    def process_item(self, item, spider):
        self.db[item.collection].insert(dict(item))
        return item
    def close_spider(self, spider):
        self.client.close()

Two variables are needed here, MONGO_URI and MONGO_DB, which are the link address and database name stored in MongoDB. We add these two variables in settings.py as follows:

MONGO_URI = 'localhost'
MONGO_DB = 'images360'

Such a Pipeline saved to MongoDB is created. The main method here is process_item(). You can directly call the insert method of the Collection object to complete the data insertion, and finally return the Item object.

MySQL

First, you need to make sure that MySQL has been installed and running properly.

Create a new database, the name is still images360, the SQL statement is as follows:

CREATE DATABASE images360 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci

Create a new data table, including four fields id, url, title, thumb, the SQL statement is as follows:

CREATE TABLE images (id VARCHAR(255) NULL PRIMARY KEY, url VARCHAR(255) NULL , title VARCHAR(255) NULL , thumb VARCHAR(255) NULL)

After executing the SQL statement, we successfully created the data table. Then you can store data in the table.

Next we implement a MySQLPipeline, the code is as follows:

import pymysql
class MysqlPipeline():
    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(host=crawler.settings.get('MYSQL_HOST'),
            database=crawler.settings.get('MYSQL_DATABASE'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            port=crawler.settings.get('MYSQL_PORT'),
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8', port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        data = dict(item)
        keys = ', '.join(data.keys())
        values = ', '.join(['% s'] * len(data))
        sql = 'insert into % s (% s) values (% s)' % (item.table, keys, values)
        self.cursor.execute(sql, tuple(data.values()))
        self.db.commit()
        return item

As mentioned earlier, the data insertion method used here is a method of dynamically constructing SQL statements.

Here are a few more MySQL configurations. We add a few variables in settings.py, as shown below:

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'images360'
MYSQL_PORT = 3306
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'

Here are respectively defined MySQL address, database name, port, user name, and password. In this way, the MySQL Pipeline is complete.

Image Pipeline

Scrapy provides a pipeline dedicated to downloading, including file downloads and image downloads. The principle of downloading files and pictures is the same as that of crawling pages, so the download process supports asynchronous and multi-threading, which is very efficient. Let's take a look at the specific implementation process.

The official document address is: https://doc.scrapy.org/en/latest/topics/media-pipeline.html .

First define the path of the storage file, you need to define an IMAGES_STORE variable, add the following code in settings.py:

IMAGES_STORE = './images'

Here we define the path as the images subfolder under the current path, that is, the downloaded pictures will be saved in the images folder of this project.

The built-in ImagesPipeline will read the image_urls field of the Item by default, and think that the field is a list form, it will traverse the image_urls field of the Item, and then take out each URL for image download.

But the image link field of the generated Item is not represented by the image_urls field, nor is it in the form of a list, but a single URL. So in order to achieve downloading, we need to redefine part of the download logic, that is, we need to customize the ImagePipeline, inherit the built-in ImagesPipeline, and rewrite the method.

We define ImagePipeline as follows:

from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
class ImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        return item

    def get_media_requests(self, item, info):
        yield Request(item['url'])

Here we have implemented ImagePipeline, inherited Scrapy's built-in ImagesPipeline, and rewritten the following methods.

get_media_requests(). Its first parameter item is the Item object generated by crawling. We take out its url field, and then directly generate the Request object. This Request is added to the scheduling queue, waiting to be scheduled, and downloading is performed.
file_path(). Its first parameter request is the Request object corresponding to the current download. This method is used to return the saved file name, and directly use the last part of the image link as the file name. It uses the split() function to split the link and extract the last part, and return the result. In this way, the name saved after downloading this picture is the file name returned by this function.
item_completed(), it is the processing method when a single Item is finished downloading. Because not every picture will be downloaded successfully, we need to analyze the download results and eliminate the pictures that failed to download. If a picture fails to download, then we do not need to save this item to the database. The first parameter results of this method is the download result corresponding to the Item. It is in the form of a list. Each element of the list is a tuple, which contains information about the success or failure of the download. Here we traverse the download results to find a list of all successful downloads. If the list is empty, the download of the image corresponding to the item fails, and the exception DropItem is thrown, and the item is ignored. Otherwise, the Item is returned, indicating that the Item is valid.

So far, the definition of the three Item Pipeline is complete. Finally, just enable it, modify settings.py and set ITEM_PIPELINES as follows:

ITEM_PIPELINES = {
    
    
    'images360.pipelines.ImagePipeline': 300,
    'images360.pipelines.MongoPipeline': 301,
    'images360.pipelines.MysqlPipeline': 302,
}

Pay attention to the order of calling here. We need to call ImagePipeline first to filter the items after download, and the items that fail to download will be ignored directly, and they will not be saved in MongoDB and MySQL. Then call the other two stored pipelines to ensure that the pictures stored in the database are downloaded successfully.
Next, run the program to perform crawling, as shown below:

scrapy crawl images

The crawler downloads while crawling, the download speed is very fast, and the corresponding output log is shown in the figure.
Insert picture description here
Check the local images folder and find that the pictures have been downloaded successfully, as shown in the figure.

Check MySQL, the downloaded image information has been successfully saved, as shown in the figure.

Check MongoDB, the downloaded image information has also been successfully saved, as shown in the figure.
Insert picture description here
In this way, we can successfully download the picture and store the picture information in the database.

9. Code for this section

The code address of this section is:
https://github.com/Python3WebSpider/Images360 .

10. Conclusion

Item Pipeline is a very important component of Scrapy, and data storage is almost all realized through this component. Please be sure to master this content carefully.