Python crawler --- Scrapy framework preliminary exploration and actual combat

Scrapy framework installation
Scrapy framework crawling principle
- The main structure of the Scrapy framework is divided into five parts:
- It also has two middleware that can customize the download function:
- How Scrapy framework works
Scrapy framework example
- Use Scrapy to crawl Ali Literature

Scrapy framework installation

Operating environment introduction

Operating system: Ubuntu19.10
Python version
: Python3.7.4 Compiler: pycharm community version

Install scrapy framework (under Linux system)

The most cumbersome part of installing the scrapy framework is to install a lot of dependent packages. If the dependent packages are missing, the installation of the scrapy framework will report an error.
However, the lxml module will be included in the anaconda environment, and some steps can be eliminated.
Enter the command in the terminal to install the dependent package

sudo apt-get install python-dev
sudo apt-get install libevent-dev

Enter y to confirm the installation according to the prompt, and wait patiently for the installation to complete. After
installing the required dependency packages, you can install the scrapy framework directly with pip.

pip install scrapy

After entering commands in the terminal, errors such as network response timeout may occur

After a few attempts, I think this is a problem with the software source and the download is too slow. It is recommended to replace the Tsinghua mirror source to install

pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

Add -i to the end of the command to temporarily use the specified image source to install the scrapy framework.
After excluding the effects of other problems, if the installation speed is still not ideal, you can try more sources to install

阿里云 http://mirrors.aliyun.com/pypi/simple/ 
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/ 
豆瓣(douban) http://pypi.douban.com/simple/ 
清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/ 
中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/

ok, we can see that the download speed has changed significantly after changing the source, just wait a few seconds to see the sign of successful installation

Check if the installation is successful

Open the terminal and enter the command to view the scrapy version

scrapy --version

See version information similar to the one shown below:

Then enter the command

scrapy bench

This command is to call scrapy to crawl the content of an empty URL. After entering the command, you see that scrapy is running to crawl the content of the empty URL. The framework is successfully installed.

Scrapy framework crawling principle

The main structure of the Scrapy framework is divided into five parts:

Scrapy Engine (engine): Responsible for communication, signal, data transmission, etc. among Spider, ItemPipeline, Downloader, Scheduler.
Scheduler (Scheduler): It is responsible for accepting the Request requests sent by the engine, and sorting and arranging according to a certain way, entering the queue, and returning to the engine when the engine needs it.
Downloader (downloader): responsible for downloading all Requests sent by Scrapy Engine (engine), and return the Responses it gets to Scrapy Engine (engine), which is handled by Spider to Spider.
Spider: It is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data required by the Item field, submitting the URL that needs to be followed to the engine, and entering the Scheduler again.
Item Pipeline (Pipeline): It is responsible for processing the items obtained in the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

It also has two middleware that can customize the download function:

Downloader Middlewares (download middleware): a component that can customize the extended download function.
Spider Middlewares (Spider Middleware): A functional component that can customize the expansion and operation of the engine and Spider middle communication.

How Scrapy framework works

As can be seen from the above figure, the engine is equivalent to the brain in the framework, and it coordinates the operation of other parts.

First, the crawler file sends a URL request to the engine, and then sends the URL to the dispatcher via the engine. The dispatcher queues the sent URL through internal functions, and sends the queued URL back to the engine. After the engine receives the queue sent back, it sends the request to the downloader, and the downloader obtains the webpage source code from the external webpage, and sends the obtained response data back to the engine. After the engine receives the webpage source code returned by the downloader, the data is returned to the crawler file. The crawler file is responsible for extracting the html data. After the data extraction is completed, the data is sent to the engine. The engine detects the content sent by the crawler file. If it is judged that the transmitted content is data, the engine sends the data to the pipeline data that specifically saves the file. After the pipeline file receives the data sent, it can be selectively stored in Database or file.

At this point, a complete crawler project operation is completed.

Scrapy framework example

Use Scrapy to crawl Ali Literature

General steps for crawling using scrapy framework

1. Create a crawler project
2. Create a crawler file
3. Write a crawler file
4. Write items
5. Write pipelines
6. Configure settings
7. Run scrapy framework

Note: The above steps are the general steps of using scrapy. In actual operation, these steps can be adjusted according to the difficulty of crawling the page

1. Create a crawler project

First find a folder to save the contents of the project.
Open the terminal and enter in this directory:

scrapy startproject [项目名称]

Successfully created

At this time, you can find an additional folder named after the project

Go to the folder and find that there is already a frame template file inside

init .py // initial file
items.py // define the target, what information you want to climb
pipelines.py // post-climbing
middlewares.py // middle key
settings.py // setting information of the file

2. Create a crawler file

scrapy genspider [爬虫名] [想要爬取网址的域名]

Note: The crawler name must not be the same as the project name, otherwise it will fail to create

3. Analyze files and write crawler files

After we created the crawler file in the previous step, we will find an extra py file under the project name directory

Open the aliwx.py file, start analyzing the webpage, and extract the information you want to crawl.
Open the homepage of Ali Literature's official website, first set the content we want to crawl to the latest updated book title and chapter, and use xpath syntax to extract

Open the spider folder under the project folder, and then open the crawler file we just created.
After opening, you can see a basic crawler framework. We need to modify the content of this framework according to actual needs.

The place with the red arrow above is where we mainly amended.
First, replace the first pointed address with the address we want to crawl, which is the homepage of Ali Literature's official website " https://www.aliwx.com.cn/"
The second place is that we set up custom items for crawling content

import scrapy

class AliwxSpider(scrapy.Spider):
    name = 'aliwx'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['https://www.aliwx.com.cn/']

    def parse(self, response):
        #选择所有a标签下的内容
        selectors = response.xpath("//ul[@class='list']/li[@class='tr']/a")  
        # 循环遍历每一个a标签,使书名和最新章节可以一对一匹配
        for selector in selectors:
            #提取a标签下的书名 . 表示在当前标签下继续选择
            book_name = selector.xpath("./span[2]/text()").get()  
            #提取a标签下的最新章节名
            book_new = selector.xpath("./span[3]/text()").get() 

            #打印书名和章节名
            print(book_name,book_new)

4. Set settings

Open the settings file and find the code in the file

The meaning of this line is to ask whether to follow the crawler protocol . Generally speaking, the protocol regulates that we can only extract the content that the website allows us to extract, so here we should replace True with False, otherwise we may not be able to extract the wanted Content

5. Run the scrapy framework

Open the terminal, switch the directory to the project folder, enter the command line to run the project

scrapy crawl [爬虫名]

Wait for the project to finish running, you can see the crawled content:

6. Save the crawled content to a file

If you want to save the content to a file, you need to set a return value to store and
replace the print (book_name, book_new) statement

items = {
                'name': book_name,
                'new': book_new,
            }

            yield items

Then open the terminal, enter the run and save the command:

scrapy crawl aliwx -o book.csv

Wait for the program to finish running, you can see a book.csv file appears in the project directory, open is what we saved

So far, one of the most basic crawler projects has been completed. If there is anything wrong with it, please criticize and correct me!