[python crawler] 14. Scrapy framework explanation

Preface

In the first two levels, we learned the advanced knowledge of coroutines that can improve the speed of crawlers, and through practical project operations, we applied coroutines to crawl the food data of Mint.com.

Maybe when you experience the complete process of developing a crawler project, you will feel like this: It turns out that there is so much trivial work to complete a complete crawler program.

For example, you need to import modules with different functions and write codes for various crawling processes. And depending on the project, the code to be written each time is also different.

I wonder if you have this idea: Is there a ready-made crawler template that we can apply just like a PPT template? We don't need to take care of the entire process of the crawler, we only need to be responsible for filling in the core logic code of the crawler. If there is, it will be very convenient and trouble-free for us to write code.

In fact, there is such a crawler template in Python, but its name is framework.

A crawler framework contains various modules that can realize the entire crawler process, just like the PPT template helps you set the theme color and layout from the beginning.

At this level, what we need to learn is Scrapy, a powerful crawler framework.

What is Scrapy

In the past, when we wrote crawlers, we had to import and operate different modules, such as requests module, gevent library, csv module, etc. In Scrapy, you don’t need to do this, because many functions that crawlers need to involve, such as troublesome asynchronous, are automatically implemented in the Scrapy framework.

The way we used to write crawlers was equivalent to piecing together parts one by one to build a running car. The Scrapy framework is a ready-made car that has already been built. We just need to step on its accelerator and it will start running. This saves us time in developing projects.

Insert image description here
Next, let’s learn about the basics of Scrapy, including its structure and how it works.

The structure of Scrapy

Insert image description here
The picture above is the entire structure of Scrapy. You can think of the entire Scrapy framework as a crawler company. The most central Scrapy Engine is the boss of this crawler company, responsible for coordinating the four major departments of the company. Each department only obeys its orders and reports to it.

I will introduce you to the four major departments of Scrapy crawler company in the order of the crawler process.

The Scheduler (scheduler) department is mainly responsible for processing the requests objects sent by the engine (i.e., the collection of relevant information for web page requests, including params, data, cookies, request headers..., etc.), and arranging the requested URLs into queues in an orderly manner. , and wait for the engine to extract (functionally similar to the queue module of the gevent library).

The Downloader department is responsible for processing requests sent by the engine, crawling web pages, and handing the returned response (crawled content) to the engine. It corresponds to the [obtaining data] step of the crawler process.

The Spiders (crawler) department is the core business department of the company. Its main task is to create requests objects and accept responses sent by the engine (content crawled by the Downloader department), and parse and extract useful data from them. It corresponds to the two steps of the crawler process [parsing data] and [extracting data].

The Item Pipeline (data pipeline) department is the company's data department, which is only responsible for storing and processing useful data extracted by the Spiders department. This corresponds to the [storage data] step of the crawler process.

The job of Downloader Middlewares is equivalent to the secretary of the downloader department. For example, it will handle many requests sent by the engine boss in advance.

The job of Spider Middlewares is equivalent to the secretary of the crawler department. For example, it will receive and process responses sent by the engine boss in advance and filter out some repetitive and useless things.

Insert image description here

How Scrapy works

You will find that in the Scrapy crawler company, each department performs its own duties, forming a very efficient operating process.

The logic of this running process is very simple, that is: what the engine boss says is the highest demand.

Insert image description here
The above figure also shows the working principle of the Scrapy framework - the engine is the center, and other components are scheduled by the engine.

In Scrapy, we don’t need to worry about the entire crawler program process, and all programs in Scrapy are in asynchronous mode. All requests or returned responses are automatically allocated by the engine for processing.

Even if an exception occurs in a certain request, the program will handle the exception, skip the error reporting request, and continue running the program.

To a certain extent, Scrapy can be said to be a very worry-free crawler framework.

How to use Scrapy

Now, you have a preliminary understanding of Scrapy's structure and working principles. Next, in order to familiarize you with the use of Scrapy, we use it to complete a small project - crawling Douban Top 250 books.

Insert image description here

Clarify goals and analysis process

Still follow the three steps of writing code: clarify the goal, analyze the process, and implement the code to complete the project. I will focus on the usage of Scrapy in the code implementation steps.

First, the goal must be clear. Please be sure to open the following link to Douban's Top 250 books.

https://book.douban.com/top250

Douban Top250 books have a total of 10 pages, and each page has 25 books. Our goal is: first crawl only the information on the first three pages of books, that is, crawl the information on the first 75 books (including book titles, publishing information and book ratings).

Next, let's analyze the web page. Since we want to crawl book information, we must first determine where the information is stored.

The method of judgment should be clear to you. Quickly right-click to open the "Inspect" tool, click on Network, refresh the page, and then click on the 0th request top250 to see the Response.

We can find the book title and publishing information inside, which means that the book information we want is hidden in the HTML of this URL.

Insert image description here
After confirming that the book information exists in the HTML of this URL, let's take a closer look at this website.

You click to turn to page 2 of Douban's Top 250 books.

Insert image description here
You will observe that the URL has changed, with ?start=25 at the end. We guess that the latter number represents 25 books per page.

You can turn to page 3 to verify whether our guess is correct.

Insert image description here
It turns out we were right. Every time you turn a page, the number after the URL will increase by 25, indicating that the start parameter represents 25 books per page.

After such observation, the structural rules of the URL we want to crawl come out. As long as we change the number after ?start= (turn a page and add 25), we can get the URL of each page.

Now that we have found the structure of the URL, we can focus on analyzing the structure of HTML to see how we can extract the book information we want.

Still right-click to open the "Inspect" tool, click Elements, then click the cursor, and move the mouse to the book title, publishing information, and ratings in order to find these book information in HTML. As shown in the picture below, all the book information of "The Kite Runner" is placed <table width="100%">in the label.

Insert image description here

Soon, you will find that the 25 book information on each page is actually hidden in a <table width="100%">label. However, this tag has no class attribute or id attribute, which makes it inconvenient for us to extract information.

Insert image description here
We have to find another tag that is easy for us to extract and can contain all the book information.

The elements under <table width="100%">the tag <tr class="item">just meet our requirements. They have both class attributes and book information.

As long as we take out the value element and element of the title attribute of the <tr class="item">element under the element , we can get the book title, publishing information and rating data.<a>、<p class="pl"><span class="rating_nums">

Insert image description here
After the page analysis is completed, then enter the steps of code implementation.

Code implementation - create project

From here, I'll take you through writing our project crawler using Scrapy. It will involve a lot of usage of Scrapy, so please read it carefully!

If you want to use Scrapy on your local computer, you need to install it in advance. (Installation method: Windows: Enter the command in the terminal: pip install scrapy; Mac: Enter the command in the terminal: pip3 install scrapy, press the enter key)

First, open the terminal on your local computer (windows: Win+R, enter cmd; mac: command+space, search for "terminal"), and then jump to the directory where you want to save the project.

Suppose you want to jump to the Pythoncode subfolder in the Python folder on drive E. You need to enter e: on the command line to jump to the e drive, and then enter cd Python to jump to the Python folder. Then enter cd Pythoncode to jump to the Pythoncode subfolder in the Python folder.

Then, enter another line of command that can help us create a Scrapy project: scrapy startproject douban, douban is the name of the Scrapy project. Press the enter key and a Scrapy project is created successfully.

Insert image description here

The structure of the entire scrapy project is as shown below:

Insert image description here
Each file in the Scrapy project has a specific function. For example, settings.py is the various settings in scrapy. items.py is used to define data, and pipelines.py is used to process data. They correspond to the Item Pipeline (data pipeline) in Scrapy's structure.

You may not understand them right now, and that's okay, things will become clearer little by little. Let's explain them.

Code implementation - editing crawler

As mentioned before, spiders is the directory where crawlers are placed. We can create crawler files in the spiders folder. Let's name this file top250. Most of the following code needs to be written in this top250.py file.

Insert image description here
First import the modules we need in the top250.py file.

import scrapy
import bs4

Importing BeautifulSoup is used to parse and extract data. This should not require me to explain more. You are already very familiar with it in levels 2 and 3.

Importing scrapy means that we will write this crawler later by creating a class. The class we create will directly inherit the scrapy.Spider class in scrapy. In this way, there are many useful properties and methods that can be used directly.

Then we started writing the core code of the crawler.

In Scrapy, the code structure of each crawler is basically as follows:

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['book.douban.com']
    start_urls = ['https://book.douban.com/top250?start=0']
    
    def parse(self, response):
        print(response.text)

Line 1 of code: Define a crawler class DoubanSpider. Like I just said, the DoubanSpider class inherits from the scrapy.Spider class.

Line 2 of code: name is the name that defines the crawler. This name is the unique identifier of the crawler. name = 'douban' means to define the name of the crawler as douban. We will use this name when we start the crawler later.

Line 3 of code: allowed_domains defines the URL domain names that are allowed to be crawled by crawlers (no need to add https://). If the domain name of the URL is not in this list, it will be filtered out.

Why is there this setting? When you are crawling a large amount of data, you often start crawling from a URL and then crawl more web pages in association with it. For example, assume that our crawler goal today is not to crawl book information, but to crawl the top 250 book reviews of Douban Books. We will first crawl the book list, then find the URL of each book, and then enter the details page of each book to crawl the comments.

Allowed_domains is limited. The URL crawled by our association must be under the domain name book.douban.com and will not jump to some strange advertising page.

Line 4 of code: start_urls defines the starting URL, which is the URL from which the crawler starts crawling. Here, the setting of allowed_domains will not affect the URLs in start_urls.

Line 6 of code: parse is a default method for processing response in Scrapy, and Chinese is parsing.

You may be curious, is there a line of code like requests.get() missing here? Indeed, here, we do not need to write this sentence. The scrapy framework will do this for us. After writing your request, you can directly write how to handle the response. I will give you an example later.

After understanding the basic structure of the crawler code, we continue to improve the code for crawling Douban Top books.

Insert image description here
There are 10 pages in Douban Top 250 books, and we all know the URL of each page. We can choose to stuff all 10 pages of URLs into the list of start_urls.

But this method is not beautiful, and if you want to crawl hundreds of URLs and stuff them all into the start_urls list, the code will be very long.

In fact, we can use the URL rules of Douban's Top 250 books to construct each URL with a for loop, and then add the URL to the list of start_urls. The code will look much better this way.

Insert image description here
The improved code is as follows:

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['book.douban.com']
    start_urls = []
    for x in range(3):
        url = 'https://book.douban.com/top250?start=' + str(x * 25)
        start_urls.append(url)

We only crawl the first 3 pages of Douban Top 250 book information.

Next, we only need to use the parse method to process the response, and use BeautifulSoup to extract the data of the book information we want, and the code will be completed.

When we analyzed the project process earlier, we already knew which elements the book information was hidden in. Now we can use the find_all and find methods to extract it. For example, the book title is the value of the title attribute of the element under the element; the publishing information is in

in the element; the score is in the element.

Based on past knowledge, we might write the code like this:

import scrapy
import bs4
from ..items import DoubanItem

class DoubanSpider(scrapy.Spider):
#定义一个爬虫类DoubanSpider。
    name = 'douban'
    #定义爬虫的名字为douban。
    allowed_domains = ['book.douban.com']
    #定义爬虫爬取网址的域名。
    start_urls = []
    #定义起始网址。
    for x in range(3):
        url = 'https://book.douban.com/top250?start=' + str(x * 25)
        start_urls.append(url)
        #把豆瓣Top250图书的前3页网址添加进start_urls。

    def parse(self, response):
    #parse是默认处理response的方法。
        bs = bs4.BeautifulSoup(response.text,'html.parser')
        #用BeautifulSoup解析response。
        datas = bs.find_all('tr',class_="item")
        #用find_all提取<tr class="item">元素,这个元素里含有书籍信息。
        for data in  datas:
        #遍历datas。
            title = data.find_all('a')[1]['title']
            #提取出书名。
            publish = data.find('p',class_='pl').text
            #提取出出版信息。
            score = data.find('span',class_='rating_nums').text
            #提取出评分。
            print([title,publish,score])
            #打印上述信息。
            

According to the past, we would assign values ​​​​to book titles, publishing information, and ratings separately, and then process them uniformly—either printing or storage. But here in scrapy, things are different.

Spiders (such as top250.py) only do what spiders should do. Another person is responsible for the subsequent processing of the data.

Code implementation - defining data

In scrapy, we will specifically define a class for recording data.

Every time we want to record data, for example, in each minimum loop, we must record "book title", "publication information", and "rating". We will instantiate an object and use this object to record data.

Each time, when the data is finished recording, it will leave the spiders and come to the Scrapy Engine, which will send it to the Item Pipeline for processing.

The py file that defines this class is items.py.

We already know that the data we want to crawl is book titles, publication information and ratings. Let's take a look at how to define these data in items.py. code show as below:

import scrapy
#导入scrapy
class DoubanItem(scrapy.Item):
#定义一个类DoubanItem,它继承自scrapy.Item
    title = scrapy.Field()
    #定义书名的数据属性
    publish = scrapy.Field()
    #定义出版信息的数据属性
    score = scrapy.Field()
    #定义评分的数据属性

In the first line of code, we imported scrapy. The purpose is that the class we will create in a moment will directly inherit the scrapy.Item class in scrapy. In this way, there are many useful properties and methods that can be used directly. For example, later on, the engine can send the item class object to the Item Pipeline (data pipeline) for processing.

Line 3 of code: We define a DoubanItem class. It inherits from scrapy.Item class.

Lines 5, 7, and 9 of code: We define three types of data: book title, publication information, and rating. What this line of code scrapy.Field() does is allow data to be recorded in a dictionary-like form. You may not quite understand what this sentence means, and that’s okay. Let me take you to experience it, so you can feel what it is like:

import scrapy
#导入scrapy
class DoubanItem(scrapy.Item):
#定义一个类DoubanItem,它继承自scrapy.Item
    title = scrapy.Field()
    #定义书名的数据属性
    publish = scrapy.Field()
    #定义出版信息的数据属性
    score = scrapy.Field()
    #定义评分的数据属性

book = DoubanItem()
# 实例化一个DoubanItem对象
book['title'] = '海边的卡夫卡'
book['publish'] = '[日] 村上春树 / 林少华 / 上海译文出版社 / 2003'
book['score'] = '8.1'
print(book)
print(type(book))

operation result:

{
    
    'publish': '[日] 村上春树 / 林少华 / 上海译文出版社 / 2003',
 'score': '8.1',
 'title': '海边的卡夫卡'}
<class '__main__.DoubanItem'>

You will see that the printed result is indeed very similar to the dictionary, but it is not dict. Its data type is the DoubanItem we defined, which is a "custom Python dictionary". We can rewrite top250.py using a style similar to the above code. As follows:

import scrapy
import bs4
from ..items import DoubanItem
# 需要引用DoubanItem,它在items里面。因为是items在top250.py的上一级目录,所以要用..items,这是一个固定用法。

class DoubanSpider(scrapy.Spider):
#定义一个爬虫类DoubanSpider。
    name = 'douban'
    #定义爬虫的名字为douban。
    allowed_domains = ['book.douban.com']
    #定义爬虫爬取网址的域名。
    start_urls = []
    #定义起始网址。
    for x in range(3):
        url = 'https://book.douban.com/top250?start=' + str(x * 25)
        start_urls.append(url)
        #把豆瓣Top250图书的前3页网址添加进start_urls。

    def parse(self, response):
    #parse是默认处理response的方法。
        bs = bs4.BeautifulSoup(response.text,'html.parser')
        #用BeautifulSoup解析response。
        datas = bs.find_all('tr',class_="item")
        #用find_all提取<tr class="item">元素,这个元素里含有书籍信息。
        for data in  datas:
        #遍历data。
            item = DoubanItem()
            #实例化DoubanItem这个类。
            item['title'] = data.find_all('a')[1]['title']
            #提取出书名,并把这个数据放回DoubanItem类的title属性里。
            item['publish'] = data.find('p',class_='pl').text
            #提取出出版信息,并把这个数据放回DoubanItem类的publish里。
            item['score'] = data.find('span',class_='rating_nums').text
            #提取出评分,并把这个数据放回DoubanItem类的score属性里。
            print(item['title'])
            #打印书名。
            yield item
            #yield item是把获得的item传递给引擎。

On line 3, we need to reference DoubanItem, which is inside items. Because items are in the upper-level directory of top250.py, you need to use...items, which is a fixed usage.

Every time we want to record data, for example, in each minimum loop, we must record "book title", "publication information", and "rating". We will instantiate an item object and use this object to record data.

Each time, when the data is finished recording, it will leave the spiders and come to the Scrapy Engine, which will send it to the Item Pipeline for processing. Here, the yield statement is used.

You may not know much about the yield statement. Here you can simply understand it as: it is somewhat similar to return, but the difference between it and return is that it does not end the function and can return information multiple times.

Insert image description here
If you use a visual way to present the process of running the program, it is as shown in the picture above: the crawlers (Spiders) will encapsulate the 10 URLs of Douban into requests objects, and the engine will extract the requests objects from the crawlers (Spiders) and then hand them over. Give it to the scheduler (Scheduler) and let the scheduler sort and process these requests objects.

The engine then sends the requests object processed by the scheduler to the downloader. The downloader will immediately crawl according to the engine's command and return the response to the engine.

Then the engine will send the response back to the crawler (Spiders). At this time, the crawler will start the default parse method for processing the response, parse and extract the data of the book information, use the item to record it, and return it to the engine. The engine sends it to the Item Pipeline (data pipeline) for processing.

Code practice - settings

At this point, we have written a crawler in code. However, in actual operation, an error may still be reported.

The reason is that the default settings in Scrapy have not been modified. For example, we need to modify the request header. Click on the settings.py file, you can find the following default settings code inside:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

Please uncomment USER _AGENT (delete #), and then replace the content of user-agent, which means modifying the request header.

And because Scrapy complies with the robots protocol, if it is content that is prohibited from crawling by the robots protocol, Scrapy will not crawl it by default, so we have to modify the default settings in Scrapy.

Changing ROBOTSTXT_OBEY=True to ROBOTSTXT_OBEY=False means replacing compliance with the robots protocol with no need to comply with the robots protocol, so that Scrapy can run without restrictions.

The modified code should look like this:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

Now, we have written the spider and modified the settings. Everything is ready, all we need is to run Scrapy.

Code practice - running

There are two ways to run Scrapy. One is to jump to the folder of the scrapy project in the terminal of the local computer (jump method: cd + the path name of the folder), and then enter the command line: scrapy crawl douban (douban is us the name of the reptile).

Insert image description here
Another way of running requires us to create a new main.py file in the outermost large folder (same level as scrapy.cfg).

Insert image description here

We only need to enter the following code in the main.py file, click Run, and the Scrapy program will start.

from scrapy import cmdline
#导入cmdline模块,可以实现控制终端命令行。
cmdline.execute(['scrapy','crawl','douban'])
#用execute()方法,输入运行scrapy的命令。

Line 1 of code: There is a module cmdline in Scrapy that can control terminal commands. After importing this module, we can control the terminal.

Line 3 of code: In the cmdline module, there is an execute method that can execute the command line of the terminal, but this method needs to pass in the parameters of the list. If we want to enter the code to run Scrapy, scrapy crawl douban, we need to write it as ['scrapy', 'crawl', 'douban'].

At this point, we have finished learning how to use Scrapy.

It is worth mentioning that in this level, for the convenience of teaching and understanding, the crawler is written first and then the data is defined. However, in actual project implementation, the order is often the opposite - define the data first, and then write the crawler. So, the flow chart should be as follows:

Insert image description here
If you are careful, you may find that the content of this level does not involve the steps of storing data.

Yes, storing data requires modification of the pipelines.py file. The content of this level is already very substantial, so we will leave this knowledge point to the next level.

review

Finally, it is a review of the key knowledge of this level.

The structure of Scrapy——

Insert image description here
How Scrapy works——

Insert image description here
How to use Scrapy——

Insert image description here
In the next level, we are going to use Scrapy to implement a big project - crawling recruitment information of popular companies.

See you in Xiaguan~

Guess you like

Origin blog.csdn.net/qq_41308872/article/details/132665268