Python's web crawler framework - the use of Scrapy crawler framework


I. Introduction

  • Personal homepage : ζ Xiaocaiji
  • Hello everyone, I'm Xiaocaiji, let's learn the use of Python's web crawler framework-Scrapy crawler framework
  • If the article is helpful to you, welcome to follow, like, and bookmark (one-click three links)

2. Build the Scrapy crawler framework

   Since the Scrapy crawler framework relies on many libraries, especially under the Windows system, at least the libraries that need to be relied on are Twisted, lxml, PyOpenSSL, and pywin32. The specific steps to build the Scrapy crawler framework are as follows:

1. Install the Twisted module

   (1) Open https://pypi.org/project/Twisted/#files URL, as shown in the figure below:

insert image description here


   (2) Download the Twisted module file, as shown in the figure below:

insert image description here


   (3) After the "Twisted-22.10.0-py3-none-any.whl" binary file is downloaded, run the command prompt window as an administrator, then use the cd command to open the path where the "Twisted-22.10.0-py3-none-any.whl" file is located, and finally enter "pip install Twisted-22.10.0-py3-none-any.whl" in the window to install Twist ed module, as shown in the figure below:

insert image description here


2. Install the Scrapy framework

   (1) Open the command line window, and then enter the "pip install Scrapy" command to install the Scrapy framework, as shown in the following figure:

insert image description here


   (2) After the installation is complete, enter "scrapy" in the command line. If there is no abnormal message or error message, it means that the Scrapy framework is installed successfully. As shown below:

insert image description here

Note: During the installation of the Scrapy framework, the lxml module and the pyOpenSSL module will also be installed in the Python environment.


3. Install the pywin32 module

  Open the command window, and then enter "pip install pywin32", if there is no error message, it means the installation is successful.


3. Create a Scrapy project

  Create a project folder in any path, for example, run the command line window in the "D:\python" folder, and then enter "scrapy startproject scrapyDemo" to create a project named "scrapyDemo", as shown in the following figure:

insert image description here


  In order to improve developer efficiency, Xiaocaiji uses the PyCharm third-party development tool to open the scrapyDemo project just created. After the project is opened, you can see the content as shown in the following figure in the directory structure of the project on the left:

insert image description here


4. Create a crawler

   When creating a crawler, you first need to create a crawler module file, which needs to be placed in the spiders folder. The crawler module is a class used to crawl data from one website or multiple websites. It needs to inherit the scrapy.Spider class. The following is an example of a crawler to save the code for crawling web pages to the project folder as an HTML file. The sample code is as follows:

import scrapy  # 导入框架


class QuotesSpider(scrapy.Spider):
    name = "quotes"  # 定义爬虫名称

    def start_requests(self):
        # 设置爬取目标的地址
        urls = [
            "http://quotes.toscrape.com/page/1/",
            "http://quotes.toscrape.com/page/2/",
        ]
        # 获取所有地址,有几个地址发送几次请求
        for url in urls:
            # 发送网络请求
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # 获取网页
        page = response.url.split("/")[-2]

        # 根据页面数据设置文件名称
        filename = "quotes-%s.html" % page
        # 写入文件的模式打开文件,如果没有该文件将创建该文件
        with open(filename, "wb") as f:
            f.write(response.body)
        # 输出保存文件的名称
        self.log("Saved file %s" % filename)

   When running the crawler project created by Scrapy, you need to enter "scrapy crawl quotes" in the command window, where "quotes" is the crawler name you defined. Since Xiaocaiji uses the PyCharm third-party development tool, you need to enter the command line to run the crawler in the "Terminal" window at the bottom. After running, the information shown in the figure below will be displayed:

insert image description here


Note: In addition to using the command "scrapy crawl quotes" entered in the command window, Scrapy also provides an API that can start crawlers in the program, that is, the CrawlerProcess class. First, you need to pass in the setting information of the project when CrawlerProcess is initialized, then pass in the name of the crawler in the crawl() method, and finally start the crawler through the start() method. code show as below:

# 导入CrawlerProcess类
from scrapy.crawler import CrawlerProcess
# 获取项目配置信息
from scrapy.utils.project import get_project_settings

# 程序入口

if __name__ == "__main__":
    # 创建CrawlerProcess类对象并传入项目设置信息参数
    process = CrawlerProcess(get_project_settings())
    # 设置需要启动的爬虫名称
    process.crawl("quotes")
    # 启动爬虫
    process.start()


5. Get data

   The Scrapy crawler framework can select a certain place in an HTML file through a specific CSS or XPath expression, and extract the corresponding data. CSS, used to control HTML page layout, fonts, colors, backgrounds, and other effects. XPath is a language for finding information based on elements and attributes in an XML document.

1. CSS extract data

  When using CSS to extract data from an HTML file, you can specify the tag name in the HTML file. For example, to get the title tag code of the web page in the above example, you can use the following code:

response.css("title").extract()

  The obtained result is as shown in the figure:

insert image description here

Description: The returned content is a list of nodes corresponding to CSS expressions, so when extracting data from tags, you can use the following code:

response.css("title::text").extract_first()

  or

response.css("title")[0].extract()

2. XPath to extract data

  When using an XPath expression to extract a certain data in an HTML file, the specified data information needs to be obtained according to the syntax of the XPath expression. For example, when obtaining the information in the title tag, the following code can be used:

response.xpath("//title/text()").extract_first()

  Let's use an example to obtain multiple pieces of information in the above example using XPath expressions. The sample code is as follows:

    def parse(self, response):
        # 获取信息
        for quote in response.xpath(".//*[@class='quote']"):
            # 获取名人名言文字信息
            text = quote.xpath(".//*[@class='text']/text()").extract_first()
            # 获取作者
            author = quote.xpath(".//*[@class='author']/text()").extract_first()
            # 获取标签
            tags = quote.xpath(".//*[@class='tag']/text()").extract()
            # 以字典的形式输出信息
            print(dict(text=text, author=author, tags=tags))

3. Flip page to extract data

  In the above example, the data in the webpage has been obtained. If you need to obtain all the information of the entire webpage, you need to use the page turning function. For example, to get the author names for the entire site in the example above, the following code could be used:

# 响应信息
    def parse(self, response):
        # div.quote
        # 获取所有信息
        for quote in response.xpath(".//*[@class='quote']"):
            # 获取作者
            author = quote.xpath(".//*[@class='author']/text()").extract_first()
            print(author)  # 输出作者名称

        # 实现翻页
        for href in response.css("li.next a::attr(href)"):
            yield response.follow(href, self.parse)

4. Create Items

  In the process of crawling web page data, structured data is extracted from unstructured data sources. For example, in the parse() method of the QuotesSpider class, text, author, and tags information can be obtained. If these data need to be packaged into structured data, then the Item class provided by scrapy is required to meet such requirements. The Item object is a simple container used to save the crawled data information. It provides a dictionary-like API and a convenient syntax for declaring its available fields. Items are declared using simple class definition syntax and Field objects. When creating the scrapyDemo project, an items.py file has been automatically created in the directory structure of the project to define the Item class for storing data information, which needs to inherit scrapy.Item. The sample code is as follows:

import scrapy


class ScrapydemoItem(scrapy.Item):
    # define the fields for your item here like:
    # 定义获取名人名言文字信息
    text = scrapy.Field()
    # 定义获取的作者
    author = scrapy.Field()
    # 定义获取的标签
    tags = scrapy.Field()
    
    pass

  After the Item is created, go back to the crawler code you wrote, create an Item object in the parse() method, and then output the item information. The code is as follows:

    def parse(self, response):
        # 获取信息
        for quote in response.xpath(".//*[@class='quote']"):
            # 获取名人名言文字信息
            text = quote.xpath(".//*[@class='text']/text()").extract_first()
            # 获取作者
            author = quote.xpath(".//*[@class='author']/text()").extract_first()
            # 获取标签
            tags = quote.xpath(".//*[@class='tag']/text()").extract()
            # 以字典的形式输出信息
            item = ScrapydemoItem(text=text, author=author, tags=tags)
            yield item  # 输出信息

Note: Due to the large content of the Scrapy crawler framework, here is only a brief introduction to the installation, creation and data extraction of the crawler framework. For detailed tutorials, you can log in ( https://docs.scrapy.org/en/latest/ ) Scrapy official documents for inquiries.


  This is the end of the introduction to the use of Python's web crawler framework-Scrapy crawler framework. Thank you for reading. If the article is helpful to you, please pay attention, like, and bookmark (one-click three links)


Guess you like

Origin blog.csdn.net/weixin_45191386/article/details/131620901