How to use Scrapy to build an efficient crawler

How to use Scrapy to build an efficient crawler

With the advent of the information age, the amount of data on the Internet continues to increase, and the demand for obtaining large amounts of data is also increasing. And the crawler program has become one of the best solutions for this need. As an excellent Python crawler framework, Scrapy has the characteristics of high efficiency, stability and ease of use, and is widely used in various fields. This article will introduce how to use Scrapy to build an efficient crawler program, and give code examples.

  1. The basic structure of a crawler

Scrapy's crawler program is mainly composed of the following components:

  • Crawler: Defines how to crawl pages, parse data from them, follow links, etc.
  • Project pipeline: responsible for processing the data extracted from the page by the crawler, and performing subsequent processing, such as storing to the database or exporting to a file.
  • Downloader middleware: the part responsible for processing sending requests and obtaining page content, which can perform operations such as User-Agent setting and proxy IP switching.
  • Scheduler: Responsible for managing all requests to be fetched and scheduling according to a certain strategy.
  • Downloader: Responsible for downloading the requested page content and returning it to the crawler.
  1. Write a crawler

In Scrapy, we need to create a new crawler project to write our crawler. Execute the following commands on the command line:

scrapy startproject myspider

This will create a project folder called "myspider" and contain some default files and folders. We can go into that folder and create a new crawler:

cd myspider
scrapy genspider example example.com

This will create a crawler called "example" that scrapes data from the "example.com" website. We can write specific crawler logic in the generated "example_spider.py" file.

Below is a simple example for crawling news headlines and links on a website.

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/news']

    def parse(self, response):
        for news in response.xpath('//div[@class="news-item"]'):
            yield {
                'title': news.xpath('.//h2/text()').get(),
                'link': news.xpath('.//a/@href').get(),
            }
        next_page = response.xpath('//a[@class="next-page"]/@href').get()
        if next_page:
            yield response.follow(next_page, self.parse)

In the above code, we define a crawler class named "ExampleSpider", which contains three attributes: name indicates the name of the crawler, allowed_domains indicates the domain names allowed to crawl websites, and start_urls indicates the starting URL. Then we rewrite the parse method, which will parse the content of the webpage, extract news titles and links, and use yield to return the results.

  1. Configure project pipeline

In Scrapy, we can pipeline the scraped data through the project pipeline. Data can be stored in a database, written to a file, or otherwise processed afterwards.

Open the "settings.py" file in the project folder, find the ITEM_PIPELINES configuration item in it, and uncomment it. Then add the following code:

ITEM_PIPELINES = {
    'myspider.pipelines.MyPipeline': 300,
}

This will enable the custom pipeline class "my spider.pipelines.MyPipeline", and specify a priority (the lower the number, the higher the priority).

Next, we need to create a pipeline class to process the data. Create a file called "pipelines.py" in the project folder and add the following code:

import json

class MyPipeline:

    def open_spider(self, spider):
        self.file = open('news.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "
"
        self.file.write(line)
        return item

In this example, we define a pipeline class called "MyPipeline" with three methods: open_spider, close_spider and process_item. In the open_spider method, we open a file to store data. In the close_spider method, we close the file. In the process_item method, we convert the data into JSON format and write it to a file.

  1. Run the crawler

After writing the crawler and project pipeline, we can execute the following command in the command line to run the crawler:

scrapy crawl example

This will start the crawler named "example" and start scraping data. The crawled data will be processed in the way we defined in the pipeline class.

The above is the basic process and sample code of using Scrapy to build an efficient crawler program. Of course, Scrapy also provides many other functions and options, which can be adjusted and extended according to specific needs. I hope this article can help readers better understand and use Scrapy, and build an efficient crawler program.

The above are the details of how to use Scrapy to build an efficient crawler program

Guess you like

Origin blog.csdn.net/lmrylll/article/details/132063359