Super simple Scrapy crawler framework

When you have learned reptiles for a while, you will know that there are too many functions and troublesome. It's not as convenient as organizing a frame yourself. Therefore, from the beginning of writing a crawler program, you will slowly come into contact with some crawler-related frameworks, efficiency improvements and easy expansion. Next, I will use the Scrapy crawler framework to record my learning process for your reference and correction.

insert image description here

1. Installation

$ pip install scrapy

2. Create a crawler project

$ scrapy startproject wikiSpider

3. The crawler project directory structure

The directory structure of the wikiSpider project folder is as follows:

scrapy.cfg
- wikiSpider
    - __init__.py
  - items.py  
  - pipelines.py
  - settings.py
  - spiders
       - __init__.py

4. Define the data fields that need to be crawled

We are going to crawl the title of the page in the items.py file, define an Article class, and then write the following code:

from scrapy import Item,Field
class Article(Item):
    title = Field()

5. Create a crawler file

To create a spider, we need to add an articleSpider.py file in the wikiSpider/wikiSpider/spiders/ folder.

In the newly created articleSpider.py file, write the following code:

from scrapy.selector import Selector
from scrapy import Spider
from wikiSpider.items import Article

class ArticleSpider(Spider):
    name = 'article'
    allowd_domains = ["en.wikipedia.org"]
    start_urls = ['http://en.wikipedia.org/wiki/Main_Page','http://en.wikipedia.org/wiki/Python_%28programming_language%29']

    def parse(self,response):
        item = Article()
        title = response.xpath('//h1/text()')[0].extract()
        print("Title is : "+title)
        item['title'] = title
        return item

The name of this class (ArticleSpider) is different from the name of the crawler file (wikiSpider). This class is only a member of the wikiSpider directory, and it is only used to collect Wiki article pages.

run crawler

You can run ArticleSpider from the wikiSpider home directory with the following command:

$ scrapy crawl article

This line of command will invoke the spider with the name defined in the ArticleSpider class. This crawler first enters the two pages in start_urls, collects information, and then stops.
Scrapy supports saving this information in different output formats, such as CSV, JSON or XML file formats, and the corresponding commands are as follows:

$ scrapy crawl article -o articles.csv -t csv 
$ scrapy crawl article -o articles.json -t json 
$ scrapy crawl article -o articles.xml -t xml

Of course, you can also customize the Item object and write the result into a file or database you need, just add the corresponding code in the parse part of the crawler. If you think the content is not bad, share it with more friends and improve your programming skills together.

Guess you like

Origin blog.csdn.net/weixin_44617651/article/details/129701944