Getting started with the Python crawler framework scrapy

learning target:

  1. Master the installation of scrapy

  2. Application to create scrapy project

  3. Application to create scrapy crawler

  4. Application running scrapy crawler

  5. Method for applying scrapy to locate and extract data or attribute value

  6. Master the common attributes of response objects

1 Install scrapy

Command: sudo apt-get install scrapy or: pip/pip3 install scrapy

2 scrapy project development process

  1. Create project: scrapy startproject mySpider

  2. Generate a crawler: scrapy genspider itcast itcast.cn

  3. Extract data: According to the structure of the website, realize data collection related content in the spider

  4. Save data: Use pipeline for subsequent data processing and storage

3. Create Project

The scrapy project files are generated by commands, and the subsequent steps are related operations in the project files. The following is to grab the learning library to learn the introductory use of scrapy: learning library

The command to create a scrapy project: scrapy startproject <project name> Example: scrapy startproject myspider

The results of the generated directories and files are as follows:

4. Create a crawler

The crawler file is created through commands. The crawler file is the main code job file. Usually, the crawling action of a website will be written in the crawler file.

Command:      execute under the project path : scrapy genspider <crawler name> <domain name allowed to be crawled>

Reptiles Name : a runtime parameter crawler allowing crawling domain : for the range for the crawler crawler set, after filtration url to be provided for crawling, crawling url if allowed by the domain and nowhere were filtered off.

Example:

    
cd myspider
    scrapy genspider itcast itcast.cn

The results of the generated directories and files are as follows:

5. Perfect crawler

Write the data collection operation of the specified website in the crawler file generated in the previous step to realize data extraction

5.1 Modify the content in /myspider/myspider/spiders/itcast.py as follows:

import scrapy
class ItcastSpider(scrapy.Spider):  # 继承scrapy.spider
    # 爬虫名字 
    name = 'itcast' 
    # 允许爬取的范围
    allowed_domains = ['itcast.cn'] 
    # 开始爬取的url地址
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
    
    # 数据提取的方法,接受下载中间件传过来的response
    def parse(self, response): 
        # scrapy的response对象可以直接进行xpath
        names = response.xpath('//div[@class="tea_con"]//li/div/h3/text()') 
        print(names)
        # 获取具体数据文本的方式如下
        # 分组
        li_list = response.xpath('//div[@class="tea_con"]//li') 
        for li in li_list:
            # 创建一个数据字典
            item = {}
            # 利用scrapy封装好的xpath选择器定位元素,并通过extract()或extract_first()来获取结果
            item['name'] = li.xpath('.//h3/text()').extract_first() # 老师的名字
            item['level'] = li.xpath('.//h4/text()').extract_first() # 老师的级别
            item['text'] = li.xpath('.//p/text()').extract_first() # 老师的介绍
            print(item)

note:

  • There must be a parsing named parse in scrapy.Spider crawler class

  • If the website structure level is more complex, you can also customize other analytical functions

  • If the URL address extracted in the parsing function wants to send a request, it must belong to the scope of allowed_domains, but the URL address in start_urls is not subject to this restriction. We will learn how to construct a sending request in the parsing function in subsequent courses

  • When starting the crawler, pay attention to the starting position, which is started under the project path

  • The parse() function uses yield to return data. Note: The objects that can be passed by yield in the parse function can only be: BaseItem, Request, dict, None

5.2 Methods of locating elements and extracting data and attribute values

Analyze and get the data in the scrapy crawler: use the xpath rule string to locate and extract

  1. The return result of the response.xpath method is a type similar to a list, which contains a selector object, the operation is the same as the list, but there are some additional methods

  2. Extra method extract(): returns a list containing strings

  3. Extra method extract_first(): returns the first string in the list, the list is empty and it does not return None

5.3 Common attributes of response object

  • response.url: the url address of the current response

  • response.request.url: The URL address of the request corresponding to the current response

  • response.headers: response header

  • response.requests.headers: request headers of the current response

  • response.body: response body, which is html code, byte type

  • response.status: response status code

 

6 Save data

Use pipeline pipeline to process (save) data

6.1 Define operations on data in the pipelines.py file

  1. Define a pipeline class

  2. Override the process_item method of the pipeline class

  3. The process_item method must return to the engine after processing the item

import json
class ItcastPipeline():
    # The method of extracting data in the crawler file will run every time an item is yielded
    # This method is a fixed name function
    def process_item(self, item, spider):
        print(item)
        return item

6.2 Configure to enable pipeline in settings.py

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

The key in the configuration item is the pipeline class used, and the pipeline class is divided by. The first is the project directory, the second is the file, and the third is the defined pipeline. The value in the configuration item is the order in which the pipeline is used. The smaller the value, the better the execution. The value is generally set within 1000.

7. Run scrapy

Command: execute scrapy crawl <crawler name> in the project directory

Example: scrapy crawl itcast


summary

  1. Installation of scrapy: pip install scrapy

  2. Create a scrapy project: scrapy startproject myspider

  3. Create scrapy crawler: execute scrapy genspider itcast itcast.cn in the project directory

  4. Run scrapy crawler: execute scrapy crawl itcast in the project directory

  5. Parse and get the data in the scrapy crawler:

    1. The return result of the response.xpath method is a type similar to a list, which contains a selector object, the operation is the same as the list, but there are some additional methods

    2. extract() returns a list containing strings

    3. extract_first() returns the first string in the list, if the list is empty, it does not return None

  6. Basic use of scrapy pipeline:

    1. Improve the process_item function in pipelines.py

    2. Set to open the pipeline in settings.py

  7. Common attributes of response object

    1. response.url: the url address of the current response

    2. response.request.url: The URL address of the request corresponding to the current response

    3. response.headers: response header

    4. response.requests.headers: request headers of the current response

    5. response.body: response body, which is html code, byte type

    6. response.status: response status code

 

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/109248550