Scrapy from entry to abandon 1-development process

The entry method of using the scrapy framework is the record of the author when learning dark horse python

Insert picture description here

1 Install scrapy

Linux command:

sudo apt-get install scrapy

Windows:

pip install scrapy

Solve the problem of slow downloading reference:
Python third-party library installation speed

2 scrapy project development process

  1. Create project:
   scrapy startproject mySpider
  1. Generate a crawler:
scrapy genspider itcast itcast.cn
  1. Extract data:

    realize data collection related content in spider according to website structure
  2. Save data:

    use pipeline for subsequent data processing and storage

3. Create Project

The scrapy project files are generated by commands, and the subsequent steps are related operations in the project files. The following is to grab the Chuanzhi teacher library to learn how to use scrapy: http://www.itcast.cn/channel/ teacher.shtml

The command to create a scrapy project:

scrapy startproject <项目名字>

Example:

scrapy startproject myspider

4. Create a crawler

The crawler file is created through commands. The crawler file is the main code job file. Usually, the crawling action of a website will be written in the crawler file.

Command:
execute in the project path :

scrapy genspider <爬虫名字> <允许爬取的域名>

Crawler name : as a parameter when the crawler is running

Allowed domain name : It is the crawl range set for the crawler. After setting, it is used to filter the url to be crawled. If the crawled url does not communicate with the allowed domain, it will be filtered out.

Example:

    cd myspider
    scrapy genspider itcast itcast.cn

5. Perfect crawler

Write the data collection operation of the designated website in the crawler file generated in the previous step to realize data extraction

5.1 Modify the content in /myspider/myspider/spiders/itcast.py as follows:

import scrapy

class ItcastSpider(scrapy.Spider):  # 继承scrapy.spider
	# 爬虫名字 
    name = 'itcast' 
    # 允许爬取的范围
    allowed_domains = ['itcast.cn'] 
    # 开始爬取的url地址
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
    
    # 数据提取的方法,接受下载中间件传过来的response
    def parse(self, response): 
    	# scrapy的response对象可以直接进行xpath
    	names = response.xpath('//div[@class="tea_con"]//li/div/h3/text()') 
    	print(names)

    	# 获取具体数据文本的方式如下
        # 分组
    	li_list = response.xpath('//div[@class="tea_con"]//li') 
        for li in li_list:
        	# 创建一个数据字典
            item = {
    
    }
            # 利用scrapy封装好的xpath选择器定位元素,并通过extract()或extract_first()来获取结果
            item['name'] = li.xpath('.//h3/text()').extract_first() # 老师的名字
            item['level'] = li.xpath('.//h4/text()').extract_first() # 老师的级别
            item['text'] = li.xpath('.//p/text()').extract_first() # 老师的介绍
            print(item)
note:
  • There must be a parsing named parse in scrapy.Spider crawler class
  • If the website structure level is more complex, you can also customize other analytical functions
  • If the URL address extracted in the parsing function wants to send a request, it must belong to the scope of allowed_domains, but the URL address in start_urls is not subject to this restriction. We will learn how to construct a sending request in the parsing function in subsequent courses
  • When starting the crawler, pay attention to the starting position, which is started under the project path
  • The parse() function uses yield to return data. Note: The objects that can be passed by yield in the parse function can only be: BaseItem, Request, dict, None

5.2 Methods of locating elements and extracting data and attribute values

Analyze and get the data in the scrapy crawler: use the xpath rule string to locate and extract

  1. The return result of the response.xpath method is a type similar to a list, which contains a selector object, the operation is the same as the list, but there are some additional methods
  2. Extra method extract(): returns a list containing strings
  3. Extra method extract_first(): returns the first string in the list, the list is empty and it does not return None

5.3 Common attributes of response object

  • response.url: the url address of the current response
  • response.request.url: The URL address of the request corresponding to the current response
  • response.headers: response header
  • response.requests.headers: request headers of the current response
  • response.body: response body, which is html code, byte type
  • response.status: response status code

6 Save data

Use pipeline pipeline to process (save) data

6.1 Define operations on data in the pipelines.py file

  1. Define a pipeline class
  2. Override the process_item method of the pipeline class
  3. The process_item method must return to the engine after processing the item
import json

class ItcastPipeline():
    # 爬虫文件中提取数据的方法每yield一次item,就会运行一次
    # 该方法为固定名称函数
    def process_item(self, item, spider):
        print(item)
        return item

6.2 Configure to enable pipeline in settings.py

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

The key in the configuration item is the pipeline class used, and the pipeline class is divided by. The first is the project directory, the second is the file, and the third is the defined pipeline.

The value in the configuration item is the order in which the pipeline is used. The smaller the value, the better the execution. The value is generally set within 1000.

7. Run scrapy

Command: execute scrapy crawl <crawler name> in the project directory



The article is a record of the author while learning dark horse python. If there is any error, please inform in the comment area
**

This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/111587388