Scrapy from entry to abandon 3-data modeling and request

scrapy data modeling and request

Insert picture description here

learning target:
  1. Application in scrapy project for modeling
  2. The application constructs the Request object and sends the request
  3. The application uses meta parameters to pass data in different analytical functions

1. Data Modeling

Usually in the process of doing the project, data modeling in items.py

1.1 Why model

  1. Defining the item means planning in advance which fields need to be grasped to prevent manual errors, because after the definition, the system will automatically check during the operation process
  2. Together with the comments, you can clearly know which fields to grab. Fields that are not defined cannot be grabbed. You can use a dictionary instead when there are few target fields.
  3. Some specific components of scrapy need to be supported by Item, such as the ImagesPipeline pipeline class of scrapy, search for more on Baidu

1.2 How to model

Define the fields to be extracted in the items.py file:

class MyspiderItem(scrapy.Item): 
    name = scrapy.Field()   # 讲师的名字
    title = scrapy.Field()  # 讲师的职称
    desc = scrapy.Field()   # 讲师的介绍

1.3 How to use template classes

The template class definition needs to be imported and instantiated in the crawler later, and the subsequent use method is the same as using the dictionary

job.py:

from myspider.items import MyspiderItem   # 导入Item,注意路径
...
    def parse(self, response)

        item = MyspiderItem() # 实例化后可直接使用

        item['name'] = node.xpath('./h3/text()').extract_first()
        item['title'] = node.xpath('./h4/text()').extract_first()
        item['desc'] = node.xpath('./p/text()').extract_first()
        
        print(item)

note:

  1. From myspider.items import MyspiderItem this line of code pay attention to the correct import path of the item and ignore the errors marked by pycharm
  2. The key to the import path in python: where to start running, import from where

1.4 Summary of development process

  1. Create project

    scrapy startproject project name
  2. Clear goals

    Model in items.py file
  3. Create a crawler

    3.1 Create a crawler

    scrapy genspider Allowable domains of the crawler name
    3.2 Complete the crawler

    Modify start_urls to
    check and modify allowed_domains to
    write and analyze methods
  4. Save data

    Define the pipeline for data processing in the pipelines.py

    file Register and enable the pipeline in the settings.py file

2. The idea of ​​page turning request

What should I do to extract the data on all pages in the following figure?

Recall how the requests module implements page turning requests:

  1. Find the URL address of the next page
  2. Call requests.get(url)

The idea of ​​scrapy to realize page turning:

  1. Find the url address of the next page
  2. Construct the request object of the URL address and pass it to the engine

3. Construct the Request object and send the request

3.1 Implementation method

  1. Determine the URL address
  2. Construct a request, scrapy.Request(url, callback)
    • callback: Specify the name of the parsing function, indicating which function is used to parse the response returned by the request
  3. Give the request to the engine: yield scrapy.Request(url,callback)

3.2 Netease recruitment crawler

Learn how to implement page turning requests by crawling the recruitment information on the recruitment page of Netease

Address: https://hr.163.com/position/list.do

Thinking analysis:
  1. Get the data of the homepage
  2. Find the address of the next page, turn the page, get the data
note:
  1. ROBOTS protocol can be set in settings
# False表示忽略网站的robots.txt协议,默认为True
ROBOTSTXT_OBEY = False
  1. User-Agent can be set in settings:
# scrapy发送的每一个请求的默认UA都是设置的这个User-Agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'

3.3 Code implementation

In the parse method of the crawler file:

......
	# 提取下一页的href
	next_url = response.xpath('//a[contains(text(),">")]/@href').extract_first()

	# 判断是否是最后一页
	if next_url != 'javascript:void(0)':

        # 构造完整url
        url = 'https://hr.163.com/position/list.do' + next_url

		# 构造scrapy.Request对象,并yield给引擎
		# 利用callback参数指定该Request对象之后获取的响应用哪个函数进行解析
    	yield scrapy.Request(url, callback=self.parse)
......

3.4 More parameters of scrapy.Request

scrapy.Request(url[,callback,method="GET",headers,body,cookies,meta,dont_filter=False])
Parameter explanation
  1. Parameters in brackets are optional parameters
  2. callback : Indicates which function to handle the response of the current URL
  3. meta : Realize data transfer in different analytic functions. Meta comes with some data by default, such as download delay, request depth, etc.
  4. dont_filter: The default is False, it will filter the requested url address, that is, the requested url address will not continue to be requested, and it can be set to True for the url address that needs to be repeatedly requested, such as the page turning request of the post bar, the total data of the page It is changing; the address in start_urls will be requested repeatedly, otherwise the program will not start
  5. method: Specify POST or GET request
  6. headers: Receive a dictionary, which does not include cookies
  7. cookies: Receive a dictionary and place cookies specifically
  8. body: Receive json string, POST data, used when sending payload_post request (post request will be introduced in the next chapter)

4. Use of meta parameters

The role of meta: meta can realize the transfer of data in different analytical functions

In the parse method of the crawler file, the parse_detail function specified by the callback before the extraction details page is added:

def parse(self,response):
    ...
    yield scrapy.Request(detail_url, callback=self.parse_detail,meta={"item":item})
...

def parse_detail(self,response):
    #获取之前传入的item
    item = resposne.meta["item"]
pay attention
  1. The meta parameter is a dictionary
  2. There is a fixed key in the meta dictionary proxy, which represents the proxy ip. We will introduce the use of proxy ip in the learning of scrapy download middleware

summary

  1. Improve and use the Item data class:
  2. Complete the fields to be crawled in items.py
  3. Import Item in the crawler file first
  4. After making the Item object, use it directly like a dictionary
  5. Construct the Request object and send the request:
  6. Import scrapy.Request class
  7. Extract the url in the parsing function
  8. yield scrapy.Request(url, callback=self.parse_detail, meta={})
  9. Use meta parameters to pass data in different analytical functions:
  10. Pass the meta through the previous parsing function yield scrapy.Request(url, callback=self.xxx, meta=())
  11. In the self.xxx function, response.meta.get('key','') or response.meta['key'] is used to retrieve the passed data

Reference Code

wangyi/spiders/job.py

import scrapy


class JobSpider(scrapy.Spider):
    name = 'job'
    # 2.检查允许的域名
    allowed_domains = ['163.com']
    # 1 设置起始的url
    start_urls = ['https://hr.163.com/position/list.do']

    def parse(self, response):
        # 获取所有的职位节点列表
        node_list = response.xpath('//*[@class="position-tb"]/tbody/tr')
        # print(len(node_list))

        # 遍历所有的职位节点列表
        for num, node in enumerate(node_list):
            # 索引为值除2取余为0的才是含有数据的节点,通过判断进行筛选
            if num % 2 == 0:
                item = {}

                item['name'] = node.xpath('./td[1]/a/text()').extract_first()
                item['link'] = node.xpath('./td[1]/a/@href').extract_first()
                item['depart'] = node.xpath('./td[2]/text()').extract_first()
                item['category'] = node.xpath('./td[3]/text()').extract_first()
                item['type'] = node.xpath('./td[4]/text()').extract_first()
                item['address'] = node.xpath('./td[5]/text()').extract_first()
                item['num'] = node.xpath('./td[6]/text()').extract_first().strip()
                item['date'] = node.xpath('./td[7]/text()').extract_first()
                yield item

        # 翻页处理
        # 获取翻页url
        part_url = response.xpath('//a[contains(text(),">")]/@href').extract_first()

        # 判断是否为最后一页,如果不是最后一页则进行翻页操作
        if part_url != 'javascript:void(0)':
            # 拼接完整翻页url
            next_url = 'https://hr.163.com/position/list.do' + part_url

            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )

wangyi / items.py

class WangyiItem(scrapy.Item):
    # define the fields for your item here like:

    name = scrapy.Field()
    link = scrapy.Field()
    depart = scrapy.Field()
    category = scrapy.Field()
    type = scrapy.Field()
    address = scrapy.Field()
    num = scrapy.Field()
    date = scrapy.Field()

This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/111991181