[Python_Scrapy study notes (12)] Implementing POST request crawler based on Scrapy framework

Implementing POST request crawler based on Scrapy framework

Preface

This article introduces how to implement a POST request crawler based on the Scrapy framework, and demonstrates it by taking the capture of KFC store information in a specified city as an example.

text

1. How the Scrapy framework handles POST requests

The Scrapy framework provides the FormRequest() method to send POST requests;
the FormRequest() method has more formdata parameters than the Request() method, accepting a dictionary or iterable tuple containing form data, and converting it into the body of the request. .
POST request: yield scrapy.FormRequest(url=post_url,formdata={},meta={},callback=...)
Note: When using the FormRequest() method to send a POST request, you must override the start_requests() method.

2. Scrapy framework handles POST request cases

Project requirements: Capture KFC store information in specified cities. The terminal prompts, please enter the city: xx to capture all KFC store data in xx city.
Required data: store number, store name, store address, city, province
URL address: http://www.kfc.com.cn/kfccda/storelist/index.aspx
POST request url address: http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
F12 packet capture analysis: find the data that needs to be crawled, obtain store information, and obtain the total number of stores
Get form form: Get form form data

Create a Scrapy project: write the items.py file

import scrapy


class KfcspiderItem(scrapy.Item):
    # 门店编号
    rownum = scrapy.Field()
    # 门店名称
    storeName = scrapy.Field()
    # 门店地址
    addressDetail = scrapy.Field()
    # 所属城市
    cityName = scrapy.Field()
    # 所属省份
    provinceName = scrapy.Field()

Write crawler files

import scrapy
import json
from ..items import KfcspiderItem

class KfcSpider(scrapy.Spider):
    name = "kfc"
    allowed_domains = ["www.kfc.com.cn"]
    post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
    city_name = input("请输入城市名称：")

    # start_urls = ["http://www.kfc.com.cn/"]
    def start_requests(self):
        """
        重写start_requests()方法，获取某个城市的KFC门店总数量
        :return:
        """
        formdata = {
      
      
            "cname": self.city_name,
            "pid": "",
            "pageIndex": '1',
            "pageSize": '10'
        }
        yield scrapy.FormRequest(url=self.post_url, formdata=formdata, callback=self.get_total,dont_filter=True)

    def parse(self, response):
        """
        解析提取具体的门店数据
        :param response:
        :return:
        """
        html=json.loads(response.text)
        for  one_shop_dict in html["Table1"]:
            item=KfcspiderItem()
            item["rownum"]=one_shop_dict['rownum']
            item["storeName"]=one_shop_dict['storeName']
            item["addressDetail"]=one_shop_dict['addressDetail']
            item["cityName"]=one_shop_dict['cityName']
            item["provinceName"]=one_shop_dict['provinceName']
            #一个完整的门店数据提取完成，交给数据管道
            yield item

    def get_total(self, response):
        """
        获取总页数，并交给调度器入队列
        :param response:
        :return:
        """
        html = json.loads(response.text)
        count = html['Table'][0]['rowcount']
        total_page = count // 10 if count % 10 == 0 else count // 10 + 1
        # 将所有页的url地址交给调度器入队列
        for page in range(1, total_page + 1):
            formdata = {
      
      
                "cname": self.city_name,
                "pid": "",
                "pageIndex": str(page),
                "pageSize": '10'
            }
            # 交给调度器入队列
            yield scrapy.FormRequest(url=self.post_url, formdata=formdata, callback=self.parse)

Write the settings file:

BOT_NAME = "KFCSpider"

SPIDER_MODULES = ["KFCSpider.spiders"]
NEWSPIDER_MODULE = "KFCSpider.spiders"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
      
      
   "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
   "Accept-Language": "en",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko)"
}

# 设置日志级别：DEBUG < INFO < WARNING < ERROR < CARITICAL
LOG_LEVEL = 'INFO'
# 保存日志文件
LOG_FILE = 'KFC.log'

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
      
      
   "KFCSpider.pipelines.KfcspiderPipeline": 300,
}

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

Print items directly in the pipeline file

Create a run.py file to run the crawler:

from scrapy import cmdline
cmdline.execute("scrapy crawl kfc".split())

running result