[Python_Scrapy study notes (12)] Implementing POST request crawler based on Scrapy framework

Implementing POST request crawler based on Scrapy framework

Preface

This article introduces how to implement a POST request crawler based on the Scrapy framework, and demonstrates it by taking the capture of KFC store information in a specified city as an example.

text

1. How the Scrapy framework handles POST requests

The Scrapy framework provides the FormRequest() method to send POST requests;
the FormRequest() method has more formdata parameters than the Request() method, accepting a dictionary or iterable tuple containing form data, and converting it into the body of the request. .
POST request: yield scrapy.FormRequest(url=post_url,formdata={},meta={},callback=...)
Note: When using the FormRequest() method to send a POST request, you must override the start_requests() method.

2. Scrapy framework handles POST request cases

  1. Project requirements: Capture KFC store information in specified cities. The terminal prompts, please enter the city: xx to capture all KFC store data in xx city.

  2. Required data: store number, store name, store address, city, province

  3. URL address: http://www.kfc.com.cn/kfccda/storelist/index.aspx

  4. POST request url address: http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
    Insert image description here

  5. F12 packet capture analysis: find the data that needs to be crawled, obtain store information, and obtain the total number of storesInsert image description here
    Insert image description here
    Insert image description here

  6. Get form form: Get form form dataInsert image description here

  7. Create a Scrapy project: write the items.py file

    import scrapy
    
    
    class KfcspiderItem(scrapy.Item):
        # 门店编号
        rownum = scrapy.Field()
        # 门店名称
        storeName = scrapy.Field()
        # 门店地址
        addressDetail = scrapy.Field()
        # 所属城市
        cityName = scrapy.Field()
        # 所属省份
        provinceName = scrapy.Field()
    
  8. Write crawler files

    import scrapy
    import json
    from ..items import KfcspiderItem
    
    class KfcSpider(scrapy.Spider):
        name = "kfc"
        allowed_domains = ["www.kfc.com.cn"]
        post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
        city_name = input("请输入城市名称:")
    
        # start_urls = ["http://www.kfc.com.cn/"]
        def start_requests(self):
            """
            重写start_requests()方法,获取某个城市的KFC门店总数量
            :return:
            """
            formdata = {
          
          
                "cname": self.city_name,
                "pid": "",
                "pageIndex": '1',
                "pageSize": '10'
            }
            yield scrapy.FormRequest(url=self.post_url, formdata=formdata, callback=self.get_total,dont_filter=True)
    
        def parse(self, response):
            """
            解析提取具体的门店数据
            :param response:
            :return:
            """
            html=json.loads(response.text)
            for  one_shop_dict in html["Table1"]:
                item=KfcspiderItem()
                item["rownum"]=one_shop_dict['rownum']
                item["storeName"]=one_shop_dict['storeName']
                item["addressDetail"]=one_shop_dict['addressDetail']
                item["cityName"]=one_shop_dict['cityName']
                item["provinceName"]=one_shop_dict['provinceName']
                #一个完整的门店数据提取完成,交给数据管道
                yield item
    
        def get_total(self, response):
            """
            获取总页数,并交给调度器入队列
            :param response:
            :return:
            """
            html = json.loads(response.text)
            count = html['Table'][0]['rowcount']
            total_page = count // 10 if count % 10 == 0 else count // 10 + 1
            # 将所有页的url地址交给调度器入队列
            for page in range(1, total_page + 1):
                formdata = {
          
          
                    "cname": self.city_name,
                    "pid": "",
                    "pageIndex": str(page),
                    "pageSize": '10'
                }
                # 交给调度器入队列
                yield scrapy.FormRequest(url=self.post_url, formdata=formdata, callback=self.parse)
    
    
  9. Write the settings file:

    BOT_NAME = "KFCSpider"
    
    SPIDER_MODULES = ["KFCSpider.spiders"]
    NEWSPIDER_MODULE = "KFCSpider.spiders"
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 1
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
          
          
       "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
       "Accept-Language": "en",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko)"
    }
    
    # 设置日志级别:DEBUG < INFO < WARNING < ERROR < CARITICAL
    LOG_LEVEL = 'INFO'
    # 保存日志文件
    LOG_FILE = 'KFC.log'
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
          
          
       "KFCSpider.pipelines.KfcspiderPipeline": 300,
    }
    
    # Set settings whose default value is deprecated to a future-proof value
    REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    FEED_EXPORT_ENCODING = "utf-8"
    
    
  10. Print items directly in the pipeline file

  11. Create a run.py file to run the crawler:

    from scrapy import cmdline
    cmdline.execute("scrapy crawl kfc".split())
    
  12. running result
    Insert image description here

Guess you like

Origin blog.csdn.net/sallyyellow/article/details/130206083