Implementing POST request crawler based on Scrapy framework
Preface
This article introduces how to implement a POST request crawler based on the Scrapy framework, and demonstrates it by taking the capture of KFC store information in a specified city as an example.
text
1. How the Scrapy framework handles POST requests
The Scrapy framework provides the FormRequest() method to send POST requests;
the FormRequest() method has more formdata parameters than the Request() method, accepting a dictionary or iterable tuple containing form data, and converting it into the body of the request. .
POST request: yield scrapy.FormRequest(url=post_url,formdata={},meta={},callback=...)
Note: When using the FormRequest() method to send a POST request, you must override the start_requests() method.
2. Scrapy framework handles POST request cases
-
Project requirements: Capture KFC store information in specified cities. The terminal prompts, please enter the city: xx to capture all KFC store data in xx city.
-
Required data: store number, store name, store address, city, province
-
URL address: http://www.kfc.com.cn/kfccda/storelist/index.aspx
-
POST request url address: http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
-
F12 packet capture analysis: find the data that needs to be crawled, obtain store information, and obtain the total number of stores
-
Get form form: Get form form data
-
Create a Scrapy project: write the items.py file
import scrapy class KfcspiderItem(scrapy.Item): # 门店编号 rownum = scrapy.Field() # 门店名称 storeName = scrapy.Field() # 门店地址 addressDetail = scrapy.Field() # 所属城市 cityName = scrapy.Field() # 所属省份 provinceName = scrapy.Field()
-
Write crawler files
import scrapy import json from ..items import KfcspiderItem class KfcSpider(scrapy.Spider): name = "kfc" allowed_domains = ["www.kfc.com.cn"] post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname' city_name = input("请输入城市名称:") # start_urls = ["http://www.kfc.com.cn/"] def start_requests(self): """ 重写start_requests()方法,获取某个城市的KFC门店总数量 :return: """ formdata = { "cname": self.city_name, "pid": "", "pageIndex": '1', "pageSize": '10' } yield scrapy.FormRequest(url=self.post_url, formdata=formdata, callback=self.get_total,dont_filter=True) def parse(self, response): """ 解析提取具体的门店数据 :param response: :return: """ html=json.loads(response.text) for one_shop_dict in html["Table1"]: item=KfcspiderItem() item["rownum"]=one_shop_dict['rownum'] item["storeName"]=one_shop_dict['storeName'] item["addressDetail"]=one_shop_dict['addressDetail'] item["cityName"]=one_shop_dict['cityName'] item["provinceName"]=one_shop_dict['provinceName'] #一个完整的门店数据提取完成,交给数据管道 yield item def get_total(self, response): """ 获取总页数,并交给调度器入队列 :param response: :return: """ html = json.loads(response.text) count = html['Table'][0]['rowcount'] total_page = count // 10 if count % 10 == 0 else count // 10 + 1 # 将所有页的url地址交给调度器入队列 for page in range(1, total_page + 1): formdata = { "cname": self.city_name, "pid": "", "pageIndex": str(page), "pageSize": '10' } # 交给调度器入队列 yield scrapy.FormRequest(url=self.post_url, formdata=formdata, callback=self.parse)
-
Write the settings file:
BOT_NAME = "KFCSpider" SPIDER_MODULES = ["KFCSpider.spiders"] NEWSPIDER_MODULE = "KFCSpider.spiders" # Obey robots.txt rules ROBOTSTXT_OBEY = False # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # Override the default request headers: DEFAULT_REQUEST_HEADERS = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko)" } # 设置日志级别:DEBUG < INFO < WARNING < ERROR < CARITICAL LOG_LEVEL = 'INFO' # 保存日志文件 LOG_FILE = 'KFC.log' # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "KFCSpider.pipelines.KfcspiderPipeline": 300, } # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8"
-
Print items directly in the pipeline file
-
Create a run.py file to run the crawler:
from scrapy import cmdline cmdline.execute("scrapy crawl kfc".split())
-
running result