Use scrapy to grab all the course names and prices under the target url, and save the data in json format url=http://www.tanzhouedu.com/mall/course/initAllCourse
Observe a web page and analyze it:
It is an ajax-loaded page. Every time the data changes, the url does not change.
By viewing the information in the headers, we get the link url that is actually requested each time the next page is clicked.
Observe that every time the page is turned, the request changes are offset. Values and Timestamps
1. Create a project
Use the command: scrapy startproject 'project_name' to get the project folder of the object, which contains some necessary components of scrapy
as follows:
For specific file meanings, see the link: http://www.cnblogs.com/pythoner6833/p/9012292.html
2. Clearly grasp the target.
Edit the items.py file and define the data field names that need to be captured
code show as below:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class TanzhouItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() """ Define the goal of crawling, in this case only the title and price are crawled So define two fields """ #Course Amount money = scrapy.Field() #Course Name title = scrapy.Field()
3. Edit the crawler.
Go to the spiders folder, create a crawler file, command: scrapy genspider 'spider_name' "start_url"
You will get a file named spider_name, in which the logic of the crawler is written
# -*- coding: utf-8 -*- """ Crawl: http://www.tanzhouedu.com/mall/course/initAllCourse All course names and prices under and save as json format Web Analysis: It is an ajax loaded page, every time the data changes, but the url does not change, By looking at the information in the headers, get the link url that is actually requested each time the next page is clicked Observation found that each time the page is turned, the request changes are the value of the offset and the timestamp 1. First create a crawler project. Use the command: scrapy startproject 'pro_name' # pro_name is the project name After entering the command, a project folder with pro_name will automatically appear, It contains the necessary files for a scrapy project 2. Define the crawling target, edit the items.py file, and define the fields to be crawled. 3. Edit the crawler. Go to the spiders folder and create a crawler file. Use the command: scrapy genspider 'spider_name' 'start_url' Generate a crawler, the name is spider_name, and the initial crawling url is start_url A spider_name.py file will be generated in the spiders folder, It contains a name='spider_name', the name is the unique identifier of different crawlers and cannot be repeated start_url is the first crawling link of the crawler (modifiable) and returns a response Parse other available links and data in the response 4. Pass the crawled data through yield and throw it to the pipelines.py file to save, Write the logic to save the file in the pipelines.py file 5. Run the crawler, use the command: scrapy crawl "spider_name" Note: Open headers and pipes in the configuration file """ import scrapy #Import the written target to be crawled (money and title) from the items file from tanzhou.items import TanzhouItem import time class TzSpider(scrapy.Spider): name = ' tz ' #Crawler name. Unique ID that distinguishes it from other crawlers. allowed_domains = [ ' tanzhouedu.com ' ] #Allow domain names #The first crawling link of the crawler is executed when the crawler is started, and a response is returned to the parse function start_urls = [ ' http://www.tanzhouedu.com/mall/course/initAllCourse ' ] offset = 0 def parse(self, response): item = TanzhouItem() #Instantiate . instance An instance object of the crawled fields. #Parse the response through xpath and extract data from it to get the xpath object node_list = response.xpath( ' //div[@id="newCourse"]/div/div/ul/li ' ) for node in node_list: # extract_first( ) is to take the value of the object and get a string item[ ' money ' ] = node.xpath( ' ./div/span/text() ' ).extract_first() item['title'] = node.xpath('./a/@title').extract_first() yield item # yield returns the item, scrapy_engine passes the item through the pipeline, and passes the item to the pipelines # pipelines.py file is used to save the crawling results if node_list == []: """ When the next page reaches the end, xpath matches an empty list At this point, there are no pages to crawl, and return ends the program. """ return self.offset += 20 #Construct a changed offset, increasing by 20 each time the page is turned # yield throws the new request to the scheduler, and then gives it to the downloader, continues to download the page, and gets the response # callback calls back the parse function to achieve cyclic grabbing yield scrapy.Request(url= " http://www.tanzhouedu.com /mall/course/initAllCourse?params.offset= " + str(self.offset) + " ¶ms.num=20&keyword=&_= " + str(int(time.time() * 1000)), callback=self.parse )
4. Write the logic to save the data.
Write the logic to save the data in the pipelines.py file
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json class TanzhouPipeline(object): """ Write the logic for saving the crawled data """ def __init__(self): """ Optional implementation, do some initialization processing for parameters """ pass def open_spider(self, spider): """ Rewrite the open_spider function, which is automatically executed when the crawler starts :param spider: :return: """ self.file = open("tz.json", 'w', encoding='utf-8') def process_item(self, item, spider): """ Perform certain processing on the data thrown by yield and save it :param item: :param spider: :return: """ #The data item passed by the pipeline is an object, convert it into a dictionary, and then store content = json.dumps(dict(item), ensure_ascii=False) + ' \n ' self.file.write(content) return item def close_spider(self, spider): """ Rewrite this function, and execute the function after the crawler finishes executing :param spider: :return: """ self.file.close()
5. Run the crawler.
Use the command: scrapy crawl "spider_name"
operation result:
Get a json file that holds the scraped results