A simple little project of scrapy

Use scrapy to grab all the course names and prices under the target url, and save the data in json format url=http://www.tanzhouedu.com/mall/course/initAllCourse

Observe a web page and analyze it:

It is an ajax-loaded page. Every time the data changes, the url does not change.
By viewing the information in the headers, we get the link url that is actually requested each time the next page is clicked.
Observe that every time the page is turned, the request changes are offset. Values ​​and Timestamps

1. Create a project

Use the command: scrapy startproject 'project_name' to get the project folder of the object, which contains some necessary components of scrapy

as follows:

For specific file meanings, see the link: http://www.cnblogs.com/pythoner6833/p/9012292.html

2. Clearly grasp the target.

Edit the items.py file and define the data field names that need to be captured

code show as below:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TanzhouItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    """
    Define the goal of crawling, in this case only the title and price are crawled
    So define two fields
    """ 
    #Course Amount 
    money = scrapy.Field() #Course
     Name title 
    = scrapy.Field()

 

3. Edit the crawler.

Go to the spiders folder, create a crawler file, command: scrapy genspider 'spider_name' "start_url"

You will get a file named spider_name, in which the logic of the crawler is written

# -*- coding: utf-8 -*-
"""
Crawl: http://www.tanzhouedu.com/mall/course/initAllCourse
All course names and prices under and save as json format

Web Analysis:
It is an ajax loaded page, every time the data changes, but the url does not change,
By looking at the information in the headers, get the link url that is actually requested each time the next page is clicked
Observation found that each time the page is turned, the request changes are the value of the offset and the timestamp


1. First create a crawler project.
    Use the command: scrapy startproject 'pro_name' # pro_name is the project name
    After entering the command, a project folder with pro_name will automatically appear,
    It contains the necessary files for a scrapy project

2. Define the crawling target, edit the items.py file, and define the fields to be crawled.

3. Edit the crawler. Go to the spiders folder and create a crawler file.
    Use the command: scrapy genspider 'spider_name' 'start_url'
    Generate a crawler, the name is spider_name, and the initial crawling url is start_url
    A spider_name.py file will be generated in the spiders folder,
    It contains a name='spider_name', the name is the unique identifier of different crawlers and cannot be repeated
    start_url is the first crawling link of the crawler (modifiable) and returns a response
    Parse other available links and data in the response

4. Pass the crawled data through yield and throw it to the pipelines.py file to save,
Write the logic to save the file in the pipelines.py file

5. Run the crawler, use the command: scrapy crawl "spider_name"

Note: Open headers and pipes in the configuration file
"""

import scrapy

#Import the written target to be crawled (money and title) from the items file 
from tanzhou.items import TanzhouItem
 import time

class TzSpider(scrapy.Spider):
    name = ' tz '   #Crawler name. Unique ID that distinguishes it from other crawlers. 
    allowed_domains = [ ' tanzhouedu.com ' ]   #Allow domain names

    #The first crawling link of the crawler is executed when the crawler is started, and a response is returned to the parse function 
    start_urls = [ ' http://www.tanzhouedu.com/mall/course/initAllCourse ' ]
    offset = 0

    def parse(self, response):
        item = TanzhouItem()   #Instantiate . instance An instance object of the crawled fields.

        #Parse the response through xpath and extract data from it to get the xpath object 
        node_list = response.xpath( ' //div[@id="newCourse"]/div/div/ul/li ' )
         for node in node_list:
             # extract_first( ) is to take the value of the object and get a string 
            item[ ' money ' ] = node.xpath( ' ./div/span/text() ' ).extract_first()
            item['title'] = node.xpath('./a/@title').extract_first()

            yield item
             # yield returns the item, scrapy_engine passes the item through the pipeline, and passes the item to the pipelines 
            # pipelines.py file is used to save the crawling results

        if node_list == []:
            """
            When the next page reaches the end, xpath matches an empty list
            At this point, there are no pages to crawl, and return ends the program.
            """
            return

        self.offset += 20   #Construct a changed offset, increasing by 20 each time the page is turned

        # yield throws the new request to the scheduler, and then gives it to the downloader, continues to download the page, and gets the response 
        # callback calls back the parse function to achieve cyclic grabbing 
        yield scrapy.Request(url= " http://www.tanzhouedu.com /mall/course/initAllCourse?params.offset= " 
            + str(self.offset) + " ¶ms.num=20&keyword=&_= " + str(int(time.time() * 1000)), callback=self.parse )

 

4. Write the logic to save the data.

Write the logic to save the data in the pipelines.py file

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class TanzhouPipeline(object):
    """
    Write the logic for saving the crawled data
    """
    def __init__(self):
        """
        Optional implementation, do some initialization processing for parameters
        """
        pass

    def open_spider(self, spider):
        """
        Rewrite the open_spider function, which is automatically executed when the crawler starts
        :param spider:
        :return:
        """
        self.file = open("tz.json", 'w', encoding='utf-8')

    def process_item(self, item, spider):
        """
        Perform certain processing on the data thrown by yield and save it
        :param item:
        :param spider:
        :return:
        """ 
        #The data item passed by the pipeline is an object, convert it into a dictionary, and then store 
        content = json.dumps(dict(item), ensure_ascii=False) + ' \n '
        self.file.write(content)
        return item

    def close_spider(self, spider):
        """
        Rewrite this function, and execute the function after the crawler finishes executing
        :param spider:
        :return:
        """
        self.file.close()

 

5. Run the crawler.

Use the command: scrapy crawl "spider_name"

 operation result:

Get a json file that holds the scraped results

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325944789&siteId=291194637