Scrapy crawler basic use and stock data Scrapy crawler

Common commands for Scrapy crawlers

scrapy command line format

Red is the three commonly used commands

Why does Scrapy use the command line to create and run crawlers?

The command line (not a graphical interface) is easier to automate and is suitable for script control In essence, Scrapy is for programmers, and functions (rather than interfaces) are more important

Basic use of Scrapy crawlers

Applying the Scrapy crawler framework is mainly to write configuration code

Step 1: Create a Scrapy crawler project

Select a directory and execute the following command

scrapy startproject python123demo

Step 2: Generate a Scrapy crawler in the project

To generate a Scrapy crawler in a project, you only need to execute a command, but this command needs to agree on the name of the crawler given by the user and the crawled website

Enter the project directory, and then execute the following command:

scrapy genspider demo python123.io

This command refers to generating a spider named demo

Of course, we can also produce manually

name = "demo" the name of the crawler

allowed_domains = ["python123.io"] The domain name submitted by the user to the command line at the beginning means that when the crawler crawls the website, it can only crawl the relevant links under this domain name

start_urls = ["http://python123.io/"] One or more domain names contained in the form of a list are the initial pages of the pages to be crawled by the scrapy framework

def parse(self, response): parse() is used to process the response, parse the content to form a dictionary, and discover new URL crawling requests

Step 3: Configure the generated spider crawler

Configuration: (1) initial URL address (2) parsing method after obtaining the page

Step 4: Run the crawler to get the webpage Under the command line, execute the following command:

scrapy crawl demo

Be sure to pay attention here: switch the directory to the current directory, because it needs to execute the demo.py command

Here, one more thing to note: it must be written in the demo.py file in the spiders directory

After running, if [scrapy] INFO: Spider closed (finished) appears in the printed log, it means the execution is complete. After that, a demo.html file appears in the current folder, which contains all the source code information of the webpage we just want to crawl.

Use of the yield keyword

According to the generator, it first executes the for loop, and then when the line of yield is executed, the function will be frozen, and the value generated by the line corresponding to the current yield will be returned, so when the function is called , it will first generate a value, which is the square value of the value when i is equal to 0, and then gradually traverse this cycle

Using the yield keyword will not report an error, but using return can only be executed once and an error will be reported. yield can wake up all values at once

The advantage is that if we use the native method, we need to count all these data, which takes up a lot of space, while yield only takes up the space of one element

Basic use of Scrapy crawlers

Scrapy crawler data types

Stock Data Scrapy Crawler

Functional description

Goal: Get the names and transaction information of all stocks on the Shanghai Stock Exchange and Shenzhen Stock Exchange Output: Save to a file

Identification of Data Sites

Get a list of stocks:

Market Center: A fast and comprehensive market system for stocks, funds, futures, US stocks, Hong Kong stocks, foreign exchange, gold, and bonds in China_Oriental Fortune.com

Quote Center_Stock Quotes_Latest Stock Quotes_Stock Trends-Xueqiu (xueqiu.com)

This is a Python script written using the Scrapy framework to scrape information about stocks from a website.

The script imports the Scrapy and BeautifulSoup modules, as well as the re module.
The StockSpider class is a subclass of the Spider class, which is used to define the behavior of the crawler. In this class, the name attribute defines the name of the crawler, and the start_urls attribute defines the URL that the crawler starts crawling.
The parse method is the default callback function for processing the response. In this method, use the response.css method to obtain the href attribute values of all a tags in the response, and then use regular expressions to extract the stock code. If the stock code is extracted, construct the URL of the stock details page and send a request, using the parse_stock method as the callback function.
The parse_stock method is used to process the response of the stock details page. In this method, the HTML in the response is parsed using regular expressions and the BeautifulSoup library, extracting the stock name and details. Finally, the information is returned as a dictionary.

In summary, this script uses Scrapy and BeautifulSoup modules to scrape stock information, including stock names and details.

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import re

class StockSpider(scrapy.Spider):
    name = 'stock'
    # allowed_domains = ['quote.eastmoney.com']
    start_urls = ['http://quote.eastmoney.com/stock_list.html']

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.search(r"[s][hz]\d{6}", href).group(0)
                stock = stock.upper()
                url = 'https://xueqiu.com/S/' + stock
                yield scrapy.Request(url, callback = self.parse_stock)
            except:
                continue

    def parse_stock(self, response):
        infoDict = {}
        if response == "":
            exit()
        try:
            name = re.search(r'<div class="stock-name">(.*?)</div>', response.text).group(1)
            infoDict.update({'股票名称': name.__str__()})
            tableHtml = re.search(r'"tableHtml":"(.*?)",', response.text).group(1)
            soup = BeautifulSoup(tableHtml, "html.parser")
            table = soup.table
            for i in table.find_all("td"):
                line = i.text
                l = line.split("：")#这里的冒号为中文的冒号(：)!!!而不是英文的(:)
                infoDict.update({l[0].__str__(): l[1].__str__()})
            yield infoDict
        except:
            print("error")

The following code defines two data processing pipeline classes, which are used to process the data after the crawler grabs the data. The details are as follows:

DemoPipeline class: This is an empty pipeline class that simply returns the data obtained from the crawler without any processing on the data.
stockPipeline class: This is a custom pipeline class, which is used to store the data obtained by the crawler into a file. In this class, the open_spider() method is called when the crawler starts to open a text file ('XueQiuStock.txt') for writing. The close_spider() method is called at the end of the spider, closing the file. The process_item() method is called when the crawler gets an item, converts the item into a dictionary and writes it to a file, and then returns the item.

In short, this code defines two data processing pipeline classes, which provide different processing methods to process the data captured by the crawler. The DemoPipeline class just returns the data, while the stockPipeline class writes the data to a file.

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class DemoPipeline(object):
    def process_item(self, item, spider):
        return item

class stockPipeline(object):
    def open_spider(self,spider):
        self.f = open('XueQiuStock.txt','w')

    def close_spider(self,spider):
        self.f.close()

    def process_item(self,item,spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

The following code is the configuration file settings.py of the Scrapy crawler project, which defines some crawler settings, including:

BOT_NAME: The name of the crawler project, which can be any string.
SPIDER_MODULES and NEWSPIDER_MODULE: Define the module where the crawler code is located, namely the 'spiders' directory. Among them, SPIDER_MODULES is a list that contains all modules containing crawler code, and NEWSPIDER_MODULE is the default crawler module name, ie 'spiders'.
USER_AGENT: The user agent used to identify the crawler can be any string, here a string that simulates the Chrome browser is used.
ROBOTSTXT_OBEY: A boolean value indicating whether to obey the robots.txt protocol. If set to True, the crawler will not visit parts of the website that are blocked from crawling.
ITEM_PIPELINES: defines a set of data processing pipeline classes, which are used to process the data after the crawler grabs the data. Here, it has only one pipeline class, stockPipeline, and its priority is set to 300, which means it will be executed after other pipeline classes.

# -*- coding: utf-8 -*-
BOT_NAME = 'demo'

SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'demo.pipelines.stockPipeline': 300,
}

Effect