Python crawler example: Scrapy crawls stock information to SQL database

Today I will share with you a previously completed crawler example, using Scrapy library and docker to crawl stock information.
The goal of this case is to crawl the stock information and store it in SQL Server.
This should be regarded as a must-learn case for getting started with crawlers. Not much to say, just go to the dry goods.

First look at the libraries and tools we will use today:

from scrapy import Spider, Request
from scrapy_splash import SplashRequest
import scrapy
import re
from getStock.items import GetstockItem

The libraries used are:
scrapy crawler library
re used to match the website information we need
scrapy_splash to help us crawl the information of dynamic websites
(if you don’t know what a dynamic website is, the simplest way to say it is that you can’t directly use the request library to crawl Website)
need to pay attention to scrapy_splash to be installed additionally

pip install scrapy_splash

Tools used:
Docker
SQL Server2019 I

believe everyone is already familiar with sql, and may be relatively new to docker.
Docker icon

Docker is an open source application container engine, based on the Go language and open source following the Apache2.0 protocol.
Its application scenarios are mainly:
1. Automated packaging and publishing of Web applications.
2. Automated testing and continuous integration and release.
3. Deploy and adjust databases or other background applications in a service-oriented environment.
4. Compile or extend the existing OpenShift or Cloud Foundry platform from scratch to build your own PaaS environment.

You can click on the link below to learn more about docker.
Docker Simple Tutorial
Install Docker SqlServer2019 Installation Tutorial under Win10

Next, let's enter our code part

headers = {
    
    
    'User-Agent': 'User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'
}

#给我们的爬虫加一个伪装的header防止被网站检测到

Our goal is to first go to Oriental Fortune.com to get the codes of all our stocks, and then go to Sohu Stocks.com to get the information we need for the corresponding stocks.
Insert picture description here
Right-click on our code to check, you can jump to the position of the code element in the website.
Insert picture description here
Our goal is to extract the stock code in the circle. This goal can be easily accomplished by using re and css.

 class StockSpider(scrapy.Spider):
    name = 'stock'
    start_urls = ['http://quote.eastmoney.com']


    def start_requests(self):
        url = 'http://quote.eastmoney.com/stock_list.html'
        yield SplashRequest(url, self.parse, headers=headers)

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.findall(r"\d{6}", href)[0]
                url = 'http://q.stock.sohu.com/cn/' + stock
                yield SplashRequest(url, self.parse1,args={
    
    'wait':5}, headers=headers )
            except:
                continue

Note that the URL we return here in parse is the URL formed by the link to the Sohu website plus our stock code. This URL can help us jump to the information page where each stock is located.
For example, the URL link of Ping An Bank (000001) is: https://q.stock.sohu.com/cn/000001
Insert picture description here
Then we will take out the information we need in each stock. Here the blogger has extracted each stock: stock code, stock name, current price, closing price yesterday, opening price, highest price, lowest price, total transaction volume, industry, date, and the
same preliminary suggestion we right click Check to jump to the element position we need to extract

Insert picture description here
Here you can find the stock information we need, and then you can use simple and rude xpath to extract it.

    def parse1(self, response):
        item = GetstockItem()

        try:
            stockinfo = response.xpath('// *[ @ id = "contentA"] / div[2] / div / div[1]')
            item['name'] = stockinfo.xpath('//*[@class="name"]/a/text()').extract()[0]
            item['code'] = stockinfo.xpath('//*[@class="code"]/text()').extract()[0].replace('(','').replace(')','')
            item['date'] = stockinfo.xpath('//*[@class="date"]/text()').extract()[0]
            item['nprice'] = float(stockinfo.xpath('//li[starts-with(@class,"e1 ")]/text()').extract()[0])
            item['high'] = float(response.xpath('//*[@id="FT_priceA2"]/tbody/tr[1]/td[5]/span/text()').extract()[0])
            item['low'] = float(response.xpath('//*[@id="FT_priceA2"]/tbody/tr[2]/td[5]/span/text()').extract()[0])
            item['ed'] = float(response.xpath('//*[@id="FT_priceA2"]/tbody/tr[1]/td[7]/span/text()').extract()[0])
            item['op'] = float(response.xpath('//*[@id="FT_priceA2"]/tbody/tr[2]/td[7]/span/text()').extract()[0])
            item['volume'] = float(response.xpath('//*[@id="FT_priceA2"]/tbody/tr[2]/td[3]/span/text()').extract()[0].replace('亿',''))
            item['hangye'] = response.xpath('//*[@id="FT_sector"] / div / ul / li[1]/a/text()').extract()[0]
            suggests = response.xpath('//*[@id="contentA"]/div[2]/div/div[3]/div[2]/div[2]/div[1]/div[2]/table/tbody/tr[1]')
            item['suggest'] = suggests.xpath('//td[starts-with(@class,"td1 ")]/span/text()').extract()[0]


        except:
            pass

        yield item

This is all the code in our crawler part. Does it look very simple? But for the scrapy library to complete this function, it also needs to set the configuration file it contains.
Settings.py settings:

ROBOTSTXT_OBEY = False #原本为True
DOWNLOADER_MIDDLEWARES = {
    
    
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}
#更改DOWNLOADER_MIDDLEWARES中的配置
ITEM_PIPELINES = {
    
    
   'getStock.pipelines.GetstockPipeline': 300,
}
#这里的ITEM要对应item.py中的item类的名字,否则会报错
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
SPLASH_URL = "http://192.168.5.185:8050/"
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]
#SPLASH_URL要改成自己电脑的docker默认地址

item.py settings:

import scrapy


class GetstockItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    code = scrapy.Field()
    nprice = scrapy.Field()
    op = scrapy.Field()
    ed = scrapy.Field()
    high = scrapy.Field()
    low = scrapy.Field()
    volume = scrapy.Field()
    date = scrapy.Field()
    hangye = scrapy.Field()
    suggest = scrapy.Field()

#你在网站中要取多少个信息,就要有相应的item,不然信息无法返回给爬虫

The final configuration is pipelines.py. To store the information we get into the database, we must configure it here. When you implement it yourself, remember to adjust the information to your own database information.

First create a table to store in our SQL
Insert picture description here

import pyodbc


class GetstockPipeline(object):
    def __init__(self):
        self.conn = pyodbc.connect("DRIVER={SQL SERVER};SERVER=服务器名;UID=用户名;PWD=密码;DATABASE=数据库名")
        self.cursor = self.conn.cursor()

    def process_item(self,item,spider):
        try:
            sql = "INSERT INTO dbo.stock_data(股票代码,股票名称,现时价格,昨收价格,开盘价格,最高价格,最低价格,总交易额,所属行业,日期) VALUES('%s','%s','%.2f','%.2f','%.2f','%.2f','%.2f','%.2f','%s','%s')"
            data = (item['code'],item['name'],item['nprice'],item['ed'],item['op'],item['high'],item['low'],item['volume'],item['hangye'],item['date'])
            self.cursor.execute(sql % data)
            try:
                sql = "update dbo.stock_data set 初步建议='%s' where dbo.stock_data.股票代码=%s"
                data = (item['suggest'],item['code'])
                self.cursor.execute(sql % data)
                print('success')
            except:
                sql = "update dbo.stock_data set 初步建议='该股票暂无初步建议' where dbo.stock_data.股票代码=%s"
                data = item['code']
                self.cursor.execute(sql % data)
                print("该股票暂无初步建议")
            self.conn.commit()
            print('信息写入成功')
            
           
 #这里填入初步建议的时候使用了try和expect,因为有的股票没有这个初步建议,爬虫爬取不到会报错

        except Exception as ex:
            print(ex)
        return item

The above is all the content of the code part, then we only need to run our crawler!

scrapy crawl stock

Insert picture description here
Successfully crawled all the stock information, let's see if we can see this information in the database.
Insert picture description here
All information has been successfully stored in the database!

The above is all the content of this case. The main difficulty was actually in the installation and implementation of docker. It took a long time for the blogger to successfully run docker. If you have any questions or have better practices, you can raise them in the comment area, and the blogger will reply in time if they see it.

Thank you for reading!

Guess you like

Origin blog.csdn.net/kiligso/article/details/108716391