scrapy data frame crawling Lotto

github project Addresshttps://github.com/v587xpt/lottery_spider
#


Last data made a color ball crawling, crawling in fact Lotto is also very simple to use request can crawl, but in order to better progress, the crawling Lotto uses scrapy framework.

Operating mechanism scrapy framework is not introduced, do not know to go to google understand, are you;


..
..

First, create a project

I am using windows for development, so it is necessary scrapy installed on windows; assumes that the good frame is mounted;

1, open cmd, run

scrapy startproject lottery_spider

Command, will generate a lottery_spider project under the file command to run
.
2, then execute cd lottery_spider enter lottery_spider project execution  

scrapy gensiper lottery "www.lottery.gov.cn"

lottery for the reptile file;

www.lottery.gov.cn target site;

Once created, the file will be generated reptiles spider in the project folder: lottery.py
..
..
 

Second, each file code within the project

1、items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class LotterySpiderItem(scrapy.Item):
    qihao = scrapy.Field()
    bule_ball = scrapy.Field()
    red_ball = scrapy.Field()

This document defines the model data, is the data parameter; qihao, bule_ball, red_ball;
.
2, lottery.py

# -*- coding: utf-8 -*-
import scrapy
from lottery_spider.items import LotterySpiderItem

class LotterySpider(scrapy.Spider):
    name = 'lottery'
    allowed_domains = ['gov.cn']        #允许爬虫爬取目标网站的域名,此域名之外的不会爬取;
    start_urls = ['http://www.lottery.gov.cn/historykj/history_1.jspx?_ltype=dlt']  #起始页;从合格web开始爬取;

    def parse(self, response):
        #使用xpath获取数据前的路径,返回一个list的格式数据;
        results = response.xpath("//div[@class='yylMain']//div[@class='result']//tbody//tr")
        for result in results:  #results数据需要for循环遍历;
            qihao = result.xpath(".//td[1]//text()").get()
            bule_ball_1 = result.xpath(".//td[2]//text()").get()
            bule_ball_2 = result.xpath(".//td[3]//text()").get()
            bule_ball_3 = result.xpath(".//td[4]//text()").get()
            bule_ball_4 = result.xpath(".//td[5]//text()").get()
            bule_ball_5 = result.xpath(".//td[6]//text()").get()
            red_ball_1 = result.xpath(".//td[7]//text()").get()
            red_ball_2 = result.xpath(".//td[8]//text()").get()

            bule_ball_list = []     #定义一个列表,用于存储五个蓝球
            bule_ball_list.append(bule_ball_1)
            bule_ball_list.append(bule_ball_2)
            bule_ball_list.append(bule_ball_3)
            bule_ball_list.append(bule_ball_4)
            bule_ball_list.append(bule_ball_5)

            red_ball_list = []      #定义一个列表,用于存储2个红球
            red_ball_list.append(red_ball_1)
            red_ball_list.append(red_ball_2)

            print("===================================================")
            print("❤期号:"+ str(qihao) + " ❤" + "蓝球:"+ str(bule_ball_list) + " ❤" + "红球" + str(red_ball_list))

            item = LotterySpiderItem(qihao = qihao,bule_ball = bule_ball_list,red_ball = red_ball_list)
            yield item

        next_url = response.xpath("//div[@class='page']/div/a[3]/@href").get()
        if not next_url:
            return
        else:
            last_url = "http://www.lottery.gov.cn/historykj/" + next_url
            yield scrapy.Request(last_url,callback=self.parse)  #这里调用parse方法的时候不用加();

This file is the file reptile run;

.
3、pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class LotterySpiderPipeline(object):
    def __init__(self):
        print("爬虫开始......")
        self.fp = open("daletou.json", 'w', encoding='utf-8')  # 打开一个json文件

    def process_item(self, item, spider):
        item_json = json.dumps(dict(item), ensure_ascii=False)      #注意此处的item,需要dict来进行序列化;
        self.fp.write(item_json + '\n')
        return item

    def close_spider(self,spider):
        self.fp.close()
        print("爬虫结束......")

This file is responsible for saving the data, the code data stored in the data to json;
.
. 4, the settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for lottery_spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'lottery_spider'

SPIDER_MODULES = ['lottery_spider.spiders']
NEWSPIDER_MODULE = 'lottery_spider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'lottery_spider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False    #False,不去寻找网站设置的rebots.txt文件;

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1      #配置爬虫速度,1秒一次
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {        #配置爬虫的请求头,模拟浏览器请求;
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'lottery_spider.middlewares.LotterySpiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'lottery_spider.middlewares.LotterySpiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {    #取消此配置的注释,让pipelines.py可以运行;
   'lottery_spider.pipelines.LotterySpiderPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

This file is run throughout the reptile project profiles;
.
5, start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl lottery".split())
#等价于 ↓
# cmdline.execute(["scrapy","crawl","xiaoshuo"])

This file is a new file, execute the command to run the project after configuration without the cmd;

Guess you like

Origin blog.51cto.com/13577495/2445700