Python crawler case (5) ——— scrapy-1

1. scrappy

Scrapy is an application framework written to crawl website data and extract structured data. It can be used in a range of programs including data mining, information processing or storing historical data.

2. Creation and operation of scrapy project

1. Create a scrapy project: terminal input scrapy startproject project name
2. Project composition:

  • spiders
    • init.py custom crawler file.py ‐‐‐》created by ourselves, it is the file that implements the core function of the crawler
    • init.py items.py ‐‐‐》The place where the data structure is defined is a class inherited from scrapy.Item
  • middlewares.py ‐‐‐》Middleware Proxy
  • pipelines.py ‐‐‐》Pipeline file, which contains only one class, which is used to process the subsequent processing of downloaded data. The default priority is 300, and the smaller the value, the higher the priority (1‐1000)
  • settings.py ‐‐‐》Configuration file such as: whether to comply with the robots protocol, User‐Agent definition, etc.

3. Create a crawler file:

Create a crawler file:
(1) Jump to the spiders folder cd directory name/directory name/spiders
(2) scrapy genspider crawler name webpage domain
name The basic composition of a crawler file:

  • Inherit scrapy.Spider class
    • name = 'baidu' ‐‐‐》 The name used when running the crawler file
    • allowed_domains ‐‐‐》 The domain name allowed by the crawler, when crawling, if it is not the url under this domain name, it will be filtered out
    • start_urls ‐‐‐》 declares the starting address of the crawler, you can write multiple urls, usually one
    • response.text ‐‐‐》The response is a string
    • response.body ‐‐‐》The response is a binary file
    • response.xpath()-"The return value type of the xpath method is a selector list
    • extract() ‐‐‐ "extracts the selector object as data
    • extract_first() ‐‐‐ "extracts the first data in the selector list
    • parse(self, response) ‐‐‐》Callback function for parsing data

Run the crawler file:
scrapy crawl crawler name Note: it should be executed in the spiders folder

4. Simple example

4.1 File structure

Please add a picture description

4.2 Code

import scrapy
class BaiduSpider(scrapy.Spider):
    # 爬虫的名字  用于运行爬虫的时候 使用的值
    name = 'baidu'
    # 允许访问的域名
    allowed_domains = ['http://www.baidu.com']
    # 起始的url地址  指的是第一次要访问的域名
    # start_urls 是在allowed_domains的前面添加一个http://
    #             在 allowed_domains的后面添加一个/
    start_urls = ['http://www.baidu.com/']

    # 是执行了start_urls之后 执行的方法   方法中的response 就是返回的那个对象
    # 相当于 response = urllib.request.urlopen()
    #       response  = requests.get()
    def parse(self, response):
        print('苍茫的天涯是我的爱')

5. Simple example

5.1 File structure

Please add a picture description

import scrapy


class CarSpider(scrapy.Spider):
    name = 'car'
    allowed_domains = ['https://car.autohome.com.cn/price/brand-15.html']
    start_urls = ['https://car.autohome.com.cn/price/brand-15.html']#ruguo 结尾是html的话,那么是不需要加上\

    def parse(self, response):
        #//div[@class='main-title']/a/text()
        #//div[@class="main-lever"]//span/span/text()
        name_list = response.xpath('//div[@class="main-title"]/a/text()')
        price_list = response.xpath('//div[@class="main-lever"]//span/span/text()')
        print(name_list)#输出的是一个列表
        '''
        [<Selector xpath='//div[@class="main-title"]/a/text()' data='宝马1系'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马3系'>, <Selector xp/div[@class="main-title"]/a/text()' data='宝马i3'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马5系'>, <Selector xpath='//div[@class="mitle"]/a/text()' data='宝马5系新能源'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马X1'>, <Selector xpath='//div[@class="main-title"]/a data='宝马X2'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马iX3'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马X3'lector xpath='//div[@class="main-title"]/a/text()' data='宝马X5'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马2系'>, <Selector xpath='[@class="main-title"]/a/text()' data='宝马4系'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马i4'>, <Selector xpath='//div[@class="main-"]/a/text()' data='宝马5系(进口)'>, <Selector xpath='//div[@class="main-title"]/a/text()' data='宝马6系GT'>]
        输出的是列表中的每一个data中的值
        宝马1系
宝马3系
宝马i3
宝马5系
宝马5系新能源
宝马X1
宝马X2
宝马iX3
宝马X3
宝马X5
宝马2系
宝马4系
宝马i4
宝马5系(进口)

        '''
        for name in name_list:
            print(name.extract())
        print('============')
        print(price_list.extract_first())#获取得到的是第一个Selector的值的data。
        pass

6. Complete sample

6.1 Code structure

Please add a picture description

Please add a picture description

6.2 Code

dang.py

import scrapy
from sccrapy_dangdang_095.items import SccrapyDangdang095Item

class DangSpider(scrapy.Spider):
    name = 'dang'
    # allowed_domains = ['http://e.dangdang.com/list-AQQG-dd_sale-0-1.html']
    # start_urls = ['http://e.dangdang.com/list-AQQG-dd_sale-0-1.html']#ruguo 结尾是html的话,那么是不需要加上\

    allowed_domains = ['category.dangdang.com']
    start_urls = ['http://category.dangdang.com/cp01.01.02.00.00.00.html']

    base_url = "http://category.dangdang.com/pg"
    page = 1

    def parse(self, response):
        #//div[@class="title"]/text()
        #//div[@class="price"]/span/text()

        #pipelines 下载数据
        #items 定义数据结构
        #所有的seletor的对象 都可以再次调用xpath方法
        li_list = response.xpath('//ul[@id="component_59"]/li')

        for li in li_list:
            #里面就有一个东西
            src = li.xpath('.//img/@data-original').extract_first()
            #第一张
            if src:
                src = src
            else:
                src = li.xpath('.//img/@src').extract_first()

            name = li.xpath('.//img/@alt').extract_first()#获取到名字
            price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()#获取到p标签中的价格

            book = SccrapyDangdang095Item(src=src, name=name, price=price)

            #huoqu获取一个book就将book交给pipelines
            yield book

        if self.page < 100:
            self.page = self.page + 1

            url = self.base_url + str(self.page) + '-cp01.01.02.00.00.00.html'

            #             怎么去调用parse方法
            #             scrapy.Request就是scrpay的get请求
            #             url就是请求地址
            #             callback是你要执行的那个函数  注意不需要加()
            yield scrapy.Request(url=url, callback=self.parse)

        print('============')


items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SccrapyDangdang095Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # tongshu通俗的说就是需要下载的数据都有什么

    #图片
    src = scrapy.Field()
    #名字
    name = scrapy.Field()
    #价格
    price = scrapy.Field()
    pass

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

# 如果想使用管道的话 那么就必须在settings中开启管道
class SccrapyDangdang095Pipeline:

    # 在爬虫文件开始的之前就执行的一个方法
    def open_spider(self,spider):
        self.fp = open('book.json','w',encoding='utf-8')

    # item就是yield后面的book对象
    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    #在爬虫文件执行完之后 执行的方法
    def close_spider(self,spider):
        self.fp.close()



import urllib.request
# 多条管道开启
#    (1) 定义管道类
#   (2) 在settings中开启管道
# 'scrapy_dangdang_095.pipelines.DangDangDownloadPipeline':301
class DangDangDownloadPipeline:
    def process_item(self,item,spider):

        url = 'http:' + item.get('src')
        filename = './books/' + item.get('name') + '.jpg'
        urllib.request.urlretrieve(url=url, filename=filename)

        return item

setting.py

# Scrapy settings for sccrapy_dangdang_095 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'sccrapy_dangdang_095'

SPIDER_MODULES = ['sccrapy_dangdang_095.spiders']
NEWSPIDER_MODULE = 'sccrapy_dangdang_095.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sccrapy_dangdang_095 (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
    
    
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
    
    
#    'sccrapy_dangdang_095.middlewares.SccrapyDangdang095SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
    
    
#    'sccrapy_dangdang_095.middlewares.SccrapyDangdang095DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
    
    
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
#  管道可以有很多个  那么管道是有优先级的  优先级的范围是1到1000   值越小优先级越高
   'sccrapy_dangdang_095.pipelines.SccrapyDangdang095Pipeline': 300,


#    DangDangDownloadPipeline
   'sccrapy_dangdang_095.pipelines.DangDangDownloadPipeline':301
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

download successful
Please add a picture description

Please add a picture description

Guess you like

Origin blog.csdn.net/guoguozgw/article/details/128853868