Python crawler series 2-------Scrapy directory structure introduction and configuration details

Scrapy directory structure introduction and configuration file detailed explanation

    First go to the architecture diagram, find it on the Internet, whether you understand it or not, first have an impression, combine the file directory and explanation to see, combined with future practice, the principle is clear at a glance.

write picture description here

  • Create a scrapy project directory as follows
├── mySpider
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
  • scrapy.cfg file
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = mySpider.settings

[deploy]
#url = http://localhost:6800/
project = mySpider
    The project basic settings file, which sets the functions enabled by the crawler, such as concurrency, pipeline files, etc., need to be set in the basic settings file
  • init .py file is the python initialization file
    Initialize the file for the python module, you can use the __all__ function to configure the export parameters, or you can write nothing, but you must have it, otherwise an error will be reported
  • items.py file
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
    This file is commonly known as the model file, which is the file that stores the fields. The above is a simple example, which defines the field name and accesses data in its own arbitrary form.
  • pipelines.py pipeline file
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MyspiderPipeline(object):
    def process_item(self, item, spider):
        return item
    def process_item is a fixed writing method. When the data is handed over to the pipeline file for processing, it is under this file.
  • settings.py core file, configured as follows
# -*- coding: utf-8 -*-

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#项目名称
BOT_NAME = 'mySpider'
#项目位置
SPIDER_MODULES = ['mySpider.spiders']
#新爬虫启动位置
NEWSPIDER_MODULE = 'mySpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mySpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
#表示爬虫是否遵循robot协议,默认为遵循
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#爬虫并发量,默认为16个
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs

#下载延时为3秒,下载太快压力会大,自行设置
#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#是否启动cookies,双刃剑,在爬取需要登录的网站时,需要,但是如果不需要登录则不需要,更好的隐藏身份
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#请求报头设置,这里设置就不用程序设置了,如:User-Agent设置
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# 爬虫中间件
#SPIDER_MIDDLEWARES = {
#    'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# 下载中间件,数字代表优先级
#DOWNLOADER_MIDDLEWARES = {
#    'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# ** 比较重要,管道文件,表示下载好的数据如何处理

#ITEM_PIPELINES = {
#    'mySpider.pipelines.MyspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  • spiders file—crawler file
cd spiders
#命令创建一个爬虫文件,最后的是资源uri
scrapy genspider myspider alinetgo.com

write picture description here
Created template content

# -*- coding: utf-8 -*-
import scrapy


class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['alinetgo.com']
    start_urls = ['http://alinetgo.com/']

    def parse(self, response):
        pass
    If you don't understand, don't worry, I will introduce it in detail later


To be continued. . . .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324785455&siteId=291194637