scrapy project architecture and configuration files

scrapy project architecture

-project    # project name 
  -project # with a project name, folder 
    -spiders     # Spiders: lying reptile reptiles genspider generated, which are placed below 
      - __init__ .py
       -chouti.py # drawer reptile 
      -cnblogs.py # cnblogs reptile 
    -items.py      # similar models.py file in django, written inside a model of a class of 
    -middlewares.py   # middleware (middleware reptiles, download middleware) middleware written in this 
    -pipelines.py    # write persistent local (persisted to a file, MySQL, Redis, MongoDB) 
    -settings.py     # profile 
  -scrapy.cfg        # configuration files when deployed on the line Scrapy

scrapy configuration file

settings.py

# Compliance with protocols reptiles, forced run 
ROBOTSTXT_OBEY = False    

# request header USER_AGENT 
USER_AGENT = ' the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_14_6) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 80.0.3987.149 Safari / 537.36 ' 
    
# so configured, the program will print an error message, 
, LOG_LEVEL, = ' eRROR ' 

Crawler program file

class ChoutiSpider (scrapy.Spider): 
    name = ' chouti '    # which is the unique name of each crawler, used to differentiate Spider 
    allowed_domains = [ ' https://dig.chouti.com/ ' ]   # allow crawling domain 
    start_urls = [ ' https://dig.chouti.com/ ' ]    # start crawling position, a start reptiles, it will first send request 

    DEF the parse (Self, response):   # parse the response object, the response Come back to automatically execute the parser, do parsing 
        print in this method ( ' --------------------------- ' , response)

 

Guess you like

Origin www.cnblogs.com/baohanblog/p/12675200.html