scrapy project architecture
-project # project name -project # with a project name, folder -spiders # Spiders: lying reptile reptiles genspider generated, which are placed below - __init__ .py -chouti.py # drawer reptile -cnblogs.py # cnblogs reptile -items.py # similar models.py file in django, written inside a model of a class of -middlewares.py # middleware (middleware reptiles, download middleware) middleware written in this -pipelines.py # write persistent local (persisted to a file, MySQL, Redis, MongoDB) -settings.py # profile -scrapy.cfg # configuration files when deployed on the line Scrapy
scrapy configuration file
settings.py
# Compliance with protocols reptiles, forced run ROBOTSTXT_OBEY = False # request header USER_AGENT USER_AGENT = ' the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_14_6) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 80.0.3987.149 Safari / 537.36 ' # so configured, the program will print an error message, , LOG_LEVEL, = ' eRROR '
Crawler program file
class ChoutiSpider (scrapy.Spider): name = ' chouti ' # which is the unique name of each crawler, used to differentiate Spider allowed_domains = [ ' https://dig.chouti.com/ ' ] # allow crawling domain start_urls = [ ' https://dig.chouti.com/ ' ] # start crawling position, a start reptiles, it will first send request DEF the parse (Self, response): # parse the response object, the response Come back to automatically execute the parser, do parsing print in this method ( ' --------------------------- ' , response)