Set setting.py
Modify the Robots Exclusion Protocol
ROBOTSTXT_OBEY = False
Set User-Agent
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3294.99 Safari/537.36' }
Add start.py
In order to use the IDE, run the Create start.py convenient reptile reptiles assembly file in the same directory
from scrapy import cmdline cmdline.execute("scrapy crawl wx_spider".split())
Directory tree
E:. │ scrapy.cfg │ │ └─BookSpider │ items.py │ middlewares.py │ pipelines.py │ settings.py │ start.py │ __init__.py │ ├─spiders │ │ biqubao_spider.py │ │ __init__.py │ │ │ └─__pycache__ │ biqubao_spider.cpython-36.pyc │ __init__.cpython-36.pyc │ └─__pycache__ settings.cpython-36.pyc __init__.cpython-36.pyc
Add the following code in reptiles, print out the page information
#biqubao_spider.py def parse(self, response): print("*"*50) print(response.text) print("*" * 50)