Production Scrapy reptiles
1, the new project (command line, type: scrapy startproject xxx): Create a new project reptile
2, pycharm open project, view the project directory
You have to crawl clear goal: 3, clear objectives (written item.py: put data model code)
4, making crawler (spider / xxspider.py): Production start crawling web crawlers
(1) Create a file crawler, spider directory under the file will be more xxspider.py
scrapy genspider xxx xxx.com
(2) preparation of reptiles file, requests and responses, and extracting data (yield item)
Crawling content:
①name = 'tencent' # reptile name, startup parameters reptiles need to be
②allowed_domains = [ 'tencent.com'] # crawling gamut, allowing crawling reptiles in the domain (optional)
③start_urls = [] # start URL list, after the execution of the first reptiles request, obtain from this list
5, the stored contents (prepared pipe files pipelines.py): Design pipe crawling content storage, processing returns spider data item, such as local persistent storage
6, write setting.py settings file, start piping components, and related settings
7, implementation of reptiles
scrapy crawl xxx
8, four methods crawler get the data storage information, output file format specified -o
(1) json format: default unicode encoding --scrapy crawl xxx -o xxx.json
(2) json lines Format: default unicode encoding --scrapy crawl xxx -o xxx.jsonl
(3) csv comma expression, can be used to open Excel --scrapy crawl xxx -o xxx.csv
(4) xml format --scrapy crawl xxx -o xxx.xml
Reproduced in: https: //www.jianshu.com/p/f94f4514e60d