Scrapy reptiles Start Basics

Production Scrapy reptiles

1, the new project (command line, type: scrapy startproject xxx): Create a new project reptile

2, pycharm open project, view the project directory

13148195-a266765af4506d87.png

You have to crawl clear goal: 3, clear objectives (written item.py: put data model code)

13148195-0d1976d18e12b175.png

4, making crawler (spider / xxspider.py): Production start crawling web crawlers

(1) Create a file crawler, spider directory under the file will be more xxspider.py

scrapy genspider xxx xxx.com

(2) preparation of reptiles file, requests and responses, and extracting data (yield item)

13148195-695a545b03a49143.png

Crawling content:

①name = 'tencent' # reptile name, startup parameters reptiles need to be

②allowed_domains = [ 'tencent.com'] # crawling gamut, allowing crawling reptiles in the domain (optional)

③start_urls = [] # start URL list, after the execution of the first reptiles request, obtain from this list

5, the stored contents (prepared pipe files pipelines.py): Design pipe crawling content storage, processing returns spider data item, such as local persistent storage

6, write setting.py settings file, start piping components, and related settings

13148195-f1f4d0c871012cde.png

7, implementation of reptiles

scrapy crawl xxx

8, four methods crawler get the data storage information, output file format specified -o

(1) json format: default unicode encoding --scrapy crawl xxx -o xxx.json

(2) json lines Format: default unicode encoding --scrapy crawl xxx -o xxx.jsonl

(3) csv comma expression, can be used to open Excel --scrapy crawl xxx -o xxx.csv

(4) xml format --scrapy crawl xxx -o xxx.xml

Reproduced in: https: //www.jianshu.com/p/f94f4514e60d

Guess you like

Origin blog.csdn.net/weixin_34200628/article/details/91093605