Python day_09 (not master)

scrapy related

Scrapy a crawler frame
transmission request ---> fetch response data ---> --- parsed data> data storage

** ** Scarpy framework introduced

1, engine (EGINE)
engine is responsible for controlling the data flow between all system components and trigger events when certain actions occur. For more information, see the section above the data stream.

2, the scheduler (SCHEDULER)
for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request may be thought of as a URL of the priority queue, it is determined by a lower to crawl What is the URL that while removing duplicate URLs

3, Downloader (DOWLOADER)
for downloading web content and web content back to EGINE, downloader is built on this twisted efficient asynchronous model

4, reptiles (SPIDERS)
SPIDERS a developer-defined class, for parsing Responses, and extracts items, or send a new request

5, item conduit (ITEM PIPLINES)
in the extracted items after they are responsible, including cleaning, validation, persistence (for example to the database) and other operations
to download middleware (Downloader Middlewares) positioned between the engine and the downloader Scrapy, is mainly used to process the request from the request transmitted DOWLOADER EGINE, has spread from the response EGINE response DOWNLOADER,
you can use this middleware to do several things:
　　(. 1) process a request to Sent Just before iS IT the Downloader (IE right before Scrapy SENDS The Request to The Website);
　　(2) Change Received Response before passing IT to A Spider;
　　(. 3) Send A new new the Request INSTEAD of passing Received Response to A Spider;
　　(. 4) Pass Response to A Spider the without fetching may Web Page A;
　　(. 5) silently drop some Requests.

6, reptiles middleware (Spider Middlewares)
positioned between and EGINE SPIDERS, SPIDERS main work is a process input (i.e. Responses) and output (i.e., requests)

** Scarpy installation **
. 1, the install PIP3 Wheel
2, the install lxml PIP3
. 3, the install pyopenssl PIP3
. 4, the install pypiwin32 PIP3
. 5, mounting frame twisted
download twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/ #twisted
install the downloaded Twisted
PIP3 install the download directory \ Twisted-17.9.0-cp36-cp36m -win_amd64.whl

6、pip3 install scrapy

** ** Scarpy using
1, enter the terminal cmd
- Scrapy
C: \ the Users \ administortra> Scrapy
Scrapy 1.6.0 - Active Project NO

2. Create scrapy project
1. Create a folder dedicated to storing scrapy project
- D: \ Scrapy_prject
2.cmd terminal input command
scrapy startproject Spider_Project (project name)
- will be in D: \ folder under Scrapy_prject will generate a file
Spider_Project: Scrapy project file

3.创建爬虫程序
cd Spider_Project # 切换到scrapy项目目录下
# 爬虫程序名称目标网站域名
scrapy genspider baidu www.baidu.com # 创建爬虫程序

3、启动scrapy项目，执行爬虫程序

# 找到爬虫程序文件进行执行
scrapy runspider只能执行某个爬虫程序.py
# 切换到爬虫程序执行文件目录下
- cd D:\Scrapy_prject\Spider_Project\Spider_Project\spiders
- scrapy runspider baidu.py

# 根据爬虫名称找到相应的爬虫程序执行
scrapy crawl 爬虫程序名称
# 切换到项目目录下
- cd D:\Scrapy_prject\Spider_Project
- scrapy crawl baidu

最后，感谢tank老师这几天的认真教学，使我在python相关学习上受益匪浅。

Python day_09 (not master)

Guess you like