Foreword
Scrapy reptile framework of a pure Python language, simple, easy to use, high scalability make it into the mainstream tool for Python reptile, this paper to the latest official version 1.6 is based on the simple use to further explore the principle of expansion.
Go ahead say something tutorial tutorials, always still no official documents speak of apt! If the reader reading this article on Scrapy an interest in and a better understanding of the original intent of Scrapy, please feel free to read a certain habit official document!
content
This paper covers the following topics
- Why Scrapy?
- Hello Scrapy! (practice)
- How Scrapy work?
For the first section, "Why Scrapy" reader is able to read, I will analyze my understanding of Scrapy business scene.
For the remaining two sections, my intent is to "Scrapy how it works" on before "Hello Scrapy" to speak of, but not everyone is willing to consider one up on the understanding of theoretical things, so we put Demo in front of little practical speaking, hoping to cause the reader's interest, interest allow us a deeper understanding of one thing. So I put "How Scrapy work" on the final say in this section, you can also undertake Scrapy principle of the next chapter!
Why Scrapy?
Although Scrapy has been designed to meet the vast majority of reptile work, but there are some scenes actually does not apply.
- What Scrapy not the first choice?
-
When you're crawling small number of pages, the size of a small site for a time, Scrapy not the first choice. Movies such as crawling point list, some of the news and so on, Requests + PyQuery this way it has been able to complete such tasks, Scrapy less than the code generated, and requests the efficiency and speed of parsing the page from the network Requests and PyQuery than Scrapy comes with two modules is better!
-
When there is no universal reptiles demand, Scrapy optional optional. In my opinion Scrapy real benefit is the ability to customize the corresponding "Spider Actions" for a variety of different types of websites, powerful "ItemLoader" to define a series of actions to deal with data input and output. If you do not need to constantly expand the demand information source, Scrapy fact, can not play the greatest ability!
-
When you need an incremental crawling data, Scrapy looked very weak. Scrapy and no incremental crawl functionality to implement because of the difficulty of increment is not the same, it estimated that if the demand for simple Scrapy minor surgery can be done, but if it is high increment requirement, it may really move a lot of trouble Scrapy !
Note: The above three cases just like to Scrapy not the first choice, and did not say no recommended! Just I hope that readers can not understand or follow the trend in the choice of a technology framework when, early in the design carefully consider the great benefit of the good development of the project.
- What Scrapy good use?
-
Need a distributed design, Scrapy unofficial assembly Scrapy-redis good use. Scrapy itself does not implement a distributed mechanism, but using rmax developed Scrapy-redis can achieve distributed, later I will gradually talked about.
-
When can expand demand, Scrapy a weapon. Specific reasons already explained above, this will not do more to explain.
Note: All of the above are summarized from the time of Scrapy I personally use for reference purposes only!
Hello Scrapy
Demo with watercress (vancomycin reptile victims) popular movie list and all of its comments for the experimental target, to recount the basic functions Scrapy, I believe readers in practice after completing this Demo, you can very well use the Scrapy.
You need to install:
- Python (As used herein, 3.7)
- scrapy
Installation Environment
- Installation Scrapy
Command line, typepip install scrapy
Create a project Scrapy
In the command line, type scrapy startproject douban_demo
the following results shown in FIG.
After you can see Scrapy also suggest that we can use genspider
this command to create our crawler files, before we take a look at just what the piece after executing the command took place.
View Files directory. We can see the following information
douban_demo
├── douban_demo
│ ├── items.py # 数据模型文件
│ ├── middlewares.py # 中间件文件,配置所有中间件
│ ├── pipelines.py # 管道文件,用于处理数据输出
│ ├── settings.py # douban_demo 的配置文件
│ └── spiders # Spider类文件夹,所有的Spider均在此存放
└── scrapy.cfg # 整个Scrapy的配置文件,由Scrapy自动生成
复制代码
After an overview of the use of each file, then we begin our journey into reptiles.
Describe a reptile
Use scrapy genspider douban douban.com
to create a new file reptile, reptile this new file will be placed douban_demo/spiders
underneath.
PS: genspider
Usagescrapy genspider [options] <name> <domain>
At this time, douban.py
the file will appear in spiders
the bottom, initial reads as follows:
# -*- coding: utf-8 -*-
import scrapy
class DoubanSpider(scrapy.Spider):
name = 'douban' # 爬虫名称
allowed_domains = ['douban.com'] # 允许爬取的域名列表
start_urls = ['http://douban.com/'] # 开始爬取的资源链接列表
def parse(self, response): # 解析数据方法
pass
复制代码
All you have to inherit Spider classes in Scrapy project scrapy.Spider
, which name
, start_urls
as well as parse
members of the Spider method is that each class must be declared. More Spider attributes and member methods can click on this link
As long as our next target link crawling into start_urls
which you can, we https://movie.douban.com/chart
as experimental subjects.
To DoubanSpider
the start_urls
replacement value for thestart_urls = ['https://movie.douban.com/chart']
Use shell fashion page test
Scrapy also provides us shell
order for us shell
for page data extraction tests, than requests + pyquery way to be efficient.
Format:scrapy shell urls
Type in the command line scrapy shell
to enter the shell
mode.
Note: Do not worry adding urls, because our test subjects have to UA to detect if a direct link into the test there will be 403. As for what the directory enter this command is not specifically limited.
Output as follows:
(venv) ➜ douban_demo scrapy shell --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x106c5c550>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x108e18898>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
复制代码
At this point we can see already entered the Python-like interactive command line interface is the same, in order to prevent watercress 403 First we should settings
join inside the DEFAULT_REQUEST_HEADERS
property, which is a request header dictionary, as long as Scrapy detects that this option will be inside the value added to the request header.
Values are as follows:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
复制代码
Type the interactive interface to add content about default request header
>>> settings.DEFAULT_REQUEST_HEADERS = {
... 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
... 'Accept-Language': 'en',
... 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \
... (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
... }
复制代码
Enter again to settings.DEFAULT_REQUEST_HEADERS
see if the added successfully.
After configuration, we can use the fetch(urls)
command to fetch page we need to test the
Type fetch('https://movie.douban.com/chart')
you can see what content
2019-06-03 23:06:13 [scrapy.core.engine] INFO: Spider opened
2019-06-03 23:06:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/robots.txt> (referer: None)
2019-06-03 23:06:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/chart> (referer: None)
复制代码
We can see from the log has successfully acquired the target page, the page before we can get to know scrapy first visited the robots.txt
file, which is a good reptile habits, then all pages scrapy acquisition will comply with all robots.txt
the rules inside If you do not want to follow this rule can be settings
configured inside ROBOTSTXT_OBEY = False
.
At this point you can use response.text
to check whether we have obtained the source code of the entire page. All analytical operations scrapy resources have been integrated in response
this object, more response
description can click on this link
Analysis of page
Movie list page
Check out pages with elements
We can see what we need crawled in table
there. Because there are multiple pages table
, so it only needs to get iterations.
In shell
use, response.css('table')
to get all of the table
elements herein employed all css selector
for element selection, xpath
it can be switched on their own.
Information about each movie in table
the label under the tr.item
inside.
The movie details link can be used a.nbg::attr(href)
to obtain
Pictures movie we can use a.nbg > img::attr(src)
to get
For the name of the movie somewhat complicated process, it can be seen from the movie may have multiple names, all wrapped in div.pl2 > a
the bottom, where the other names in div.pl2 > a > span
the bottom, so we need to name some formatting, such as removing spaces, line breaks, etc. Wait.
So you can use shadow name div.pl2 > a::text
and div.pl2 > a > span::text
separately acquired, but therefore div.pl2
under the a
more labels, we just need to get the first to use the extract_first()
method to remove the first Selector
content element and converted str
.
The movie only need to use p.pl::text
the acquisition to
Movie Reviews page
After the appropriate link for more information splicing film comments?status=P
to enter film critic page.
As can be seen by a plurality of data critic comment-item
composition, critics content is encapsulated in div.comment
the bottom, so according to the above analytical method can find the corresponding data acquisition mode. Not here in elaborate
Realization of ideas
-
Create two
parse
methods:parse_rank
andparse_comments
,parse_rank
dealing with movie list page,parse_comments
is responsible for handling the corresponding comments page. -
Override the
Spider
classstart_requests
method, paddingurl
andcallback
property values, because the first movie list for details information page is available by obtaining want to review closing address, sostart_requests
the return ofRequest callback
property should be filledself.parse_rank
-
In
parse_rank
the process of returnreponse
, in accordance with the "analysis page" of ideas to parse the data and useyield
throw comment pagesRequest
,callback
property filledself.parse_comments
-
In the
parse_comments
method, the process returns to the comments page, and the next throw data link.
NOTE: Spider parse
Method: All parse
methods must return Item (may now be understood as a data item) or Requests (next request). Where all parse
the means is not specific to Spider
the class generated in parse
the method, but includes all the functions of the parse function should return Item or Requests.
The sample code
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.request import Request
class DoubanSpider(scrapy.Spider):
name = 'douban'
def start_requests(self):
yield Request(url='https://movie.douban.com/chart', callback=self.parse_rank)
def parse_rank(self, response):
for item in response.css('tr.item'):
detail_url = item.css('a.nbg::attr(href)').extract_first()
img_url = item.css('a.nbg > img::attr(src)').extract_first()
main_name = item.css('div.pl2 > a::text').extract_first()
other_name = item.css('div.pl2 > a > span::text').extract_first()
brief = item.css('p.pl::text').extract_first()
main_name = main_name.replace('\n', '').replace(' ', '')
yield {
'detail_url': detail_url,
'img_url': img_url,
'name': main_name+other_name,
'brief': brief
}
yield Request(url=detail_url+'comments?status=P',
callback=self.parse_comments,
meta={'movie': main_name})
def parse_comments(self, response):
for comments in response.css('.comment-item'):
username = comments.css('span.comment-info > a::text').extract_first()
comment = comments.css('span.short::text').extract_first()
yield {
'movie': response.meta['movie'],
'username': username,
'comment': comment
}
nexturl = response.css('a.next::attr(href)').extract_first()
if nexturl:
yield Request(url=response.url[:response.url.find('?')]+nexturl,
callback=self.parse_comments,
meta=response.meta)
复制代码
Start reptiles
Everything is ready, we can at douban_demo
the bottom (top) directory, type the command scrapy crawl douban
you can see there are a lot of log data and also print out a lot of movie information and comment.
This we are on the IMDb Ranking and reviews completed the initial crawl, of course, watercress limit the number of non-login users can view comments and to detect the behavior of reptiles and so on, these anti-creep mechanisms we revisit in the future.
So now there is a problem that I need to save the data should do it?
Scrapy provides a number Feed exports
of methods, output data can be saved asjson, json lines, csv, xml
Back to the enable command -o xx.json
can be saved as a file json
format.
Such as:scrapy crawl douban -o result.json
Because the data have Chinese content, scrapy using json encoder
the default when all the data are ascii
, so we need to encode data set utf-8
.
Just settings.py
added FEED_EXPORT_ENCODING = 'utf-8'
to.
This time you can see the Chinese in this data display properly.
At this point in approximately 2000 generated data.
summary
This we have completed the initial crawl on IMDb and critics, although able to successfully crawl data, but gives the impression that "I just write the code for parsing the page and start typing commands reptile, the results can help Scrapy I finished output data from a web page request to all tasks, "we have to continue to explore when we typed scrapy crawl douban -o result.json
after which a start command Scrapy in the end what has been done.
How Scrapy work?
Please understand the intent of the readers Scrapy save the figure below, this figure learning Scrapy particularly critical.
According to this chart analysis, when we type the scrapy crawl douban -o result.json
following, Scrapy do the following work
-
Crawler
Receivedcrawl
the order will be activated, the activationname
isdouban
theSpider
same time to createEngine
, at this time weDoubanSpider
would be started. -
When
DoubanSpider
after being new,Engine
it will detect theSpider
request queue, which is ourstart_urls
property orstart_requests
method. Both must be iterable, so you can understand our sample codestart_requests
method is to use whatyield
is thrown. At this generatedRequest
object, allRequest
objects will go throughSpider Middlewares
this middleware, middleware this stage we only need to be understood as a bridge, we now do not have to go into what the bridge. -
Spider
GeneratedRequest
objects passesEngine
into theScheduler
scheduler, the scheduler will allRequest
join request queue can be scheduled once,Request
will passDownloader Middlewares
these bridges arrivesDownloader
,Downloader
it will request access to content according to the specified Internet resources, the process is asynchronous. -
When
Downloader
the completion of aRequest
post-task, they will be packaged into a resourceResponse
, which will include the originalRequest
message, packaged parser, and so on, we can see in the exampleparse_rank
thrownRequest
carrymeta
data, aftermeta
continue to save inparse_comments
theresponse
in. -
At this time, all
Response
will once again throughDownloader Middlewares
these bridges, throughEngine
andSpider Middlewares
back to the correspondingSpider
, and will activate the correspondingcallback
function, we write the final is good implementation ofparse
the method in the code. Whenparse
thrown againRequest
when the object will be re-executed (3-5) steps. -
When
Spider
thrown when data (Item), will once again afterSpider Middlewares
arrivalItem Pipeline
, but we were not onItem Pipeline
it will only thrown to the outside world specify any actionItem
, to be followed bylogger
the capture of the output, that is, we can see the console data generation, because we use the-o
instruction, soexporter
willitem
the output of the appropriate format, we will have the specifiedresult.json
data set.
to sum up
We completed this Scrapy how to write a simple program reptiles, as well as a general understanding of the Scrapy workflow, then we will be more in-depth discussion Scrapy other components and how to use them to break anti-climb mechanism.
As this view is wrong Yazheng Welcome!