Scrapy Concise Guide (a)

Description link

Foreword

Scrapy reptile framework of a pure Python language, simple, easy to use, high scalability make it into the mainstream tool for Python reptile, this paper to the latest official version 1.6 is based on the simple use to further explore the principle of expansion.

Go ahead say something tutorial tutorials, always still no official documents speak of apt! If the reader reading this article on Scrapy an interest in and a better understanding of the original intent of Scrapy, please feel free to read a certain habit official document!

Scrapy official documents

content

This paper covers the following topics

  • Why Scrapy?
  • Hello Scrapy! (practice)
  • How Scrapy work?

For the first section, "Why Scrapy" reader is able to read, I will analyze my understanding of Scrapy business scene.

For the remaining two sections, my intent is to "Scrapy how it works" on before "Hello Scrapy" to speak of, but not everyone is willing to consider one up on the understanding of theoretical things, so we put Demo in front of little practical speaking, hoping to cause the reader's interest, interest allow us a deeper understanding of one thing. So I put "How Scrapy work" on the final say in this section, you can also undertake Scrapy principle of the next chapter!

Why Scrapy?

Although Scrapy has been designed to meet the vast majority of reptile work, but there are some scenes actually does not apply.

  • What Scrapy not the first choice?
  1. When you're crawling small number of pages, the size of a small site for a time, Scrapy not the first choice. Movies such as crawling point list, some of the news and so on, Requests + PyQuery this way it has been able to complete such tasks, Scrapy less than the code generated, and requests the efficiency and speed of parsing the page from the network Requests and PyQuery than Scrapy comes with two modules is better!

  2. When there is no universal reptiles demand, Scrapy optional optional. In my opinion Scrapy real benefit is the ability to customize the corresponding "Spider Actions" for a variety of different types of websites, powerful "ItemLoader" to define a series of actions to deal with data input and output. If you do not need to constantly expand the demand information source, Scrapy fact, can not play the greatest ability!

  3. When you need an incremental crawling data, Scrapy looked very weak. Scrapy and no incremental crawl functionality to implement because of the difficulty of increment is not the same, it estimated that if the demand for simple Scrapy minor surgery can be done, but if it is high increment requirement, it may really move a lot of trouble Scrapy !

Note: The above three cases just like to Scrapy not the first choice, and did not say no recommended! Just I hope that readers can not understand or follow the trend in the choice of a technology framework when, early in the design carefully consider the great benefit of the good development of the project.

  • What Scrapy good use?
  1. Need a distributed design, Scrapy unofficial assembly Scrapy-redis good use. Scrapy itself does not implement a distributed mechanism, but using rmax developed Scrapy-redis can achieve distributed, later I will gradually talked about.

  2. When can expand demand, Scrapy a weapon. Specific reasons already explained above, this will not do more to explain.

Note: All of the above are summarized from the time of Scrapy I personally use for reference purposes only!

Hello Scrapy

Demo with watercress (vancomycin reptile victims) popular movie list and all of its comments for the experimental target, to recount the basic functions Scrapy, I believe readers in practice after completing this Demo, you can very well use the Scrapy.

Project gitHub

You need to install:

  • Python (As used herein, 3.7)
  • scrapy

Installation Environment

  • Installation Scrapy

Command line, typepip install scrapy

Create a project Scrapy

In the command line, type scrapy startproject douban_demothe following results shown in FIG.

After you can see Scrapy also suggest that we can use genspiderthis command to create our crawler files, before we take a look at just what the piece after executing the command took place.

View Files directory. We can see the following information

douban_demo
├── douban_demo
│   ├── items.py       # 数据模型文件
│   ├── middlewares.py # 中间件文件,配置所有中间件 
│   ├── pipelines.py   # 管道文件,用于处理数据输出
│   ├── settings.py    # douban_demo 的配置文件
│   └── spiders        # Spider类文件夹,所有的Spider均在此存放
└── scrapy.cfg         # 整个Scrapy的配置文件,由Scrapy自动生成
复制代码

After an overview of the use of each file, then we begin our journey into reptiles.

Describe a reptile

Use scrapy genspider douban douban.comto create a new file reptile, reptile this new file will be placed douban_demo/spidersunderneath.

PS: genspiderUsagescrapy genspider [options] <name> <domain>

At this time, douban.pythe file will appear in spidersthe bottom, initial reads as follows:

# -*- coding: utf-8 -*-
import scrapy


class DoubanSpider(scrapy.Spider):
    name = 'douban'                       # 爬虫名称
    allowed_domains = ['douban.com']      # 允许爬取的域名列表
    start_urls = ['http://douban.com/']   # 开始爬取的资源链接列表

    def parse(self, response):            # 解析数据方法
        pass
复制代码

All you have to inherit Spider classes in Scrapy project scrapy.Spider, which name, start_urlsas well as parsemembers of the Spider method is that each class must be declared. More Spider attributes and member methods can click on this link

As long as our next target link crawling into start_urlswhich you can, we https://movie.douban.com/chartas experimental subjects.

To DoubanSpiderthe start_urlsreplacement value for thestart_urls = ['https://movie.douban.com/chart']

Use shell fashion page test

Scrapy also provides us shellorder for us shellfor page data extraction tests, than requests + pyquery way to be efficient.

Format:scrapy shell urls

Type in the command line scrapy shellto enter the shellmode.

Note: Do not worry adding urls, because our test subjects have to UA to detect if a direct link into the test there will be 403. As for what the directory enter this command is not specifically limited.

Output as follows:

(venv) ➜  douban_demo scrapy shell --nolog                                 
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x106c5c550>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x108e18898>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
复制代码

At this point we can see already entered the Python-like interactive command line interface is the same, in order to prevent watercress 403 First we should settingsjoin inside the DEFAULT_REQUEST_HEADERSproperty, which is a request header dictionary, as long as Scrapy detects that this option will be inside the value added to the request header.

Values ​​are as follows:

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \
  (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
复制代码

Type the interactive interface to add content about default request header

>>> settings.DEFAULT_REQUEST_HEADERS = {
...   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
...   'Accept-Language': 'en',
...   'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 \
...   (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
... }
复制代码

Enter again to settings.DEFAULT_REQUEST_HEADERSsee if the added successfully.

After configuration, we can use the fetch(urls)command to fetch page we need to test the

Type fetch('https://movie.douban.com/chart')you can see what content

2019-06-03 23:06:13 [scrapy.core.engine] INFO: Spider opened
2019-06-03 23:06:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/robots.txt> (referer: None)
2019-06-03 23:06:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://movie.douban.com/chart> (referer: None)
复制代码

We can see from the log has successfully acquired the target page, the page before we can get to know scrapy first visited the robots.txtfile, which is a good reptile habits, then all pages scrapy acquisition will comply with all robots.txtthe rules inside If you do not want to follow this rule can be settingsconfigured inside ROBOTSTXT_OBEY = False.

At this point you can use response.textto check whether we have obtained the source code of the entire page. All analytical operations scrapy resources have been integrated in responsethis object, more responsedescription can click on this link

Analysis of page

Movie list page

Check out pages with elements

We can see what we need crawled in tablethere. Because there are multiple pages table, so it only needs to get iterations.

In shelluse, response.css('table')to get all of the tableelements herein employed all css selectorfor element selection, xpathit can be switched on their own.

Information about each movie in tablethe label under the tr.iteminside.

The movie details link can be used a.nbg::attr(href)to obtain

Pictures movie we can use a.nbg > img::attr(src)to get

For the name of the movie somewhat complicated process, it can be seen from the movie may have multiple names, all wrapped in div.pl2 > athe bottom, where the other names in div.pl2 > a > spanthe bottom, so we need to name some formatting, such as removing spaces, line breaks, etc. Wait.

So you can use shadow name div.pl2 > a::textand div.pl2 > a > span::textseparately acquired, but therefore div.pl2under the amore labels, we just need to get the first to use the extract_first()method to remove the first Selectorcontent element and converted str.

The movie only need to use p.pl::textthe acquisition to

Movie Reviews page

After the appropriate link for more information splicing film comments?status=Pto enter film critic page.

As can be seen by a plurality of data critic comment-itemcomposition, critics content is encapsulated in div.commentthe bottom, so according to the above analytical method can find the corresponding data acquisition mode. Not here in elaborate

Realization of ideas

  1. Create two parsemethods: parse_rankand parse_comments, parse_rankdealing with movie list page, parse_commentsis responsible for handling the corresponding comments page.

  2. Override the Spiderclass start_requestsmethod, padding urland callbackproperty values, because the first movie list for details information page is available by obtaining want to review closing address, so start_requeststhe return of Request callbackproperty should be filledself.parse_rank

  3. In parse_rankthe process of return reponse, in accordance with the "analysis page" of ideas to parse the data and use yieldthrow comment pages Request, callbackproperty filledself.parse_comments

  4. In the parse_commentsmethod, the process returns to the comments page, and the next throw data link.

NOTE: Spider parseMethod: All parsemethods must return Item (may now be understood as a data item) or Requests (next request). Where all parsethe means is not specific to Spiderthe class generated in parsethe method, but includes all the functions of the parse function should return Item or Requests.

The sample code

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.request import Request


class DoubanSpider(scrapy.Spider):
    name = 'douban'

    def start_requests(self):
        yield Request(url='https://movie.douban.com/chart', callback=self.parse_rank)

    def parse_rank(self, response):
        for item in response.css('tr.item'):
            detail_url = item.css('a.nbg::attr(href)').extract_first()
            img_url = item.css('a.nbg > img::attr(src)').extract_first()
            main_name = item.css('div.pl2 > a::text').extract_first()
            other_name = item.css('div.pl2 > a > span::text').extract_first()
            brief = item.css('p.pl::text').extract_first()
            main_name = main_name.replace('\n', '').replace(' ', '')

            yield {
                'detail_url': detail_url,
                'img_url': img_url,
                'name': main_name+other_name,
                'brief': brief
            }

            yield Request(url=detail_url+'comments?status=P',
                          callback=self.parse_comments,
                          meta={'movie': main_name})

    def parse_comments(self, response):
        for comments in response.css('.comment-item'):
            username = comments.css('span.comment-info > a::text').extract_first()
            comment = comments.css('span.short::text').extract_first()

            yield {
                'movie': response.meta['movie'],
                'username': username,
                'comment': comment
            }
        nexturl = response.css('a.next::attr(href)').extract_first()
        if nexturl:
            yield Request(url=response.url[:response.url.find('?')]+nexturl,
                          callback=self.parse_comments,
                          meta=response.meta)

复制代码

Start reptiles

Everything is ready, we can at douban_demothe bottom (top) directory, type the command scrapy crawl doubanyou can see there are a lot of log data and also print out a lot of movie information and comment.

This we are on the IMDb Ranking and reviews completed the initial crawl, of course, watercress limit the number of non-login users can view comments and to detect the behavior of reptiles and so on, these anti-creep mechanisms we revisit in the future.

So now there is a problem that I need to save the data should do it?

Scrapy provides a number Feed exportsof methods, output data can be saved asjson, json lines, csv, xml

Back to the enable command -o xx.jsoncan be saved as a file jsonformat.

Such as:scrapy crawl douban -o result.json

Because the data have Chinese content, scrapy using json encoderthe default when all the data are ascii, so we need to encode data set utf-8.

Just settings.pyadded FEED_EXPORT_ENCODING = 'utf-8'to.

This time you can see the Chinese in this data display properly.

At this point in approximately 2000 generated data.

summary

This we have completed the initial crawl on IMDb and critics, although able to successfully crawl data, but gives the impression that "I just write the code for parsing the page and start typing commands reptile, the results can help Scrapy I finished output data from a web page request to all tasks, "we have to continue to explore when we typed scrapy crawl douban -o result.jsonafter which a start command Scrapy in the end what has been done.

How Scrapy work?

Please understand the intent of the readers Scrapy save the figure below, this figure learning Scrapy particularly critical.

According to this chart analysis, when we type the scrapy crawl douban -o result.jsonfollowing, Scrapy do the following work

  1. CrawlerReceived crawlthe order will be activated, the activation nameis doubanthe Spidersame time to create Engine, at this time we DoubanSpiderwould be started.

  2. When DoubanSpiderafter being new, Engineit will detect the Spiderrequest queue, which is our start_urlsproperty or start_requestsmethod. Both must be iterable, so you can understand our sample code start_requestsmethod is to use what yieldis thrown. At this generated Requestobject, all Requestobjects will go through Spider Middlewaresthis middleware, middleware this stage we only need to be understood as a bridge, we now do not have to go into what the bridge.

  3. SpiderGenerated Requestobjects passes Engineinto the Schedulerscheduler, the scheduler will all Requestjoin request queue can be scheduled once, Requestwill pass Downloader Middlewaresthese bridges arrives Downloader, Downloaderit will request access to content according to the specified Internet resources, the process is asynchronous.

  4. When Downloaderthe completion of a Requestpost-task, they will be packaged into a resource Response, which will include the original Requestmessage, packaged parser, and so on, we can see in the example parse_rankthrown Requestcarry metadata, after metacontinue to save in parse_commentsthe responsein.

  5. At this time, all Responsewill once again through Downloader Middlewaresthese bridges, through Engineand Spider Middlewaresback to the corresponding Spider, and will activate the corresponding callbackfunction, we write the final is good implementation of parsethe method in the code. When parsethrown again Requestwhen the object will be re-executed (3-5) steps.

  6. When Spiderthrown when data (Item), will once again after Spider Middlewaresarrival Item Pipeline, but we were not on Item Pipelineit will only thrown to the outside world specify any action Item, to be followed by loggerthe capture of the output, that is, we can see the console data generation, because we use the -oinstruction, so exporterwill itemthe output of the appropriate format, we will have the specified result.jsondata set.

to sum up

We completed this Scrapy how to write a simple program reptiles, as well as a general understanding of the Scrapy workflow, then we will be more in-depth discussion Scrapy other components and how to use them to break anti-climb mechanism.

As this view is wrong Yazheng Welcome!

Guess you like

Origin blog.csdn.net/weixin_33922672/article/details/91370103