爬虫利器Scrapy框架:1:概要介绍

在这里插入图片描述
Scrapy是使用python实现的一个web抓取框架,这篇文章将对Scrapy的概要、安装进行说明,并结合scrapy shell获取页面的title的简单示例来获取scrapy的直观使用感受。

概要信息

Scrapy是使用python实现的一个web抓取框架, 非常适合用于网站数据爬取、结构化数据提取等操作,相较于通用搜索为目的的Apache Nutch更加小巧和灵活,概要信息如下表所示:

项目 说明
官网 https://scrapy.org/
开源/闭源 开源
源码管理地址 https://github.com/scrapy/scrapy
开发语言 python
当前稳定版本 1.13.0 (2019/03/19)

安装

使用pip即可直接安装Scrapy,执行命令如下所示:

执行命令:pip install scrapy

本文使用python3和python并存的环境,使用pip3进行安装, 安装日志如下所示:

liumiaocn:scrapy liumiao$ pip3 install scrapy
Collecting scrapy
  Downloading 
...省略
Successfully built protego PyDispatcher zope.interface
Installing collected packages: six, pycparser, cffi, cryptography, pyasn1, pyasn1-modules, attrs, service-identity, protego, cssselect, pyOpenSSL, w3lib, PyDispatcher, incremental, constantly, Automat, PyHamcrest, zope.interface, idna, hyperlink, Twisted, lxml, parsel, queuelib, scrapy
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 attrs-19.3.0 cffi-1.14.0 constantly-15.1.0 cryptography-2.8 cssselect-1.1.0 hyperlink-19.0.0 idna-2.9 incremental-17.5.0 lxml-4.5.0 parsel-1.5.2 protego-0.1.16 pyOpenSSL-19.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 queuelib-1.5.0 scrapy-2.0.1 service-identity-18.1.0 six-1.14.0 w3lib-1.21.0 zope.interface-5.0.1
liumiaocn:scrapy liumiao$

版本确认

liumiaocn:scrapy liumiao$ scrapy -h
Scrapy 2.0.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
liumiaocn:scrapy liumiao$ 

获取页面的标题信息

爬虫实际上是对HTML进行的处理,最为简单的确认Scrapy的功能的示例方式是通过scrapy shell来进行,scrapy shell提供了一种交互式的方式进行数据的抓取,也可以用于抓取的调试。

示例说明:希望获取Scrapy官网主页的标题信息,页面如下所示
在这里插入图片描述

步骤1: 执行scrapy shell

执行如下示例命令:

执行命令:scrapy shell https://scrapy.org/

liumiaocn:scrapy liumiao$ scrapy shell https://scrapy.org/
2020-03-28 05:38:09 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
2020-03-28 05:38:09 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov  1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.2.0-x86_64-i386-64bit
2020-03-28 05:38:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-03-28 05:38:09 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet Password: 5e36afd357190e93
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-28 05:38:09 [scrapy.core.engine] INFO: Spider opened
2020-03-28 05:38:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapy.org/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1073043d0>
[s]   item       {}
[s]   request    <GET https://scrapy.org/>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x1075d05d0>
[s]   spider     <DefaultSpider 'default' at 0x107acbd90>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> 

步骤2: 通过response.css获取title

输入response.css(‘title’),回车即可看到输出的信息中的title内容

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Scrapy | A Fast and Powerful S...'>]
>>> 

进一步获取title的详细信息

>>> response.css('title').extract_first()
'<title>Scrapy | A Fast and Powerful Scraping and Web Crawling Framework</title>'
>>> 
发布了1143 篇原创文章 · 获赞 1364 · 访问量 415万+

猜你喜欢

转载自blog.csdn.net/liumiaocn/article/details/105154963