Scrapy is a Python implementation of the website for crawling data, extract structured data written application framework.

Scrapy often used in mining, including data, the information processing program or a series of historical data is stored, and the like.

Usually we can easily achieve a reptile by Scrapy framework, crawling content or images the Web site.


FIG Scrapy Architecture (green line is a data flow)

  • Scrapy Engine (engine) : responsible for Spider, ItemPipeline, Downloader, middle Scheduler communications, signals, data transfer and so on.

  • Scheduler (Scheduler) : It is responsible for accepting a request Request sent from the engine, and organize them arranged in a certain way, into the team, when the engine needs time, returned to the engine.

  • Downloader (Downloader) : responsible for downloading all requests Requests Scrapy Engine (engine) is transmitted, and the acquired Responses returned Scrapy Engine (engine), the engine to be handled Spider,

  • Spider (crawler) : It is responsible for handling all Responses, from which to extract data analysis, data acquisition Item required fields, and submit to the engine need to follow up the URL, enter the Scheduler (Scheduler) again.

  • Item Pipeline (Pipeline) : It handles Spider in Item acquired, and post-processing (detailed analysis, filtration, storage, etc.) place.

  • Downloader Middlewares (download middleware) : You can be regarded as a custom component can be extended download function.

  • Spider Middlewares (Spider middleware) : You can be understood as a functional component can be customized and extended operation of the engine and the intermediate communication Spider (such as entering Responses of Spider; Requests and out of the Spider)

Scrapy operating procedures

Code is written, the program starts running ...

  • 1 Engine: Hi! Spider, which you have to deal with a website?
  • 2 Spider: Boss asked me to xxxx.com.
  • 3 engine: URL you first need to be addressed to me.
  • 4 Spider: to give you, the first URL is xxxxxxx.com.
  • 5 Engine: Hi! Scheduler, I have this request asking you to help me sort it into the team.
  • 6 Scheduler: Well, wait a minute you are dealing with.
  • 7 Engine: Hi! Scheduler, you deal with a request to give me a good request.
  • 8 dispatcher: Here, this is my request handled well
  • 9 Engine: Hi! Downloader, you help me with the downloaded middleware boss set the download request about this request
  • 10 Downloader: good! To you, this is good stuff to download. (If it fails: sorry, this request failed to download the engine and then tell the dispatcher, the request fails to download, you record it, and then a little later we download.)
  • 11 Engine: Hi! Spider, which is to download a good thing, and has been processed in accordance with the downloaded middleware boss, you deal with it (note! Here is the default responses to def parse () function processed)
  • After 12 Spider :( processed data for the required follow-up URL), Hi! Engine, I've got two results, this is what I need to follow up the URL, as well as this is what I get to the Item data.
  • 13 Engine: Hi! I've got a piping item you help me deal with it! scheduler! This is the URL you need to follow to help me deal with next. Then start the cycle from the fourth step, until you get all the information needed boss.
  • 14 Pipeline dispatcher: OK, do it now!

note! Only when there is no request scheduler, the entire program will stop, (that is, for a failed download URL, Scrapy will be re-downloaded.)


Scrapy reptiles need to make a total of four steps:

  1. New Project (scrapy startproject xxx): Create a new project reptile
  2. Clear objectives (written items.py): clear goals that you want to crawl
  3. Production reptiles (spiders / xxspider.py): Production start crawling web crawlers
  4. The stored content (pipelines.py): where duct storing content crawl

installation

Windows Installation

Pip upgrade version:

pip install --upgrade pip

Scrapy pip mounting frame by:

pip install Scrapy

Ubuntu Installation

Install non-Python dependency:

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

Scrapy pip mounting frame by:

sudo pip install scrapy

Mac OS installation

For Mac OS system, because the system itself comes with python2.x reference library, the default installation package can not be deleted, but you use python2.x Scrapy will complain to install, use python3.x to install is error, I installed directly Scrapy the method is not found there, so I look for another way to install installation steps, the solution is to use virtualenv way to install.

$ sudo pip install virtualenv
$ virtualenv scrapyenv
$ cd scrapyenv
$ source bin/activate
$ pip install Scrapy

Once installed, simply enter scrapy in command terminal, prompts similar to the following result, representatives has been successfully installed.


Getting Case

learning target

  • Creating a project Scrapy
  • Extracting defined data structure (Item)
  • Write crawling Spider's Web site and extract structured data (Item)
  • Item Pipelines prepared to store the extracted Item (i.e. structured data)

A. New Project (scrapy startproject)

Before you begin crawling, you must create a new Scrapy project. Enter a custom project directory, run the following command:

scrapy startproject mySpider

Which, mySpider for the project name, you can see mySpider will create a folder, the directory structure is as follows:

Let's briefly explain the function of each key documents:

mySpider/
    scrapy.cfg
    mySpider/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...

这些文件分别是:

  • scrapy.cfg: 项目的配置文件。
  • mySpider/: 项目的Python模块,将会从这里引用代码。
  • mySpider/items.py: 项目的目标文件。
  • mySpider/pipelines.py: 项目的管道文件。
  • mySpider/settings.py: 项目的设置文件。
  • mySpider/spiders/: 存储爬虫代码目录。

二、明确目标(mySpider/items.py)

我们打算抓取 http://www.itcast.cn/channel/teacher.shtml 网站里的所有讲师的姓名、职称和个人信息。

    1. 打开 mySpider 目录下的 items.py。

    2. Item 定义结构化数据字段,用来保存爬取到的数据,有点像 Python 中的 dict,但是提供了一些额外的保护减少错误。

    3. 可以通过创建一个 scrapy.Item 类, 并且定义类型为 scrapy.Field 的类属性来定义一个 Item(可以理解成类似于 ORM 的映射关系)。

接下来,创建一个 ItcastItem 类,和构建 item 模型(model)。

import scrapy

class ItcastItem(scrapy.Item): name = scrapy.Field() title = scrapy.Field() info = scrapy.Field()

三、制作爬虫 (spiders/itcastSpider.py)

爬虫功能要分两步:

1. 爬数据

在当前目录下输入命令,将在mySpider/spider目录下创建一个名为itcast的爬虫,并指定爬取域的范围:

scrapy genspider itcast "itcast.cn"

打开 mySpider/spider目录里的 itcast.py,默认增加了下列代码:

import scrapy

class ItcastSpider(scrapy.Spider): name = "itcast" allowed_domains = ["itcast.cn"] start_urls = ( 'http://www.itcast.cn/', ) def parse(self, response): pass

其实也可以由我们自行创建itcast.py并编写上面的代码,只不过使用命令可以免去编写固定代码的麻烦

要建立一个Spider, 你必须用scrapy.Spider类创建一个子类,并确定了三个强制的属性 和 一个方法。

name = "" :这个爬虫的识别名称,必须是唯一的,在不同的爬虫必须定义不同的名字。

allow_domains = [] 是搜索的域名范围,也就是爬虫的约束区域,规定爬虫只爬取这个域名下的网页,不存在的URL会被忽略。

start_urls = () :爬取的URL元祖/列表。爬虫从这里开始抓取数据,所以,第一次下载的数据将会从这些urls开始。其他子URL将会从这些起始URL中继承性生成。

parse(self, response) :解析的方法,每个初始URL完成下载后将被调用,调用的时候传入从每一个URL传回的Response对象来作为唯一参数,主要作用如下:

负责解析返回的网页数据(response.body),提取结构化数据(生成item)
生成需要下一页的URL请求。
将start_urls的值修改为需要爬取的第一个url

start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)

修改parse()方法

def parse(self, response): filename = "teacher.html" open(filename, 'w').write(response.body)

然后运行一下看看,在mySpider目录下执行:

scrapy crawl itcast

是的,就是 itcast,看上面代码,它是 ItcastSpider 类的 name 属性,也就是使用 scrapy genspider命令的唯一爬虫名。

运行之后,如果打印的日志出现 [scrapy] INFO: Spider closed (finished),代表执行完成。 之后当前文件夹中就出现了一个 teacher.html 文件,里面就是我们刚刚要爬取的网页的全部源代码信息。

注意: Python2.x默认编码环境是ASCII,当和取回的数据编码格式不一致时,可能会造成乱码;我们可以指定保存内容的编码格式,一般情况下,我们可以在代码最上方添加

import sys
reload(sys) sys.setdefaultencoding("utf-8")

这三行代码是 Python2.x 里解决中文编码的万能钥匙,经过这么多年的吐槽后 Python3 学乖了,默认编码是Unicode了...(祝大家早日拥抱Python3)

2. 取数据

爬取整个网页完毕,接下来的就是的取过程了,首先观察页面源码:

<div class="li_txt"> <h3> xxx </h3> <h4> xxxxx </h4> <p> xxxxxxxx </p>

是不是一目了然?直接上 XPath 开始提取数据吧。

xpath 方法,我们只需要输入的 xpath 规则就可以定位到相应 html 标签节点,详细内容可以查看 xpath 教程

不会 xpath 语法没关系,Chrome 给我们提供了一键获取 xpath 地址的方法(右键->检查->copy->copy xpath),如下图:

这里给出一些 XPath 表达式的例子及对应的含义:

  • /html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
  • /html/head/title/text(): 选择上面提到的 <title> 元素的文字
  • //td: 选择所有的 <td> 元素
  • //div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

举例我们读取网站 http://www.itcast.cn/ 的网站标题,修改 itcast.py 文件代码如下::

# -*- coding: utf-8 -*-
import scrapy

# 以下三行是在 Python2.x版本中解决乱码问题,Python3.x 版本的可以去掉 import sys reload(sys) sys.setdefaultencoding("utf-8") class Opp2Spider(scrapy.Spider): name = 'itcast' allowed_domains = ['itcast.com'] start_urls = ['http://www.itcast.cn/'] def parse(self, response): # 获取网站标题 context = response.xpath('/html/head/title/text()') # 提取网站标题 title = context.extract_first() print(title) pass

执行以下命令:

$ scrapy crawl itcast
...
...
传智播客官网-好口碑IT培训机构,一样的教育,不一样的品质 ... ...

我们之前在 mySpider/items.py 里定义了一个 ItcastItem 类。 这里引入进来:

from mySpider.items import ItcastItem

然后将我们得到的数据封装到一个 ItcastItem 对象中,可以保存每个老师的属性:

from mySpider.items import ItcastItem def parse(self, response): #open("teacher.html","wb").write(response.body).close() # 存放老师信息的集合 items = [] for each in response.xpath("//div[@class='li_txt']"): # 将我们得到的数据封装到一个 `ItcastItem` 对象 item = ItcastItem() #extract()方法返回的都是unicode字符串 name = each.xpath("h3/text()").extract() title = each.xpath("h4/text()").extract() info = each.xpath("p/text()").extract() #xpath返回的是包含一个元素的列表 item['name'] = name[0] item['title'] = title[0] item['info'] = info[0] items.append(item) # 直接返回最后数据 return items

我们暂时先不处理管道,后面会详细介绍。

保存数据

scrapy保存信息的最简单的方法主要有四种,-o 输出指定格式的文件,命令如下:

scrapy crawl itcast -o teachers.json

json lines格式,默认为Unicode编码

scrapy crawl itcast -o teachers.jsonl

csv 逗号表达式,可用Excel打开

scrapy crawl itcast -o teachers.csv

xml格式

scrapy crawl itcast -o teachers.xml

思考

如果将代码改成下面形式,结果完全一样。

请思考 yield 在这里的作用(Python yield 使用浅析):

# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import ItcastItem # 以下三行是在 Python2.x版本中解决乱码问题,Python3.x 版本的可以去掉 import sys reload(sys) sys.setdefaultencoding("utf-8") class Opp2Spider(scrapy.Spider): name = 'itcast' allowed_domains = ['itcast.com'] start_urls = ("http://www.itcast.cn/channel/teacher.shtml",) def parse(self, response): #open("teacher.html","wb").write(response.body).close() # 存放老师信息的集合 items = [] for each in response.xpath("//div[@class='li_txt']"): # 将我们得到的数据封装到一个 `ItcastItem` 对象 item = ItcastItem() #extract()方法返回的都是unicode字符串 name = each.xpath("h3/text()").extract() title = each.xpath("h4/text()").extract() info = each.xpath("p/text()").extract() #xpath返回的是包含一个元素的列表 item['name'] = name[0] item['title'] = title[0] item['info'] = info[0] items.append(item) # 直接返回最后数据 return items