Scrapy

scrapy框架是一个非常全面的爬虫框架，可以说是爬虫界的django了，里面有相当多的组件，格式化组件item，持久化组件pipeline，爬虫组件spider

首先我们要先和django一样先pip现在

Linux
    pip3 install scrapy

Windows
    a. pip3 install wheel
    b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    c. 进入下载目录，执行 pip3 install Twisted-xxxxx.whl

    d. pip3 install scrapy  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
    e. pip3 install pywin32  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

创建第一个scrapy程序

打开shell

创建scrapy项目

scrapy startproject xxx(项目名称)

cd xianglong
scrapy genspider chouti chouti.com (这一步写的url会在start_url中体现)
运行程序（带有日志记录）
scrapy crawl chouti
不带有日志的打印
scrapy crawl chouti --nolog

import scrapyclass ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        print(response.text)

此处parse是一个回调函数，会把爬取到的结果封装到response中传给parse

如果我们想解析其中的数据，可以使用里面的内置模块,不用bs4模块了不然会有一种四不像的感觉

from scrapy.selector import HtmlXPathSelectoclass ChoutiSpider(scrapy.Spider):    name = 'chouti'

    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/',]

    def parse(self, response):
        """
        当起始URL下载完毕后，自动执行parse函数：response封装了响应相关的所有内容。
        :param response:
        :return:
        """

        hxs = HtmlXPathSelector(response=response)

        # 去下载的页面中：找新闻
　　　　　　
　　　　　# // 代表子子孙孙下找，div[@id='content-list'] div id是content-list 
　　　　　# / 儿子找， div class属性是item

        items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
        for item in items:
            href = item.xpath('.//div[@class="part1"]//a[1]/@href').extract_first()
            text = item.xpath('.//div[@class="part1"]//a[1]/text()').extract_first()
            item = XianglongItem(title=text,href=href)
            yield item

        pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract()
        for page_url in pages:
            page_url = "https://dig.chouti.com" + page_url
            yield Request(url=page_url,callback=self.parse)

如果yield 一个Item对象那么会去pipelines.py中去出里

要使用这个功能需要在settings文件中配置

item/pipelines
配置：
ITEM_PIPELINES = {
	'xianglong.pipelines.XianglongPipeline': 300,
}

items.py 中主要处理数据的格式化

import scrapy


class XianglongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    href = scrapy.Field()

持久化组件pipelines.py

class XianglongPipeline(object):

    def process_item(self, item, spider):
        self.f.write(item['href']+'\n')
        self.f.flush()

        return item

    def open_spider(self, spider):
        """
        爬虫开始执行时，调用
        :param spider:
        :return:
        """
        self.f = open('url.log','w')

    def close_spider(self, spider):
        """
        爬虫关闭时，被调用
        :param spider:
        :return:
        """
        self.f.close()

因为在持久化的时候我们需要对文件或者数据库进行操作，我们可以在项目开始的就打开文件句柄或者数据库连接，对文件进行操作

当我们查完这一页的数据，我们得到了下一页的页码，想让爬虫继续爬。

我们可以这么设置

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from ..items import XianglongItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/',]

    def parse(self, response):
        """
        当起始URL下载完毕后，自动执行parse函数：response封装了响应相关的所有内容。
        :param response:
        :return:
        """

        pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract()
        for page_url in pages:
            page_url = "https://dig.chouti.com" + page_url
            yield Request(url=page_url,callback=self.parse)

只要yield 一个Request对象就会继续执行他设置的回调函数。

Scrapy框架的初步使用

Scrapy

创建第一个scrapy程序

猜你喜欢