Python 分布式爬虫框架 Scrapy 4-6 编写spider爬取所有文章

命令行执行：

scrapy shell https://news.cnblogs.com/

调试，获取页面的全部文章的url：

>>> articles_link_xpath = '//*[@class="news_entry"]/a/@href'
>>> articles_link_selector = response.xpath(articles_link_xpath)
>>> articles_link_selector
[<Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630083/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630082/'>, <Selector xpath=
'//*[@class="news_entry"]/a/@href' data='/n/630081/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630080/'>, <Selector xpath='//*[@class="news
_entry"]/a/@href' data='/n/630079/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630078/'>, <Selector xpath='//*[@class="news_entry"]/a/@href'
 data='/n/630077/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630076/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630075/
'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630074/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630073/'>, <Selector xpa
th='//*[@class="news_entry"]/a/@href' data='/n/630072/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630071/'>, <Selector xpath='//*[@class="n
ews_entry"]/a/@href' data='/n/630070/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630069/'>, <Selector xpath='//*[@class="news_entry"]/a/@hr
ef' data='/n/630068/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630067/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/6300
66/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630065/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630064/'>, <Selector
xpath='//*[@class="news_entry"]/a/@href' data='/n/630063/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630061/'>, <Selector xpath='//*[@class
="news_entry"]/a/@href' data='/n/630062/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630060/'>, <Selector xpath='//*[@class="news_entry"]/a/
@href' data='/n/630059/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630058/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/6
30057/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630056/'>, <Selector xpath='//*[@class="news_entry"]/a/@href' data='/n/630055/'>, <Select
or xpath='//*[@class="news_entry"]/a/@href' data='/n/630054/'>]
>>> articles_link_list = articles_link_selector.extract()
>>> articles_link_list
['/n/630083/', '/n/630082/', '/n/630081/', '/n/630080/', '/n/630079/', '/n/630078/', '/n/630077/', '/n/630076/', '/n/630075/', '/n/630074/', '/n/630073/', '/n/
630072/', '/n/630071/', '/n/630070/', '/n/630069/', '/n/630068/', '/n/630067/', '/n/630066/', '/n/630065/', '/n/630064/', '/n/630063/', '/n/630061/', '/n/63006
2/', '/n/630060/', '/n/630059/', '/n/630058/', '/n/630057/', '/n/630056/', '/n/630055/', '/n/630054/']
>>> articles_link = ['https://news.cnblogs.com' + url for url in articles_link_list]
>>> articles_link
['https://news.cnblogs.com/n/630083/', 'https://news.cnblogs.com/n/630082/', 'https://news.cnblogs.com/n/630081/', 'https://news.cnblogs.com/n/630080/', 'https
://news.cnblogs.com/n/630079/', 'https://news.cnblogs.com/n/630078/', 'https://news.cnblogs.com/n/630077/', 'https://news.cnblogs.com/n/630076/', 'https://news
.cnblogs.com/n/630075/', 'https://news.cnblogs.com/n/630074/', 'https://news.cnblogs.com/n/630073/', 'https://news.cnblogs.com/n/630072/', 'https://news.cnblog
s.com/n/630071/', 'https://news.cnblogs.com/n/630070/', 'https://news.cnblogs.com/n/630069/', 'https://news.cnblogs.com/n/630068/', 'https://news.cnblogs.com/n
/630067/', 'https://news.cnblogs.com/n/630066/', 'https://news.cnblogs.com/n/630065/', 'https://news.cnblogs.com/n/630064/', 'https://news.cnblogs.com/n/630063
/', 'https://news.cnblogs.com/n/630061/', 'https://news.cnblogs.com/n/630062/', 'https://news.cnblogs.com/n/630060/', 'https://news.cnblogs.com/n/630059/', 'ht
tps://news.cnblogs.com/n/630058/', 'https://news.cnblogs.com/n/630057/', 'https://news.cnblogs.com/n/630056/', 'https://news.cnblogs.com/n/630055/', 'https://n
ews.cnblogs.com/n/630054/']

通过上面的代码，我们也可以看出，在xpath筛选字符串后面加：

/@href

可以获取标签内的href属性。

还原start_urls为：

https://news.cnblogs.com/

上面调试中是使用了列表生成式实现了url的拼接，但我们将通过urllib中提供的方法完成。

下面的代码实现parse函数的第一个功能，即：获取文章列表页中的文章的url，交给scrapy下载后，进行具体字段的解析。

from scrapy.http import Request
from urllib import parse

parse函数改为：

    def parse(self, response):
        """
        1. 获取文章列表页中的文章的url，交给scrapy下载后，进行具体字段的解析
        2. 获取下一页的url，交给scrapy进行下载，下载完成后交给parse
        """
        # 获取文章列表页中的文章的url，交给scrapy下载后，进行具体字段的解析
        articles_link = response.xpath('//*[@class="news_entry"]/a/@href').extract()
        for url in articles_link:
            yield Request(url=parse.urljoin(response.url, url), callback=self.parse_detail)
            pass

上面的代码先通过xpath解析出了拼接前的url，在for循环中，通过urljoin获取到完整的url，urljoin更具通用性；将url和回调函数作为参数传入到Request中，再由yield命令，就可以实现下载并回调解析了。

之前的具体字段解析封装在类中的parse_detail方法中（把import re放在文件开头了）：

    def parse_detail(self, response):
        """
        解析文章具体字段
        """
        # 标题
        title = response.xpath('//*[@id="news_title"]/a/text()').extract_first("Untitled")
        # 发布日期
        create_date = re.findall('\d{4}-\d{2}-\d{2}', response.xpath('//*[@id="news_info"]/span[2]/text()').extract_first("0000-00-00"))[0]
        # 正文
        content = response.xpath('//*[@id="news_body"]').extract_first("No content")
        # 标签
        tags = ','.join(response.xpath('//*[@id="news_more_info"]/div/a/text()').extract())
        # 来源
        source = response.xpath('//*[@id="link_source2"]/text()').extract_first('Unknown')

        pass

下面开发parse函数的第二个功能，即：获取下一页的url，交给scrapy进行下载，下载完成后交给parse。

首先是在scrapy的shell中进行调试：

>>> next_url_xpath = '//div[@class="pager"]/a[last()]/@href'
>>> next_url_selector = response.xpath(next_url_xpath)
>>> next_url_selector
[<Selector xpath='//div[@class="pager"]/a[last()]/@href' data='/n/page/2/'>]
>>> next_url = next_url_selector.extract_first()
>>> next_url
'/n/page/2/'

parse函数中添加以下逻辑：

        # 获取下一页的url，交给scrapy进行下载，下载完成后交给parse
        next_url = response.xpath('//div[@class="pager"]/a[not(@class)]/@href').extract_first('end')
        if next_url != 'end':
            next_url = parse.urljoin(response.url, next_url)
            yield Request(url=next_url, callback=self.parse)
        pass

上面是考虑到了最后一页获取到的标签不再是下一页，故而不再采用last()，而是选择没有class属性的a标签，方法就是：

/a[not(@class)]

总的来说，cnblogs.py中为：

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.http import Request
from urllib import parse


class CnblogsSpider(scrapy.Spider):
    name = 'cnblogs'
    allowed_domains = ['news.cnblogs.com']  # 允许的域名
    start_urls = ['https://news.cnblogs.com']  # 起始url

    def parse(self, response):
        """
        1. 获取文章列表页中的文章的url，交给scrapy下载后，进行具体字段的解析
        2. 获取下一页的url，交给scrapy进行下载，下载完成后交给parse
        """
        # 获取文章列表页中的文章的url，交给scrapy下载后，进行具体字段的解析
        articles_link = response.xpath('//*[@class="news_entry"]/a/@href').extract()
        for url in articles_link:
            yield Request(url=parse.urljoin(response.url, url), callback=self.parse_detail)
            pass

        # 获取下一页的url，交给scrapy进行下载，下载完成后交给parse
        next_url = response.xpath('//div[@class="pager"]/a[not(@class)]/@href').extract_first('end')
        if next_url != 'end':
            next_url = parse.urljoin(response.url, next_url)
            yield Request(url=next_url, callback=self.parse)
        pass

    def parse_detail(self, response):
        """
        解析文章具体字段
        """
        # 标题
        title = response.xpath('//*[@id="news_title"]/a/text()').extract_first("Untitled")
        # 发布日期
        create_date = re.findall('\d{4}-\d{2}-\d{2}', response.xpath('//*[@id="news_info"]/span[2]/text()').extract_first("0000-00-00"))[0]
        # 正文
        content = response.xpath('//*[@id="news_body"]').extract_first("No content")
        # 标签
        tags = ','.join(response.xpath('//*[@id="news_more_info"]/div/a/text()').extract())
        # 来源
        source = response.xpath('//*[@id="link_source2"]/text()').extract_first('Unknown')

        pass

需要指出的是，因为我们使用了scrapy，一些工作我们不需要使用urllib或者requests进行了。我们使用的Request中的callback实际上是异步机制。异步的表现就是，我们获取到的内容并不是按照传入Request中的url的顺序，即同一个页面的顺序并不是严格的。此外，不同页面，大部分是前面的页面的内容先获取到，小部分出现前面页面落后于后面的页面获取到。

爬虫什么时候终止呢？其实回调可以看做是一种递归，或者说是一种循环。当你把它作为循环看待时，什么时候终止就不再是一个难想的问题了。

此外，对于我们待爬取的网站，发现近三日的文章是不需要登录即可查看的，而三天以前的文章是需要登录才可以查看的，下一章我们将解决这个问题，本章我们只爬取近三天的内容。所以，再改写parse_detail：

    def parse_detail(self, response):
        """
        解析文章具体字段
        """
        if 'account' in response.url:
            pass
        else:
            # 标题
            title = response.xpath('//*[@id="news_title"]/a/text()').extract_first("Untitled")
            # 发布日期
            create_date = re.findall('\d{4}-\d{2}-\d{2}', response.xpath('//*[@id="news_info"]/span[2]/text()').extract_first("0000-00-00"))[0]
            # 正文
            content = response.xpath('//*[@id="news_body"]').extract_first("No content")
            # 标签
            tags = ','.join(response.xpath('//*[@id="news_more_info"]/div/a/text()').extract())
            # 来源
            source = response.xpath('//*[@id="link_source2"]/text()').extract_first('Unknown')

            pass

        pass

dmxjhg

发布了101 篇原创文章 · 获赞 26 · 访问量 1万+

私信关注

Python 分布式爬虫框架 Scrapy 4-6 编写spider爬取所有文章

猜你喜欢