【Scrapy 动态配置爬虫 LinkExtractor提取链接】

提取链接的两种方法：

Selector
LinkExtractor

linkExtractor的使用分两种情况，一种是用 crawl 模板创建，一种是不用 crawl 模板创建。

     # 使用 LinkExtractor提取链接
            from scrapy.linkextractors import LinkExtractor

            le = LinkExtractor(restrict_css='ul.pager li.next')
            links = le.extract_links(response)
            if links:
                next_url = links[0].url
                yield scrapy.Request(next_url,callback=self.parse)

            # 下一页的url在 ul.pager > li.next > a 里面
            # next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
            # if next_url:
            #     # 如果找到下一页的url,得到绝对路径,构造新的Response对象
            #     next_url = response.urljoin(next_url)
            #     yield scrapy.Request(next_url, callback=self.parse)

https://blog.csdn.net/keenshinsword/article/details/79091859这里是linkExtractor参数的详解。

实际应用实例1：不使用crawl模板

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class TengxSpider(scrapy.Spider):
    name = 'tengx'
    #allowed_domains = ['jianshu.com']
    start_urls = ['http://www.yub2b.com/news/list-49.html']

    def parse(self, response):
        print("sdf")
        le = LinkExtractor(restrict_css='.pages a')
        #过滤规则的启动
        links = le.extract_links(response)
        if links:
            next_url = links[1].url
            print(next_url)
            next_page = response.urljoin(next_url)
            print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)

可以通过print来进行测试，发现 LinkExtractor相当于页面查找某些特定标签的过滤规则，输出状态如下：

<scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor object at 0x033345C8>

它的extract_links是回调连接加上过滤规则的启动装置。输出状态如下：

[Link(url='http://www.yub2b.com/news/list-49.html', text='\xa0首页\xa0', fragment='', nofollow=False), 
Link(url='http://www.yub2b.com/news/list-49-2.html', text='\xa0下一页\xa0', fragment='', nofollow=False), 
Link(url='http://www.yub2b.com/news/list-49-94.html', text='\xa0上一页\xa0', fragment='', nofollow=False), 
Link(url='http://www.yub2b.com/news/list-49-94.html', text='\xa0尾页\xa0', fragment='', nofollow=False)]

关于LinkExtractor里边参数的使用，比如allow、restrict_xpath 与 restrict_css、tags、attrs、process_value。

参考网址是https://blog.csdn.net/zjkpy_5/article/details/89812626

这里主要介绍allow和process_value

allow

#allow接收一个正则表达式或一个正则表达式列表，，提取绝对链接与正则表达式匹配的链接，如果该参数为空（默认），就提取全部链接
from sccrapy.linkextractors import LinkExtractor
pat = '正则表达'
le = LinkExtractor(allow=pat)
links = le.extract_links(response)

process_value

#接收一个回调函数，如果传了该参数，LinkExtractor 将调用该回调函数对提取的每一个链接（如 a 的 href）进行处理，回调函数正常情况下应返回一个字符串（处理结果）想要抛弃所处理的链接时，返回 None
import re
def process(value):
    m = re.search('正则',value)
    #如果匹配，就提取其中 url 并返回，不匹配这返回原值
    if m:
        value = m.group(1）
    return value
 
from scrapy.linkextractors import LinkExtractor
 
le = LinkExtractor(process_value=process)
links = le.extract_links(response)

使用crawl模板

具体的使用步骤有三步：

第一步：是去把启动文件用crawler启动，比如：

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('tengx')#爬虫名称
process.start() # the script will block here until the crawling is finished

第二步：是要把爬虫文件的继承方法从spider.Spider变为CrawlSpider

class TengxSpider(CrawlSpider):

第三步，在rule中定义规则，用回调函数去验证和提取数据

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TengxSpider(CrawlSpider):
    name = 'tengx'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net']
    # 指定链接提取的规律
    rules = (
        # follow:是指爬取了之后，是否还继续从该页面提取链接，然后继续爬下去
        Rule(LinkExtractor(allow=r'.*/article/.*'), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        print('-' * 100)
        print(response.url)
        title = response.css('h1::text').extract()[0]
        print(title)
        print('-' * 100)
        return None

范之度

发布了56 篇原创文章 · 获赞 2 · 访问量 3万+

私信关注

【Scrapy 动态配置爬虫 LinkExtractor提取链接】

linkExtractor的使用分两种情况，一种是用 crawl 模板创建，一种是不用 crawl 模板创建。

使用crawl模板

猜你喜欢