【Scrapy 动态配置爬虫 LinkExtractor提取链接】

       提取链接的两种方法:

  • Selector
  • LinkExtractor

linkExtractor的使用分两种情况,一种是用 crawl 模板创建,一种是不用 crawl 模板创建。

     # 使用 LinkExtractor提取链接
            from scrapy.linkextractors import LinkExtractor

            le = LinkExtractor(restrict_css='ul.pager li.next')
            links = le.extract_links(response)
            if links:
                next_url = links[0].url
                yield scrapy.Request(next_url,callback=self.parse)

            # 下一页的url在 ul.pager > li.next > a 里面
            # next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
            # if next_url:
            #     # 如果找到下一页的url,得到绝对路径,构造新的Response对象
            #     next_url = response.urljoin(next_url)
            #     yield scrapy.Request(next_url, callback=self.parse)

https://blog.csdn.net/keenshinsword/article/details/79091859这里是linkExtractor参数的详解。

实际应用实例1:不使用crawl模板

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class TengxSpider(scrapy.Spider):
    name = 'tengx'
    #allowed_domains = ['jianshu.com']
    start_urls = ['http://www.yub2b.com/news/list-49.html']

    def parse(self, response):
        print("sdf")
        le = LinkExtractor(restrict_css='.pages a')
        #过滤规则的启动
        links = le.extract_links(response)
        if links:
            next_url = links[1].url
            print(next_url)
            next_page = response.urljoin(next_url)
            print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)

       可以通过print来进行测试,发现 LinkExtractor相当于页面查找某些特定标签的过滤规则,输出状态如下:

       <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor object at 0x033345C8>

     它的extract_links是回调连接加上过滤规则的启动装置 。输出状态如下:

[Link(url='http://www.yub2b.com/news/list-49.html', text='\xa0首页\xa0', fragment='', nofollow=False), 
Link(url='http://www.yub2b.com/news/list-49-2.html', text='\xa0下一页\xa0', fragment='', nofollow=False), 
Link(url='http://www.yub2b.com/news/list-49-94.html', text='\xa0上一页\xa0', fragment='', nofollow=False), 
Link(url='http://www.yub2b.com/news/list-49-94.html', text='\xa0尾页\xa0', fragment='', nofollow=False)]

关于LinkExtractor里边参数的使用,比如allow、restrict_xpath 与 restrict_css、tags、attrs、process_value。

参考网址是https://blog.csdn.net/zjkpy_5/article/details/89812626

这里主要介绍allow和process_value

allow

#allow接收一个正则表达式或一个正则表达式列表,,提取绝对链接与正则表达式匹配的链接,如果该参数为空(默认),就提取全部链接
from sccrapy.linkextractors import LinkExtractor
pat = '正则表达'
le = LinkExtractor(allow=pat)
links = le.extract_links(response) 

process_value

#接收一个回调函数,如果传了该参数,LinkExtractor 将调用该回调函数对提取的每一个链接(如 a 的 href)进行处理,回调函数正常情况下应返回一个字符串(处理结果)想要抛弃所处理的链接时,返回 None
import re
def process(value):
    m = re.search('正则',value)
    #如果匹配,就提取其中 url 并返回,不匹配这返回原值
    if m:
        value = m.group(1)
    return value
 
from scrapy.linkextractors import LinkExtractor
 
le = LinkExtractor(process_value=process)
links = le.extract_links(response)

使用crawl模板

具体的使用步骤有三步:

第一步:是去把启动文件用crawler启动,比如:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('tengx')#爬虫名称
process.start() # the script will block here until the crawling is finished

第二步:是要把爬虫文件的继承方法从spider.Spider变为CrawlSpider

class TengxSpider(CrawlSpider):

第三步,在rule中定义规则,用回调函数去验证和提取数据

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TengxSpider(CrawlSpider):
    name = 'tengx'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net']
    # 指定链接提取的规律
    rules = (
        # follow:是指爬取了之后,是否还继续从该页面提取链接,然后继续爬下去
        Rule(LinkExtractor(allow=r'.*/article/.*'), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        print('-' * 100)
        print(response.url)
        title = response.css('h1::text').extract()[0]
        print(title)
        print('-' * 100)
        return None
发布了56 篇原创文章 · 获赞 2 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/fan13938409755/article/details/105089248