scrapy--Rule()与LinkExtractor()函数理解

这两个函数用于CrawlSpider内的rules属性中，具体的参数用法网上有很多，这里不再赘述。我想说的是差点搞死我的几个注意点。

1.来源：

from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor

2.注意点：

1.rules内规定了对响应中url的爬取规则，爬取得到的url会被再次进行请求，并根据callback函数和follow属性的设置进行解析或跟进。
这里强调两点：一是会对所有返回的response进行url提取，包括首次url请求得来的response；二是rules列表中规定的所有Rule都会被执行。

2.allow参数没有必要写出要提取的url完整的正则表达式，部分即可，只要能够区别开来。且最重要的是，即使原网页中写的是相对url，通过LinkExtractor这个类也可以提取中绝对的url，这个类太厉害了。

3.LinkExtractor单独使用

start_urls = ['https://www.kanunu8.com/book2/10935/index.html']
def parse(self, response):
    link = LinkExtractor(allow='\d{6}\.html',restrict_xpaths='//div//table//a')
    links = link.extract_links(response)
    print(links)

[Link(url=‘https://www.kanunu8.com/book2/10935/194600.html’, text=‘楔子’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194601.html’, text=‘第一章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194602.html’, text=‘第二章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194603.html’, text=‘第三章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194604.html’, text=‘第四章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194605.html’, text=‘第五章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194606.html’, text=‘第六章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194607.html’, text=‘第七章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194608.html’, text=‘第八章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194609.html’, text=‘第九章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194610.html’, text=‘第十章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194611.html’, text=‘第十一章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194612.html’, text=‘第十二章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194613.html’, text=‘第十三章’, fragment=’’, nofollow=False),
Link(url=‘https://www.kanunu8.com/book2/10935/194614.html’, text=‘后记’, fragment=’’, nofollow=False)]

看到没，原网页给的是相对地址，它竟然能够通过计算返回出绝对地址，真是很厉害。而且links是一个Link对象的列表。这里通过：

for link in links:
print(link.url)

即可提取绝对url地址，这个作用很方便，就不用再用response.urljoin()函数了。