提取链接的两种方法:
- Selector
- LinkExtractor
linkExtractor的使用分两种情况,一种是用 crawl 模板创建,一种是不用 crawl 模板创建。
# 使用 LinkExtractor提取链接
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(restrict_css='ul.pager li.next')
links = le.extract_links(response)
if links:
next_url = links[0].url
yield scrapy.Request(next_url,callback=self.parse)
# 下一页的url在 ul.pager > li.next > a 里面
# next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
# if next_url:
# # 如果找到下一页的url,得到绝对路径,构造新的Response对象
# next_url = response.urljoin(next_url)
# yield scrapy.Request(next_url, callback=self.parse)
https://blog.csdn.net/keenshinsword/article/details/79091859这里是linkExtractor参数的详解。
实际应用实例1:不使用crawl模板
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class TengxSpider(scrapy.Spider):
name = 'tengx'
#allowed_domains = ['jianshu.com']
start_urls = ['http://www.yub2b.com/news/list-49.html']
def parse(self, response):
print("sdf")
le = LinkExtractor(restrict_css='.pages a')
#过滤规则的启动
links = le.extract_links(response)
if links:
next_url = links[1].url
print(next_url)
next_page = response.urljoin(next_url)
print(next_url)
yield scrapy.Request(url=next_url, callback=self.parse)
可以通过print来进行测试,发现 LinkExtractor相当于页面查找某些特定标签的过滤规则,输出状态如下:
<scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor object at 0x033345C8>
它的extract_links是回调连接加上过滤规则的启动装置 。输出状态如下:
[Link(url='http://www.yub2b.com/news/list-49.html', text='\xa0首页\xa0', fragment='', nofollow=False),
Link(url='http://www.yub2b.com/news/list-49-2.html', text='\xa0下一页\xa0', fragment='', nofollow=False),
Link(url='http://www.yub2b.com/news/list-49-94.html', text='\xa0上一页\xa0', fragment='', nofollow=False),
Link(url='http://www.yub2b.com/news/list-49-94.html', text='\xa0尾页\xa0', fragment='', nofollow=False)]
关于LinkExtractor里边参数的使用,比如allow、restrict_xpath 与 restrict_css、tags、attrs、process_value。
参考网址是https://blog.csdn.net/zjkpy_5/article/details/89812626
这里主要介绍allow和process_value
allow
#allow接收一个正则表达式或一个正则表达式列表,,提取绝对链接与正则表达式匹配的链接,如果该参数为空(默认),就提取全部链接
from sccrapy.linkextractors import LinkExtractor
pat = '正则表达'
le = LinkExtractor(allow=pat)
links = le.extract_links(response)
process_value
#接收一个回调函数,如果传了该参数,LinkExtractor 将调用该回调函数对提取的每一个链接(如 a 的 href)进行处理,回调函数正常情况下应返回一个字符串(处理结果)想要抛弃所处理的链接时,返回 None
import re
def process(value):
m = re.search('正则',value)
#如果匹配,就提取其中 url 并返回,不匹配这返回原值
if m:
value = m.group(1)
return value
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(process_value=process)
links = le.extract_links(response)
使用crawl模板
具体的使用步骤有三步:
第一步:是去把启动文件用crawler启动,比如:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('tengx')#爬虫名称
process.start() # the script will block here until the crawling is finished
第二步:是要把爬虫文件的继承方法从spider.Spider变为CrawlSpider
class TengxSpider(CrawlSpider):
第三步,在rule中定义规则,用回调函数去验证和提取数据
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class TengxSpider(CrawlSpider):
name = 'tengx'
allowed_domains = ['blog.csdn.net']
start_urls = ['https://blog.csdn.net']
# 指定链接提取的规律
rules = (
# follow:是指爬取了之后,是否还继续从该页面提取链接,然后继续爬下去
Rule(LinkExtractor(allow=r'.*/article/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
print('-' * 100)
print(response.url)
title = response.css('h1::text').extract()[0]
print(title)
print('-' * 100)
return None