8_2 scrapy入门实战之CrawlSpider(微信小程序社区教程爬取)

CrawlSpider可用于有规则的网站,对其整站的爬取

一、创建项目

scrapy startproject wxapp

cd wxapp

scrapy genspider -t crawl wxapp_spider wxapp-union.com

二、更改setting.py

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {...}

三、wxapp_spider.py编写(重点)

  代码编写

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule 6 
 7 class WxappSpiderSpider(CrawlSpider):
 8     name = 'wxapp_spider'
 9     allowed_domains = ['wxapp-union.com']
10     #start_urls = ['http://wxapp-union.com/']
11     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
12 
13     rules = (
14         # 注意特殊字符加\
15         Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True),
16         Rule(LinkExtractor(allow=r'.+article.+\.html'), callback='parse_detail', follow=False)
17     )
18 
19     def parse_detail(self, response):
      print(type(response))

  附:注意问题:

    1、parse_detail(self, response):是用在Rule中的回调函数,若是要对其进行调试,则start_url和allowed_domains域名要一致。不一致的话是程序无法进入parse_detail(self, response):,因为会自动过滤。

    2、需要使用LinkExtractor和Rule,这两个决定爬虫的爬取方法;

      2.1  allow设置规则的方法:限制在程序需要爬取的url,同时注意re特殊字符的转义

      2.2  什么情况下使用follow:如果在爬取页面的时候,需要将满足当前条件的url再进行跟进,那么设置为Ture,否则设置False   

       2.3  什么情况下使用callback:如果想要获取url对应页面中的数据,那么就需要指定爬取函数为callback 。如果获取页面只是为了获取更多的url,不要需要其数据,则无需指定callback

四、对页面进行爬取

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 
 6 class WxappSpiderSpider(CrawlSpider):
 7     name = 'wxapp_spider'
 8     allowed_domains = ['wxapp-union.com']
 9     #start_urls = ['http://wxapp-union.com/']
10     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
11 
12     rules = (
13         # 注意特殊字符加\
14         Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True),
15         Rule(LinkExtractor(allow=r'.+article.+\.html'), callback='parse_detail', follow=False)
16     )
17 
18     def parse_detail(self, response):
19         title = response.xpath("//h1[@class='ph']/text()").get()
20         author_p = response.xpath("//p[@class='authors']")
21         author = author_p.xpath(".//a/text()").get()
22         pub_time = author_p.xpath(".//span/text()").get()
23         content = response.xpath("//td[@id='article_content']//text()").getall()
24         content = "".join(content).strip()

五、数据存储

  1、items.py

 1 import scrapy
 2 
 3 
 4 class WxappItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     title = scrapy.Field()
 8     author = scrapy.Field()
 9     pub_time = scrapy.Field()
10     content = scrapy.Field()

  2、pipelines.py

 1 from scrapy.exporters import JsonLinesItemExporter
 2 
 3 class WxappPipeline:
 4     def __init__(self):
 5         self.fp = open('wxjc.json', 'wb')
 6         self.exporter = JsonLinesItemExporter(self.fp,
 7                                               ensure_ascii=False,
 8                                               encoding='utf-8')
 9 
10     def process_item(self, item, spider):
11         self.exporter.export_item(item)
12         return item
13 
14     def close_spider(self,spider):
15         self.fp.close()

  3、wxapp_spider.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from wxapp.items import WxappItem
 6 
 7 class WxappSpiderSpider(CrawlSpider):
 8     name = 'wxapp_spider'
 9     allowed_domains = ['wxapp-union.com']
10     #start_urls = ['http://wxapp-union.com/']
11     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
12 
13     rules = (
14         # 注意特殊字符加\
15         Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True),
16         Rule(LinkExtractor(allow=r'.+article.+\.html'), callback='parse_detail', follow=False)
17     )
18 
19     def parse_detail(self, response):
20         title = response.xpath("//h1[@class='ph']/text()").get()
21         author_p = response.xpath("//p[@class='authors']")
22         author = author_p.xpath(".//a/text()").get()
23         pub_time = author_p.xpath(".//span/text()").get()
24         content = response.xpath("//td[@id='article_content']//text()").getall()
25         content = "".join(content).strip()
26 
27         # 数据存储
28         item = WxappItem(title=title, author=author,pub_time=pub_time, content=content)
29         yield item

  4、更改setting.py

ITEM_PIPELINES = {
   'wxapp.pipelines.WxappPipeline': 300,
}

猜你喜欢

转载自www.cnblogs.com/sruzzg/p/13185783.html