Scrapy爬取论坛存入elasticsearch

爬完微博没几天,老板看我在自己看视频学AI,就想可能是再加点任务给我了,昨天让我爬一个小论坛,好在我刚刚自学完scrapy,正愁没时间练习,于是用一个下午的时间写完了。话不多说,开始搞起。

一、环境信息

        Python版本: Python 3.6.5 :: Anaconda

        IDE:Eclipse Oxygen.3 Release (4.7.3)

        开发电脑:MacOS 10.13.2

        部署服务器:CentOS7

        使用到的pypi:scrapyelasticsearch 0.9.1

二、准备工作

1、安装Python3我就不多说了,Anaconda稳稳地

2、准备scrapy框架,打开终端,cd到你想保存项目的目录下,执行如下指令

MacBookPro:forum songyao$ scrapy startproject [项目名]

MacBookPro:forum songyao$ cd forum/

MacBookPro:forum songyao$ scrapy genspider forum_spider "http://bbs.07430743.com"

MacBookPro:forum songyao$ scrapy crawl forum_spider

这里的最后一行指令是用于执行scrapy的,总是在终端敲比较麻烦,而且window环境还有安装一个win32插件再进入开发环境,很麻烦,所以可以在项目的目录下,新建一个文件,名字就叫start.py。把要在终端执行的指令写入,运行时直接运行start.py文件即可。

#encoding: utf-8

from scrapy import cmdline

cmdline.execute("scrapy crawl forum_spider".split())
# cmdline.execute(["scrapy","crawl","forum_spider"])

创建完成,如下图:

简单介绍一下结构:

items:存储所有爬取数据的模型的,里面存储数据类型,时间、正文、作者等等,这样就不需要字典的方式去操作了。

middlewares:包括所有的中间件,比如加代理也可以在这里写,不过基本上小项目很少用得到。

pipelines:用来处理爬取下来的数据的,存Mysql或者ES,亦可直接写入json文件。

settings:是给我们爬虫程序来配置的,可以在里面设置默认请求头、是否开启cookie、是否在下载之前进行延迟操作

3、安装scrapyelasticsearch

这个真的是非常非常好用的插件,在网上看了很多自己写代码存elasticsearch的,感觉太过复杂,用这个插件,基本上几行配置文件就可以搞定,强烈推荐!!!

附上项目网址:https://pypi.org/project/ScrapyElasticSearch/

打开终端,一行指令搞定:pip install ScrapyElasticSearch

三、开发阶段:

现附上我的代码(膝盖):

forum_spider.py

# -*- coding: utf-8 -*-
import scrapy
import time
from forum.items import ForumItem
from scrapy.http.response.html import HtmlResponse

class ForumSpiderSpider(scrapy.Spider):
    name = 'forum_spider'
    allowed_domains = ['bbs.07430743.com']
    start_urls = ['http://bbs.07430743.com/thread-1616404-1-1.html']
    
    def parse(self, response):
        commands = response.xpath("//div//td[@class='t_f']/text()")
        times =  response.xpath("//div[@class='authi']//span/@title")
        time = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
        try:
            time = times.get().strip()
        except:
            pass
#         items = [ ]
        for forum_url in commands:
            command = forum_url.get().strip()
            if command != "":
                source = "luntan"
                url = "http://bbs.07430743.com"
                item = ForumItem(source=source,command=command,time=time,url=url)
#                 items.append(item)
                yield item
        next_url = response.xpath("//div[@class='pg']/a[last()]/@href").get()
        if not next_url:
            return
        else:
            yield scrapy.Request(next_url,callback=self.parse)

start.py

#encoding: utf-8

from scrapy import cmdline

cmdline.execute("scrapy crawl forum_spider".split())
# cmdline.execute(["scrapy","crawl","forum_spider"])

settings.py

# -*- coding: utf-8 -*-


BOT_NAME = 'forum'

SPIDER_MODULES = ['forum.spiders']
NEWSPIDER_MODULE = 'forum.spiders'



DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
   'Referer':'http://tianqi.2345.com/plugin/widget/index.htm?s=1&z=2&t=1&v=0&d=2&bd=0&k=&f=&ltf=009944&htf=cc0000&q=0&e=0&a=0&c=60011&w=410&h=60&align=center'

}

# 用于向ES写数据,实际使用需要将修改为ES的IP+port
# 查看 http://localhost:9200/blog/_search
 
 
ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 100
}

ELASTICSEARCH_SERVERS = ['192.168.1.80:9200']
ELASTICSEARCH_INDEX = 'forum'
ELASTICSEARCH_TYPE = 'jishou'

pipeline.py

# -*- coding: utf-8 -*-


from scrapy.exporters import JsonLinesItemExporter
  
class ForumPipeline(object):
    def __init__(self):
        self.fp = open("luntan.json",'wb')
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
           
    def open_spider(self,spider):
        print('论坛爬虫开始了...')
       
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
       
    def close_spider(self,spider):
        self.fp.close()
        print('论坛爬虫结束了...')

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ForumItem(scrapy.Item):
    source = scrapy.Field()
    command = scrapy.Field()
    time = scrapy.Field()
    url = scrapy.Field()

四、运行

导入服务器,进入项目目录,python start.py 开始运行

亲测,稳如老狗。

有任何问题欢迎给我留言,去食堂吃饭去啦~

猜你喜欢

转载自blog.csdn.net/IncubusSong/article/details/82658269
今日推荐