实现elasticsearch 和scrapy-redis分布式

kibana-5.1.2-windows-x86

elasticsearch-rtf

elasticsearch-head

elasticsearch-rtf的版本最好要和kibana接近 具体操作可以从GitHub上查找

使用到npm的话再去下载node.js



在项目中建立一个models文件夹类似django

 
 

from datetime import datetime
from elasticsearch_dsl import DocType, Date, Nested, Boolean, \
    analyzer, InnerDoc, Completion, Keyword, Text,Integer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=["localhost"])
class jobboleItemsType(DocType):
    title = Text(analyzer="ik_max_word")
    date_time = Date()
    style = Text(analyzer='ik_max_word')
    content = Text(analyzer='ik_max_word')
    cherish = Integer()
    image_url = Keyword()
    img_path = Keyword()

    class Meta:
        index = 'job_bole'
        doc_type = 'article'

if __name__ == '__main__':
    jobboleItemsType.init()

如上将item中对应的设置一下 类似数据库建立表


在对应的item类中

def save_to_es(self):
    article = jobboleItemsType()
    article.title = self['title']
    article.content = self['content']
    article.date_time = self['date_time']
    article.cherish = self['cherish']
    article.image_url = self['image_url']
    # article.img_path = item['img_path']
    article.meta.id = self['id']
    article.save()
    return

在对应的pipeline中调用这个方法,就可以实现将数据存进去了

写完后记得要在settings中注册

分布式-------------------------------------

下载安装好scrapy-redis 

C:\Users\chase\Desktop\scrapy-redis-master\src\scrapy_redis  将这个文件夹放入到项目中之后按照GitHub上给的布置

之后运行下面这两个 修改成自己的名称

  1. run the spider:

    scrapy runspider myspider.py
    
  2. push urls to redis:

    redis-cli lpush myspider:start_urls http://google.com




猜你喜欢

转载自blog.csdn.net/chasejava/article/details/80024698