[Elasticsearch]如何通过python操作ES数据库 pythonElasticsearch入门

[Elasticsearch]如何通过python操作ES数据库 python Elasticsearch

Elasticsearch基本介绍
Elasticsearch入门
说明

Elasticsearch基本介绍

ES是一个分布式文档储存中间件，存储的方式是已序列化的JSON文档的复杂数据结构。
使用倒排索引的数据结构，支持快速全文搜索。在倒排索引里列出了所有文档中出现的每一个唯一单词并分别标识了每个单词在哪一个文档中。
- 正向索引：文档->关键词
  - 例如，搜索ABC这一字段，方法：每一行的单词逐一扫描，扫描到ABC时提取它。
- 倒排索引：关键词->文档
  - 倒排索引表，表内的关键词对应一个倒排列表，列表内有包含该关键字的文档的DocID的集合。
采用RestfulAPI标准：通过http接口使用JSON格式进行操作数据
数据存储的最小单位是文档，本质上是JSON文本

图片转载，侵删
上图转载

Elasticsearch入门

安装与启动

python操作ES数据库

连接ES数据库

无用户名密码状态

from elasticsearch import Elasticsearch

es=Elasticsearch([{
    
    "host":"xxx.xxx.xxx.xxx","port":xxxx}])

有密码

es = Elasticsearch(['10.10.13.12'], http_auth=('xiao', '123456'), timeout=3600)

创建索引（ES中的索引即数据库）

# 创建索引（数据库）
es.indices.create(index="索引名字，字母小写")

已经存在该索引时会报错

in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, ‘resource_already_exists_exception’, ‘index [es_test/CvW-H_EpTK6YmcnQ7vk2Wg] already exists’

可以使用下面的语句来忽略上述错误

es.indices.create(index="es_test",ignore=400)

插入数据

单条数据

# 插入数据
body={
    
    'keyword':'测试',"content":"这是一个测试数据1"}
es.index(index='es_test',doc_type='_doc',body=body)

多条数据

#插入多条数据

doc=[{
    
    'index':{
    
    '_index':'es_zilongtest','_type':'_doc','_id':4}},
     {
    
    'keyword':'食物',"content":"我喜欢吃大白菜"},
     {
    
    'index': {
    
    '_index': 'es_zilongtest', '_type': '_doc', '_id': 5}},
     {
    
    'keyword': '食物', "content": "鸡胸肉很好吃"},
     {
    
    'index':{
    
    '_index':'es_zilongtest','_type':'_doc','_id':6}},
     {
    
    'keyword':'食物',"content":"小白菜也好吃"},
     ]
es.bulk(index='es_zilongtest',doc_type='_doc',body=doc)

查询数据

查询结果返回参数各字段含义

took
该命令请求花费了多长时间，单位：毫秒。

timed_out
搜索是否超时。

shards
搜索分片信息。

total
搜索分片总数。

successful
搜索成功的分片数量。

skipped
没有搜索的分片，跳过的分片。

failed
搜索失败的分片数量。

hits
搜索结果集。项目中，我们需要的一切数据都是从hits中获取。

total
返回多少条数据。

max_score
返回结果中，最大的匹配度分值。

hits
默认查询前十条数据，根据分值降序排序。

_index
索引库名称。

_type
类型名称。

_id
该条数据的id。

_score
关键字与该条数据的匹配度分值。

_source
索引库中类型，返回结果字段，不指定的话，默认全部显示出来。

参考资料ElasticSearch之查询返回结果各字段含义

最直接的查询方法

print(es.search(index='es_zilongtest'))

只需指定索引（数据库），会返回数据库中的信息

{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 8, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'es_zilongtest', '_type': '_doc', '_id': 'mTZQK3wBr1SJ1UhpryaJ', '_score': 1.0, '_source': {'keyword': '测试', 'content': '这是一个测试数据'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': 'mjZRK3wBr1SJ1UhpCSZj', '_score': 1.0, '_source': {'keyword': '测试', 'content': '这是一个测试数据1'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '1', '_score': 1.0, '_source': {'keyword': '动物', 'content': '大白把隶属家的小黄咬了'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '2', '_score': 1.0, '_source': {'keyword': '动物', 'content': '王博家里买了很多小鸡'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '3', '_score': 1.0, '_source': {'keyword': '动物', 'content': '王叔家的小白爱吃鸡胸肉'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '4', '_score': 1.0, '_source': {'keyword': '食物', 'content': '我喜欢吃大白菜'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '5', '_score': 1.0, '_source': {'keyword': '食物', 'content': '鸡胸肉很好吃'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '6', '_score': 1.0, '_source': {'keyword': '食物', 'content': '小白菜也好吃'}}]}}

这里的数据将用于后面的讲解，如有需要，请留意

常用参数

index - 索引名
q - 查询指定匹配使用Lucene查询语法
from_ - 查询起始点默认0
doc_type - 文档类型
size - 指定查询条数默认10
field - 指定字段逗号分隔
sort - 排序字段：asc/desc
body - 使用Query DSL
scroll - 滚动查询

用body指定条件

# 查询数据
body ={
    
    
     'from':0, #从0开始
    'size':10 #size可以在es.search中指定，也可以在此指定，默认是10
}

print(es.search(index='es_zilongtest',body=body))

# size的另一种指定方法
es.search(index='es_python', filter_path=filter_path, body=body, size=200)

如果觉得查出来的结果太复杂，可以设定过滤字段

# 有过滤字段查询数据
body ={
    
    
     'from':0, #从0开始
}
# 定义过滤字段，最终只显示此此段信息 hits.hits._source.写在前面 后面写你自己定义的字段名 我这里是keyword和content
filter_path=['hits.hits._source.keyword',  # keyword为第一个需要显示的字段
             'hits.hits._source.content']  # content为字段2
# print(es.search(index='es_zilongtest'))
print(es.search(index='es_zilongtest',filter_path=filter_path,body=body))

模糊查询

模糊查询需要用query命令查询方法match 注意只能查一个字段

body ={
    
    
     'from':0,
     'query':{
    
        # 查询命令
          'match':{
    
     # 查询方法：模糊查询
               'content':'小白菜'  #content为字段名称，match这个查询方法只支持查找一个字段
          }
     }
}

filter_path=['hits.hits._source.keyword',  # 字段1
             'hits.hits._source.content']  # 字段2
# print(es.search(index='es_zilongtest'))
print(es.search(index='es_zilongtest',filter_path=filter_path,body=body))

查询结果：

{‘hits’: {‘hits’: [{’_source’: {‘keyword’: ‘食物’, ‘content’: ‘小白菜也好吃’}}, {’_source’: {‘keyword’: ‘食物’, ‘content’: ‘我喜欢吃大白菜’}}, {’_source’: {‘keyword’: ‘动物’, ‘content’: ‘大白把隶属家的小黄咬了’}}, {’_source’: {‘keyword’: ‘动物’, ‘content’: ‘王叔家的小白爱吃鸡胸肉’}}, {’_source’: {‘keyword’: ‘动物’, ‘content’: ‘王博家里买了很多小鸡’}}]}}

可以看到 content中不仅出现了小白菜 还出现了大白菜 大白 小白等内容因为模糊查询把小白菜进行了拆分

如果不进行过滤，会看到更加详细的内容

{‘took’: 8, ‘timed_out’: False, ‘_shards’: {‘total’: 1, ‘successful’: 1, ‘skipped’: 0, ‘failed’: 0}, ‘hits’: {‘total’: {‘value’: 5, ‘relation’: ‘eq’}, ‘max_score’: 3.0320468, ‘hits’: [{’_index’: ‘es_zilongtest’, ‘_type’: ‘_doc’, ‘_id’: ‘6’, ‘_score’: 3.0320468, ‘_source’: {‘keyword’: ‘食物’, ‘content’: ‘小白菜也好吃’}}, {’_index’: ‘es_zilongtest’, ‘_type’: ‘_doc’, ‘_id’: ‘4’, ‘_score’: 2.1276836, ‘_source’: {‘keyword’: ‘食物’, ‘content’: ‘我喜欢吃大白菜’}}, {’_index’: ‘es_zilongtest’, ‘_type’: ‘_doc’, ‘_id’: ‘1’, ‘_score’: 1.2374083, ‘_source’: {‘keyword’: ‘动物’, ‘content’: ‘大白把隶属家的小黄咬了’}}, {’_index’: ‘es_zilongtest’, ‘_type’: ‘_doc’, ‘_id’: ‘3’, ‘_score’: 1.2374083, ‘_source’: {‘keyword’: ‘动物’, ‘content’: ‘王叔家的小白爱吃鸡胸肉’}}, {’_index’: ‘es_zilongtest’, ‘_type’: ‘_doc’, ‘_id’: ‘2’, ‘_score’: 0.6464764, ‘_source’: {‘keyword’: ‘动物’, ‘content’: ‘王博家里买了很多小鸡’}}]}}

其中score为关键字与该条数据的匹配度分值，默认降序排序

term 精确查询

#精确单值查询
body1={
    
    
     "query":{
    
    
          "terms":{
    
    
               "keyword.keyword":["食物","测试"] # 查询keyword="食物"或"测试"...的数据
          }
     }
}
print(es.search(index='es_zilongtest',body=body1))

注意这里的第一个keyword是我自己设定的字段名，第二个是接口要求的必须为keyword，所以我此处可以改成

#精确多值查询
body1={
    
    
     "query":{
    
    
          "terms":{
    
    
               "content.keyword":["小白菜","大白"] # 查询keyword="小白菜"或"大白"...的数据
          }
     }
}

这样搜索结果为空，因为并没有content是小白菜或大白（文中含有这个字段也不行，必须完全相同）

multi_match，多字段查询

# 查询多个字段中都包含指定内容的数据
body3 = {
    
    
    "query":{
    
    
        "multi_match":{
    
    
            "query":"小白菜",  # 指定查询内容，注意：会被分词
            "fields":["keyword", "content"]  # 指定字段
        }
    }
}
print(es.search(index='es_zilongtest',body=body3))

prefix，前缀查询

body3 = {
    
    
    "query":{
    
    
        "prefix":{
    
    
            "content.keyword":"小白菜",  # 查询前缀是指定字符串的数据
        }
    }
}
# 注：英文不需要加keyword
print(es.search(index='es_zilongtest',body=body3))

wildcard，通配符查询

body = {
    
    
    'query': {
    
    
        'wildcard': {
    
    
            'ziduan1.keyword': '?刘婵*'  # ?代表一个字符，*代表0个或多个字符
        }
    }
}
# 注：此方法只能查询单一格式的（都是英文字符串，或者都是汉语字符串）。两者混合不能查询出来。

regexp，正则匹配

body = {
    
    
    'query': {
    
    
        'regexp': {
    
    
            'ziduan1': 'W[0-9].+'   # 使用正则表达式查询
        }
    }
}
1234567

bool，多条件查询

# must：[] 各条件之间是and的关系
body = {
    
    
        "query":{
    
    
            "bool":{
    
    
                'must': [{
    
    "term":{
    
    'ziduan1.keyword': '我爱你中国'}},
                         {
    
    'terms': {
    
    'ziduan2': ['I love', 'China']}}]
            }
        }
    }

# should: [] 各条件之间是or的关系
body = {
    
    
        "query":{
    
    
            "bool":{
    
    
                'should': [{
    
    "term":{
    
    'ziduan1.keyword': '我爱你中国'}},
                         {
    
    'terms': {
    
    'ziduan2': ['I love', 'China']}}]
            }
        }
    }

# must_not：[]各条件都不满足
body = {
    
    
        "query":{
    
    
            "bool":{
    
    
                'must_not': [{
    
    "term":{
    
    'ziduan1.keyword': '我爱你中国'}},
                         {
    
    'terms': {
    
    'ziduan2': ['I love', 'China']}}]
            }
        }
    }



# bool嵌套bool
# ziduan1、ziduan2条件必须满足的前提下，ziduan3、ziduan4满足一个即可
body = {
    
    
    "query":{
    
    
        "bool":{
    
    
            "must":[{
    
    "term":{
    
    "ziduan1":"China"}},  #  多个条件并列  ，注意：must后面是[{}, {}],[]里面的每个条件外面有个{}
                    {
    
    "term":{
    
    "ziduan2.keyword": '我爱你中国'}},
                    {
    
    'bool': {
    
    
                        'should': [
                            {
    
    'term': {
    
    'ziduan3': 'Love'}},
                            {
    
    'term': {
    
    'ziduan4': 'Like'}}
                        ]
                    }}
            ]
        }
    }
}

说明

本文在ES数据库入门之python操作ES数据库这部分内容主要参考自参考资料1，并在其中增加了基于自身实践测试的感想，自认为对于新手可以少绕一些弯。

对于查询结果字段不理解的可以看查询数据中的查询结果返回参数各字段含义一节

参考资料:

1 python操作ES数据库

2 下一阶段阅读的内容，掌握更加详细的 Python Elasticsearch api