[Elasticsearch] How to operate ES database through python Getting started with pythonElasticsearch

[Elasticsearch] How to operate ES database through python python Elasticsearch

Basic introduction to Elasticsearch
Getting started with Elasticsearch
illustrate

Basic introduction to Elasticsearch

ES is a distributed document storage middleware that stores complex data structures of serialized JSON documents.
Use the inverted index data structure to support fast full-text search. The inverted index lists every unique word that appears in all documents and identifies which document each word appears in.
- Forward index: Document->Keywords
  - For example, to search the field ABC, the method is: scan the words in each line one by one, and extract it when ABC is scanned.
- Inverted index: keyword->document
  - Inverted index table, the keywords in the table correspond to an inverted list, and the list contains a collection of DocIDs of documents containing the keyword.
Adopt RestfulAPI standard: use JSON format to operate data through http interface
The smallest unit of data storage is a document, which is essentially JSON text

Picture reprinting, infringement and deletion
Reprinted above

Getting started with Elasticsearch

Installation and startup

Python operates ES database

Connect to ES database

No username and password status

from elasticsearch import Elasticsearch

es=Elasticsearch([{
    
    "host":"xxx.xxx.xxx.xxx","port":xxxx}])

Have password

es = Elasticsearch(['10.10.13.12'], http_auth=('xiao', '123456'), timeout=3600)

Create an index (the index in ES is the database)

# 创建索引（数据库）
es.indices.create(index="索引名字，字母小写")

An error will be reported if the index already exists

in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, ‘resource_already_exists_exception’, ‘index [es_test/CvW-H_EpTK6YmcnQ7vk2Wg] already exists’

You can use the following statement to ignore the above error

es.indices.create(index="es_test",ignore=400)

Insert data

Single piece of data

# 插入数据
body={
    
    'keyword':'测试',"content":"这是一个测试数据1"}
es.index(index='es_test',doc_type='_doc',body=body)

Multiple pieces of data

#插入多条数据

doc=[{
    
    'index':{
    
    '_index':'es_zilongtest','_type':'_doc','_id':4}},
     {
    
    'keyword':'食物',"content":"我喜欢吃大白菜"},
     {
    
    'index': {
    
    '_index': 'es_zilongtest', '_type': '_doc', '_id': 5}},
     {
    
    'keyword': '食物', "content": "鸡胸肉很好吃"},
     {
    
    'index':{
    
    '_index':'es_zilongtest','_type':'_doc','_id':6}},
     {
    
    'keyword':'食物',"content":"小白菜也好吃"},
     ]
es.bulk(index='es_zilongtest',doc_type='_doc',body=doc)

Query data

The meaning of each field of the query result return parameters

took
How long it took for the command request, unit: milliseconds.

timed_out
whether the search times out.

shards
searches for shard information.

total
total number of search shards.

The number of shards successfully searched .

skippedThe
shards that are not searched, the skipped shards.

failedNumber
of shards for which search failed.

hits
search result set. In the project, all the data we need is obtained from hits.

How many pieces of data are returned by total .

max_score
returns the maximum matching score among the results.

Hits
queries the top ten data by default and sorts them in descending order according to the score.

_index
index library name.

_type
type name.

_id
is the id of the data.

_score
is the matching score between the keyword and the piece of data.

The type in the _source
index library returns the result fields. If not specified, all fields will be displayed by default.

Reference materials : The meaning of each field in the query return result of ElasticSearch

The most direct query method

print(es.search(index='es_zilongtest'))

Just specify the index (database) and the information in the database will be returned

{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 8, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'es_zilongtest', '_type': '_doc', '_id': 'mTZQK3wBr1SJ1UhpryaJ', '_score': 1.0, '_source': {'keyword': '测试', 'content': '这是一个测试数据'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': 'mjZRK3wBr1SJ1UhpCSZj', '_score': 1.0, '_source': {'keyword': '测试', 'content': '这是一个测试数据1'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '1', '_score': 1.0, '_source': {'keyword': '动物', 'content': '大白把隶属家的小黄咬了'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '2', '_score': 1.0, '_source': {'keyword': '动物', 'content': '王博家里买了很多小鸡'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '3', '_score': 1.0, '_source': {'keyword': '动物', 'content': '王叔家的小白爱吃鸡胸肉'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '4', '_score': 1.0, '_source': {'keyword': '食物', 'content': '我喜欢吃大白菜'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '5', '_score': 1.0, '_source': {'keyword': '食物', 'content': '鸡胸肉很好吃'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '6', '_score': 1.0, '_source': {'keyword': '食物', 'content': '小白菜也好吃'}}]}}

The data here will be used for subsequent explanations. Please pay attention if necessary.

Common parameters

index - index name
q - Query for specified matches using Lucene query syntax
from_ - Query starting point defaults to 0
doc_type - document type
size - Specifies the number of query items, the default is 10
field - comma separated fields
sort - sort field: asc/desc
body - using Query DSL
scroll - scroll query

Use body to specify conditions

# 查询数据
body ={
    
    
     'from':0, #从0开始
    'size':10 #size可以在es.search中指定，也可以在此指定，默认是10
}

print(es.search(index='es_zilongtest',body=body))

# size的另一种指定方法
es.search(index='es_python', filter_path=filter_path, body=body, size=200)

If you feel that the results are too complicated, you can set filter fields

# 有过滤字段查询数据
body ={
    
    
     'from':0, #从0开始
}
# 定义过滤字段，最终只显示此此段信息 hits.hits._source.写在前面 后面写你自己定义的字段名 我这里是keyword和content
filter_path=['hits.hits._source.keyword',  # keyword为第一个需要显示的字段
             'hits.hits._source.content']  # content为字段2
# print(es.search(index='es_zilongtest'))
print(es.search(index='es_zilongtest',filter_path=filter_path,body=body))

fuzzy query

For fuzzy query, you need to use the query command to query the method match. Note that you can only check one field.

body ={
    
    
     'from':0,
     'query':{
    
        # 查询命令
          'match':{
    
     # 查询方法：模糊查询
               'content':'小白菜'  #content为字段名称，match这个查询方法只支持查找一个字段
          }
     }
}

filter_path=['hits.hits._source.keyword',  # 字段1
             'hits.hits._source.content']  # 字段2
# print(es.search(index='es_zilongtest'))
print(es.search(index='es_zilongtest',filter_path=filter_path,body=body))

search result:

{'hits': {'hits': [{'_source': {'keyword': 'food', 'content': 'Chinese cabbage is also delicious'}}, {'_source': {'keyword': ' food', 'content': 'I like to eat Chinese cabbage'}}, {'_source': {'keyword': 'animal', 'content': 'Dabai bit Xiao Huang who belonged to his family'}}, { '_source': {'keyword': 'animal', 'content': 'Little Bai from Uncle Wang's family loves to eat chicken breast'}}, {'_source': {'keyword': 'animal', 'content': 'Wang The Bo family bought a lot of chicks'}}]}}

You can see that not only appears in the content 小白菜, but also other contents appear 大白菜 大白 小白because the fuzzy query has 小白菜split the

If you do not filter, you will see more detailed content

{'took': 8, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total ': {'value': 5, 'relation': 'eq'}, 'max_score': 3.0320468, 'hits': [{'_index': 'es_zilongtest', '_type': '_doc', '_id': '6', '_score': 3.0320468, '_source': {'keyword': 'food', 'content': 'Chinese cabbage is also delicious'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '4', '_score': 2.1276836, '_source': {'keyword': 'food', 'content': 'I like to eat Chinese cabbage'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '1', '_score': 1.2374083, '_source': {'keyword': 'animal', 'content': 'Dabai's little one belonging to the family Huang bit it'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '3', '_score': 1.2374083, '_source': {'keyword': 'animal' , 'content': 'Xiaobai from Uncle Wang's family loves to eat chicken breast'}}, {'_index': 'es_zilongtest', '_type': '_doc', '_id': '2', '_score': 0.6464764, ' _source': {'keyword': 'animal', 'content': 'Wang Bo bought a lot of chicks at home'}}]}}

where is scorethe matching score between the keyword and the piece of data, sorted in descending order by default

term precise query

#精确单值查询
body1={
    
    
     "query":{
    
    
          "terms":{
    
    
               "keyword.keyword":["食物","测试"] # 查询keyword="食物"或"测试"...的数据
          }
     }
}
print(es.search(index='es_zilongtest',body=body1))

Note that the first keyword here is the field name I set myself, and the second one is required by the interface to be keyword, so I can change it here to

#精确多值查询
body1={
    
    
     "query":{
    
    
          "terms":{
    
    
               "content.keyword":["小白菜","大白"] # 查询keyword="小白菜"或"大白"...的数据
          }
     }
}

In this way, the search results are empty, because there is no content that is Xiaobaicai or Dabai (it does not work if the text contains this field, it must be exactly the same)

multi_match, multi-field query

# 查询多个字段中都包含指定内容的数据
body3 = {
    
    
    "query":{
    
    
        "multi_match":{
    
    
            "query":"小白菜",  # 指定查询内容，注意：会被分词
            "fields":["keyword", "content"]  # 指定字段
        }
    }
}
print(es.search(index='es_zilongtest',body=body3))

prefix, prefix query

body3 = {
    
    
    "query":{
    
    
        "prefix":{
    
    
            "content.keyword":"小白菜",  # 查询前缀是指定字符串的数据
        }
    }
}
# 注：英文不需要加keyword
print(es.search(index='es_zilongtest',body=body3))

wildcard, wildcard query

body = {
    
    
    'query': {
    
    
        'wildcard': {
    
    
            'ziduan1.keyword': '?刘婵*'  # ?代表一个字符，*代表0个或多个字符
        }
    }
}
# 注：此方法只能查询单一格式的（都是英文字符串，或者都是汉语字符串）。两者混合不能查询出来。

regexp, regular matching

body = {
    
    
    'query': {
    
    
        'regexp': {
    
    
            'ziduan1': 'W[0-9].+'   # 使用正则表达式查询
        }
    }
}
1234567

bool, multi-condition query

# must：[] 各条件之间是and的关系
body = {
    
    
        "query":{
    
    
            "bool":{
    
    
                'must': [{
    
    "term":{
    
    'ziduan1.keyword': '我爱你中国'}},
                         {
    
    'terms': {
    
    'ziduan2': ['I love', 'China']}}]
            }
        }
    }

# should: [] 各条件之间是or的关系
body = {
    
    
        "query":{
    
    
            "bool":{
    
    
                'should': [{
    
    "term":{
    
    'ziduan1.keyword': '我爱你中国'}},
                         {
    
    'terms': {
    
    'ziduan2': ['I love', 'China']}}]
            }
        }
    }

# must_not：[]各条件都不满足
body = {
    
    
        "query":{
    
    
            "bool":{
    
    
                'must_not': [{
    
    "term":{
    
    'ziduan1.keyword': '我爱你中国'}},
                         {
    
    'terms': {
    
    'ziduan2': ['I love', 'China']}}]
            }
        }
    }



# bool嵌套bool
# ziduan1、ziduan2条件必须满足的前提下，ziduan3、ziduan4满足一个即可
body = {
    
    
    "query":{
    
    
        "bool":{
    
    
            "must":[{
    
    "term":{
    
    "ziduan1":"China"}},  #  多个条件并列  ，注意：must后面是[{}, {}],[]里面的每个条件外面有个{}
                    {
    
    "term":{
    
    "ziduan2.keyword": '我爱你中国'}},
                    {
    
    'bool': {
    
    
                        'should': [
                            {
    
    'term': {
    
    'ziduan3': 'Love'}},
                            {
    
    'term': {
    
    'ziduan4': 'Like'}}
                        ]
                    }}
            ]
        }
    }
}

illustrate

In python操作ES数据库this part of the introduction to ES database, this article mainly refers to the reference material 1, and adds my own reflections based on my own practical tests. I think it can save some detours for novices.

If you don’t understand the query result fields, you can read the meaning of each field in the query result return parameters in the query data.

References:

1 Python operates ES database

2 The next stage of reading content, master the more detailed Python Elasticsearch api