ElasticSearch Quick Start Tips

1. Write in front

ElasticSearch is used in my work, which is a full-text search engine that can quickly store, search and analyze massive data. This thing is very important and is used by major companies. This article is a quick start to ElasticSearch's notes. My idea is to learn how to use this thing through some materials first, use it first, and then if you need to supplement the theory later, it will be fast.

The following are introductions from installation, basic concepts, and postman and using ElasticSearch through the Python API.

2. The core of the Elastic Stack

Elastic Stack, including ElasticSearch, Kibana, Beats and Logstash (ELK Stack), can securely and reliably obtain any source and any form of data, and then search, analyze and visualize the data in real time .

ElasticSearch (ES) is an open source highly scalable distributed full-text search engine , the core of the entire Elastic Stack technology stack. It can store and retrieve data in near real time. It has good scalability and can be extended to hundreds of servers to process PB-level data.

Official address of ElasticSearch: https://www.elastic.co/cn/

3. ElasticSearch installation

Regarding the installation of Elastic, there is not much to sort out here. You can refer to online tutorials. I think some of them need to configure the environment. Since I am a novice, I want to use it first, so I directly pulled an Elastic Mirror, run a container, and you can play. Here is the code for docker to install Elastic:

docker pull elasticsearch:7.6.2	  存储和检索数据
docker pull kibana:7.6.2		 可视化检索数据


# 创建自己的目录 /home/wuzhongqiang下面
mkdir -p ./mydata/elasticsearch/config
mkdir -p ./mydata/elasticsearch/data
echo "http.host: 0.0.0.0" >> ./mydata/elasticsearch/config/elasticsearch.yml

# 新建一个网络 方便后面与kaibana在一个网络通信
docker cereate network es_net

# /wuzhongqiang/mydata/elasticsearch/config
# -p 9200:9200  容器内部端口映射到linux的端口  9200是后端发送请求restAPI使用的
# -p 9300:9300	9300是es在分布式集群下节点间的通信端口
# -e "discovery.type = single-node"	指定单节点模式运行
# -e ES_JAVA_OPTS="-Xms64m -Xmx128m" 如果不指定会将整个内存全部占用 初始64m最大占用128 上线一般32G
docker run --name elasticsearch --network es_net -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
-e ES_JAVA_OPTS="-Xms64m -Xmx128m" \
-v /home/wuzhongqiang/mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
-v /home/wuzhongqiang/mydata/elasticsearch/data:/usr/share/elasticsearch/data \
-v /home/wuzhongqiang/mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
-d elasticsearch:7.6.2

Download Postman, a powerful webpage debugging tool that provides powerful WebAPI and HTTP request debugging. www.getpostman.com

Can send any type of HTTP request (GET, HEAD, POST, PUT,…), not only form submission, but also any type of request body

4. ElasticSearch data format

ElasticSearch is a document-oriented database, and a piece of data is a document. Conceptual comparison between Elasticsearch storing document data and relational database MySQL storing data:
insert image description here
The index in ES can be regarded as a library, while Types is equivalent to a table, and Documents is equivalent to a row of a table.

Basic concepts in ElasticSearch:

  • Node and Cluster : ElasticSearch is essentially a distributed database that allows multiple servers to work together, and each server can run multiple instances of ElasticSearch. A single instance of ElasticSearch is called a node (Node). A group of nodes form a cluster (Cluster)
  • Index (Index) : The top-level unit of ElasticSearch data management is called Index, which is equivalent to the database concept in MySQL, MongoDB, etc. ES will index all the fields, and write a reverse index (inverted) after processing. When searching for data, directly search for the index. Note : The name of each index should be lowercase
  • Document (document) : A single record in the Index is called a Document, and many Documents form an Index. Documents are expressed in JSON format. Documents of the same Index do not require the same structure (Scheme), but it is best to keep the same, which is conducive to improving search efficiency.
  • Type : Documents can be grouped. For example, the weather index can be grouped by city (Beijing and Shanghai) or by climate (sunny and rainy). This type of grouping is called Type, which is a virtual logical grouping used to filter Documents, similar Data table in MySQL, Collection in MongoDB. Different Types should have a similar structure (Scheme). For example, the id field cannot be a string in this group and a value in another group. This is a difference from tables in relational databases. Data with completely different properties (such as products and logs) should be stored as two Indexes, not two Types in one Index (although it can be done). Note : The concept of Types here is gradually weakened. In version 6.x, an index contains only one type, but in version 7.x, the concept of type is deleted.
  • Fields (field) : Each Document is similar to a JSON structure, which contains many fields, each field has a corresponding value, and multiple fields form a Document, which is analogous to the fields in the MySQL data table

To understand forward index and inverted index , I wrote an article before.
insert image description here
Forward index:文章id -> 文章内容 -> 文章关键字

Inverted index:文章关键字 -> 文章id -> 文章内容

5. ES - Basic Operations (Postman Demo)

5.1 ES - Index Operations

5.1.1 Index Creation

Here you can use postman to send requests and create indexes in ElasticSearch.
insert image description here

Scenario Description:

  • If you click send again to send a put request at this time, since put is idempotent, if you create it again, it will show that the shopping index already exists
  • If you change a request, such as put put to POST, since POST is not idempotent, the resulting index results may be different, which is not allowed

5.1.2 Index query and deletion

To obtain detailed information under an index, change the request method to GET
insert image description here

List all indexes, the GET request method remains unchanged, modify the URL
insert image description here

Delete an index, select an index URL, and send a DELETE request
insert image description here

5.2 ES - Document Operations

5.2.1 Document creation

The index is created, and then the document is created and data is added. The document here can be compared to the table data of a relational database, and the added data format is JSON.

In Postman, send a POST request to ES:
insert image description here
the underscore here _docindicates the meaning of document data.

Note that only POST can be used here, not PUT. This is because, after we click send, we will find that an id will be generated in the following results, which is the unique representation of the data. But, if we click this send many times, we will find that the ids in the data below will be randomly generated, and they are all different. That is to say, the request for inserting data is not idempotent, but the PUT request requires idempotence sex.

There is still a problem above. Since the id here is the unique identifier of the data, we can use this id to manipulate the data later, but the randomly generated id is too long and difficult to remember. Is there a simple way to customize it? What about the data id? http://xxx.xxx.xxx.xx:9200/shopping/_doc/1001The following 1001 is a custom id number. Once defined, it will remain unchanged later. You can use this id number to access data. At this time, since the id number is sent many times, it is idempotent, and you can use PUT request at this time.

5.2.2 Document query

The URL is a specific document id, replaced by a GET request
insert image description here

1001 is similar to the primary key, and the result is similar to the result of the primary key query.

Specify a data id, then GET can query the results in the ElsticSearch database, so what if I want to check all the data under a certain index? so:
insert image description here

5.2.3 Document modification

After the data is created, how should I modify it if I want to modify it?
insert image description here
The above is a complete coverage operation. No matter how many requests are made, the data is completely covered. This is a full data update. In essence, it is an overriding operation.

What if you don't want to overwrite all, but want to modify a field locally? At this time, you can use the partial update method. Since the result of each update is not necessarily the same, this is not an idempotent operation, so use POST at this time .
insert image description here

What if the data is deleted? Modify to DELETE request:
insert image description here

5.3 Complex query operations

5.3.1 Attribute value filtering

Filter by specific attribute values:
insert image description here

However, it is not elegant to write query conditions in the URL, so it is generally recommended to use the second method, that is, to write conditional queries in the request body
insert image description here

Here you can modify the writing in the body to meet different needs:

{
    
    
	"query":{
    
    
		# 条件过滤筛选  如果想查全部数据,"match_all": {}
		"match":{
    
    
			"category": "小米"
		},
		
        # 分页显示
		"from": 0,     # 起始页     查询任意页公式: (页码-1) * size
		"size": 2,     # 每页多少个     
		
		# 只想要某些字段,不需要全部显示, 只想看title
		"_source": ["title"],
		
		# 对字段排序
		"sort": {
    
    
			# 按照哪个字段排?
			"price": {
    
    
				# 升降序?
				"order": "desc"
			}
		}
	}
}

According to your own needs, you can make corresponding settings.

5.3.2 Multi-condition query

Multiple conditions are queried together, which is very similar to and in sql

{
    
    
	"query": {
    
    
		"bool": {
    
    
			# 多个条件同时满足and   如果
			"must": [
				{
    
    
					"match": {
    
    
						"category": "小米"
					}
				},
                {
    
    
                    "match": {
    
    
                        "price": 1999
                    }
                }, 
                {
    
    
                    "match"
                }, 
                .....
			]
		}
	}
}

If you want to implement the or in sql

{
    
    
	"query": {
    
    
		"bool": {
    
    
			# or
			"should": [
				{
    
    
					"match": {
    
    
						"category": "小米"
					}
				},
                {
    
    
                    "match": {
    
    
                        "category": "华为"
                    }
                }, 
                .....
			], 
            
            # 如果想再进行范围的查询
            "filter": {
    
    
                "range": {
    
    
                    "price": {
    
    
                        "gt" : 5000,    # price > 5000
                    }
                }
            }
		}
	}
}

5.3.3 Document aggregation query

Aggregation functions can be used, such as averaging a field

{
    
    
	"aggs": {
    
      # 聚合操作
		"price_avg": {
    
      # 聚合后的列名
			"avg": {
    
       # 聚合函数
				"field": "price"  # 分组字段
			}
		}
	}
}

group aggregation

{
    
    
	"aggs": {
    
      # 聚合操作
		"price_groups": {
    
      # 分组后的列名
			"terms": {
    
       # 分组,类似groupby
				"field": "price"  # 分组字段
			}
		}
	}
}

In the above code, after grouping, the number of each group is displayed.

5.4 Full Text Search & Exact Match

When ES builds an inverted index for a document, it builds it according to the word splitting. If you perform a query at this time, it is actually a full-text search. What do you mean?

{
    
    
	"query":{
    
    
		# 条件过滤筛选  如果想查全部数据,"match_all": {}
		"match":{
    
    
			"category": "小"  # 这里会查出带小字的所有数据来  "小华" 会查出带小字和华字的所有数据来
		}
}

That won't work, what if I want to match the query exactly?

{
    
    
	"query":{
    
    
		# 条件过滤筛选  如果想查全部数据,"match_all": {}
		"match_phrase":{
    
    
			"category": "小"  # 精确查询  category="小"的数据会回来
		}
    },
    
    # 对查询的字段高亮显示
    "highlight": {
    
    
        "fields": {
    
    
            "category": {
    
    }
        }
    }
}

5.5 ES - Mapping relationship

The mapping relationship is similar to the table structure information when the database is created. It indicates some field query rules to be followed when creating data under the index. For example, some fields above can support full-text search, and some fields only support exact matching. , Some fields can also be indexed, some cannot be indexed, etc. This can be explained in advance through the mapping relationship.

{
    
    
    "properties":{
    
    
        "name": {
    
    
            "type": "text",  # 文本类型,支持全文检索
            "index": true   # 可以通过索引给找到
        },
        "sex": {
    
    
            "type": "keyword",  # 关键字类型,此时不能拆开,只能完全匹配
            "index": true  # 创建索引
        },
        "tel": {
    
    
            "type": "keyword", # 完全匹配
            "index": false # 这个不创建索引,所以不能通过这个字段查询
        }
    }
}

You can see an example in detail, first create an index called user
insert image description here

Then, create a mapping relationship for this index and copy the above code into the body

insert image description here

At this time, if you change to a GET request, you can query the mapping relationship of the current index. Next, insert a piece of data:
insert image description here

At this point, we execute the query,
insert image description here

At this point, it is found that the name field supports full-text search. If we query through the sex field

insert image description here
If you want to query through tel, an error will fail because tel is not indexed.

6. Python docking ElasticSearch

ElasticSearch provides a python API, so that we can complete the creation of the above index or the addition, deletion, modification and query of data by writing python code. Official documents , here are the commonly used commands.

Install in a virtual environment:

# 这里最好是指定版本,和安装的ES的版本匹配起来,否则可能会报错
pip install elasticsearch==7.6

6.1 Initialize ES

from elasticsearch import Elasticsearch

es = Elasticsearch([{
    
    
	"host": xx.xxx.xxx.72,
	"port": 9200
}], timeout=3600)

Note that if this operation reports

TypeError: NodeConfig.__init__() missing 1 required positional argument: 'scheme'

Note, the version of the installed Elasticsearch package is 8.x, and the installed ES is 7.x. Here I first checked the version of the installed ES, and then specified the version when installing the report.

6.2 index operation

mappings = {
    
    
    "mappings": {
    
    
            "properties": {
    
    
                "id": {
    
    
                    "type": "long",
                    "index": "false"
                },
                "serial": {
    
    
                    "type": "text",  # keyword不会进行分词,text会分词
                    "index": "false"  # 不建索引
                },
                # tags可以存json格式,访问tags.content
                "tags": {
    
    
                    "type": "object",
                    "properties": {
    
    
                        "content": {
    
    "type": "keyword", "index": True},
                        "dominant_color_name": {
    
    "type": "keyword", "index": True},
                        "skill": {
    
    "type": "keyword", "index": True},
                    }
                },
                "hasTag": {
    
    
                    "type": "long",
                    "index": True
                },
                "status": {
    
    
                    "type": "long",
                    "index": True
                },
                "createTime": {
    
    
                    "type": "date",
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                },
                "updateTime": {
    
    
                    "type": "date",
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                }
            }
        }
    }
es.indices.create(index = 'test',body=mappings, ignore=[400, 404])  # 创建索引,如果创建不成功,会返回错误码,这里指定了ignore参数之后, 就忽略对应的错误码保证程序往后执行,而不是抛异常

Delete the index:

result = es.indices.delete(index='test', ignore=[400, 404])

6.3 Data manipulation

6.3.1 Data Insertion

Insert a single piece of data

action = {
    
    
    "id": "111",
    "serial": "版本",
    # 以下tags.content是错误的写法
    # "tags.content" :"标签2",
    # "tags.dominant_color_name": "域名的颜色黄色",
    # 正确的写法如下:
    "tags": {
    
    "content": "标签3", "dominant_color_name": "域名的颜色黄色"},
    # 按照字典的格式写入,如果用上面的那种写法,会直接写成一个tags.content字段。
    # 而不是在tags中content添加数据,这点需要注意
    "tags.skill": "分类信息",
    "hasTag": "123",
    "status": "11",
    "createTime": "2018-02-02",
    "updateTime": "2018-02-03",
}
es.index(index="test", doc_type="_doc", body=action, id="111")  # 如果不指定id,则会自动生成一个id,
# 注意根据测试发现,action里面那个id并不是文档的id,这个会当成文档其中的一个field,也就是字段

# 这里创建也可以用create函数,但是这个函数,需要指定id字段来唯一标识该条数据
es.create(index="test", doc_type="_doc", body=action, id="111")

Insert multiple data

doc = [
    {
    
    
        "_index": {
    
    "test"},
    	"_source":
    	{
    
    
        	"id": "111",
        	"serial": "版本",
        	"tags": {
    
    "content": "标签3", "dominant_color_name": "域名的颜色黄色"},
        	"tags.skill": "分类信息",
        	"hasTag": "123",
        	"status": "11",
        	"createTime": "2018-2-2",
        	"updateTime": "2018-2-3",
    	}
    },
    {
    
    
        "_index": {
    
    "test"},
    	"_source":
        {
    
    
        	"id": "222",
        	"serial": "版本",
        	"tags": {
    
    "content": "标签3", "dominant_color_name": "域名的颜色黄色"},
        	"tags.skill": "分类信息",
        	"hasTag": "123",
        	"status": "11",
        	"createTime": "2018-2-2",
        	"updateTime": "2018-2-3",
        }
    },
        ...
    ]

a = es.bulk(index='test', doc_type='_doc', body=doc)

# 下面这种写法也行
doc = [
    {
    
    "index": {
    
    }},
    {
    
    
        "id": "111",
        "serial": "版本1",
        "tags": {
    
    "content": "标签3", "dominant_color_name": "域名的颜色黄色"},
        "tags.skill": "分类信息",
        "hasTag": "123",
        "status": "11",
        "createTime": "2018-2-2",
        "updateTime": "2018-2-3",
    },
    {
    
    "index": {
    
    }},
    {
    
    
        "id": "222",
        "serial": "版本2",
        "tags": {
    
    "content": "标签3", "dominant_color_name": "域名的颜色黄色"},
        "tags.skill": "分类信息",
        "hasTag": "123",
        "status": "11",
        "createTime": "2018-2-2",
        "updateTime": "2018-2-3",
    },]

6.3.2 Update data

modify_data = {
    
    
	"tags": {
    
    "content": "标签5"}
}
response = es.update(index="test", doc_type="_doc", id="222", body=modify_data)
# 或者也可以用index,这个可以代替我们完成两个操作,如果数据不存在,那就插入,如果存在,就更新
response = es.index(index="test", doc_type="_doc", id="222", body=modify_data)

6.4 Delete data

To delete a piece of data, call the delete method and specify the id of the data to be deleted.

response = es.delete(index="test", doc_type="_doc", id="222")

6.5 Query data

6.5.1 id query

response = es.get(index="test", id="111")

6.5.2 Query based on specific fields

This needs to be the same as the postman above, you need to write the body of JSON, such as

query = {
    
    
	"query": {
    
    
		"bool": {
    
    
			"must": [
				{
    
    
					"term": {
    
    
						"serial": {
    
    
							"value": "版本1"
						}
					}
				}
			]
		}
	}	
}
response = es.search(index="test", size=1, body=query)

The query functions provided by ES are also very powerful, such as fillter query, aggregate query, group query and so on.

This is all using the search function, and you need to write different bodies. I think it’s better to check it now, so I won’t sort it out one by one. There are many query examples in the third link below, and you can come here for reference at that time.

If there is new knowledge later, it will continue to be added.

Reference :

Guess you like

Origin blog.csdn.net/wuzhongqiang/article/details/125889153