Elasticsearch7.6 study notes 1 Getting start with Elasticsearch
Foreword
The authoritative guide Chinese is only 2.x, but now es has reached 7.6. Install the latest to learn.
installation
Here is the learning installation, production installation is another set of logic.
win
es download address:
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip
Kibana download address:
https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip
The current official latest is 7.6.0, but the download speed is terrible. Using Thunder download speed can reach xM.
bin\elasticsearch.bat
bin\kibana.bat
Double-click bat to start.
docker installation
For testing and learning, it is faster and more convenient to directly use the official docker image.
See the installation method: https://www.cnblogs.com/woshimrf/p/docker-es7.html
The following is from:
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html
Index some documents
This test uses kibana directly, of course, you can also visit localhost: 9200 through curl or postman.
Visit localhost: 5601, and then click Dev Tools.
Create a new customer index (index)
PUT /{index-name}/_doc/{id}
PUT /customer/_doc/1
{
"name": "John Doe"
}
put
Is http method, if there is no index (index) es in customer
, create one and insert a data, id
to ,
name = John`.
If there is updated. Note that updates are updated coverage that is what body json, the end result is what.
The return is as follows:
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 7,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 6,
"_primary_term" : 1
}
_index
Is the index name_type
Only for_doc
_id
Is the primary key of the document, which is the pk of a record_version
Is the_id
number of updates, I have updated 7 times here_shards
Represents the result of sharding. We have deployed a total of two nodes here, and both have been successfully written.
You can check the status of the index in the -index manangement on kibana. For example, our record has two main and two shards.
After the record is successfully saved, it can be read immediately:
GET /customer/_doc/1
return
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 15,
"_seq_no" : 14,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "John Doe"
}
}
_source
Is what we record
Bulk insert
When there are multiple pieces of data to be inserted, we can insert them in batches. Download the prepared documents, and then import es through the HTTP request.
Create an index bank: Since shards (shards) and replicas (replicas) cannot be modified after they are created, configure shards when creating them first. There are 3 shards and 2 replicas.
PUT /bank
{
"settings": {
"index": {
"number_of_shards": "3",
"number_of_replicas": "2"
}
}
}
Document address: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json
After downloading, the curl command or postman sends a file request
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"
The format of each record is as follows:
{
"_index": "bank",
"_type": "_doc",
"_id": "1",
"_version": 1,
"_score": 0,
"_source": {
"account_number": 1,
"balance": 39225,
"firstname": "Amber",
"lastname": "Duke",
"age": 32,
"gender": "M",
"address": "880 Holmes Lane",
"employer": "Pyrami",
"email": "[email protected]",
"city": "Brogan",
"state": "IL"
}
}
Select self monitor in kibana monitor. Then find the index bank in indices. You can see the distribution of the data we imported.
As you can see, there are 3 shards on different nodes, and there are 2 replicas.
Start query
After inserting some data in batches, we can start learning to query. As we know above, the data is the bank employee table. We query all users and sort them according to account numbers.
Similar to sql
select * from bank order by account_number asc limit 3
Query DSL
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
],
"size": 3,
"from": 2
}
_search
Express queryquery
Is the query condition, here are allsize
Represents the number of each query, the number of pages. If not transmitted, the default is 10.hits
Displayed in the returned results .from
Indicates from which number
return:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "2",
"_score" : null,
"_source" : {
"account_number" : 2,
"balance" : 28838,
"firstname" : "Roberta",
"lastname" : "Bender",
"age" : 22,
"gender" : "F",
"address" : "560 Kingsway Place",
"employer" : "Chillium",
"email" : "[email protected]",
"city" : "Bennett",
"state" : "LA"
},
"sort" : [
2
]
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"account_number" : 3,
"balance" : 44947,
"firstname" : "Levine",
"lastname" : "Burks",
"age" : 26,
"gender" : "F",
"address" : "328 Wilson Avenue",
"employer" : "Amtap",
"email" : "[email protected]",
"city" : "Cochranville",
"state" : "HI"
},
"sort" : [
3
]
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "4",
"_score" : null,
"_source" : {
"account_number" : 4,
"balance" : 27658,
"firstname" : "Rodriquez",
"lastname" : "Flores",
"age" : 31,
"gender" : "F",
"address" : "986 Wyckoff Avenue",
"employer" : "Tourmania",
"email" : "[email protected]",
"city" : "Eastvale",
"state" : "HI"
},
"sort" : [
4
]
}
]
}
}
The returned result provides the following information
took
es query time, the unit is millisecondstimed_out
whether search has timed out_shards
How much we searched, how much weshards
succeeded, how much we failed, how much we skipped. About shard, it is simply understood as data sharding, that is, the data in an index is divided into several pieces, which can be understood as a table divided by id.max_score
Score of the most relevant document
Next, you can try a conditional query.
Word search
Query the address with mill
sum lane
in address.
GET /bank/_search
{
"query": { "match": { "address": "mill lane" } },
"size": 2
}
return
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 19,
"relation" : "eq"
},
"max_score" : 9.507477,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "136",
"_score" : 9.507477,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane",
"employer" : "Neteria",
"email" : "[email protected]",
"city" : "Urie",
"state" : "IL"
}
},
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "970",
"_score" : 5.4032025,
"_source" : {
"account_number" : 970,
"balance" : 19648,
"firstname" : "Forbes",
"lastname" : "Wallace",
"age" : 28,
"gender" : "M",
"address" : "990 Mill Road",
"employer" : "Pheast",
"email" : "[email protected]",
"city" : "Lopezo",
"state" : "AK"
}
}
]
}
}
- I set to return 2, but actually hit 19
Exact match query
GET /bank/_search
{
"query": { "match_phrase": { "address": "mill lane" } }
}
At this time, there is only one that matches exactly.
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 9.507477,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "136",
"_score" : 9.507477,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane",
"employer" : "Neteria",
"email" : "[email protected]",
"city" : "Urie",
"state" : "IL"
}
}
]
}
}
Multi-condition query
In actual query, usually multiple conditions are queried together
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}
bool
Used to combine multiple query conditionsmust
,should
,must_not
Is a sub-statement boolean querymust
,should
determine relevance score, score in accordance with the result of the default sortmust not
It is used as a filter to affect the results of the query, but does not affect the score, but only filters from the results.
You can also specify any filters explicitly to include or exclude documents based on structured data.
For example, query the balance between 20,000 and 30,000.
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
Aggregate operation group by
Statistics by province
According to the wording of sql, it may be
select state AS group_by_state, count(*) from tbl_bank limit 3;
The request corresponding to es is
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 3
}
}
}
}
size=0
Is to limit the content returned, because es will return the query records, we only want to aggregate valuesaggs
Is an aggregated grammatical wordgroup_by_state
Is an aggregation result, the name is customizedterms
The fields of the query match exactly, here are the fields that need to be groupedstate.keyword
state is atext
type, character type needs statistics and grouping, type must be keywordsize=3
Limit the number returned by group by, here is top3, default top10, the system maximum is 10000, can be achieved by modificationsearch.max_buckets
, note that multiple shards will produce accuracy problems, and then learn in depth later
return value:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 26,
"sum_other_doc_count" : 928,
"buckets" : [
{
"key" : "MD",
"doc_count" : 28
},
{
"key" : "ID",
"doc_count" : 23
},
{
"key" : "TX",
"doc_count" : 21
}
]
}
}
}
hits
Hit the record of the query condition, because size = 0 is set, it returns[]
.total
This query hit 1000 recordsaggregations
Is the aggregate indicator resultgroup_by_state
Is the variable name named in our querydoc_count_error_upper_bound
There is no potential aggregation result returned in this aggregation, but there may be a potential aggregation result. The key name means "upper bound", which means that the value of the final result is not calculated in the worst case of the estimated, of course the value of doc_count_error_upper_bound The larger the value, the greater the possibility that the final data will be inaccurate. What can be determined is that its value is 0, indicating that the data is completely correct, but it is not 0, which does not mean that the aggregated data is wrong.sum_other_doc_count
The number of documents that were not counted in the aggregation
It is worth noting that top3 is accurate. We see doc_count_error_upper_bound
that there is a wrong number, that is, the statistical result is likely to be inaccurate, and the top3 obtained is 28, 23, 21. We will add another query parameter to compare result:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 3,
"shard_size": 60
}
}
}
}
-----------------------------------------
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 915,
"buckets" : [
{
"key" : "TX",
"doc_count" : 30
},
{
"key" : "MD",
"doc_count" : 28
},
{
"key" : "ID",
"doc_count" : 27
}
]
}
}
shard_size
Represents the number of calculations for each shard. Because the agg aggregation operation calculates a result for each shard, and then aggregates the final result. The data is unevenly distributed in the shards, and the topN of each shard is not the same, just It is possible that the final aggregation result is a little less. So it isdoc_count_error_upper_bound
not 0. The defaultshard_size
value of es issize*1.5+10
that size = 3 corresponds to 14.5, verifying that the return value when shar_size = 14.5 is indeed the same as not passing. When set to 60, the error is finally It is 0, that is, it can be guaranteed that these 3 are definitely the most top3. That is to say, the aggregation operation should set shard_size as large as possible, such as 20 times the size.
Count the number of people by province and calculate the average salary
We want to check the average salary of each province, sql may be
select
state, avg(balance) AS average_balance, count(*) AS group_by_state
from tbl_bank
group by state
limit 3
You can query in es like this:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"size": 3,
"shard_size": 60
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
},
"sum_balance": {
"sum": {
"field": "balance"
}
}
}
}
}
}
- The second
aggs
is to calculate the aggregation index of each state average_balance
Custom variable name, value is the balanceavg
operation of the same statesum_balance
Custom variable name, value is the balancesum
operation of the same state
The result is as follows:
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 915,
"buckets" : [
{
"key" : "TX",
"doc_count" : 30,
"sum_balance" : {
"value" : 782199.0
},
"average_balance" : {
"value" : 26073.3
}
},
{
"key" : "MD",
"doc_count" : 28,
"sum_balance" : {
"value" : 732523.0
},
"average_balance" : {
"value" : 26161.535714285714
}
},
{
"key" : "ID",
"doc_count" : 27,
"sum_balance" : {
"value" : 657957.0
},
"average_balance" : {
"value" : 24368.777777777777
}
}
]
}
}
}
Statistics by province and sorted by average salary
The default order of agg terms is count descending, if we want to use other methods, SQL may be like this:
select
state, avg(balance) AS average_balance, count(*) AS group_by_state
from tbl_bank
group by state
order by average_balance
limit 3
The corresponding es can be queried like this:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
},
"size": 3
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
The top3 returned result is not the previous one:
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 983,
"buckets" : [
{
"key" : "DE",
"doc_count" : 2,
"average_balance" : {
"value" : 39040.5
}
},
{
"key" : "RI",
"doc_count" : 5,
"average_balance" : {
"value" : 36035.4
}
},
{
"key" : "NE",
"doc_count" : 10,
"average_balance" : {
"value" : 35648.8
}
}
]
}
}
reference
- Chinese Community: https://elasticsearch.cn/
- es official documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
- es official document: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html
- Terms aggregation calculation is inaccurate: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/