Elasticsearch7.6 study notes 1 Getting start with Elasticsearch

Elasticsearch7.6 study notes 1 Getting start with Elasticsearch

Foreword

The authoritative guide Chinese is only 2.x, but now es has reached 7.6. Install the latest to learn.

installation

Here is the learning installation, production installation is another set of logic.

win

es download address:

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip

Kibana download address:

https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip

The current official latest is 7.6.0, but the download speed is terrible. Using Thunder download speed can reach xM.

bin\elasticsearch.bat
bin\kibana.bat

Double-click bat to start.

docker installation

For testing and learning, it is faster and more convenient to directly use the official docker image.

See the installation method: https://www.cnblogs.com/woshimrf/p/docker-es7.html

The following is from:

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html

Index some documents

This test uses kibana directly, of course, you can also visit localhost: 9200 through curl or postman.

Visit localhost: 5601, and then click Dev Tools.

Create a new customer index (index)

PUT /{index-name}/_doc/{id}

PUT /customer/_doc/1
{
  "name": "John Doe"
}

putIs http method, if there is no index (index) es in customer, create one and insert a data, idto , name = John`.
If there is updated. Note that updates are updated coverage that is what body json, the end result is what.

The return is as follows:

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 7,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 6,
  "_primary_term" : 1
}

  • _index Is the index name
  • _type Only for_doc
  • _id Is the primary key of the document, which is the pk of a record
  • _versionIs the _idnumber of updates, I have updated 7 times here
  • _shards Represents the result of sharding. We have deployed a total of two nodes here, and both have been successfully written.

You can check the status of the index in the -index manangement on kibana. For example, our record has two main and two shards.

After the record is successfully saved, it can be read immediately:

GET /customer/_doc/1

return

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 15,
  "_seq_no" : 14,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "John Doe"
  }
}

  • _source Is what we record

Bulk insert

When there are multiple pieces of data to be inserted, we can insert them in batches. Download the prepared documents, and then import es through the HTTP request.

Create an index bank: Since shards (shards) and replicas (replicas) cannot be modified after they are created, configure shards when creating them first. There are 3 shards and 2 replicas.

PUT /bank
{
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "2"
    }
  }
}

Document address: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json

After downloading, the curl command or postman sends a file request

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"

The format of each record is as follows:

{
  "_index": "bank",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "_score": 0,
  "_source": {
    "account_number": 1,
    "balance": 39225,
    "firstname": "Amber",
    "lastname": "Duke",
    "age": 32,
    "gender": "M",
    "address": "880 Holmes Lane",
    "employer": "Pyrami",
    "email": "[email protected]",
    "city": "Brogan",
    "state": "IL"
  }
}

Select self monitor in kibana monitor. Then find the index bank in indices. You can see the distribution of the data we imported.

As you can see, there are 3 shards on different nodes, and there are 2 replicas.

Start query

After inserting some data in batches, we can start learning to query. As we know above, the data is the bank employee table. We query all users and sort them according to account numbers.

Similar to sql

select * from bank order by  account_number asc limit 3

Query DSL


GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ],
  "size": 3,
  "from": 2
}
  • _search Express query
  • query Is the query condition, here are all
  • sizeRepresents the number of each query, the number of pages. If not transmitted, the default is 10. hitsDisplayed in the returned results .
  • fromIndicates from which number

return:


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "account_number" : 2,
          "balance" : 28838,
          "firstname" : "Roberta",
          "lastname" : "Bender",
          "age" : 22,
          "gender" : "F",
          "address" : "560 Kingsway Place",
          "employer" : "Chillium",
          "email" : "[email protected]",
          "city" : "Bennett",
          "state" : "LA"
        },
        "sort" : [
          2
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "account_number" : 3,
          "balance" : 44947,
          "firstname" : "Levine",
          "lastname" : "Burks",
          "age" : 26,
          "gender" : "F",
          "address" : "328 Wilson Avenue",
          "employer" : "Amtap",
          "email" : "[email protected]",
          "city" : "Cochranville",
          "state" : "HI"
        },
        "sort" : [
          3
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "account_number" : 4,
          "balance" : 27658,
          "firstname" : "Rodriquez",
          "lastname" : "Flores",
          "age" : 31,
          "gender" : "F",
          "address" : "986 Wyckoff Avenue",
          "employer" : "Tourmania",
          "email" : "[email protected]",
          "city" : "Eastvale",
          "state" : "HI"
        },
        "sort" : [
          4
        ]
      }
    ]
  }
}



The returned result provides the following information

  • took es query time, the unit is milliseconds
  • timed_out whether search has timed out
  • _shardsHow much we searched, how much we shardssucceeded, how much we failed, how much we skipped. About shard, it is simply understood as data sharding, that is, the data in an index is divided into several pieces, which can be understood as a table divided by id.
  • max_score Score of the most relevant document

Next, you can try a conditional query.

Word search

Query the address with millsum lanein address.

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } },
  "size": 2
}

return

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 19,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "[email protected]",
          "city" : "Urie",
          "state" : "IL"
        }
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "970",
        "_score" : 5.4032025,
        "_source" : {
          "account_number" : 970,
          "balance" : 19648,
          "firstname" : "Forbes",
          "lastname" : "Wallace",
          "age" : 28,
          "gender" : "M",
          "address" : "990 Mill Road",
          "employer" : "Pheast",
          "email" : "[email protected]",
          "city" : "Lopezo",
          "state" : "AK"
        }
      }
    ]
  }
}

  • I set to return 2, but actually hit 19

Exact match query

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

At this time, there is only one that matches exactly.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "[email protected]",
          "city" : "Urie",
          "state" : "IL"
        }
      }
    ]
  }
}

Multi-condition query

In actual query, usually multiple conditions are queried together

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}
  • boolUsed to combine multiple query conditions
  • must, should, must_notIs a sub-statement boolean query must, shoulddetermine relevance score, score in accordance with the result of the default sort
  • must notIt is used as a filter to affect the results of the query, but does not affect the score, but only filters from the results.

You can also specify any filters explicitly to include or exclude documents based on structured data.

For example, query the balance between 20,000 and 30,000.

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

Aggregate operation group by

Statistics by province

According to the wording of sql, it may be

select state AS group_by_state, count(*) from tbl_bank limit 3;

The request corresponding to es is


GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3
      }
    }
  }
}
  • size=0Is to limit the content returned, because es will return the query records, we only want to aggregate values
  • aggsIs an aggregated grammatical word
  • group_by_state Is an aggregation result, the name is customized
  • terms The fields of the query match exactly, here are the fields that need to be grouped
  • state.keywordstate is a texttype, character type needs statistics and grouping, type must be keyword
  • size=3Limit the number returned by group by, here is top3, default top10, the system maximum is 10000, can be achieved by modification search.max_buckets, note that multiple shards will produce accuracy problems, and then learn in depth later

return value:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 26,
      "sum_other_doc_count" : 928,
      "buckets" : [
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 23
        },
        {
          "key" : "TX",
          "doc_count" : 21
        }
      ]
    }
  }
}


  • hitsHit the record of the query condition, because size = 0 is set, it returns []. totalThis query hit 1000 records
  • aggregations Is the aggregate indicator result
  • group_by_state Is the variable name named in our query
  • doc_count_error_upper_bound There is no potential aggregation result returned in this aggregation, but there may be a potential aggregation result. The key name means "upper bound", which means that the value of the final result is not calculated in the worst case of the estimated, of course the value of doc_count_error_upper_bound The larger the value, the greater the possibility that the final data will be inaccurate. What can be determined is that its value is 0, indicating that the data is completely correct, but it is not 0, which does not mean that the aggregated data is wrong.
  • sum_other_doc_count The number of documents that were not counted in the aggregation

It is worth noting that top3 is accurate. We see doc_count_error_upper_boundthat there is a wrong number, that is, the statistical result is likely to be inaccurate, and the top3 obtained is 28, 23, 21. We will add another query parameter to compare result:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3,
        "shard_size":  60
      }
    }
  }
}
-----------------------------------------
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 915,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30
        },
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 27
        }
      ]
    }
  }
  • shard_sizeRepresents the number of calculations for each shard. Because the agg aggregation operation calculates a result for each shard, and then aggregates the final result. The data is unevenly distributed in the shards, and the topN of each shard is not the same, just It is possible that the final aggregation result is a little less. So it is doc_count_error_upper_boundnot 0. The default shard_sizevalue of es is size*1.5+10that size = 3 corresponds to 14.5, verifying that the return value when shar_size = 14.5 is indeed the same as not passing. When set to 60, the error is finally It is 0, that is, it can be guaranteed that these 3 are definitely the most top3. That is to say, the aggregation operation should set shard_size as large as possible, such as 20 times the size.

Count the number of people by province and calculate the average salary

We want to check the average salary of each province, sql may be

select 
  state, avg(balance) AS average_balance, count(*) AS group_by_state 
from tbl_bank
group by state
limit 3

You can query in es like this:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3,
        "shard_size":  60
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        },
        "sum_balance": {
          "sum": {
            "field": "balance"
          }
        }
      }
    }
  }
}
  • The second aggsis to calculate the aggregation index of each state
  • average_balanceCustom variable name, value is the balance avgoperation of the same state
  • sum_balanceCustom variable name, value is the balance sumoperation of the same state

The result is as follows:

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 915,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30,
          "sum_balance" : {
            "value" : 782199.0
          },
          "average_balance" : {
            "value" : 26073.3
          }
        },
        {
          "key" : "MD",
          "doc_count" : 28,
          "sum_balance" : {
            "value" : 732523.0
          },
          "average_balance" : {
            "value" : 26161.535714285714
          }
        },
        {
          "key" : "ID",
          "doc_count" : 27,
          "sum_balance" : {
            "value" : 657957.0
          },
          "average_balance" : {
            "value" : 24368.777777777777
          }
        }
      ]
    }
  }
}

Statistics by province and sorted by average salary

The default order of agg terms is count descending, if we want to use other methods, SQL may be like this:

select 
  state, avg(balance) AS average_balance, count(*) AS group_by_state 
from tbl_bank
group by state
order by average_balance
limit 3

The corresponding es can be queried like this:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        },
        "size": 3
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

The top3 returned result is not the previous one:

  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 983,
      "buckets" : [
        {
          "key" : "DE",
          "doc_count" : 2,
          "average_balance" : {
            "value" : 39040.5
          }
        },
        {
          "key" : "RI",
          "doc_count" : 5,
          "average_balance" : {
            "value" : 36035.4
          }
        },
        {
          "key" : "NE",
          "doc_count" : 10,
          "average_balance" : {
            "value" : 35648.8
          }
        }
      ]
    }
  }

reference

Guess you like

Origin www.cnblogs.com/woshimrf/p/es7-start.html