Elasticsearch production index management depth analysis - combat online search system

This set of technology columnist (Qin Kai new) focused on big data and containers cloud core technology decryption, 5 years of big data cloud platform construction experience in industrial IOT, can offer a full stack of big data + cloud native platform counseling programs, please continue this set of blog attention. QQ-mail address: [email protected], academic exchange, if any, can contact at any time.

1. Index Management

1.1 Creating an index

  • Settings to use this index when you create can add some settings, there may initialize some type of mapping

      curl -XPUT 'http://elasticsearch02:9200/twitter?pretty' -d '
      {
          "settings" : {
              "index" : {
                  "number_of_shards" : 3, 
                  "number_of_replicas" : 2 
              }
          },
          "mappings" : {
              "type1" : {
                  "properties" : {
                      "field1" : { "type" : "text" }
                  }
              }
          }
      }'
    复制代码

1.2 index creation interpret the return message

  • After the default, index creation command after the copy of each primary shard begins to copy, or request times out and returns a response message similar to the following.

  • Which acknowledged indicate that this index has not been created successfully, it shards_acknowledged show each primary shard there is not a sufficient number of replica starts replicating. It is possible these two parameters is false, but the index still can create success. Because these parameters are merely indicates that before the request times out, the two acts have not been successful, there may be a request timed out, he never succeeded before the timeout, but after a timeout at es server side or are executed.

  • If acknoledged is false, then it may be time out, this time in response to a message received when, cluster state have not changed, have not joined the newly created index, but maybe later will still create the index. If shards_acknowledged before is false, it may be a copy of the copy in the primary shard, it is timeout, but maybe this time the index is created successfully, and cluster state has joined the newly created index.

      {
          "acknowledged": true,
          "shards_acknowledged": true
      }
    复制代码

1.3 Delete Index

删除索引中的一个type
curl -XDELETE 'http://elasticsearch02:9200/twitter?pretty'
复制代码

1.4 query index information set

curl -XGET 'http://elasticsearch02:9200/twitter?pretty'
复制代码

1.5 Open / Close Index

  • If you turn off after an index, the index will not bring any performance overhead, as long as the retention index metadata can then read and write operations on this index will not succeed. A closed index may then re-opened, it will be shard recovery process after opening.

  • For example, you do some operation and maintenance operation time, and now you want to do some configuration on a particular index, operation and maintenance operations, change some settings, turn off the index, does not allow write successfully re-opened later index

      curl -XPOST 'http://elasticsearch02:9200/twitter/_close?pretty'
      curl -XPOST 'http://elasticsearch02:9200/twitter/_open?pretty'
      
      curl -XPUT 'http://elasticsearch02:9200/twitter/type1/1?pretty' -d '
      {
      	"field1": "1"
      }'
    复制代码

1.6 compression index

  • shrink command to an existing index compressed into a new index, while the primary shard will be less.

  • primary shard because it involves hash routing problem document, it can not be modified. But if you want to reduce the primary shard index, you can use the command to shrink compression index. But the number of shard compressed must be divisible by the number of the original shard. For example, a primary shard of index 8 may be compressed into four, two or one of the primary shard index.

  • Compression index, for example, is to retain the data for 7 days, then gave 10 shard, but now the demand has changed, this index as long as reservations three days of data on it, so a small amount of data, you do not need a shard 10 , you can do shrink operations, five shard.

  • Workflow shrink command is as follows:

      (1)首先,它会创建一个跟source index的定义一样的target
           index,但是唯一的变化就是primary shard变成了指定的数量
      (2)接着它会将source index的segment file直接用hard-link的方式连接到
           target index的segment file,如果操作系统不支持hard-link,那么就会将
            source index的segment file都拷贝到target index的
            data dir中,会很耗时。如果用hard-link会很快
      (3)最后,会将target index进行shard recovery恢复
    复制代码
  • If you want to shrink index, then the index must be marked as read only, and one copy of each shard of the index, either primary or replica, must be copied to a node up.

  • By default, each shard index is possible on different machines, for example, index has, on 5 shard, shard0 and shard1 on the machine 1, shard2, shard3 2 shard4 on the machine 3 in the machine. Now we had to carry shard0, shard1, shard2, shard3, shard4 all copies of the same to a machine up, but can be shard0 the replica shard. And each primary shard must exist. It can be done by the following command. Index.routing.allocation.require._name which must be the name of a node, this can all be your own settings.

      curl -XPUT 'http://elasticsearch02:9200/twitter/_settings?pretty' -d '
      {
        "settings": {
          "index.routing.allocation.require._name": "node-elasticsearch-02", 
          "index.blocks.write": true 
        }
      }'
    复制代码
  • This command will take a little time to copy source index for each shard are copied to the specified node up, you can GET _cat / recovery? V command to track the progress of this process.

  • Etc. After the above process shard copy relocate, an index can shrink, with the following command:

      POST my_source_index/_shrink/my_target_inde
    复制代码
  • If the target index returned only after being added to the cluster state, this command will return immediately and not wait for the shrink process is completed. Of course, you can also use the following command to shrink when the target index modify the settings in the settings where you can set the number of primary shard of the target index.

      curl -XPOST 'http://elasticsearch02:9200/twitter/_shrink/twitter_shrinked?pretty' -d '
      {
        "settings": {
          "index.number_of_replicas": 1,
          "index.number_of_shards": 1, 
          "index.codec": "best_compression" 
        }
      }'
    复制代码
  • Is the need to monitor the whole process shrink, with GET _cat / recovery? V can be.

1.7 rollover index to create a new index

  • a rollover command to reset the alias to a new index go up, if existing index data is considered too large or too old. This command can receive a alias name, there are a number of condition.

  • If the index meets the condition, it will create a new index, while the alias will point to the new index. For example, the following command. For example, there is a logs-0000001 index, gave an alias is logs_write, then launched a rollover of command, the index pointed to before, if logs_write alias, that is, logs-0000001, created more than 7 days, or inside document has more than 1000, and then creates an index logs-000002, while logs_write alias will point to the new index.

  • This command is actually useful, entering data, especially data for this user access logs, or some online transaction system, you can write a shell script, 0:00 every day when you execute the following command rollover, At this point it is determined that, if the previous index has been in existence for more than one day, then this time to create a new index out while aliases point to the new index. To scroll automatically create a new index, keep each index only one hour, a day, seven days, three days, a week, a month.

  • Similarly es do with the log platform, it is possible to distributed electronic business platform, may order system log, a separate index, it is required to retain the last three days of log it. Trading system log, a separate index, is required to retain log of the last 30 days.

      curl -XPUT 'http://elasticsearch02:9200/logs-000001?pretty' -d ' 
      {
        "aliases": {
          "logs_write": {}
        }
      }'
      
      # Add > 1000 documents to logs-000001
      
      curl -XPUT 'http://elasticsearch02:9200/logs-000001/data/1?pretty' -d '
      {
      	"userid": 1,
      	"page": 1
      }'
      
      curl -XPUT 'http://elasticsearch02:9200/logs-000001/data/2?pretty' -d '
      {
      	"userid": 2,
      	"page": 2
      }'
      
      curl -XPUT 'http://elasticsearch02:9200/logs-000001/data/3?pretty' -d '
      {
      	"userid": 3,
      	"page": 3
      }'
      
      curl -XPOST 'http://elasticsearch02:9200/logs_write/_rollover?pretty' -d ' 
      {
        "conditions": {
          "max_age":   "1d",
          "max_docs":  3
        }
      }'
      
      {
        "acknowledged": true,
        "shards_acknowledged": true,
        "old_index": "logs-000001",
        "new_index": "logs-000002",
        "rolled_over": true, 
        "dry_run": false, 
        "conditions": { 
          "[max_age: 7d]": false,
          "[max_docs: 1000]": true
        }
      }
    复制代码
  • This process is common in the site log user behavior data, such as by day to automatically cut subindex, write a script to perform regular rollover, it will automatically continue to create a new index, but is always an alias for external users, It is used in the latest index data.

  • As a simple example, this is how to play and do real-time user behavior such as site analysis es, is a requirement to retain data as long as the index date on it, then you can use this rollover strategy to ensure each index contains all the latest data of the day. The old data, it becomes another index, and then you can write a shell script, delete the old data, this is the case, es where you keep the most recent data currently on it. Also can according to your needs, to retain the last 7 days of data, but the latest data for one day in an index, for analysis queries.

  • By default, if the index already exists with - plus a number at the end of the symbol, such as logs-000001, then the name of the new index will automatically add 1 to that number, such as logs-000002, it is to give automatic a six-digit number, and will automatically zeros. But we can also specify your own name to the new index, such as the following:

      POST /my_alias/_rollover/my_new_index_name
      {
        "conditions": {
          "max_age":   "7d",
          "max_docs":  1000
        }
      }
    复制代码
  • date and date rollover command can be used in combination, as in the following example, create a logs-2016.10.31-1 index format. Then each rollover if successful, then if it is in the day rollover several times, that is today's date, the end of the number is incremented. If the next day before rollover, will automatically change the date, while maintaining digital number at the end.

      PUT /%3Clogs-%7Bnow%2Fd%7D-1%3E 
      {
        "aliases": {
          "logs_write": {}
        }
      }
      
      PUT logs_write/log/1
      {
        "message": "a dummy log"
      }
      
      POST logs_write/_refresh
      
      # Wait for a day to pass
      
      POST /logs_write/_rollover 
      {
        "conditions": {
          "max_docs":   "1"
        }
      }
    复制代码
  • Of course, it can also rollover when the new index to a new setting:

      POST /logs_write/_rollover
      {
        "conditions" : {
          "max_age": "7d",
          "max_docs": 1000
        },
        "settings": {
          "index.number_of_shards": 2
        }
      }
    复制代码

1.8 mapping management

  • put mapping command allows us to give an existing index to add a new type, or modify a type, such as a certain type to add some fields.

  • The following command is in the creation of the index, followed immediately create a type:

      curl -XPUT 'http://elasticsearch02:9200/twitter?pretty' -d ' 
      {
        "mappings": {
          "tweet": {
            "properties": {
              "message": {
                "type": "text"
              }
            }
          }
        }
      }'
    复制代码
  • The following command is an existing index to add a type:

      curl -XPUT 'http://elasticsearch02:9200/twitter/_mapping/user?pretty' -d ' 
      {
        "properties": {
          "name": {
            "type": "text"
          }
        }
      }'
    复制代码
  • The following command is an existing type to add a field:

      curl -XPUT 'http://elasticsearch02:9200/twitter/_mapping/tweet?pretty' -d '
      {
        "properties": {
          "user_name": {
            "type": "text"
          }
        }
      }'
    
      curl -XGET 'http://elasticsearch02:9200/twitter/_mapping/tweet?pretty',上面这行命令可以查看某个type的
      mapping映射信息
      
      curl -XGET 'http://elasticsearch02:9200/twitter/_mapping/tweet/field/message?pretty',这行命令可以看某个
      type的某个field的映射信息
    复制代码

1.9 Management Index Alias

    curl -XPOST 'http://elasticsearch02:9200/_aliases?pretty' -d '
    {
        "actions" : [
            { "add" : { "index" : "twitter", "alias" : "twitter_prod" } }
        ]
    }'
    
    curl -XPOST 'http://elasticsearch02:9200/_aliases?pretty' -d '
    {
        "actions" : [
            { "remove" : { "index" : "twitter", "alias" : "twitter_prod" } }
        ]
    }'
    
    POST /_aliases
    {
        "actions" : [
            { "remove" : { "index" : "test1", "alias" : "alias1" } },
            { "add" : { "index" : "test2", "alias" : "alias1" } }
        ]
    }
    
    POST /_aliases
    {
        "actions" : [
            { "add" : { "indices" : ["test1", "test2"], "alias" : "alias1" } }
        ]
    }
复制代码
  • Index aliases, or quite useless, what is it primarily, that is, an index can mount multiple aliases underlying index, such as seven days of data

  • Index aliases often and that rollover Explaining the combine, we have to facilitate the performance and management of daily data rollover out of an index, but at the time of data analysis, it may be like this, there is an index access-log, point the date of the most recent data, used to calculate real-time data; there is an index access-log-7days, pointing to the seven days of the seven index, you can let us analyze some of the statistics and data of the week.

1.10 index settings management

  • Often may have to do some settings adjustments to the index, and index open and close frequently before to combine

      curl -XPUT 'http://elasticsearch02:9200/twitter/_settings?pretty' -d '
      {
          "index" : {
              "number_of_replicas" : 1
          }
      }'
      
      curl -XGET 'http://elasticsearch02:9200/twitter/_settings?pretty'
    复制代码

1.11 index template management

  • You can define some of the index template, this template is automatically applied to the newly created index up. template can contain settings and mappings, may also contain a pattern, determines the template which will be applied to the index. And template will only be applied when the index is created, modified template, will not affect the existing index.

      curl -XPUT 'http://elasticsearch02:9200/_template/template_access_log?pretty' -d '
      {
        "template": "access-log-*",
        "settings": {
          "number_of_shards": 2
        },
        "mappings": {
          "log": {
            "_source": {
              "enabled": false
            },
            "properties": {
              "host_name": {
                "type": "keyword"
              },
              "created_at": {
                "type": "date",
                "format": "EEE MMM dd HH:mm:ss Z YYYY"
              }
            }
          }
        },
        "aliases" : {
            "access-log" : {}
        }
      }'
      
      curl -XDELETE 'http://elasticsearch02:9200/_template/template_access_log?pretty'
      curl -XGET 'http://elasticsearch02:9200/_template/template_access_log?pretty'
      curl -XPUT 'http://elasticsearch02:9200/access-log-01?pretty'
      curl -XGET 'http://elasticsearch02:9200/access-log-01?pretty'
    复制代码
  • index template, may be like this, is that you may often create different indexes, such as commodities, divided into several, each kind of product data are large, might say, a kind of a commodity index, but each a commodity index set is about the same, why we can engage in a commodity index template, and then each time a new kind of commodity index, tied directly to the template, cited related settings

2 index statistics

  • indice stat for different types of operations that occur on the index have provided statistics. The api provides statistical information index level, but most of the statistical information is also available from the node level. Included here are the statistics underlying mechanisms of memory usage quantity doc, index size, segment of, merge, flush, refresh, translog and so on.

      curl -XGET 'http://elasticsearch02:9200/twitter/_stats?pretty'
    复制代码

2.1 segment statistics

  • View low level of segment information lucene it can be used to view the shard and more information index, including some optimization information as delete wasted data space, and so on.

      curl -XGET 'http://elasticsearch02:9200/twitter/_segments?pretty'
      
      {
          ...
              "_3": {
                  "generation": 3,
                  "num_docs": 1121,
                  "deleted_docs": 53,
                  "size_in_bytes": 228288,
                  "memory_in_bytes": 3211,
                  "committed": true,
                  "search": true,
                  "version": "4.6",
                  "compound": true
              }
          ...
      }
    复制代码
  • _3, is the name of the segment, this segment files with the name of the file name of a relationship, all the files are a segment with the beginning of the name

  • generation: Every time a new generation segment, it will increment a value, segment name is this value

  • num_docs: stored in this segment in the number of deleted_docs not deleted document: the number of document is deleted, this value is stored in this segment is does not matter, because every segment merge document when these are removed

  • size_in_bytes: This segment disk space

  • memory_in_bytes: segment need some data cached in memory, so that higher performance can search, this value is the segment of the space occupied by memory size

  • committed: segment whether sync to disk to go up, commit / sync the segment ensures that data is not lost, but even if this value is false does not matter, because the data is stored in a translog while inside, es restart the process when it is possible translog replay of logs to recover data

  • search: This segment can not be searched, if it is false, it may have been this segment sync to disk, but has no refresh, it can not be searched

  • version: lucene version number

  • compound: If it is true, it means that this segment lucene will merge all the files into a file, and then save the file descriptor consumption

2.2 shard store statistical information

  • Query index shard copy of information stored, you can see what shard copy, shard copy of allocation id on which nodes, an error that uniquely identifies each shard copy, including open when it came to the index. By default, it will show at least shard an unassigned copy if the cluster health is Yellow, displays at least shard an unassigned replica when cluster health is Red, will be displayed shard unassigned primary of. But with status = green can see information about each shard of.

      curl -XGET 'http://elasticsearch02:9200/twitter/_shard_stores?pretty'
      curl -XGET 'http://elasticsearch02:9200/twitter/_shard_stores?status=green&pretty'
    
      {
          ...
         "0": { 
              "stores": [ 
                  {
                      "sPa3OgxLSYGvQ4oPs-Tajw": { 
                          "name": "node_t0",
                          "transport_address": "local[1]",
                          "attributes": {
                              "mode": "local"
                          }
                      },
                      "allocation_id": "2iNySv_OQVePRX-yaRH_lQ", 
                      "legacy_version": 42, 
                      "allocation" : "primary" | "replica" | "unused", 
                      "store_exception": ... 
                  },
                  ...
              ]
         },
          ...
      }
    复制代码
  • 0:shard id

  • stores: Each copy of the shard of store information

  • sPa3OgxLSYGvQ4oPs-Tajw: node id, node holds a copy of the information

  • allocationi_id:copy的allocationid

  • allocation: the role shard copy of

2.4 clear cache

curl -XPOST 'http://elasticsearch02:9200/twitter/_cache/clear?pretty',这个命令可以清空所有的缓存
复制代码

2.5 flush

  • flush API allows us to force flush multiple indexes, the index after the flush, this index will be released out of memory occupied, because there will os cache data to disk fsync forced up, while also translog clean out the log. By default, es from time to time will automatically trigger flush operation in order to promptly clean up memory. POST twitter / _flush, this command can be.

  • flush command accepts the following two parameters, wait_if_going, if set to true, then flush api will wait until after they return from flush operation executed, even if the need to wait for another flush operation to complete. The default value is false, so, if there are other flush operation is performed, the error will be; force, if not necessary to flush, then, it is not to force a flush

      curl -XPOST 'http://elasticsearch02:9200/twitter/_flush?pretty'
    复制代码

2.6 refresh

  • refresh to refresh a explicit index, which would allow all operations performed before this refresh, are in a visible state. POST twitter / _refresh

      curl -XPOST 'http://elasticsearch02:9200/twitter/_refresh?pretty'
    复制代码

2.7 force go

  • force merge API can combine multiple index files force may be a shard lucene index corresponding plurality of segment file are combined, can reduce the number of segment file. POST / twitter / _forcemerge.

      curl -XPOST 'http://elasticsearch02:9200/twitter/_forcemerge?pretty'
    复制代码

3 summary

Production deployment there are a lot of work to do, this article from the primary ideas involvement, were integration issues.

This set of technology columnist (Qin Kai new) focused on big data and containers cloud core technology decryption, 5 years of big data cloud platform construction experience in industrial IOT, can offer a full stack of big data + cloud native platform counseling programs, please continue this set of blog attention. QQ-mail address: [email protected], academic exchange, if any, can feel free to contact

Qin Kai New

Guess you like

Origin juejin.im/post/5d3dbb57f265da1b6a34dd2c