Elasticsearch _reindex Alias Use

Application Background:
1, when your data is too large, fragmented and insufficient number of your index originally created, resulting in slower data warehousing situation, this time need to expand the number of slices, then you can try using Reindex.

2, when the need to modify the data mapping, but a large amount of data has been imported into the index, and re-introduced into the new index data is too time consuming; however, in the ES, in the definition of mapping of a field after data is imported and can not be modified of,

So in this case you may also consider trying Reindex.

REINDEX:
ES provides _reindex this API. Relative to our re-import the data will certainly be a lot faster, the measured speed is probably the bulk import data 5-10 times.

Data migration steps:
1, create a new index (can be created directly on the head by plug-java program)

Note: when creating the index table structure should have created (ie, mapping)

2, data replication

The most simple and basic ways:

1) Request Code:

POST _reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}
2)利用命令:

curl _XPOST 'ES数据库请求地址:9200/_reindex' -d{"source":{"index":"old_index"},"dest":{"index":"new_index"}}

 

But if there is data in the new index, and the possible conflict, you can set version_type "version_type": "internal" or not set, then Elasticsearch mandatory document dump to the destination, overwriting any of the same type and ID content:

_Reindex the POST
{
"Source": {
"index": "OLD_INDEX"
},
"dest": {
"index": "NEW_INDEX",
"versionjype": "Internal"
}
}
set op_type to create the target would result in only _reindex index create the missing documents. All existing versions of the document will lead to POST _reindex 
 


"Source": { 
"index": "twitter" 
}, 
"dest": { 
"index": "new_twitter", 
"op_type": "the Create" 

}
version _reindex conflict will abort the process, but you can request body is provided "conflict": "proceed" to count in case of conflict:

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

Documents can also be limited by adding a type or add to the source query 

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "old_ijndex"
  },
  "dest": {
    "index": "new_index",
    "version_type": "internal",
    "op_type": "create"
  }
}
增加查询 

-u $ {} newClusterUser curl: $ {} -XPOST newClusterPass "HTTP: // $ {} newClusterHost / _reindex Pretty?" -H "Content-Type: the Application / json" -d '{
"Source": {
"Remote ": {
" Host ":" '$ {oldClusterHost}' ",
" username ":" '$ {oldClusterUser}' ",
" password ":" '$ {oldClusterPass}' "
},
" index ":" '$ indexName} { ' ",
" Query ": {
" MATCH_ALL ": {}
}
},
" dest ": {
" index ":"' $ {indexName} ' "
}
}'
 large amount of data, no deletion, update time. Within the time

curl -u ${newClusterUser}:${newClusterPass} -XPOST "${newClusterHost}/_reindex?pretty" -H "Content-Type: application/json" -d '{
"source": {
"remote": {
"host": "'${oldClusterHost}'",
"username": "'${oldClusterUser}'",
"password": "'${oldClusterPass}'"
},
"index": "'${indexName}'",
"query": {
"range" : {
"'${timeField}'" : {
"gte" : '${lastTimestamp}',
"lt" : '${curTimestamp}'
}
}
}
},
"dest": {
"index ":" '$ {indexName}' "  ES changed field contents REINDEX Script} '
}


 

reindex when changing the contents of the field
  

{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
},
"script": {
"source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
"lang": "painless"
}
}
 

Data migration efficiency
problems found:

If we only conventional small amount of data migration using a common reindex can be very good to meet the requirements, but when we found that we needed to migrate the data is too large, we will find reindex speed will become very slow

The amount of data at the scene of dozens of G, elasticsearch reindex too slow, the cable guide from the old data to the new index, what is the best solution currently?
Cause Analysis:

The core reindex do cross-indexing, data migration across the cluster.
Reasons for the slow and optimization is nothing more than ideas include:
    1) batch size value may be too small. Requires a combination of heap memory, thread pool resize;
    2) The underlying REINDEX is achieved scroll, scroll means of parallel optimization mode, to improve efficiency;
    3) cross-indexed across the cluster core is writing data, the write Optimizing considered to improve efficiency. 

Possible options:

1) enhance the write batch size value

By default, _reindex 1000 batch operation, you can adjust the source in batch_size.


POST _reindex
{
"source": {
"index": "source",
"size": 5000
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
批量大小设置的依据:

1, using a batch request to obtain the best performance index.

Batch size depends on the data, analysis and cluster configuration, but is a good starting point for each batch 5-15 MB.

Note that this is the physical size. The number of documents is not a good measure indicators batch size. For example, if each batch of documents indexed 1000:

1) Each 1kb the 1000 document is 1mb.

2) Each document 100kb of 1000 is 100 MB.

These are completely different volume sizes.

2, gradually increasing the size of the document capacity tuning mode.

1) started with the big capacity of about 5-15 MB, and slowly increase until you can not see to enhance performance. Then began to increase concurrency batch write (multi-threading, etc.).

2) Use kibana, cerebro or iostat, top and ps tools such as monitoring nodes to see when to start the resource bottleneck. If you start receiving EsRejectedExecutionException, your cluster can no longer keep up: There is at least one resource has reached capacity.

Either reduce concurrency, or more limited resources (e.g., switching from a mechanical hard drive to ssd SSD), or to add more nodes.

2) With sliced ​​scroll enhance the writing efficiency

Reindex support Sliced ​​Scroll to parallelize the index rebuild process. This parallel process can increase the efficiency and provide a convenient method to request into smaller parts.

sliced ​​principle (from medcl)

1) Scroll interfaces used it, very slowly? If you are a large amount of data, use Scroll through the data that really can not accept, now Scroll interfaces can be complicated for data traversed. 
2) Scroll each request, the request can be divided into a plurality of Slice, slices will be appreciated, each independently Slice parallel, or by using Scroll rebuild traverse many times faster.

Application Example slicing

slicing set divided into two ways: manually set fragmentation, fragmentation is automatically set. 
Referring to manually set the slice official website. 
Slice automatically set as follows:


? POST _reindex slices = 5 & Refresh
{
"Source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
slices size settings Note:

. 1) the size of the slices can be provided manually specified, slices or set to auto, auto meanings are: for a single index, the number of slices fragment size =; minimum for multiple indexes, = slice slices.
2) When the number of slices equal to the number of fragments in the index, the most efficient query performance. slices size is larger than the number of fragments, not only will not improve efficiency, but will increase overhead.
3) If the slices big number (eg 500), recommended to choose a low number because too large slices can affect performance.

Practice has proved that speed can reindex than the default setting 10x +.

Index Alias
when operated for a specified index, ElasticSearch the API accepts the name of an index of. Of course, a plurality of index may be specified in the case of a plurality of index.

The index aliases API allows us to specify an alias for a index. Alias ​​can be mapped to a plurality of the index, the index corresponding to the specified after the other index has a corresponding alias, can be automatically extended to the plurality of index. An alias can be associated on a filter, and when the search route automatically when using this filter.

add the alias
example of the association of alias alias1 to index test1

curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "test1", "alias" : "alias1" } }
    ]
}'

Success, returns

{"acknowledged":true}

remove the alias
can also delete the alias

curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "test1", "alias" : "alias1" } }
    ]
}'

Success, returns

{"acknowledged":true}

Rename
rename alias a very simple, by simply remove the same API and add operation on it. This operation is atomic, do not worry about in a very short period of time does not point to any one of the alias index

curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "test1", "alias" : "alias1" } },
        { "add" : { "index" : "test1", "alias" : "alias2" } }
    ]
}'

Success, returns

{"acknowledged":true}

Alias handover
of course, may also be implemented to switch to a different index of alias through the same API. In actual use, if we ElasticSearch accessed by an alias, it can be implemented quickly switch to a different version of the data is switched by an alias pointing to the index.

curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "test1", "alias" : "alias2" } },
        { "add" : { "index" : "test", "alias" : "alias2" } }
    ]
}'

Success, returns

{"acknowledged":true}

A plurality of index points to an alias
may be an alias points to more than one index, by a simple operation of a plurality of add

curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "test1", "alias" : "alias1" } },
        { "add" : { "index" : "test2", "alias" : "alias1" } }
    ]
}'

Success, returns

{"acknowledged":true}
---------------------

Guess you like

Origin www.cnblogs.com/hyhy904/p/11098546.html