How to elegantly read the data in the Elasticsearch index in full

(1) Introduction to scroll

Sometimes we may want to read the data of the entire es index or most of the data to rebuild the index or process the data. I believe most people will say that this is very simple. It can be done directly with from+size, but the actual situation is The paging method of from+size is not suitable for this kind of full data extraction, and the performance of this method is lower as it goes to the later stage, which is why the data in es restricts the result of a single query to no more than 10,000 pieces of data. .

The scroll method is provided in es to read the index data in full, which is very similar to the concept of cursor in the database. When using scroll to read data, you only need to send a query request, and then the es server will generate a current Request the snapshot data set of the index, and then we read the batch data of the specified size through scrollId each time until the data of the entire index is read.

It should be noted here that when the index snapshot set is generated, a search context context is actually maintained within es. This context is read-only and immutable within a specified time interval, that is, as long as it is generated, the subsequent Your add, delete, and update data will not be perceived.

(2) The use of scroll

Here's how to use it:

(1) To use the scroll method to read data, two steps are required. The first step is to initialize a search context, as follows:

curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
'

Note that scroll=1m in the above url means that this search context only retains a one-minute validity period.

(2) In the first operation, we can obtain a scrollId, and then each subsequent read will get a scrollId. We need to return this scrollId when reading the data of the next batch, as follows:

curl -XGET  'localhost:9200/_search/scroll'  -d'
{
    "scroll" : "1m", 
    "scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" 
}
'

Or by way of searching lite api:

curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1'

In this way, the data is read in turn until the searchHits array is empty.

The same is true for the aggregated scroll request, but the data body of the aggregated request will only exist in the initialized search, which needs attention. However, the scroll of the aggregated request generally does not have such an application scenario. After all, the aggregated result is generally Several orders of magnitude less.

In addition, the scroll request can also add one or more sorting fields. If the index data you read completely ignores its order, then we can also use the doc field sorting to improve performance.

curl -XGET 'localhost:9200/_search?scroll=1m' -d '
{
  "sort": [
    "_doc"
  ]
}
'

ok, and then add the method of how to fully read the es index data in the java api:

`           //指定一个index和type    
            SearchRequestBuilder search = client.prepareSearch("active2018").setTypes("active");
            //使用原生排序优化性能
            search.addSort("_doc", SortOrder.ASC);
            //设置每批读取的数据量
            search.setSize(100);
            //默认是查询所有
            search.setQuery(QueryBuilders.queryStringQuery("*:*"));
            //设置 search context 维护1分钟的有效期
            search.setScroll(TimeValue.timeValueMinutes(1));

            //获得首次的查询结果
            SearchResponse scrollResp=search.get();
            //打印命中数量
            System.out.println("命中总数量:"+scrollResp.getHits().getTotalHits());
            //打印计数
            int count=1;
            do {
                System.out.println("第"+count+"次打印数据:");
                //读取结果集数据
                for (SearchHit hit : scrollResp.getHits().getHits()) {
                    System.out.println(hit.getSource())  ;
                }
                count++;
                //将scorllId循环传递
                scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
                .setScroll(TimeValue.timeValueMinutes(1))
                .execute().actionGet();
                
                //当searchHits的数组为空的时候结束循环,至此数据全部读取完毕
            } while(scrollResp.getHits().getHits().length != 0); 

(3) Delete useless scrolls

As mentioned above, a scroll request will maintain a set of search context snapshots. How is this done? Through the previous articles (you can see it by clicking the bottom menu bar), we know that when es writes data, it will continuously generate segments in memory, and then there is a merge thread that will continuously merge small segments into larger ones. In the segment, and then delete the old segment to reduce the occupation of system resources by es, especially the file handle, then maintaining an index snapshot within a period of time means that all segments within this period cannot be merged, otherwise it will be The static nature of the snapshot is destroyed, so that small segments that cannot be merged temporarily will occupy a large amount of file handles and system resources in the system, so the scroll method must be used offline instead of being provided for near real-time use.

We need to develop a good habit, we should clear the scroll manually when we are done using it, although the search context timeout will also be cleared automatically.

es provides an api command to check how many open search contexts are currently in the system:

curl -XGET localhost:9200/_nodes/stats/indices/search?pretty

Let's see how to delete scrollId

(1) delete a scrollId

DELETE /_search/scroll
{
    "scroll_id" : "UQlNsdDcwakFMNjU1QQ=="
}

(2) Delete multiple scrollIds

DELETE /_search/scroll
{
    "scroll_id" : [
      "aNmRMaUhiQlZkMWFB==",
      "qNmRMaUhiQlZkMWFB=="
    ]
}


(3) Delete all scrollIds

DELETE /_search/scroll/_all

(4) Delete multiple scrollId usage of search lite api

DELETE /_search/scroll/aNmRMaUhiQlZkMWFB==,qNmRMaUhiQlZkMWFB==

All the above functions have been verified in the es2.3.4 version. In addition, in the later versions of es5.x, a function of reading the index by sharding has been added, and the parallel reading method is supported by sharding to improve the Export efficiency:

An example is as follows:

GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 0, 
        "max": 2 
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 1,
        "max": 2
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

Pay attention to the slice parameter above, where the id field represents the currently read data by shards, and the max parameter represents the number of slices we divide the entire index data into. The default sharding algorithm:

slice(doc) = floorMod(hashCode(doc._uid), max) 

It can be seen from the above that it is based on the hashCode of the uid field and the maximum number of fragments. Note that the floorMod method and % modulo have the same result when they are both positive integers.

The slice field can also add custom fields to participate in sharding, such as based on date fields:

    "slice": {
        "field": "date",
        "id": 0,
        "max": 10
    }

The fields participating in sharding must be numeric fields and doc value needs to be enabled. In addition, the number of max set should not exceed the number of shards, otherwise the query performance will be degraded. By default, the maximum number of shards limited by es to each index is 1024. However, it is changed by setting the index.max_slices_per_scroll parameter in the setting.

(4) Summary

This article introduces how to elegantly read es index data in full and some of its principles and precautions. Understanding these will help us better use es in our daily work, thereby improving our understanding of es.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324391345&siteId=291194637