Problems encountered in querying ES tens of millions of data indexes using thread pool

https://www.jylt.cc/#/detail?id=f41997ce9c8828d627a68ca7a9fc2de5icon-default.png?t=M276https://www.jylt.cc/#/detail?id=f41997ce9c8828d627a68ca7a9fc2de5

scenes to be used:

        The company received a request to query all the data in ES index A, and then query another index B based on a certain field in the queried data, integrate and obtain the final required data, then generate excel, upload oss, etc. . Tens of millions of pieces of data are stored in both index A and index B. My previous colleague used a single thread to write the data. To query index A, use limit and from deep paging. The final data generation takes about... I don’t know how long it will take. It might not take a month to generate it, but then the need fell to me.

        I have never used ES before doing this requirement, and I don't know much about thread pools. I thought that using a thread pool would improve the processing speed. After some research, I finally increased the processing speed from 4 minutes to process 1,000 items to 1 minute to process 6,000 items. The code is as follows:

(The most time-consuming step is actually to query the tens of millions of data in index A, here I will post the code of this step)


int i = 0;
//查询出索引A的数量
int count = esService.queryNum("索引A的名称");
while (true) {
//            如果线程的数量没有超 并且查询出的数据量不够 继续执行(这一步也思考了很久,因为不知道怎么控制是否让新的任务进入线程池,如果不加条件,那么任务就会一股脑的往线程池里送,没一会儿就报错了。MAXIMUMPOOLSIZE是最大线程池数量
    if (threadPool.getActiveCount() < MAXIMUMPOOLSIZE && totalCount < count) {
    //线程池里的任务如果想获取到外部的数据,需要用final定义
        final int n = i;
        i ++;
        threadPool.execute(new Runnable() {
            @Override
            public void run() {
            	int limit = 1000;
                long queryStart = System.currentTimeMillis();
                List<String> dataSetEsQueryList = esService.queryData(yuliaoIndex, n * limit, limit);
                long queryEnd = System.currentTimeMillis();
                logger.info("查询一千条语料成功,耗时:" + (queryEnd - queryStart) / 1000 + "s");
            }
        });
    }
}

        queryData method:

public List<String> queryData(String dataSetEsIndex, int from, int to) {
        SearchRequest searchRequest = new SearchRequest();
        searchRequest.indices(dataSetEsIndex);
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        //根据ID进行排序
        sourceBuilder.sort("_id");
        sourceBuilder.from(from);
        sourceBuilder.size(to);
        //之前查询的是索引的全部字段,但是我只需要一个字段,所以这里做了控制
        sourceBuilder.fetchSource(new String[]{"query"}, null);
        searchRequest.source(sourceBuilder);
        List<String> dataSetQueryEsList = new ArrayList<>();
        try {
            SearchResponse rp = client.search(searchRequest, RequestOptions.DEFAULT);
            if (rp != null) {
                SearchHits hits = rp.getHits();
                if (hits != null) {
                    for (SearchHit hit : hits.getHits()) {
                        String source = hit.getSourceAsString();
                        DataSetEsTwo index = GsonUtil.GSON_FORMAT_DATE.fromJson(source,
                                new TypeToken<DataSetEsTwo>() {
                                }.getType());
                        index.setId(hit.getId());
                        dataSetQueryEsList.add(index.getQuery());
                    }
                }
            }
        } catch (IOException e) {
            logger.error("query ES is error " + e.getMessage(),e);
        }
        return dataSetQueryEsList;
    }

        process result:

        That's probably it. As mentioned above, I successfully increased the processing speed from 1,000 items in 4 minutes to 6,000 items in 1 minute. I thought I was done, but! ! ! Here comes the problem. In this case, there are only 7,000 pieces of data in the index. Because the index I want to check has tens of millions of pieces of data, I tried it to see if the processing time is the same when there are tens of millions of pieces of data in the index. I thought it was a proportional increase. I adjusted the interface at night and let it run, so I went to sleep peacefully. Before I went to see it in the morning, I thought happily, look how many pieces of data have been processed, and then! I just discovered a bald thing.

        As shown in the figure, the query takes longer as it goes forward. When there were only 7,000 pieces of data before, it took about 4 seconds to query a thousand pieces. However, when the amount of index data is large, this time-consuming... is unacceptable. I only processed more than 10,000 pieces of data in one night, and I couldn't figure it out. At first, I didn't locate the problem of ES query, thinking that it was time-consuming to process. Later, I finally found that querying ES wasted a lot of time, but I thought to myself, isn't this the query, query in pages. I searched the Internet and found the problem.

(Easy to understand, the following content is copied from here: es deep paging query_weixin_30872671's blog - CSDN blog )

        Assume that our ES has three nodes. When the paging query request comes, if it falls on the node1 node, then the node1 node will send the same query request to node2 and node3, and each node will return topN documents (here only the document's id and scoring and sorting fields to reduce data transmission), node1 will sort all the documents (3*N) of the three nodes, take topN, and then query the entire document data on the corresponding node according to the document's ID, and finally return client.

        For paging queries, such as from=10000, szie=10000, each node actually needs to query from+size=20000 pieces of data, and 10000 pieces of data are intercepted after sorting. When we perform in-depth paging, such as querying the tenth page of data, each node needs to query 10*size=10W pieces of data, which is terrible. And by default, when from+size is greater than 10000, the query will throw an exception. After ES2.0, there is a max_result_window attribute setting. The default value is 10000, which is the maximum limit of from+size. Of course, you can modify this value as a temporary coping strategy, but the symptoms will not be cured, and the product will only get worse!

        This means that when using from and limit paging queries, the larger the amount of data in the ES index (more than 10,000), the slower the query speed. The query speed almost doubles and doubles. See what I said above. The picture can be felt. So what to do, fortunately, ES provides us with another query method, which is the magical scroll query!

        Scroll query is also called cursor query or scrolling query. For details, please refer to the official document: Search After | Elasticsearch Guide [6.5] | Elastic

        Then I carried out another modification, the modified code is as follows:

String queryEnd = "false";
long startTime = System.currentTimeMillis();
//        1. 创建查询对象
SearchRequest searchRequest = new SearchRequest("索引名称");//指定索引
searchRequest.scroll(TimeValue.timeValueMinutes(1L));//指定存在内存的时长为1分钟
//    2. 封装查询条件
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.sort("id", SortOrder.DESC); //按照哪个字段进行排序
searchSourceBuilder.size(2);    //一次查询多少条
searchSourceBuilder.fetchSource(new String[]{"query"}, null);   //只查询哪些字段或不查询哪些字段
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
searchRequest.source(searchSourceBuilder);
//        3.执行查询
// client执行
HttpHost httpHost = new HttpHost("ip", "端口号(int类型)", "http");
RestClientBuilder restClientBuilder = RestClient.builder(httpHost);
//也可以多个结点
//RestClientBuilder restClientBuilder = RestClient.builder(
//    new HttpHost("ip", "端口号(int类型)", "http"),
//        new HttpHost("ip", "端口号(int类型)", "http"),
//        new HttpHost("ip", "端口号(int类型)", "http"));
RestHighLevelClient restHighLevelClient = new RestHighLevelClient(restClientBuilder);

SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();

//        4.获取数据
SearchHit[] hits = searchResponse.getHits().getHits();
totalCount = totalCount + hits.length;
for(SearchHit searchHit : hits){
    String source = searchHit.getSourceAsString();
    DataSetEsTwo index2 = GsonUtil.GSON_FORMAT_DATE.fromJson(source,
            new TypeToken<DataSetEsTwo>() {
            }.getType());
    //index2就是我要的数据
    index2.setId(searchHit.getId());
}
//获取全部的下一页
while (true) {
//                当查不出数据后就不再往下执行 这里做判断是因为走到这里的时候可能有的线程还没执行完
//                  所以需要确保所有的线程都执行结束了,这样数据才是对的
    if ("true".equals(queryEnd)) {
        if (threadPool.getActiveCount() == 0) {
            break;
        }
    }
    SearchHit[] hits1 = null;
    try {
        //创建SearchScrollRequest对象
        SearchScrollRequest searchScrollRequest = new SearchScrollRequest(scrollId);
        searchScrollRequest.scroll(TimeValue.timeValueMinutes(3L));
        SearchResponse scroll = restHighLevelClient.scroll(searchScrollRequest, RequestOptions.DEFAULT);
        hits1 = scroll.getHits().getHits();
    } catch (Exception e) {
        logger.error("第一次查询数据失败:" + e.getMessage());
    }

//                线程池处理获取的结果
    //如果当前线程池的数量是满的 那就等待 直到空出一个线程
    //这个是一样的道理 不可以让任务一股脑的进入线程池
    while (threadPool.getActiveCount() >= MAXIMUMPOOLSIZE) {
        try {
            Thread.sleep(100);
        } catch (Exception e) {
            logger.error("休眠失败...");
        }
    }

    if (hits1 != null && hits1.length > 0) {
        //走到下面的肯定是有线程空位的
        final SearchHit[] hits1Fin = hits1;
        threadPool.execute(new Runnable() {
            @SneakyThrows
            @Override
            public void run() {
                //                            线程池处理查询出的结果
                for (SearchHit searchHit : hits1Fin) {
                    try {
                        String source = searchHit.getSourceAsString();
                        DataSetEsTwo index2 = GsonUtil.GSON_FORMAT_DATE.fromJson(source,
                                new TypeToken<DataSetEsTwo>() {
                                }.getType());
                        //index2就是我要的数据
                        index2.setId(searchHit.getId());
                    } catch (Exception e) {
                        logger.error("线程执行错误:" +e.getMessage());
                    }
                }
            }
        });
    } else {
        logger.info("------------语料查询结束--------------");
        queryEnd = "true";
    }
}
//删除ScrollId
try {
    ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
    clearScrollRequest.addScrollId(scrollId);
    ClearScrollResponse clearScrollResponse = restHighLevelClient.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
} catch (Exception e) {
    logger.error("ScrollId删除失败:" + e.getMessage());
}
long endTime = System.currentTimeMillis();
logger.info("数据查询运行时间:" + (endTime - startTime) / 1000 / 60 + "min");

        The optimized code can process approximately 3,000 pieces of data in one minute. No matter how many pieces of data are in the index, the processing time increases proportionally. It ends perfectly!

Guess you like

Origin blog.csdn.net/wuchenxiwalter/article/details/123909237