Elasticsearch real | when necessary, have space for time!

1, application scenarios

Real-time data streams Kafka, according to business needs, a part of the direct aid of different kafka-connector into Elasticsearch index.
Another part, the need to do first clustering, classification, the classification results of the polymerization index is stored clustering ES cluster. As shown below:
a layered structure can be divided into business system: access layer, the data processing layer, data storage layer, the interface layer.
So the question is?
We need to realize retrieval and analysis operations based on the results of the polymerization aggregation (data processing layer), how to achieve faster and more efficient retrieval of polymerization analysis results so far?

2, program selection

Option One:
the establishment of only one index, aggs_index.
The polymerization results are stored in the data processing layer in the ES specified index, while the polymerized matter related to each data into each of the document following a field. As shown in the following schematic:

A schematic diagram of the program

Option Two:
Create two indexes: aggs_index and aggs_detail_index.
In which:
1) aggs_index store event list information.
2) content of the article information aggs_detail_index store events associated with it.
As shown below:

Scheme II schematic

3, program comparison

Option One advantage: saving storage space, stores only related article id, there is no duplication of data storage.
Scheme disadvantage: retrieving, slow the polymerization, performance is not standard.
All subsequent operations a program, required to traverse IDs to retrieve the pile, and then retrieved, polymerizing analysis operation.

Operation examples are as follows (the actual complex than this):
The first step: event id, id list get the associated articles;
Step: id list based on the associated article, retrieval and aggregation operations.

POST  aggs_index/_search
{
  "_source": {
  "includes":[
    "title",
"abstract",
"publish_time",
"author"
    ]},
  "query":{
    "terms":{
      "_id":"["789b4cb872be00a04560d95bf13ec8f42c",
      "792d9610b03676dc5644c2ff4db372dec4",
"817f5cff3dd0ec3564d45615f940cb7437",
"....."]
    }
  }
}

步骤2当id数量很多时,会有如下的错误提示:

{
  "error": {
    "root_cause": [
      {
        "type": "too_many_clauses",
        "reason": "too_many_clauses:
        maxClauseCount is set to 1024"
      },

。。。

方案二优点:分开存储,便于一个索引中进行检索、聚合分析操作。
空间换时间,极大的提升检索效率、聚合速度。
方案二缺点:同样的数据,多存储了一份。
其对应的检索操作如下:

POST  aggs_index/_search
{
  "_source": {
  "includes":[
    "title",
"abstract",
"publish_time",
"author"
    ]},
  "query":{
    "term":{
      "topic_id":"WIAEgRbI0k9s1D2JrXPC"
    }
  }
}

是真的吗?
用事实说话:
以下响应时间的单位为:ms。
方案一要在N个(N接近10)索引,每个索引近千万级别的数据中检索。

两方案对比

两方案响应时间对比效果图

4、小结

  • 由以上图示,对比可知,方案二采取了空间换时间的策略,数据量多存储了一份,但是性能提升了10余倍。

  • 在实战开发中,我们要理性的选择存储方案,在磁盘成本日渐低廉的当下,把性能放在第一位,用户才能用的"爽“!

推荐阅读:

《深入理解 Java 内存模型》读书笔记

面试-基础篇

Spring Boot 2.0 迁移指南

SpringBoot使用Docker快速部署项目

为什么选择 Spring 作为 Java 框架?

SpringBoot RocketMQ 整合使用和监控

Spring Boot 面试的十个问题

使用 Spring Framework 时常犯的十大错误

SpringBoot Admin 使用指南

SpringBoot Kafka 整合使用

SpringBoot RabbitMQ 整合使用

使用Arthas 获取Spring ApplicationContext还原问题现场

上篇好文:

Elasticsearch索引增量统计及定时邮件实现

Guess you like

Origin www.cnblogs.com/springforall/p/11334539.html