Application of elasticsearch in statistical storage usage

storage introduction

Object storage is widely used in current projects. It is mainly used to store static resources such as pictures, videos, audio, and files. Basically, all cloud service vendors have object storage. Object storage is generally charged per GB per month, such as seven Niu’s 0.098 yuan/GB/month, Ali’s 0.12 yuan/GB/month. For example, if I used 30GB last month, the cost of last month is 30*0.098. It should be noted here that the 30G used last month does not mean that there will be 30G of data in the Bucket at the end of last month, but the last average The daily dosage is 30G. For example, Xiao Ming uploaded 1G files every morning last month, so the usage last month was (1+2+3+...+30)/30=15.5G, and this leads to a new problem. If Xiao Ming uploads 1G files every morning, Another 1G file was deleted in the afternoon, so what was the storage usage last month? It must not be 0, otherwise it is not prostitution? In order to prevent users from prostitution, you can define the daily usage as the maximum usage space of the Bucket on that day, then Xiaoming uploads 1G in the morning and deletes 1G in the afternoon. The maximum storage space for the day is 1G, and the usage for the month is (1+1+1...+1 )/30=1G. If you want to accurately calculate the maximum space of the day, you need to count the current usage when each file is added and deleted, and then take the maximum value in a day. If the requirements are not high, you can also count the usage at intervals. Here I introduce the use of elasticsearch to count daily storage usage.

insert image description here

Statistical Basic Process

Count the current storage usage and store it in ES every 30 minutes. The main fields are as follows:

tenant ID Statistics Time size
1 2023-07-10 00:00:00 1024
1 2023-07-10 00:00:30 2024
1 2023-07-10 00:00:00 1024

Create ES index

PUT /bucket_size 
{
    
    
  "settings": {
    
    
    "number_of_shards": 6,
    "number_of_replicas": 0
  },
  "mappings": {
    
    
    "properties": {
    
    
      "id": {
    
    
        "type": "long"
      },
      "size": {
    
    
        "type": "long"
      },
      "tenantId": {
    
    
        "type": "long"
      },
      "time": {
    
    
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

Test Data

{
    
    
        "id": "1",
        "tenantId": 1,
        "size": 1024,
        "time": "2023-07-17 18:00:00"
    }
    {
    
    
        "id": "2",
        "tenantId": 1,
        "size": 2048,
        "time": "2023-07-17 19:00:00"
    }
    {
    
    
        "id": "3",
        "tenantId": 1,
        "size": 1024,
        "time": "2023-07-17 10:00:00"
    }
    {
    
    
        "id": "4",
        "tenantId": 2,
        "size": 1024,
        "time": "2023-07-17 09:00:00"
    }
    {
    
    
        "id": "5",
        "tenantId": 2,
        "size": 0,
        "time": "2023-07-17 10:00:00"
    }
    {
    
    
        "id": "6",
        "tenantId": 2,
        "size": 1024,
        "time": "2023-07-17 11:11:00"
    }

Query the tenant's daily usage

Query requirements, input the tenant ID, start time and end time, and return the daily usage of each tenant within the specified time.

GET /bucket_size/_search
{
    
    
    "query": {
    
    
        "bool": {
    
    
            "must": [
                {
    
    
                    "terms": {
    
    
                        "tenantId": [
                            1,
                            2
                        ],
                        "boost": 1
                    }
                },
                {
    
    
                    "range": {
    
    
                        "time": {
    
    
                            "from": "2023-07-01",
                            "to": "2023-07-31",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "aggregations": {
    
    
        "tenantGroup": {
    
    
            "terms": {
    
    
                "field": "tenantId",
                "size": 10,
                "min_doc_count": 1,
                "shard_min_doc_count": 0,
                "show_term_doc_count_error": false,
                "order": [
                    {
    
    
                        "_count": "desc"
                    },
                    {
    
    
                        "_key": "asc"
                    }
                ]
            },
            "aggregations": {
    
    
                "groupDay": {
    
    
                    "date_histogram": {
    
    
                        "field": "time",
                        "format": "yyyy-MM-dd",
                        "calendar_interval": "1d",
                        "offset": 0,
                        "order": {
    
    
                            "_key": "asc"
                        },
                        "keyed": false,
                        "extended_bounds" : {
    
     
                            "min" : "2023-07-01",
                            "max" : "2023-07-31"
                        }

                    },
                    "aggregations": {
    
    
                        "maxSize": {
    
    
                            "max": {
    
    
                                "field": "size",
                                "missing": 0 
                            }
                        }
                    }
                }
            }
        }
    }
}

result

{
    
    
    "took": 3,
    "timed_out": false,
    "_shards": {
    
    
        "total": 6,
        "successful": 6,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 6,
            "relation": "eq"
        },
        "max_score": 2,
        "hits": [

            {
    
    
                "_index": "bucket_size",
                "_type": "_doc",
                "_id": "2",
                "_score": 2,
                "_source": {
    
    
                    "id": "2",
                    "tenantId": 1,
                    "size": 2048,
                    "time": "2023-07-17 19:00:00"
                }
            }
        ]
    },
    "aggregations": {
    
    
        "tenantGroup": {
    
    
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
    
    
                    "key": 1,
                    "doc_count": 3,
                    "groupDay": {
    
    
                        "buckets": [
                            {
    
    
                                "key_as_string": "2023-07-01",
                                "key": 1688169600000,
                                "doc_count": 0,
                                "maxSize": {
    
    
                                    "value": null
                                }
                            },
                            {
    
    
                                "key_as_string": "2023-07-02",
                                "key": 1688256000000,
                                "doc_count": 0,
                                "maxSize": {
    
    
                                    "value": null
                                }
                            }
                        ]
                    }
                },
                {
    
    
                    "key": 2,
                    "doc_count": 3,
                    "groupDay": {
    
    
                        "buckets": [
                          
                            {
    
    
                                "key_as_string": "2023-07-31",
                                "key": 1690761600000,
                                "doc_count": 0,
                                "maxSize": {
    
    
                                    "value": null
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

Implementation using JAVA code

 public Map<Long, Map<String, Long>> getTenantSize(Long[] tenantIds, String mouthStartDate, String mouthEndDate) throws IOException {
    
    
        Map<Long, Map<String, Long>> map = new TreeMap<>();
        BoolQueryBuilder queryBuilder = QueryBuilders.boolQuery();
        queryBuilder.must(QueryBuilders.termsQuery("tenantId", Arrays.asList(tenantIds)));
        queryBuilder.must(QueryBuilders.rangeQuery("time").gte(mouthStartDate).lte(mouthEndDate));
        AggregationBuilder tenantGroup = AggregationBuilders.terms("tenantGroup").field("tenantId")
                .subAggregation(AggregationBuilders.dateHistogram("groupDay").field("time").calendarInterval(DateHistogramInterval.DAY)
                        .format(DatePattern.NORM_DATE_PATTERN).order(BucketOrder.key(true)).extendedBounds(new LongBounds(mouthStartDate,mouthEndDate))
                        .subAggregation(AggregationBuilders.max("maxSize").field("size"))
                );
        Aggregations aggregations = esClient.search(queryBuilder, tenantGroup, "bucket_size");
        Map<String, Aggregation> tenantGroupMap = aggregations.asMap();
        if (MapUtil.isNotEmpty(tenantGroupMap)) {
    
    
            tenantGroupMap.forEach((k, v) -> {
                Terms terms = (Terms) v;
                List<? extends Terms.Bucket> buckets = terms.getBuckets();
                if (CollUtil.isNotEmpty(buckets)) {
    
    
                    buckets.forEach(bucket -> {
    
    
                        Map<String, Long> daySizeMap = new TreeMap<>();
                        Map<String, Aggregation> dayGroup = bucket.getAggregations().asMap();
                        if (MapUtil.isNotEmpty(dayGroup)) {
    
    
                            dayGroup.forEach((key, value) -> {
                                ParsedDateHistogram daySizeTerms = (ParsedDateHistogram) value;
                                List<? extends Histogram.Bucket> daySizeBucket = daySizeTerms.getBuckets();
                                if (CollUtil.isNotEmpty(daySizeBucket)) {
    
    
                                    daySizeBucket.forEach(daySize -> {
    
    
                                        ParsedMax maxSize = daySize.getAggregations().get("maxSize");
                                        Long size=maxSize.getValue()!=Double.NEGATIVE_INFINITY? Double.valueOf(maxSize.getValue()).longValue():0L;
                                        daySizeMap.put(daySize.getKeyAsString(),size);
                                    });
                                }
                            });
                        }
                        map.put(Long.valueOf(bucket.getKeyAsString()), daySizeMap);
                    });

                }
            });
        }
        return map;
    }

Summarize

This article mainly introduces the use of elasticsearch computing storage to learn the use of elasticsearch group query and use JAVA code to call elasticsearch group query. There are the following precautions:

  1. If you query from July 1st to July 30th, if there is no data of the day in ES, it will also be returned. Date_histogram is used here, and extended_bounds is forced to return null
  2. After the query results are grouped, they should be sorted by time
  3. According to daily aggregations, use max to take the largest size of the day as the storage usage of the day
  4. Elasticsearch group query consumes more memory. It has been grouped in three layers. The time and number of tenants should not be too much, otherwise it will cause OOM
  5. In the case, the storage is counted every 30 minutes. If you upload and delete within 30 minutes, you will be prostituted.

Guess you like

Origin blog.csdn.net/whzhaochao/article/details/131822058