storage introduction
Object storage is widely used in current projects. It is mainly used to store static resources such as pictures, videos, audio, and files. Basically, all cloud service vendors have object storage. Object storage is generally charged per GB per month, such as seven Niu’s 0.098 yuan/GB/month, Ali’s 0.12 yuan/GB/month. For example, if I used 30GB last month, the cost of last month is 30*0.098. It should be noted here that the 30G used last month does not mean that there will be 30G of data in the Bucket at the end of last month, but the last average The daily dosage is 30G. For example, Xiao Ming uploaded 1G files every morning last month, so the usage last month was (1+2+3+...+30)/30=15.5G, and this leads to a new problem. If Xiao Ming uploads 1G files every morning, Another 1G file was deleted in the afternoon, so what was the storage usage last month? It must not be 0, otherwise it is not prostitution? In order to prevent users from prostitution, you can define the daily usage as the maximum usage space of the Bucket on that day, then Xiaoming uploads 1G in the morning and deletes 1G in the afternoon. The maximum storage space for the day is 1G, and the usage for the month is (1+1+1...+1 )/30=1G. If you want to accurately calculate the maximum space of the day, you need to count the current usage when each file is added and deleted, and then take the maximum value in a day. If the requirements are not high, you can also count the usage at intervals. Here I introduce the use of elasticsearch to count daily storage usage.
Statistical Basic Process
Count the current storage usage and store it in ES every 30 minutes. The main fields are as follows:
tenant ID | Statistics Time | size |
---|---|---|
1 | 2023-07-10 00:00:00 | 1024 |
1 | 2023-07-10 00:00:30 | 2024 |
1 | 2023-07-10 00:00:00 | 1024 |
Create ES index
PUT /bucket_size
{
"settings": {
"number_of_shards": 6,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"size": {
"type": "long"
},
"tenantId": {
"type": "long"
},
"time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Test Data
{
"id": "1",
"tenantId": 1,
"size": 1024,
"time": "2023-07-17 18:00:00"
}
{
"id": "2",
"tenantId": 1,
"size": 2048,
"time": "2023-07-17 19:00:00"
}
{
"id": "3",
"tenantId": 1,
"size": 1024,
"time": "2023-07-17 10:00:00"
}
{
"id": "4",
"tenantId": 2,
"size": 1024,
"time": "2023-07-17 09:00:00"
}
{
"id": "5",
"tenantId": 2,
"size": 0,
"time": "2023-07-17 10:00:00"
}
{
"id": "6",
"tenantId": 2,
"size": 1024,
"time": "2023-07-17 11:11:00"
}
Query the tenant's daily usage
Query requirements, input the tenant ID, start time and end time, and return the daily usage of each tenant within the specified time.
GET /bucket_size/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"tenantId": [
1,
2
],
"boost": 1
}
},
{
"range": {
"time": {
"from": "2023-07-01",
"to": "2023-07-31",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"tenantGroup": {
"terms": {
"field": "tenantId",
"size": 10,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"groupDay": {
"date_histogram": {
"field": "time",
"format": "yyyy-MM-dd",
"calendar_interval": "1d",
"offset": 0,
"order": {
"_key": "asc"
},
"keyed": false,
"extended_bounds" : {
"min" : "2023-07-01",
"max" : "2023-07-31"
}
},
"aggregations": {
"maxSize": {
"max": {
"field": "size",
"missing": 0
}
}
}
}
}
}
}
}
result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "bucket_size",
"_type": "_doc",
"_id": "2",
"_score": 2,
"_source": {
"id": "2",
"tenantId": 1,
"size": 2048,
"time": "2023-07-17 19:00:00"
}
}
]
},
"aggregations": {
"tenantGroup": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 3,
"groupDay": {
"buckets": [
{
"key_as_string": "2023-07-01",
"key": 1688169600000,
"doc_count": 0,
"maxSize": {
"value": null
}
},
{
"key_as_string": "2023-07-02",
"key": 1688256000000,
"doc_count": 0,
"maxSize": {
"value": null
}
}
]
}
},
{
"key": 2,
"doc_count": 3,
"groupDay": {
"buckets": [
{
"key_as_string": "2023-07-31",
"key": 1690761600000,
"doc_count": 0,
"maxSize": {
"value": null
}
}
]
}
}
]
}
}
}
Implementation using JAVA code
public Map<Long, Map<String, Long>> getTenantSize(Long[] tenantIds, String mouthStartDate, String mouthEndDate) throws IOException {
Map<Long, Map<String, Long>> map = new TreeMap<>();
BoolQueryBuilder queryBuilder = QueryBuilders.boolQuery();
queryBuilder.must(QueryBuilders.termsQuery("tenantId", Arrays.asList(tenantIds)));
queryBuilder.must(QueryBuilders.rangeQuery("time").gte(mouthStartDate).lte(mouthEndDate));
AggregationBuilder tenantGroup = AggregationBuilders.terms("tenantGroup").field("tenantId")
.subAggregation(AggregationBuilders.dateHistogram("groupDay").field("time").calendarInterval(DateHistogramInterval.DAY)
.format(DatePattern.NORM_DATE_PATTERN).order(BucketOrder.key(true)).extendedBounds(new LongBounds(mouthStartDate,mouthEndDate))
.subAggregation(AggregationBuilders.max("maxSize").field("size"))
);
Aggregations aggregations = esClient.search(queryBuilder, tenantGroup, "bucket_size");
Map<String, Aggregation> tenantGroupMap = aggregations.asMap();
if (MapUtil.isNotEmpty(tenantGroupMap)) {
tenantGroupMap.forEach((k, v) -> {
Terms terms = (Terms) v;
List<? extends Terms.Bucket> buckets = terms.getBuckets();
if (CollUtil.isNotEmpty(buckets)) {
buckets.forEach(bucket -> {
Map<String, Long> daySizeMap = new TreeMap<>();
Map<String, Aggregation> dayGroup = bucket.getAggregations().asMap();
if (MapUtil.isNotEmpty(dayGroup)) {
dayGroup.forEach((key, value) -> {
ParsedDateHistogram daySizeTerms = (ParsedDateHistogram) value;
List<? extends Histogram.Bucket> daySizeBucket = daySizeTerms.getBuckets();
if (CollUtil.isNotEmpty(daySizeBucket)) {
daySizeBucket.forEach(daySize -> {
ParsedMax maxSize = daySize.getAggregations().get("maxSize");
Long size=maxSize.getValue()!=Double.NEGATIVE_INFINITY? Double.valueOf(maxSize.getValue()).longValue():0L;
daySizeMap.put(daySize.getKeyAsString(),size);
});
}
});
}
map.put(Long.valueOf(bucket.getKeyAsString()), daySizeMap);
});
}
});
}
return map;
}
Summarize
This article mainly introduces the use of elasticsearch computing storage to learn the use of elasticsearch group query and use JAVA code to call elasticsearch group query. There are the following precautions:
- If you query from July 1st to July 30th, if there is no data of the day in ES, it will also be returned. Date_histogram is used here, and extended_bounds is forced to return null
- After the query results are grouped, they should be sorted by time
- According to daily aggregations, use max to take the largest size of the day as the storage usage of the day
- Elasticsearch group query consumes more memory. It has been grouped in three layers. The time and number of tenants should not be too much, otherwise it will cause OOM
- In the case, the storage is counted every 30 minutes. If you upload and delete within 30 minutes, you will be prostituted.