[Elasticsearch] How does Elasticsearch physically delete historical data for a given period?

Insert picture description here

1 Overview

Reprinted: https://blog.csdn.net/laoyang360/article/details/80038930

1. Inscription
Thinking of deleting, the basic cognition is delete, which is subdivided into deleting documents and deleting indexes; to delete historical data, the basic cognition is: delete data with a given condition, use delete_by_query.
Actual operation found:

  • After deleting the document, the disk space did not decrease immediately, but increased?
  • Apart from timed tasks + delete_by_query, is there a better way?

2. Common delete operations
2.1 Delete a single document

DELETE /twitter/_doc/1

2.2 Delete documents that meet the given conditions

POST twitter/_delete_by_query
{
    
    
  "query": {
    
     
    "match": {
    
    
      "message": "some message"
    }
  }
}

Note: When performing batch deletion, version conflicts may occur. The way to enforce deletion is as follows:

POST twitter/_doc/_delete_by_query?conflicts=proceed
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  }
}

2.3 Delete a single index

DELETE /twitter

2.4 Delete all indexes

DELETE /_all

or

DELETE /*

Deleting all indexes is a very dangerous operation, so be careful.

3. What did you do to delete the document backstage?
The return result after deleting:

{
    
    
  "_index": "test_index",
  "_type": "test_type",
  "_id": "22",
  "_version": 2,
  "result": "deleted",
  "_shards": {
    
    
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 17
}

Interpretation:

Every document in the index is versioned.
When deleting a document, you can specify the version to ensure that the related document we are trying to delete is actually deleted and has not been changed during this period.

Every write operation performed on a document, including deletion, will increase its version.

The real time to delete:

deleting a document doesn’t immediately remove the document from disk; it just marks it as deleted. Elasticsearch will clean up deleted documents in the background as you continue to index more data.

4. What is the difference between deleting an index and deleting a document?
1) Deleting an index will release space immediately, and there is no so-called "marking" logic.

2) When deleting a document, the new document is written and the old document is marked as deleted. Whether the disk space is released depends on whether the new and old documents are in the same segment file. Therefore, the segment merge of the ES background may trigger the physical deletion of the old documents during the process of merging the segment files.

But because a shard may have hundreds of segment files, there is still a high probability that old and new documents exist in different segments and cannot be physically deleted. If you want to free up space manually, you can only do force merge regularly and set max_num_segments to 1.

POST /_forcemerge

5. How to save only the last 100 days of data?
With the above knowledge, the task of saving data for only nearly 100 days is broken down into:

  • 1) delete_by_query setting to retrieve nearly 100 days of data;
  • 2) Execute forcemerge operation to manually release disk space.

The delete script is as follows:

#!/bin/sh
curl -H'Content-Type:application/json' -d'{
    
    
    "query": {
    
    
        "range": {
    
    
            "pt": {
    
    
                "lt": "now-100d",
                "format": "epoch_millis"
            }
        }
    }
}
' -XPOST "http://192.168.1.101:9200/logstash_*/
_delete_by_query?conflicts=proceed"

The merge script is as follows:

#!/bin/sh
curl -XPOST 'http://192.168.1.101:9200/_forcemerge?
only_expunge_deletes=true&max_num_segments=1'

6. Is there a more general method?
Yes, use the ES official website tool-curator tool.

6.1 Introduction to curator
Main purpose: planning and managing ES index. Support common operations: create, delete, merge, reindex, snapshot and other operations.

6.2 Curator official website address
http://t.cn/RuwN0oM

Git address: https://github.com/elastic/curator

6.3 Curator installation wizard
Address: http://t.cn/RuwCkBD

Note:
Curator blogs and tutorials are endless, but there are big differences between the old version and the new version of curator. It is recommended to refer to the latest manual deployment on the official website.
The old version of the command line method is no longer supported by the new version.

6.4 Curator command line operation

$ curator --help
Usage: curator [OPTIONS] ACTION_FILE

  Curator for Elasticsearch indices.

  See http://elastic.co/guide/en/elasticsearch/client/curator/current

Options:
  --config PATH  Path to configuration file. Default: ~/.curator/curator.yml
  --dry-run      Do not perform any changes.
  --version      Show the version and exit.
  --help         Show this message and exit.

core:

  • Configuration file config.yml: configure the ES address to be connected, log configuration, log level, etc.;

Execution file action.yml: Configure the operation to be performed (can be batched), configure the index format (prefix matching, regular matching, etc.)
6.5 Curator applicable scenarios The
most important thing is:

Just take the delete operation as an example: Curator can delete the index after x days very simply. The premise is that the index naming must follow a specific naming pattern-such as: the index named after the day: logstash_2018.04.05.

The naming pattern needs to correspond to the timestring under delete_indices in action.yml.

7. Summary
Refer to the latest documents on the official website. The historical documents of historical versions are easy to mislead;
practice more, not just know;
medcl: The new version of ES 6.3 has an Index LifeCycle Management that can easily manage the preservation of the index the term.

Guess you like

Origin blog.csdn.net/qq_21383435/article/details/109280911