ElasticSearch deletes data in the index (delete_by_query)

1. Delete data older than two months

In Elasticsearch, to delete data older than two months, the following steps can be taken:

To calculate the date two months before the current time, you can use Python's datetime module to achieve.

import datetime

# 获取当前日期
now = datetime.datetime.now()

# 计算两个月前的日期
two_months_ago = now - datetime.timedelta(days=60)

Construct the deletion request of Elasticsearch, and use the Elasticsearch-Py library to interact with Elasticsearch.

from elasticsearch import Elasticsearch

# 创建 Elasticsearch 连接
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# 构造删除请求
delete_query = {
    "query": {
        "range": {
            "timestamp": {
                "lt": two_months_ago.strftime("%Y-%m-%dT%H:%M:%S")  # 格式化日期为 Elasticsearch 支持的格式
            }
        }
    }
}

# 发送删除请求
es.delete_by_query(index='your_index_name', body=delete_query)

        This removes data from the index that is older than two months. It should be noted that the delete data operation is irreversible, please use it with caution and conduct sufficient tests in the production environment. In addition, specific index names, field names, and date field names need to be modified according to your actual situation.

2. Delete data by alias

        Elasticsearch Alias ​​can be used to delete data in multiple indexes at once. An alias can be understood as a reference to one or more indexes, and the alias can be deleted to delete the data of multiple indexes at the same time.

The following is a sample code that uses the Elasticsearch-Py library to delete data in multiple indexes at once:

from elasticsearch import Elasticsearch

# 创建 Elasticsearch 连接
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# 定义要删除数据的索引别名
index_alias = "your_index_alias"

# 构造删除请求
delete_query = {
    "query": {
        "range": {
            "timestamp": {
                "lt": "now-2M"  # 删除两个月以前的数据
            }
        }
    }
}

# 发送删除请求,并指定索引别名
es.delete_by_query(index=index_alias, body=delete_query)

        In this example, we use the index_alias variable to specify the index alias of the data to be deleted, and then send a delete request through the es.delete_by_query() method, in which we specify the range of data to be deleted. Elasticsearch's date math is used here The expression "now-2M" to represent a point in time two months ago. This will delete data older than two months in all indexes associated with the alias.

        It should be noted that using an alias to delete data in an index is a powerful operation, so it must be used with caution and fully tested in a production environment. At the same time, specific index aliases and field names need to be modified according to your actual situation.

3. Whether it belongs to logical deletion

        Using Elasticsearch's delete_by_query method to delete data in the index is not a tombstone. It actually executes a query operation, and then performs a delete operation based on the query results to physically delete documents that meet the query conditions from the index.

        Logical deletion is usually achieved by adding a flag field (such as deleted) to the document and setting it to true to indicate that it is marked as deleted, and then filtering out these documents marked as deleted when querying, so as to achieve the effect of logical deletion. This approach can preserve the history of deleted documents for subsequent recovery or query needs.

        However, using the delete_by_query method is to physically delete the documents that meet the query conditions directly from the index. These documents will not be restored, and the deletion operation will generate certain overhead and affect performance in Elasticsearch. Therefore, when using the delete_by_query method, careful consideration is required, and adequate testing and verification must be performed in a production environment.

4. Does it still take up storage space after deleting data?

        In Elasticsearch, when documents are deleted using the delete_by_query method, the documents are actually marked as deleted, but are not immediately deleted from disk. Instead, these deleted documents are purged from disk during subsequent segment merge operations.

        Therefore, although documents are deleted using the delete_by_query method, these documents still occupy storage space until the segment merge operation is triggered and completes. This also means that even if a large number of documents are deleted using the delete_by_query method, the storage space on the disk will not be released immediately, but will need to wait for the segment merge operation to clean up.

        In Elasticsearch 7.x and above, a new _delete_by_query API has been introduced, which can be refreshed in real time through the refresh parameter when deleting a document (by default, Elasticsearch will perform a refresh operation periodically to ensure that the data is persisted to disk) . This frees up storage space more quickly, but you still need to be aware of the potential performance and storage impact. When using the delete_by_query method, it is recommended to be cautious in a production environment and to closely monitor storage usage when deleting large amounts of data.

5. Does the use of the delete_by_query method affect big data storage?

        For the deletion of a large amount of data, using the delete_by_query method may have a certain impact on performance and storage space. Because the delete_by_query method needs to perform a query operation in Elasticsearch and mark and delete matching documents, and these marked deleted documents still occupy storage space and need to be cleaned up in subsequent segment merge operations.

        For the deletion of large amounts of data, it is recommended to use more efficient methods, such as deleting the entire index or using Elasticsearch's time-based index to periodically delete old data. This avoids excessive pressure on query and storage space using the delete_by_query method.

For example, the following approaches can be considered:

  • Delete the entire index : If the data to be deleted is large and has a clear time range, you can directly delete the entire index. For example, one index per month. When the data in an index exceeds two months, delete the index directly.
  • Use Elasticsearch's timestamp index : Data can be stored in different indexes according to time information such as timestamps, such as one index per day or hour. When the data expires, you only need to delete the corresponding index, so as to quickly delete a large amount of data.

        The above methods can avoid the potential impact on performance and storage space caused by using the delete_by_query method, and are suitable for large amounts of data deletion scenarios. It is necessary to select an appropriate deletion strategy based on specific business needs and data scale.

Guess you like

Origin blog.csdn.net/weixin_44799217/article/details/130192119