es Reference Version: elasticsearch: 5.5
_delete_by_query will delete all documents on the match query statements, used as follows:
curl -X POST "localhost:9200/twitter/_delete_by_query" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"name": "测试删除"
}
}
}
'
Queries must be a valid key-value pairs query
is key, and this Search API
is the same way. In search api
the q
parameter and the above effect is the same.
Return data format, to tell you how much data is used and delete
{
"took" : 147,
"timed_out": false,
"deleted": 119, "batches": 1, "version_conflicts": 0, "noops": 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0, "requests_per_second": -1.0, "throttled_until_millis": 0, "total": 119, "failures" : [ ] }
When starting (at the beginning you want to delete), _ delete_by_query will get a snapshot of the index (database) and use the build number to find which documents you want to delete. This means that if this period of time to get a snapshot with the deletion process, the document has changed, the version will be conflict. To match the version control document will be deleted.
Because internal control version does not support a valid number 0, so the version number is 0 documents can not be deleted, and the request will fail.
During the execution _delete_by_query, in order to delete all matching documents, multiple search requests are executed sequentially. All documents each time to find a batch of documents, will perform the corresponding batch request to delete finds. If the search or batch request is denied, _delete_by_query retry (up to 10) on the request is denied by the default policy. After the maximum number of retries, the request will cause _delete_by_query suspension, and will respond to all faults in the field failures. That have been deleted are still executed. In other words, the process is not rolled back, only interrupted.
In the first request fails interruption caused by the failure of a batch request of all error messages are recorded in failures element; and return back. Therefore, there will be many failed requests.
If you want to calculate how many versions of the conflict, rather than suspended, you can be set to conflicts = proceed in the URL or set "conflicts" in the request body: "proceed".
Back api format, you can type in a single (ie: table) limits _delete_by_query.
The following is just remove the index (ie: database) twitter in type (ie: table) of all data tweet:
curl -X POST "localhost:9200/twitter/_doc/_delete_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}
'
Delete multiple indexes (ie: database) data of multiple types (ie, table), and is also possible. E.g:
curl -X POST "localhost:9200/twitter,blog/_docs,post/_delete_by_query" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}
'
If you provide routing, then the routing will be copied to scroll query, according to the matching routing value, to determine which patch to deal with:
curl -X POST "localhost:9200/twitter/_delete_by_query?routing=1" -H 'Content-Type: application/json' -d'
{
"query": {
"range" : {
"age" : {
"gte" : 10
}
}
}
}
'
By default, _delete_by_query top-down batch of 1000 data, you can also use scroll_size parameter in the URL:
curl -X POST "localhost:9200/twitter/_delete_by_query?scroll_size=5000" -H 'Content-Type: application/json' -d'
{
"query": {
"term": {
"user": "kimchy"
}
}
}
'
URL Parameters (url parameter)
In addition to the standard parameters like pretty
, Delete By Query API
also it supports refresh
, wait_for_completion
, wait_for_active_shards
and timeout
.
Belt transmission refresh
request parameters, once completed, delete by query
all the parts involved are refreshed api. This is different from Delete API
the refresh
parameters, which upon receiving the deletion request is refreshed fragment.
If the request contains wait_for_completion=false
, it elasticsearch
will be executed 预检查
, 启动请求
and a return can be Tasks APIs
used task
to cancel or obtain task
status. elasticsearch
Also will .tasks/task/${taskId}
create a path to record this document task
. You can choose to keep or delete it according to their own circumstances; when you delete, elasticsearch
will recycle its space.
Prior to processing the request, wait_for_active_shards
the control how many copies of the fragments must be active. Details here . timeout
Waiting for controlling each write request becomes unavailable available time slice sliced. Both will Bulk API
work correctly.
requests_per_second
It can be set to any positive decimal number (1.4,6,1000, etc.) and may limit the delete-by-query
issue of the number of requests per second, or it is set to -1
disable this limitation. This limitation will wait between batches, so that it can operate scroll timeout
. This waiting time between completion time and batch requests_per_second * requests_in_the_batch
time is different. Since batch will not be broken down into multiple requests, and such a large batch will result in elasticsearch
creation and will be waiting to multiple requests at the beginning of the next set of (batch) before and so on, this is bursty
not the smooth
. The default is -1
.
Response body (body response)
json response format thereof is as follows:
{
"took" : 639,
"deleted": 0,
"batches": 1, "version_conflicts": 2, "retries": 0, "throttled_millis": 0, "failures" : [ ] }
Parameter Description
took the start from the entire operation to the time the end it takes milliseconds
number deleted successfully deleted document
batches returns the number of rolling response by query by delete (in my opinion: in line delete by the number of documents query conditions)
version_conflicts delete by queryapi number of hits conflict version (that is, in the implementation process, there were many times conflict)
retries the delete by query api respond to a full queue, the number of retries
throttled_millis according requests_per_second, requesting sleep how many milliseconds
failures is an array, indicating failure of all index (insert); if it is not empty, then the request will be suspended due to a fault. How to prevent version conflicts can refer to discontinued operations.
Works with the Task API
You can use the Task API to obtain the status of delete-by-query any running request.
curl -X GET "localhost:9200/_tasks?detailed=true&actions=*/delete/byquery"
response
{
"nodes" : {
"r1A2WoRbTwKZ516z6NEs5A" : {
"name" : "r1A2WoR",
"transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { [](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#CO38-1)![](//upload-images.jianshu.io/upload_images/4097351-8117f89c35e1e6d2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) "total" : 6154, "updated" : 0, "created" : 0, "deleted" : 3500, "batches" : 36, "version_conflicts" : 0, "noops" : 0, "retries": 0, "throttled_millis": 0 }, "description" : "" } } } } }</pre>
① This object contains the actual state. Json body is a response format, wherein the total field is very important. It represents a desired total number reindex perform operations. You can estimate the progress by the addition of updated, created and deleted fields. When they are equal to the sum total field, the request will end.
Use task id can look directly to this task.
curl -X GET "localhost:9200/_tasks/taskId:1"
The advantage of this api is that it integrates transparently wait_for_completion = false returns the completed status of the task. If the task completed and arranged to wait_for_completion = false, then it will return results or error field. The cost of this feature is that when setting wait_for_completion = false, creates a document in .tasks / task / $ {taskId} in. Of course, you can also delete the document.
curl -X POST "localhost:9200/_tasks/task_id:1/_cancel"
You can use the above task api to find task_id;
cancellation should take place as soon as possible, but may also take a few seconds, the above task state api will be listed in the task until it wakes up and cancel their own.
curl -X POST "localhost:9200/_delete_by_query/task_id:1/_rethrottle?requests_per_second=-1"
Rethrottling
Requests_per_second value can be changed by using the delete on queryapi _rethrottle parameters are running:
curl -X POST "localhost:9200/_delete_by_query/task_id:1/_rethrottle?requests_per_second=-1"
Using the above tasks API to find task_id
As provided in _delete_by_query in the same, requests_per_second you can set -1 prohibit any such restrictions or a decimal number, like 1.7 or 12 to limit to this level. Rethrottling accelerate the query will take effect immediately, but Rethrottling slow query will take effect after the current batch. This is to prevent scroll timeouts.
Manually slicing
Delete-by-query support Sliced Scroll, which can be relatively easy to make your manual parallelization process:
curl -X POST "localhost:9200/twitter/_delete_by_query" -H 'Content-Type: application/json' -d'
{
"slice": {
"id": 0,
"max": 2
},
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
'
curl -X POST "localhost:9200/twitter/_delete_by_query" -H 'Content-Type: application/json' -d' { "slice": { "id": 1, "max": 2 }, "query": { "range": { "likes": { "lt": 10 } } } } '
You can be verified in the following ways:
curl -X GET "localhost:9200/_refresh"
curl -X POST "localhost:9200/twitter/_search?size=0&filter_path=hits.total" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
'
Like this is reasonable only a total of:
{
"hits": {
"total": 0
}
}
Automatic slicing
You can also use Sliced Scroll make delete-by-query api automatic parallelization to slice on _uid:
curl -X POST "localhost:9200/twitter/_delete_by_query?refresh&slices=5" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
'
You can verify this by the following:
curl -X POST "localhost:9200/twitter/_search?size=0&filter_path=hits.total" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
'
Like the following total is a reasonable result:
{
"hits": {
"total": 0
}
}
Add slices
, _delete_by_query
will automate the manual processing section above section, create a sub request which means that some strange things:
- You can
Tasks APIs
see these requests. These sub-request is the use ofslices
sub-tasks requested task. - This request (used
slices
) to obtain job status only contains the completed sections of the state. - These are all independently addressable sub-request, for example: and canceled
rethrottling
. - Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally.
- Cancellation
slices
requests will cancel each sub-request. - Due to
slices
the nature of each sub-request and the document will not be completely uniform results. All documents are to be processed, but someslices
(sections) will be bigger, some will be smaller. Hope largeslices
(slice) have a more uniform distribution. - In
slices
like requestrequests_per_second
andsize
parameters proportionally allocated to each sub-request. Combined with heterogeneity on the distribution of the above, you should conclude: it included inslices
the_delete_by_query
use requestsize
parameter may not get the correct size document results. - Each sub-request will receive a slightly different snapshot source index, although these requests are approximately the same time.
Picking the number of slices
Here we have some on slices
recommendation number (if manual parallel, then in slice api
that max
parameter):
- Do not use large numbers. Such as 500, will create sizeable
CPU
shocks.
Under instructions here shaking (thrashing) meaning:
cpu
most of the time during the feed, while the real work time is very short of a phenomenon known asthrashing (震荡)
- From the query performance standpoint, a plurality of fragments in the index source is more efficient.
- From a performance perspective, the query, and use the same number of fragments in the index source is more efficient.
- Performance index should be utilized in
slices
linear expansion between. - Index (insert) or query performance is dominant depends on many factors, such as: re-indexing documents and re-clustering index.
reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
https://blog.csdn.net/u013066244/article/details/76258188