Elasticsearch delete the data _delete_by_query

es Reference Version: elasticsearch: 5.5
_delete_by_query will delete all documents on the match query statements, used as follows:

curl -X POST "localhost:9200/twitter/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": { 
    "match": {
      "name": "测试删除"
    }
  }
}
'

Queries must be a valid key-value pairs queryis key, and this Search APIis the same way. In search apithe qparameter and the above effect is the same.

Return data format, to tell you how much data is used and delete

{
  "took" : 147,
  "timed_out": false,
  "deleted": 119, "batches": 1, "version_conflicts": 0, "noops": 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0, "requests_per_second": -1.0, "throttled_until_millis": 0, "total": 119, "failures" : [ ] } 

When starting (at the beginning you want to delete), _ delete_by_query will get a snapshot of the index (database) and use the build number to find which documents you want to delete. This means that if this period of time to get a snapshot with the deletion process, the document has changed, the version will be conflict. To match the version control document will be deleted.

Because internal control version does not support a valid number 0, so the version number is 0 documents can not be deleted, and the request will fail.

During the execution _delete_by_query, in order to delete all matching documents, multiple search requests are executed sequentially. All documents each time to find a batch of documents, will perform the corresponding batch request to delete finds. If the search or batch request is denied, _delete_by_query retry (up to 10) on the request is denied by the default policy. After the maximum number of retries, the request will cause _delete_by_query suspension, and will respond to all faults in the field failures. That have been deleted are still executed. In other words, the process is not rolled back, only interrupted.
In the first request fails interruption caused by the failure of a batch request of all error messages are recorded in failures element; and return back. Therefore, there will be many failed requests.
If you want to calculate how many versions of the conflict, rather than suspended, you can be set to conflicts = proceed in the URL or set "conflicts" in the request body: "proceed".

Back api format, you can type in a single (ie: table) limits _delete_by_query.
The following is just remove the index (ie: database) twitter in type (ie: table) of all data tweet:

curl -X POST "localhost:9200/twitter/_doc/_delete_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}
'

Delete multiple indexes (ie: database) data of multiple types (ie, table), and is also possible. E.g:

curl -X POST "localhost:9200/twitter,blog/_docs,post/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}
'

If you provide routing, then the routing will be copied to scroll query, according to the matching routing value, to determine which patch to deal with:

curl -X POST "localhost:9200/twitter/_delete_by_query?routing=1" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range" : {
        "age" : {
           "gte" : 10
        }
    }
  }
}
'

By default, _delete_by_query top-down batch of 1000 data, you can also use scroll_size parameter in the URL:

curl -X POST "localhost:9200/twitter/_delete_by_query?scroll_size=5000" -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}
'

URL Parameters (url parameter)

In addition to the standard parameters like pretty, Delete By Query APIalso it supports refresh, wait_for_completion, wait_for_active_shardsand timeout.

Belt transmission refreshrequest parameters, once completed, delete by queryall the parts involved are refreshed api. This is different from Delete APIthe refreshparameters, which upon receiving the deletion request is refreshed fragment.

If the request contains wait_for_completion=false, it elasticsearchwill be executed 预检查, 启动请求and a return can be Tasks APIsused taskto cancel or obtain taskstatus. elasticsearchAlso will .tasks/task/${taskId}create a path to record this document task. You can choose to keep or delete it according to their own circumstances; when you delete, elasticsearchwill recycle its space.

Prior to processing the request, wait_for_active_shardsthe control how many copies of the fragments must be active. Details here . timeoutWaiting for controlling each write request becomes unavailable available time slice sliced. Both will Bulk APIwork correctly.

requests_per_secondIt can be set to any positive decimal number (1.4,6,1000, etc.) and may limit the delete-by-queryissue of the number of requests per second, or it is set to -1disable this limitation. This limitation will wait between batches, so that it can operate scroll timeout. This waiting time between completion time and batch requests_per_second * requests_in_the_batchtime is different. Since batch will not be broken down into multiple requests, and such a large batch will result in elasticsearchcreation and will be waiting to multiple requests at the beginning of the next set of (batch) before and so on, this is burstynot the smooth. The default is -1.

Response body (body response)

json response format thereof is as follows:

{
  "took" : 639,
  "deleted": 0,
  "batches": 1, "version_conflicts": 2, "retries": 0, "throttled_millis": 0, "failures" : [ ] } 

Parameter Description
took the start from the entire operation to the time the end it takes milliseconds
number deleted successfully deleted document
batches returns the number of rolling response by query by delete (in my opinion: in line delete by the number of documents query conditions)
version_conflicts delete by queryapi number of hits conflict version (that is, in the implementation process, there were many times conflict)
retries the delete by query api respond to a full queue, the number of retries
throttled_millis according requests_per_second, requesting sleep how many milliseconds
failures is an array, indicating failure of all index (insert); if it is not empty, then the request will be suspended due to a fault. How to prevent version conflicts can refer to discontinued operations.

Works with the Task API

You can use the Task API to obtain the status of delete-by-query any running request.

curl -X GET "localhost:9200/_tasks?detailed=true&actions=*/delete/byquery"

response

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { [](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#CO38-1)![](//upload-images.jianshu.io/upload_images/4097351-8117f89c35e1e6d2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) "total" : 6154, "updated" : 0, "created" : 0, "deleted" : 3500, "batches" : 36, "version_conflicts" : 0, "noops" : 0, "retries": 0, "throttled_millis": 0 }, "description" : "" } } } } }</pre> 

① This object contains the actual state. Json body is a response format, wherein the total field is very important. It represents a desired total number reindex perform operations. You can estimate the progress by the addition of updated, created and deleted fields. When they are equal to the sum total field, the request will end.

Use task id can look directly to this task.

curl -X GET "localhost:9200/_tasks/taskId:1"

The advantage of this api is that it integrates transparently wait_for_completion = false returns the completed status of the task. If the task completed and arranged to wait_for_completion = false, then it will return results or error field. The cost of this feature is that when setting wait_for_completion = false, creates a document in .tasks / task / $ {taskId} in. Of course, you can also delete the document.

curl -X POST "localhost:9200/_tasks/task_id:1/_cancel"

You can use the above task api to find task_id;
cancellation should take place as soon as possible, but may also take a few seconds, the above task state api will be listed in the task until it wakes up and cancel their own.

curl -X POST "localhost:9200/_delete_by_query/task_id:1/_rethrottle?requests_per_second=-1"

Rethrottling

Requests_per_second value can be changed by using the delete on queryapi _rethrottle parameters are running:

curl -X POST "localhost:9200/_delete_by_query/task_id:1/_rethrottle?requests_per_second=-1"

Using the above tasks API to find task_id

As provided in _delete_by_query in the same, requests_per_second you can set -1 prohibit any such restrictions or a decimal number, like 1.7 or 12 to limit to this level. Rethrottling accelerate the query will take effect immediately, but Rethrottling slow query will take effect after the current batch. This is to prevent scroll timeouts.

Manually slicing

Delete-by-query support Sliced ​​Scroll, which can be relatively easy to make your manual parallelization process:

curl -X POST "localhost:9200/twitter/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
'
curl -X POST "localhost:9200/twitter/_delete_by_query" -H 'Content-Type: application/json' -d' { "slice": { "id": 1, "max": 2 }, "query": { "range": { "likes": { "lt": 10 } } } } ' 

You can be verified in the following ways:

curl -X GET "localhost:9200/_refresh"
curl -X POST "localhost:9200/twitter/_search?size=0&filter_path=hits.total" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
'

Like this is reasonable only a total of:

{
  "hits": {
    "total": 0
  }
}

Automatic slicing

You can also use Sliced ​​Scroll make delete-by-query api automatic parallelization to slice on _uid:

curl -X POST "localhost:9200/twitter/_delete_by_query?refresh&slices=5" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
'

You can verify this by the following:

curl -X POST "localhost:9200/twitter/_search?size=0&filter_path=hits.total" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
'

Like the following total is a reasonable result:

{
  "hits": {
    "total": 0
  }
}

Add slices, _delete_by_querywill automate the manual processing section above section, create a sub request which means that some strange things:

  1. You can Tasks APIssee these requests. These sub-request is the use of slicessub-tasks requested task.
  2. This request (used slices) to obtain job status only contains the completed sections of the state.
  3. These are all independently addressable sub-request, for example: and canceled rethrottling.
  4. Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally.
  5. Cancellation slicesrequests will cancel each sub-request.
  6. Due to slicesthe nature of each sub-request and the document will not be completely uniform results. All documents are to be processed, but some slices(sections) will be bigger, some will be smaller. Hope large slices(slice) have a more uniform distribution.
  7. In sliceslike request requests_per_secondand sizeparameters proportionally allocated to each sub-request. Combined with heterogeneity on the distribution of the above, you should conclude: it included in slicesthe _delete_by_queryuse request sizeparameter may not get the correct size document results.
  8. Each sub-request will receive a slightly different snapshot source index, although these requests are approximately the same time.

Picking the number of slices

Here we have some on slicesrecommendation number (if manual parallel, then in slice apithat maxparameter):

  1. Do not use large numbers. Such as 500, will create sizeable CPUshocks.
    Under instructions here shaking (thrashing) meaning:
    cpumost of the time during the feed, while the real work time is very short of a phenomenon known asthrashing (震荡)
  2. From the query performance standpoint, a plurality of fragments in the index source is more efficient.
  3. From a performance perspective, the query, and use the same number of fragments in the index source is more efficient.
  4. Performance index should be utilized in sliceslinear expansion between.
  5. Index (insert) or query performance is dominant depends on many factors, such as: re-indexing documents and re-clustering index.

reference:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
https://blog.csdn.net/u013066244/article/details/76258188




Guess you like

Origin www.cnblogs.com/zhuyeshen/p/10950539.html