ElasticSearch Learning Path -day07

This article reprinted from: https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html , ES version 6.3.0

Retrieve multiple documents
Elasticsearch speed has been very fast, but even faster. The requests into a plurality of avoiding network delay and overhead processing takes each request individually. If you need to retrieve a lot of documents from Elasticsearch, then use the multi-get or mget API to retrieve these requests in a request, the request document faster than one by one to retrieve all documents.
mget API is a parameter docs array, _index array defined for each node of a document, _type, _id metadata. If you only want to retrieve one or several defined fields, you can also define a _source parameters:

POST /_mget
{
   "docs" : [
      {
         "_index" : "website2",
         "_type" :  "blog",
         "_id" :    1
      },
      {
         "_index" : "website",
         "_type" :  "pageviews",
         "_id" :    1,
         "_source": "views"
      }
   ]
}

NOTE: Elastcisearch6.3 I index does not support the plurality of document type, so this is operated in two mget index in
response docs array also includes a body, further comprising a response to each document, which are arranged in the order defined by the request . Each such response alone get request response to the same body:

{
  "docs": [
    {
      "_index": "website2",
      "_type": "blog",
      "_id": "1",
      "_version": 1,
      "found": true,
      "_source": {
        "title": "My first blog entry",
        "text": "Just trying this out..."
      }
    },
    {
      "_index": "website",
      "_type": "pageviews",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "views": 2
      }
    }
  ]
}

In fact, if all the documents have the same _index and _type, you can instead of the full array of docs simple ids array:

POST /website2/blog/_mget
{
   "ids" : [ "1", "2" ]
}

Note that we requested a second document does not exist. We define the type of blog, but the document type ID is 1 for pageviews. This document does not exist it will be told in the response body.

{
  "docs": [
    {
      "_index": "website2",
      "_type": "blog",
      "_id": "1",
      "_version": 1,
      "found": true,
      "_source": {
        "title": "My first blog entry",
        "text": "Just trying this out..."
      }
    },
    {
      "_index": "website2",
      "_type": "blog",
      "_id": "2",
      "found": false
    }
  ]
}

In fact the second document does not exist does not affect the retrieval of a document. Each document retrieval and reporting are independent.

Although mentioned a document has not been found, but the HTTP request or 200 status code. In fact, even if all the documents are not found, the request still returns 200 because mget request itself a success. If you want to know if each document had been successful, you need to check found signs.

Bulk operations when updating
as a one-time mget allow us to retrieve multiple documents, like, bulk API allows us to use a single request to implement multiple documents create, index, update, or delete. This stream is very useful for such data logging activity similar to the index, they can be indexed to a batch of hundreds of sequential data.
request body as bulk, it has a little unusual:

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

This format is similar to a "\ n" line by line connecting symbols JSON document stream (stream). Two important points to note:

  • Each line must end with "\ n" symbol at the end, including the last line. These are separated from each row as an effective marker done.
  • Each row of data can not contain line breaks not escape, they will interfere with the analysis - which means that JSON can not be printed landscaping.

action / metadata This line defines the document behavior (what action) occurs over which documents (which document).
Behavior (action) must be the following:

behavior

Explanation

create

When the creation of the document does not exist. See "Create Document"

index

Create a new document or replace an existing document. See "Document Index" and "Update Document"

update

Local update the document. See "partial update"

delete

To delete a document. See "Delete documents"

In the index, create, update, or you must specify the document _index, _type when you delete, _id metadata (metadata).
Such as the deletion request looks like this:

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

Request body (request body) by the composition _source document - document contains some of the fields and their values. It is necessary to create and index operation, which makes sense: you must provide documents to the index.
These update operations are also required, but the composition should be consistent with the request body update API (doc, upsert, script, etc.). No need to request a delete operation member (request body).

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

If you do not define _id, ID will be created automatically:

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

To put these together, bulk request form is such that:

POST /_bulk 
{"delete":{"_index":"website2","_type":"blog","_id":"1"}}  ##delete行为(action)没有请求体
{"create":{"_index":"website2","_type":"blog","_id":"1"}}
{"title":"My first blog post"}
{"index":{"_index":"website2","_type":"blog"}}
{"title":"My second blog post"}
{"update":{"_index":"website2","_type":"blog","_id":"1","_retry_on_conflict":3}}
{"doc":{"title":"My updated blog post"}} ##记得最后一个换行符

Elasticsearch response contains an array of items, which lists the results of each request, and the sequence order of the results is the same as our request:

{
  "took": 101,
  "errors": false, ##所有子请求都成功完成
  "items": [
    {
      "delete": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 2,
        "result": "deleted",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 1,
        "_primary_term": 3,
        "status": 200
      }
    },
    {
      "create": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 3,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 3,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "website2",
        "_type": "blog",
        "_id": "3obs92cB_9PWWV036uyh",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 0,
        "_primary_term": 3,
        "status": 201
      }
    },
    {
      "update": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 4,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 3,
        "_primary_term": 3,
        "status": 200
      }
    }
  ]
}

Each sub-requests are executed independently, so a wrong sub-request does not affect other requests. If any request fails, error flag will be set to the top true, then the error will be reported in detail in the corresponding request

POST /_bulk
{"create":{"_index":"website2","_type":"blog","_id":"1"}}
{"title":"Cannot create - it already exists"}
{"index":{"_index":"website2","_type":"blog","_id":"1"}}
{"title":"But we can update it"}

In response, we will see create a document 1 failed because the document already exists, but the index later executed on 1 request was successful:

{
  "took": 40,
  "errors": true, ##一个或多个请求失败
  "items": [
    {
      "create": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "status": 409,##这个请求的状态码被报告为409Conflict
        "error": {
          "type": "version_conflict_engine_exception", 
          "reason": "[blog][1]: version conflict, document already exists (current version [4])",  ##错误消息说明了什么请求错误
          "index_uuid": "W3VTB9NyRNC3tgfYpnqkvA",
          "shard": "3",
          "index": "website2"
        }
      }
    },
    {
      "index": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 5,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 4,
        "_primary_term": 3,
        "status": 200 ##第二个请求成功了状态码是200
      }
    }
  ]
}

These instructions bulk requests are not atomic - they can not implement transaction. Each separate operation request, the request is successful or not each does not interfere with other operations

Do not repeat
you might type in the same batch index log data at the same index. Specify the same for each document metadata is redundant. Like mget API, bulk requests can also be used in the URL / _index or / _index / _type: POST / website3 / _bulk

{ "index": { "_type": "log" }}
{ "event": "User logged in" }

You can still cover _index and _type metadata row, it will use the value of the URL in the absence of coverage as a default:

How big is considered
entire batch request needs to be loaded into the memory of the node to accept our request, the request so larger memory available to other requests to laugh. There is an optimal bulk request size. More than this size, but also to enhance the performance is not likely to decrease.
Optimal size, of course, is not a fixed number. It all depends on your hardware, load your document size and complexity as well as indexing and searching. Fortunately, this sweet spot (sweetspot) is easy to find:
Try to batch index standard document, with the growth of the size, when performance begins to decrease, indicating the size of each batch you too. The number can begin between 1000 and 5000 documents, if your document is very large, you can use smaller batches.
Usually focus on the physical size of your request batch is very useful. One thousand and one thousand documents 1kB 1MB document different. A good batch is preferably maintained at between 5-15MB size.

Guess you like

Origin blog.csdn.net/qq_23536449/article/details/90898185