ElasticSearch学习之路-day07

本文转载自:https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html，ES版本号6.3.0

取回多个文档
Elasticsearch 的速度已经很快了，但甚至能更快。将多个请求合并成一个，避免单独处理每个请求花费的网络延时和开销。如果你需要从 Elasticsearch 检索很多文档，那么使用 multi-get 或者 mget API 来将这些检索请求放在一个请求中，将比逐个文档请求更快地检索到全部文档。
mget API参数是一个docs数组，数组的每个节点定义一个文档的_index、_type、_id元数据。如果你只想检索一个或几个确定的字段，也可以定义一个_source参数：

POST /_mget
{
   "docs" : [
      {
         "_index" : "website2",
         "_type" :  "blog",
         "_id" :    1
      },
      {
         "_index" : "website",
         "_type" :  "pageviews",
         "_id" :    1,
         "_source": "views"
      }
   ]
}

注:本人使用的Elastcisearch6.3不支持索引多个文档类型，所以这是在两个index中的mget操作
响应体也包含一个docs数组，每个文档还包含一个响应，它们按照请求定义的顺序排列。每个这样的响应与单独使用get request响应体相同：

{
  "docs": [
    {
      "_index": "website2",
      "_type": "blog",
      "_id": "1",
      "_version": 1,
      "found": true,
      "_source": {
        "title": "My first blog entry",
        "text": "Just trying this out..."
      }
    },
    {
      "_index": "website",
      "_type": "pageviews",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "views": 2
      }
    }
  ]
}

事实上，如果所有文档具有相同_index和_type，你可以通过简单的ids数组来代替完整的docs数组：

POST /website2/blog/_mget
{
   "ids" : [ "1", "2" ]
}

注意到我们请求的第二个文档并不存在。我们定义了类型为blog，但是ID为1的文档类型为pageviews。这个不存在的文档会在响应体中被告知。

{
  "docs": [
    {
      "_index": "website2",
      "_type": "blog",
      "_id": "1",
      "_version": 1,
      "found": true,
      "_source": {
        "title": "My first blog entry",
        "text": "Just trying this out..."
      }
    },
    {
      "_index": "website2",
      "_type": "blog",
      "_id": "2",
      "found": false
    }
  ]
}

事实上第二个文档不存在并不影响第一个文档的检索。每个文档的检索和报告都是独立的。

尽管前面提到有一个文档没有被找到，但HTTP请求状态码还是200。事实上，就算所有文档都找不到，请求也还是返回200，原因是mget请求本身成功了。如果想知道每个文档是否都成功了，你需要检查found标志。

更新时的批量操作
就像mget允许我们一次性检索多个文档一样，bulk API允许我们使用单一请求来实现多个文档的create、index、update或delete。这对索引类似于日志活动这样的数据流非常有用，它们可以以成百上千的数据为一个批次按序进行索引。
bulk请求体如下，它有一点不同寻常：

{ action: { metadata }}\n
{ request body }\n
{ action: { metadata }}\n
{ request body }\n
...

这种格式类似于用"\n"符号连接起来的一行一行的JSON文档流(stream)。两个重要的点需要注意：

每行必须以"\n"符号结尾，包括最后一行。这些都是作为每行有效的分离而做的标记。
每一行的数据不能包含未被转义的换行符，它们会干扰分析——这意味着JSON不能被美化打印。

action/metadata这一行定义了文档行为(what action)发生在哪个文档(which document)之上。
行为(action)必须是以下几种：

行为	解释
create	当文档不存在时创建之。详见《创建文档》
index	创建新文档或替换已有文档。见《索引文档》和《更新文档》
update	局部更新文档。见《局部更新》
delete	删除一个文档。见《删除文档》

在索引、创建、更新或删除时必须指定文档的_index、_type、_id这些元数据(metadata)。
例如删除请求看起来像这样：

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

请求体(request body)由文档的_source组成——文档所包含的一些字段以及其值。它被index和create操作所必须，这是有道理的：你必须提供文档用来索引。
这些还被update操作所必需，而且请求体的组成应该与update API（doc, upsert, script等等）一致。删除操作不需要请求体(request body)。

{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "My first blog post" }

如果不定义_id，ID将会被自动创建：

{ "index": { "_index": "website", "_type": "blog" }}
{ "title": "My second blog post" }

为了将这些放在一起，bulk请求表单是这样的：

POST /_bulk
{"delete":{"_index":"website2","_type":"blog","_id":"1"}} ##delete行为(action)没有请求体
{"create":{"_index":"website2","_type":"blog","_id":"1"}}
{"title":"My first blog post"}
{"index":{"_index":"website2","_type":"blog"}}
{"title":"My second blog post"}
{"update":{"_index":"website2","_type":"blog","_id":"1","_retry_on_conflict":3}}
{"doc":{"title":"My updated blog post"}} ##记得最后一个换行符

Elasticsearch响应包含一个items数组，它罗列了每一个请求的结果，结果的顺序与我们请求的顺序相同：

{
  "took": 101,
  "errors": false, ##所有子请求都成功完成
  "items": [
    {
      "delete": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 2,
        "result": "deleted",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 1,
        "_primary_term": 3,
        "status": 200
      }
    },
    {
      "create": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 3,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 3,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "website2",
        "_type": "blog",
        "_id": "3obs92cB_9PWWV036uyh",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 0,
        "_primary_term": 3,
        "status": 201
      }
    },
    {
      "update": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 4,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 3,
        "_primary_term": 3,
        "status": 200
      }
    }
  ]
}

每个子请求都被独立的执行，所以一个子请求的错误并不影响其他请求。如果任何一个请求失败，顶层的error标记将被设置为true,然后细节的错误将在相应请求中被报告

POST /_bulk
{"create":{"_index":"website2","_type":"blog","_id":"1"}}
{"title":"Cannot create - it already exists"}
{"index":{"_index":"website2","_type":"blog","_id":"1"}}
{"title":"But we can update it"}

响应中我们将看到create文档1失败了，因为文档已经存在，但是后来的在1上执行的index请求成功了：

{
  "took": 40,
  "errors": true, ##一个或多个请求失败
  "items": [
    {
      "create": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "status": 409,##这个请求的状态码被报告为409Conflict
        "error": {
          "type": "version_conflict_engine_exception", 
          "reason": "[blog][1]: version conflict, document already exists (current version [4])",  ##错误消息说明了什么请求错误
          "index_uuid": "W3VTB9NyRNC3tgfYpnqkvA",
          "shard": "3",
          "index": "website2"
        }
      }
    },
    {
      "index": {
        "_index": "website2",
        "_type": "blog",
        "_id": "1",
        "_version": 5,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 4,
        "_primary_term": 3,
        "status": 200 ##第二个请求成功了状态码是200
      }
    }
  ]
}

这些说明bulk请求不是原子操作——它们不能实现事务。每个请求操作时分开的，所以每个请求的成功与否不干扰其它操作

不要重复
你可能在同一个index下的同一个type里批量索引日志数据。为每个文档指定相同的元数据是多余的。就像mget API，bulk请求也可以在URL中使用/_index或/_index/_type:POST /website3/_bulk

{ "index": { "_type": "log" }}
{ "event": "User logged in" }

你依旧可以覆盖元数据行的_index和_type，在没有覆盖时它会使用URL中的值作为默认值：

多大才算大
整个批量请求需要被加载到接受我们请求节点的内存里,所以请求越大，给其他请求可用的内存就笑。有一个最佳的bulk请求大小。超过这个大小，性能不在提升而且可能降低。
最佳大小，当然并不是一个固定的数字。它完全取决于你的硬件、你文档的大小和复杂度以及索引和搜索的负载。幸运的是，这个最佳点(sweetspot)还是容易找到的：
试着批量索引标准的文档，随着大小的增长，当性能开始降低，说明你每个批次的大小太大了。开始的数量可以在1000~5000个文档之间，如果你的文档非常大，可以使用较小的批次。
通常着眼于你请求批次的物理大小是非常有用的。一千个1kB的文档和一千个1MB的文档大不相同。一个好的批次最好保持在5-15MB大小间。

ElasticSearch学习之路-day07

猜你喜欢