ElasticSearch 快速请求

 
 绑定多个请求在一个请求里，可以避免处理每个请求带来的网络负载问题。 

 
 （这个技术算不上NB吧，基本都支持了。） 

 
 如果你知道你需要检索多个文档，一次请求的速度很快，而不是一个接着一个文档获取。 

 
 mget API 期望一组docs,每个元素包含_index,_type,_id元数据 

 
 你也可以指定_source参数来过滤字段。 

 
 GET /_mget 

{

 
 "docs" 
 : [

{

 
           
 "_index" 
 :  
 "website" 
 , 

 
           
 "_type" 
 :   
 "blog" 
 , 

 
 "_id" 
 :    2

},

{

 
           
 "_index" 
 :  
 "website" 
 , 

 
           
 "_type" 
 :   
 "pageviews" 
 , 

 
 "_id" 
 :    1,

 
           
 "_source" 
 :  
 "views" 

}

]

}

 
 响应体也包含一组docs.按照请求对应的顺序返回。 

 
 每个响应 

{

 
 "docs" 
 : [

{

 
           
 "_index" 
 :    
 "website" 
 , 

 
           
 "_id" 
 :       
 "2" 
 , 

 
           
 "_type" 
 :     
 "blog" 
 , 

 
 "found" 
 :    true,

 
 "_source" 
 : {

 
              
 "text" 
 :   
 "This is a piece of cake..." 
 , 

 
              
 "title" 
 :  
 "My first external blog entry" 

},

 
 "_version" 
 : 10

},

{

 
           
 "_index" 
 :    
 "website" 
 , 

 
           
 "_id" 
 :       
 "1" 
 , 

 
           
 "_type" 
 :     
 "pageviews" 
 , 

 
 "found" 
 :    true,

 
 "_version" 
 : 2,

 
 "_source" 
 : {

 
 "views" 
 : 2

}

}

]

}

 
 如果你检索的文档都在同一个 
 index 
 里(甚至同一个type), 

 
 你可以指定/_index或者/_index/_type在url里。 

 
 你也可以覆盖这些值。 

 
 GET /website/blog/_mget 

{

 
 "docs" 
 : [

 
        
 {  
 "_id" 
 : 2 }, 

 
        
 {  
 "_type" 
 :  
 "pageviews" 
 ,  
 "_id" 
 :   1 } 

]

}

 
 事实上，如果所有的文档有同样的_index,_type, 

 
 你可以这样来查询： 

 
 GET /website/blog/_mget 

{

 
     
 "ids" 
 : [  
 "2" 
 ,  
 "1" 
 ] 

}

 
 注意：第二个文档不存在， 

 
 如果不存在 

{

 
 "docs" 
 : [

{

 
        
 "_index" 
 :    
 "website" 
 , 

 
        
 "_type" 
 :     
 "blog" 
 , 

 
        
 "_id" 
 :       
 "2" 
 , 

 
 "_version" 
 : 10,

 
 "found" 
 :    true,

 
 "_source" 
 : {

 
          
 "title" 
 :    
 "My first external blog entry" 
 , 

 
          
 "text" 
 :     
 "This is a piece of cake..." 

}

},

{

 
        
 "_index" 
 :    
 "website" 
 , 

 
        
 "_type" 
 :     
 "blog" 
 , 

 
        
 "_id" 
 :       
 "1" 
 , 

 
 "found" 
 :    false

}

]

}

 
 文档没有找到。 

 
 第二个文档没找到，不影响第一个，每个文档独立执行。 

 
 HTTP响应体的代码是200，尽管有一个文档没找到， 

 
 事实上，就算都没找到，也还是200， 

 
 原因是mget本身已经成功执行了， 

 
 用户需要关注found字段的值。 

http://my.oschina.net/qiangzigege/blog/264370

 
  mget让我们一次检索多个文档 
 
  bulk API让我们来做多个创建，索引，更新和删除请求。 
 
  这个非常有用。 
 
  bulk请求体有如下的格式： 
 
  { action: { metadata }}\n 
 
  { request body        }\n 
 
  { action: { metadata }}\n 
 
  { request body        }\n 
 
  ... 
 
  需要注意两点： 
 
  1 
  每行要以 
  '\n' 
  结束，最后一行也是， 
 
  2 
  不能包含非转义换行符，影响解析。 
 
  action 
  / 
  metadata指定对文档执行什么操作。 
 
  action是以下几种之一：index,create,update,delete. 
 
  metadata指定_index,_type,_id来让文档被索引，创建，更新和删除。 
 
  比如，一个删除的请求如下： 
 
  {  
  "delete" 
  : {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  }} 
 
  请求体包含了文档_source本身，文档包含的字段和值， 
 
  当行为为index和create时要求存在，你必须提供文档来索引。 
 
  行为为update时，也需要,比如doc,upsert,script等等。 
 
  删除则不需要request body. 
 
  {  
  "create" 
  :  {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  }} 
 
  {  
  "title" 
  :     
  "My first blog post" 
  } 
 
  如果没有指定 
  id 
  ,自动生成一个 
  id 
  . 
 
  {  
  "index" 
  : {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  }} 
 
  {  
  "title" 
  :     
  "My second blog post" 
  } 
 
  看一个例子。 
 
  POST  
  / 
  _bulk 
 
  {  
  "delete" 
  : {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  }}  
 
  {  
  "create" 
  : {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  }} 
 
  {  
  "title" 
  :     
  "My first blog post" 
  } 
 
  {  
  "index" 
  :  {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  }} 
 
  {  
  "title" 
  :     
  "My second blog post" 
  } 
 
  {  
  "update" 
  : {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  ,  
  "_retry_on_conflict" 
  :  
  3 
  } } 
 
  {  
  "doc" 
  : { 
  "title" 
  :  
  "My updated blog post" 
  } }  
 
  VIEW IN SENSE 
 
  响应如下： 
 
  { 
 
  "took" 
  :  
  4 
  , 
 
  "errors" 
  : false,  
 
  "items" 
  : [ 
 
  {   
  "delete" 
  : { 
 
  "_index" 
  :    
  "website" 
  , 
 
  "_type" 
  :     
  "blog" 
  , 
 
  "_id" 
  :       
  "123" 
  , 
 
  "_version" 
  :  
  2 
  , 
 
  "status" 
  :    
  200 
  , 
 
  "found" 
  :    true 
 
  }}, 
 
  {   
  "create" 
  : { 
 
  "_index" 
  :    
  "website" 
  , 
 
  "_type" 
  :     
  "blog" 
  , 
 
  "_id" 
  :       
  "123" 
  , 
 
  "_version" 
  :  
  3 
  , 
 
  "status" 
  :    
  201 
 
  }}, 
 
  {   
  "create" 
  : { 
 
  "_index" 
  :    
  "website" 
  , 
 
  "_type" 
  :     
  "blog" 
  , 
 
  "_id" 
  :       
  "EiwfApScQiiy7TIKFxRCTw" 
  , 
 
  "_version" 
  :  
  1 
  , 
 
  "status" 
  :    
  201 
 
  }}, 
 
  {   
  "update" 
  : { 
 
  "_index" 
  :    
  "website" 
  , 
 
  "_type" 
  :     
  "blog" 
  , 
 
  "_id" 
  :       
  "123" 
  , 
 
  "_version" 
  :  
  4 
  , 
 
  "status" 
  :    
  200 
 
  }} 
 
  ] 
 
  }} 
 
  ~~~~~~~~~~~~~~~~~~ 
 
  每个子请求都独立执行，所以一个失败不会影响别人， 
 
  如果任何一个请求失败，顶层的失败标识被设置为true, 
 
  错误细节会被报告。 
 
  POST  
  / 
  _bulk 
 
  {  
  "create" 
  : {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  }} 
 
  {  
  "title" 
  :     
  "Cannot create - it already exists" 
  } 
 
  {  
  "index" 
  :  {  
  "_index" 
  :  
  "website" 
  ,  
  "_type" 
  :  
  "blog" 
  ,  
  "_id" 
  :  
  "123" 
  }} 
 
  {  
  "title" 
  :     
  "But we can update it" 
  } 
 
  响应体里可以看到： 
 
  创建 
  123 
  失败，因为已经存在，但是后续的请求成功了。 
 
  { 
 
  "took" 
  :  
  3 
  , 
 
  "errors" 
  : true,  
 
  "items" 
  : [ 
 
  {   
  "create" 
  : { 
 
  "_index" 
  :    
  "website" 
  , 
 
  "_type" 
  :     
  "blog" 
  , 
 
  "_id" 
  :       
  "123" 
  , 
 
  "status" 
  :    
  409 
  ,  
 
  "error" 
  :    "DocumentAlreadyExistsException  
 
  [[website][ 
  4 
  ] [blog][ 
  123 
  ]: 
 
  document already exists]" 
 
  }}, 
 
  {   
  "index" 
  : { 
 
  "_index" 
  :    
  "website" 
  , 
 
  "_type" 
  :     
  "blog" 
  , 
 
  "_id" 
  :       
  "123" 
  , 
 
  "_version" 
  :  
  5 
  , 
 
  "status" 
  :    
  200 
 
  }} 
 
  ] 
 
  } 
 
  这也意味着，批量请求不是原子的，不能用来实现事务。 
 
  每个请求独立处理， 
 
  不要重复自己 
 
  也许你在批量索引日志数据到同一个index里，同一个 
  type 
  , 
 
  为每个文档指定元数据是一种浪费，正如mget api所示， 
 
  bulk请求接受 
  / 
  _index和 
  / 
  _index 
  / 
  _type在url里。 
 
  POST  
  / 
  website 
  / 
  _bulk 
 
  {  
  "index" 
  : {  
  "_type" 
  :  
  "log" 
  }} 
 
  {  
  "event" 
  :  
  "User logged in" 
  } 
 
  你也仍然可以覆盖_index和_type在metadata行里。 
 
  但是它默认使用url里的值。 
 
  POST  
  / 
  website 
  / 
  log 
  / 
  _bulk 
 
  {  
  "index" 
  : {}} 
 
  {  
  "event" 
  :  
  "User logged in" 
  } 
 
  {  
  "index" 
  : {  
  "_type" 
  :  
  "blog" 
  }} 
 
  {  
  "title" 
  :  
  "Overriding the default type" 
  } 
 
  多大是多大？ 
 
  整个bulk请求需要被接受到请求的节点放在内存里，所以请求的数据越多， 
 
  其它请求可用的内存数量越少，有一个最佳大小， 
 
  超过那个大小，性能不会提高甚至下降。 
 
  这个最佳大小，尽管如此，不是一个固定数字，取决于你的硬件，文档大小和复杂度。 
 
  你的索引和查找负载，幸运的，很容易找到这个点。 
 
  尝试批量索引典型的文档，大小不断增加，当性能开始下降，说明这个数字太大了。 
 
  可以取一个[ 
  1000 
  , 
  5000 
  ]之间的数字作为开始。 
 
  同样也要关注你的请求的物理大小， 
  1000 
  个 
  1kb 
  的文档是不同于 
  1000 
  个 
  1M 
  的文档的。 
 
  一个比较好的bulk大小是 
  5 
  - 
  15MB 
  大小。 
 
  http://my.oschina.net/qiangzigege/blog/264382

ElasticSearch 快速请求

猜你喜欢