ElasticSearch笔记（二）

一、ElasticSearch笔记（一）

二、IK分词器

1、ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

GET my_index/_analyze
{
 "analyzer": "ik_smart",
 "text":"安徽省长江流域"
}

2、ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；

GET _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text":"安徽省长江流域"
}

3、定义mapping

POST my_index/fulltext/_mapping
{ 
    "properties": { 
        "content": { 
            "type": "text", 
            "analyzer": "ik_max_word", 
            "search_analyzer": "ik_max_word" 
                    } 
                  }
}

如果想按照自己的方式分词，需要自定义IK分词器词典

查看已有词典

ll /root/apps/elasticsearch-6.3.1/plugins/ik/config

自定义词典

 mkdir custom
 vi custom/new_word.dic
 cat custom/new_word.dic 
老铁
王者荣耀
洪荒之力
共有产权房
一带一路

 vi IKAnalyzer.cfg.xml 

    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom/new_word.dic</entry>

三、检索

数据源

vi  website.json 

{ "index":{ "_index": "website", "_type": "blog", "_id": "1" }}
{ "title": "Ambari源码编译","author":"小明","postdate":"2016-12-21","abstract":"CentOS7.x下的Ambari2.4源码编译","url":"http://url.cn/53788351"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "2" }}
{ "title": "watchman源码编译","author":"小明","postdate":"2016-12-23","abstract":"CentOS7.x的watchman源码编译","url":"http://url.cn/53844169"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "3" }}
{ "title": "CentOS升级gcc","author":"小明","postdate":"2016-12-25","abstract":"CentOS升级gcc","url":"http://url.cn/53868915"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "4" }}
{ "title": "vmware复制虚拟机","author":"小明","postdate":"2016-12-29","abstract":"vmware复制虚拟机","url":"http://url.cn/53946664"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "5" }}
{ "title": "libstdc++.so.6","author":"小明","postdate":"2016-12-30","abstract":"libstdc++.so.6问题解决","url":"http://url.cn/53946911"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "6" }}
{ "title": "CentOS更换国内yum源","author":"小明","postdate":"2016-12-30","abstract":"CentOS更换国内yum源","url":"http://url.cn/53946911"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "7" }}
{ "title": "搭建Ember开发环境","author":"小明","postdate":"2016-12-30","abstract":"CentOS下搭建Ember开发环境","url":"http://url.cn/53947507"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "8" }}
{ "title": "es高亮","author":"小明","postdate":"2017-01-03","abstract":"Elasticsearch查询关键字高亮","url":"http://url/53991802"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "9" }}
{ "title": "to be or not to be","author":"somebody","postdate":"2018-01-03","abstract":"to be or not to be,that is the question","url":"http://url/63991802"}

创建索引

PUT website
{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "blog":{
      "properties": {
        "title":{
          "type":"text",
          "analyzer": "ik_max_word"
        },
        "author":{
          "type":"text"
        },
        "postdate":{
          "type":"date",
          "format": "yyyy-MM-dd"
        },
        "abstract":{
          "type":"text",
          "analyzer": "ik_max_word"
        },
        "url":{
          "type":"text"
        }
      }
    }  
  }
}

批量导入（从文件）

curl -XPOST "http://hdp-1:9200/_bulk?pretty" -H "Content-Type: application/json;charset=UTF-8" --data-binary @website.json

term查询

GET website/_search
{
  "query": {
    "term": {
        "title": "vmware"
    }
  }
}

分页

GET website/_search
{
  "from":0,
  "size":3,
  "query": {
    "match_all": {}
  }
}

过滤字段

GET website/_search
{
  "_source": ["title","author"], 
  "query": {
    "term": {
        "title": "centos"
    }
  }
}

显示version

GET website/_search
{
  "_source": ["title"], 
  "version": true, 
  "query": {
    "term": {
        "title": "centos"
    }
  }
}

评分过滤

GET website/_search
{
  "min_score":"0.5",
  "query": {
    "term": {
        "title": "centos"
    }
  }
}

高亮关键字

GET website/_search
{
  "query": {
    "term": {
        "title": "centos"
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

四、聚合查询

Metrics（度量/指标）：简单的对过滤出来的数据集进行avg，max操作，是一个单一的数值

Bucket（桶）：将过滤出来的数据集按条件分成多个小数据集，然后Metrics会分别作用在这些小数据集上

官方文档地址：
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-aggregations-bucket.html

指标聚合：准备数据

POST my-index/person/1
{
  "name":"小明",
  "age":28,
  "salary":10000
}
PUT my-index/person/2
{
  "name":"hadron",
  "age":19,
  "salary":5000
}

1、max：查询最大值 min查询最小值 aggs --> Aggregations 聚合 max_age是聚合后的名字要求类型是long

GET my-index/person/_search
{
  "aggs": {
    "max_age": {
      "max(min)": {
        "field": "age"
      }
    }
  }
}

2、ave：平均值

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "avg_salary": {
      "avg": {"field": "salary"}
    }
  }
}

3、sum求和

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "sum_salary": {
      "sum": {"field": "salary"}
    }
  }
}

4、stats 显示所有的计算状态把count、avg、max、min、sum都显示出来

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "stats_salary": {
      "stats": {"field": "salary"}
    }
  }
}

5、extended_stats 比stats 更详细

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "stats_salary": {
      "extended_stats": {"field": "salary"}
    }
  }
}

cardinality基数统计，工资有几个等级

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "class_salary": {
      "cardinality": {"field": "salary"}
    }
  }
}

value_count：文档数量统计文本类型需加 .keyword

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count": {"field": "salary"}
    }
  }
}
GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count": {"field": "name.keyword"}
    }
  }
}

percentiles

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "persion_salary": {
      "percentiles": {"field": "salary"}
    }
  }
}

桶聚合

准备数据

DELETE my-index

PUT my-index

PUT my-index/persion/1
{
  "name":"张三",
  "age":27,
  "gender":"男",
  "salary":15000,
  "dep":"bigdata"
}

PUT my-index/persion/2
{
  "name":"李四",
  "age":26,
  "gender":"女",
  "salary":15000,
  "dep":"bigdata"
}

PUT my-index/persion/3
{
  "name":"王五",
  "age":26,
  "gender":"男",
  "salary":17000,
  "dep":"AI"
}
PUT my-index/persion/4
{
  "name":"刘六",
  "age":27,
  "gender":"女",
  "salary":18000,
  "dep":"AI"
}

PUT my-index/persion/5
{
  "name":"程裕强",
  "age":31,
  "gender":"男",
  "salary":20000,
  "dep":"bigdata"
}
PUT my-index/persion/6
{
  "name":"hadron",
  "age":30,
  "gender":"男",
  "salary":20000,
  "dep":"AI"
}

1、Terms Aggregation Terms聚合用于分组聚合。最后会显示以salary为组，每个组有多少个

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "terms": {"field": "salary"}
    }
  }
}

2、以工资为组，显示每个组有多少人，平均年龄（相当于上一个term里面加一个aggs嵌套筛选）

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "terms": {"field": "salary"},
      "aggs": {
        "myavg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

3、统计每个部门人数 dep加keyword防止被切，不加报错

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "terms": {"field": "dep.keyword"}
    }
  }
}

4、filter

Filter Aggregation Filter聚合用于过滤器聚合，把满足过滤器条件的文档分到一组。计算男人的平均年龄也就是统计gender字段包含关键字“男”的文档的age平均值。

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "filter": {
        "term":{"gender": "男"}
      },
      "aggs":{
        "avg_age":{
          "avg":{"field": "age"}
        }
      }
    }
  }
}

统计男女员工的平均年龄

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "filters":{
        "filters": [
          {"match":{"gender": "男"}},
          {"match":{"gender": "女"}}
        ]
      },
      "aggs":{
        "avg_age":{
            "avg":{"field": "age"}
        }
      }
    }
  }
}

统计body字段包含”error”和包含”warning”的文档数

数据

PUT /logs/message/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }

查询

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }
    }
  }
}

5、范围查询

Range Aggregation from…to区间范围是[from,to),也就是说包含from点，不包含to点。查询薪资在[0,10000),[10000,20000),[2000,+无穷大)三个范围的员工数

GET my-index/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "range": {
        "field": "salary",
        "ranges": [
            {"to": 10000},
            {"from": 10000,"to":20000},  
            {"from": 20000}
        ]
      }
    }
  }
}

带格式范围查询，查询发布日期在2016-12-01之前、2016-12-01至2017-01-01、2017-01-01之后三个时间区间的文档数

GET website/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "range": {
        "field": "postdate",
        "format":"yyyy-MM-dd",
        "ranges": [
            {"to": "2016-12-01"},
            {"from": "2016-12-01","to":"2017-01-01"},  
            {"from": "2017-01-01"}
        ]
      }
    }
  }
}

Date Range聚合
专用于日期值的范围聚合。这种聚合和正常范围聚合的主要区别在于，起始和结束值可以在日期数学表达式中表示，并且还可以指定返回起始和结束响应字段的日期格式。请注意，此聚合包含from值并排除每个范围的值。

计算一年前之前发表的博文数和从一年前以来发表的博文总数

GET website/_search
{
  "size": 0, 
  "aggs": {
    "group_count": {
      "range": {
        "field": "postdate",
        "format":"yyyy-MM-dd",
        "ranges": [
            {"to": "now-12M/M"},
            {"from": "now-12M/M"}
        ]
      }
    }
  }
}

6、Missing聚合

返回不包含字段的文档数量

GET my-index/_count                   #统计一共有多少个文档   9

GET my-index/_search                  #统计不含salary字段的数量   3
{
  "size": 0, 
  "aggs": {
    "noDep_count": {
      "missing": {"field": "salary"}
    }
  }
}

6、 children聚合

一个特殊的单桶集合，用于选择具有指定类型的子文档，如join字段中定义的。这种聚合有一个单一的选择：type - 应该选择的子类型.
（1）索引定义
下面通过join字段定义了一个单一关系，join_index下的doc类型的question 是answer的父文档。

PUT join_index
{
  "mappings": {
    "doc": {
      "properties": {
        "my_join_field": { 
          "type": "join",
          "relations": {
            "question": "answer"     #父类在前
          }
        }
      }
    }
  }
}

父类文档question ？refresh刷新指定index的数据、同步

PUT join_index/doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}

PUT join_index/doc/2?refresh
{
  "text": "This is a another question",
  "my_join_field": {
    "name": "question"
  }
}

子文档answer

PUT join_index/doc/3?routing=1&refresh 
{
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", 
    "parent": "1" 
  }
}
PUT join_index/doc/4?routing=1&refresh
{
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

统计子文档数量

POST join_index/_search
{
  "size": 0, 
  "aggs": {
    "to-answers": {
        "children": {
          "type" : "answer" 
        }
    }
  }
}

辛聪明

发布了77 篇原创文章 · 获赞 19 · 访问量 4068

私信关注