目录
一、ElasticSearch笔记(一)
二、IK分词器
1、ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
GET my_index/_analyze
{
"analyzer": "ik_smart",
"text":"安徽省长江流域"
}
2、ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
GET _analyze?pretty
{
"analyzer": "ik_max_word",
"text":"安徽省长江流域"
}
3、定义mapping
POST my_index/fulltext/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}
如果想按照自己的方式分词,需要自定义IK分词器词典
查看已有词典
ll /root/apps/elasticsearch-6.3.1/plugins/ik/config
自定义词典
mkdir custom
vi custom/new_word.dic
cat custom/new_word.dic
老铁
王者荣耀
洪荒之力
共有产权房
一带一路
vi IKAnalyzer.cfg.xml
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">custom/new_word.dic</entry>
三、检索
数据源
vi website.json
{ "index":{ "_index": "website", "_type": "blog", "_id": "1" }}
{ "title": "Ambari源码编译","author":"小明","postdate":"2016-12-21","abstract":"CentOS7.x下的Ambari2.4源码编译","url":"http://url.cn/53788351"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "2" }}
{ "title": "watchman源码编译","author":"小明","postdate":"2016-12-23","abstract":"CentOS7.x的watchman源码编译","url":"http://url.cn/53844169"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "3" }}
{ "title": "CentOS升级gcc","author":"小明","postdate":"2016-12-25","abstract":"CentOS升级gcc","url":"http://url.cn/53868915"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "4" }}
{ "title": "vmware复制虚拟机","author":"小明","postdate":"2016-12-29","abstract":"vmware复制虚拟机","url":"http://url.cn/53946664"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "5" }}
{ "title": "libstdc++.so.6","author":"小明","postdate":"2016-12-30","abstract":"libstdc++.so.6问题解决","url":"http://url.cn/53946911"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "6" }}
{ "title": "CentOS更换国内yum源","author":"小明","postdate":"2016-12-30","abstract":"CentOS更换国内yum源","url":"http://url.cn/53946911"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "7" }}
{ "title": "搭建Ember开发环境","author":"小明","postdate":"2016-12-30","abstract":"CentOS下搭建Ember开发环境","url":"http://url.cn/53947507"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "8" }}
{ "title": "es高亮","author":"小明","postdate":"2017-01-03","abstract":"Elasticsearch查询关键字高亮","url":"http://url/53991802"}
{ "index":{ "_index": "website", "_type": "blog", "_id": "9" }}
{ "title": "to be or not to be","author":"somebody","postdate":"2018-01-03","abstract":"to be or not to be,that is the question","url":"http://url/63991802"}
创建索引
PUT website
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 5
},
"mappings": {
"blog":{
"properties": {
"title":{
"type":"text",
"analyzer": "ik_max_word"
},
"author":{
"type":"text"
},
"postdate":{
"type":"date",
"format": "yyyy-MM-dd"
},
"abstract":{
"type":"text",
"analyzer": "ik_max_word"
},
"url":{
"type":"text"
}
}
}
}
}
批量导入(从文件)
curl -XPOST "http://hdp-1:9200/_bulk?pretty" -H "Content-Type: application/json;charset=UTF-8" --data-binary @website.json
term查询
GET website/_search
{
"query": {
"term": {
"title": "vmware"
}
}
}
分页
GET website/_search
{
"from":0,
"size":3,
"query": {
"match_all": {}
}
}
过滤字段
GET website/_search
{
"_source": ["title","author"],
"query": {
"term": {
"title": "centos"
}
}
}
显示version
GET website/_search
{
"_source": ["title"],
"version": true,
"query": {
"term": {
"title": "centos"
}
}
}
评分过滤
GET website/_search
{
"min_score":"0.5",
"query": {
"term": {
"title": "centos"
}
}
}
高亮关键字
GET website/_search
{
"query": {
"term": {
"title": "centos"
}
},
"highlight": {
"fields": {
"title": {}
}
}
}
四、聚合查询
Metrics(度量/指标):简单的对过滤出来的数据集进行avg,max操作,是一个单一的数值
Bucket(桶):将过滤出来的数据集按条件分成多个小数据集,然后Metrics会分别作用在这些小数据集上
官方文档地址:
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-aggregations-bucket.html
POST my-index/person/1
{
"name":"小明",
"age":28,
"salary":10000
}
PUT my-index/person/2
{
"name":"hadron",
"age":19,
"salary":5000
}
1、max:查询最大值 min查询最小值 aggs --> Aggregations 聚合 max_age是聚合后的名字 要求类型是long
GET my-index/person/_search
{
"aggs": {
"max_age": {
"max(min)": {
"field": "age"
}
}
}
}
2、ave:平均值
GET my-index/_search
{
"size": 0,
"aggs": {
"avg_salary": {
"avg": {"field": "salary"}
}
}
}
3、sum求和
GET my-index/_search
{
"size": 0,
"aggs": {
"sum_salary": {
"sum": {"field": "salary"}
}
}
}
4、stats 显示所有的计算状态 把count、avg、max、min、sum都显示出来
GET my-index/_search
{
"size": 0,
"aggs": {
"stats_salary": {
"stats": {"field": "salary"}
}
}
}
5、extended_stats 比stats 更详细
GET my-index/_search
{
"size": 0,
"aggs": {
"stats_salary": {
"extended_stats": {"field": "salary"}
}
}
}
cardinality基数统计,工资有几个等级
GET my-index/_search
{
"size": 0,
"aggs": {
"class_salary": {
"cardinality": {"field": "salary"}
}
}
}
value_count:文档数量统计 文本类型需加 .keyword
GET my-index/_search
{
"size": 0,
"aggs": {
"doc_count": {
"value_count": {"field": "salary"}
}
}
}
GET my-index/_search
{
"size": 0,
"aggs": {
"doc_count": {
"value_count": {"field": "name.keyword"}
}
}
}
percentiles
GET my-index/_search
{
"size": 0,
"aggs": {
"persion_salary": {
"percentiles": {"field": "salary"}
}
}
}
桶聚合
准备数据
DELETE my-index
PUT my-index
PUT my-index/persion/1
{
"name":"张三",
"age":27,
"gender":"男",
"salary":15000,
"dep":"bigdata"
}
PUT my-index/persion/2
{
"name":"李四",
"age":26,
"gender":"女",
"salary":15000,
"dep":"bigdata"
}
PUT my-index/persion/3
{
"name":"王五",
"age":26,
"gender":"男",
"salary":17000,
"dep":"AI"
}
PUT my-index/persion/4
{
"name":"刘六",
"age":27,
"gender":"女",
"salary":18000,
"dep":"AI"
}
PUT my-index/persion/5
{
"name":"程裕强",
"age":31,
"gender":"男",
"salary":20000,
"dep":"bigdata"
}
PUT my-index/persion/6
{
"name":"hadron",
"age":30,
"gender":"男",
"salary":20000,
"dep":"AI"
}
1、Terms Aggregation Terms聚合用于分组聚合。 最后会显示以salary为组,每个组有多少个
GET my-index/_search
{
"size": 0,
"aggs": {
"group_count": {
"terms": {"field": "salary"}
}
}
}
2、以工资为组,显示每个组有多少人,平均年龄(相当于上一个term里面加一个aggs嵌套筛选)
GET my-index/_search
{
"size": 0,
"aggs": {
"group_count": {
"terms": {"field": "salary"},
"aggs": {
"myavg": {
"avg": {
"field": "age"
}
}
}
}
}
}
3、统计每个部门人数 dep加keyword防止被切,不加报错
GET my-index/_search
{
"size": 0,
"aggs": {
"group_count": {
"terms": {"field": "dep.keyword"}
}
}
}
4、filter
Filter Aggregation Filter聚合用于过滤器聚合,把满足过滤器条件的文档分到一组。计算男人的平均年龄 也就是统计gender字段包含关键字“男”的文档的age平均值。
GET my-index/_search
{
"size": 0,
"aggs": {
"group_count": {
"filter": {
"term":{"gender": "男"}
},
"aggs":{
"avg_age":{
"avg":{"field": "age"}
}
}
}
}
}
统计男女员工的平均年龄
GET my-index/_search
{
"size": 0,
"aggs": {
"group_count": {
"filters":{
"filters": [
{"match":{"gender": "男"}},
{"match":{"gender": "女"}}
]
},
"aggs":{
"avg_age":{
"avg":{"field": "age"}
}
}
}
}
}
统计body字段包含”error”和包含”warning”的文档数
数据
PUT /logs/message/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
查询
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
}
}
}
}
5、范围查询
Range Aggregation from…to区间范围是[from,to),也就是说包含from点,不包含to点。查询薪资在[0,10000),[10000,20000),[2000,+无穷大)三个范围的员工数
GET my-index/_search
{
"size": 0,
"aggs": {
"group_count": {
"range": {
"field": "salary",
"ranges": [
{"to": 10000},
{"from": 10000,"to":20000},
{"from": 20000}
]
}
}
}
}
带格式范围查询,查询发布日期在2016-12-01之前、2016-12-01至2017-01-01、2017-01-01之后三个时间区间的文档数
GET website/_search
{
"size": 0,
"aggs": {
"group_count": {
"range": {
"field": "postdate",
"format":"yyyy-MM-dd",
"ranges": [
{"to": "2016-12-01"},
{"from": "2016-12-01","to":"2017-01-01"},
{"from": "2017-01-01"}
]
}
}
}
}
Date Range聚合
专用于日期值的范围聚合。 这种聚合和正常范围聚合的主要区别在于,起始和结束值可以在日期数学表达式中表示,并且还可以指定返回起始和结束响应字段的日期格式。 请注意,此聚合包含from值并排除每个范围的值。
计算一年前之前发表的博文数和从一年前以来发表的博文总数
GET website/_search
{
"size": 0,
"aggs": {
"group_count": {
"range": {
"field": "postdate",
"format":"yyyy-MM-dd",
"ranges": [
{"to": "now-12M/M"},
{"from": "now-12M/M"}
]
}
}
}
}
6、Missing聚合
返回不包含字段的文档数量
GET my-index/_count #统计一共有多少个文档 9
GET my-index/_search #统计不含salary字段的数量 3
{
"size": 0,
"aggs": {
"noDep_count": {
"missing": {"field": "salary"}
}
}
}
6、 children聚合
一个特殊的单桶集合,用于选择具有指定类型的子文档,如join字段中定义的。这种聚合有一个单一的选择:type - 应该选择的子类型.
(1)索引定义
下面通过join字段定义了一个单一关系,join_index下的doc类型的question 是answer的父文档。
PUT join_index
{
"mappings": {
"doc": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"question": "answer" #父类在前
}
}
}
}
}
}
父类文档question ?refresh刷新指定index的数据、同步
PUT join_index/doc/1?refresh
{
"text": "This is a question",
"my_join_field": {
"name": "question"
}
}
PUT join_index/doc/2?refresh
{
"text": "This is a another question",
"my_join_field": {
"name": "question"
}
}
子文档answer
PUT join_index/doc/3?routing=1&refresh
{
"text": "This is an answer",
"my_join_field": {
"name": "answer",
"parent": "1"
}
}
PUT join_index/doc/4?routing=1&refresh
{
"text": "This is another answer",
"my_join_field": {
"name": "answer",
"parent": "1"
}
}
统计子文档数量
POST join_index/_search
{
"size": 0,
"aggs": {
"to-answers": {
"children": {
"type" : "answer"
}
}
}
}