ElasticSearch教程（二）—— 基本使用

基本使用

基本概念

ElasticSearch是面向文档的，它存储文档，并索引每个文档的内容使之可以被索引。ES选择json作为文档序列化格式。

索引：名词，类似一个数据库，是一个存储关系性文档的地方。

索引：动词，把关系型文档存到索引的过程，是插入。

ES使用倒排索引来索引文档，只有在倒排索引中存在的属性才能被搜索。

倒排索引

倒排索引，文档经过分词器分出许多词根，并把词根和文档的关联关系存在一个文档中。

term doc1 doc2

run X

jump X

swim X X

fight X

当搜索一个语句，回返回所有存在该term的文档。如搜索run swim，doc1，doc2都存在索引，但doc1的匹配度更高。

term doc1 doc2

run X

swim X X

total 2 1

倒排索引存在的问题是必须完全按照倒排索引的字段来查询，只要单词不一样，就搜索不到匹配文档。如倒排索引分出的词是swiming，搜索swim，SWIM都不会匹配到。

可以的做法，规范搜索词，如SWIM后台转成swim去匹配。对于swiming，users这类的词，可以通过词干抽取，把swiming抽成swim，users抽成user。倒排索引数据会很大，需要压缩。

term	doc1	doc2
run	X
jump		X
swim	X	X
fight	X

term	doc1	doc2
run	X
swim	X	X

total	2	1

索引文档

一个elasticSearch集群有多个索引（index），每个索引有多个type（类型），每个type有多个属性。

对于索引雇员目录，我们这么做

每个雇员都是一个文档
每个文档我们都放到employ类型下
employ类型在索引megacorp中
该索引保存在elaticSearch集群内。

curl -X PUT "localhost:9200/megacorp/employee/1" -H 'Content-type: application/json;' -d'
{
    "first_name": "John";
    "last_name": "Smith";
    "age": 25;
    "about": "I love to go rock climbing";
    "interests": ["sports", "music"]
}
'

{
    "_index": "megacorp",
    "_type": "employee",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

属性	含义
_index	存放索引
_type	存放类型
_shards	存放分片信息
_id	插入的id
total	总分片
successful	成功操作分片
failed	失败操作分片
result	操作类型，create，update等

多插入几条数据

curl -X PUT "localhost:9200/megacorp/employee/2" -H 'Content-type: application/json;' -d'
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
'

curl -X PUT "localhost:9200/megacorp/employee/3" -H 'Content-type: application/json;' -d'
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}
'

检索文档

通过GET来检索文档。

curl -X GET "localhost:9200megacrp/employee/1"

返回结果

{
    "_index": "megacorp",
    "_type": "employee",
    "_id": "2",
    "_version": 1,
    "found": true,
    "_source": {
        "first_name": "Marx",
        "last_name": "Smith",
        "age": 25,
        "about": "I love to go rock climbing",
        "interests": ["sports", "music"]
    }
}

使用GET检索文档，使用PUT索引文档（insert/update）通过_index，使用DELETE命令来删除文档，使用HEAD指令来检查文档是否存在。

ES支持通过一个string作为参数查询，也可以使用request body查询

使用string检索

上面的例子是通过index直接访问改文档，下面的是通过查询得到文档结果。

通过_search可以搜索索引库中，某个type下的文档。展示所有文档。

curl -X GET "localhost:9200/megacorp/employee/_search"

查询first_name未Marx的所有文档

curl -X GET "localhost:9200/megacorp/employee/_search?q=first_name:Marx"

简单检索是一个即席查询。

即席查询是用户根据自己的需求，灵活的选择查询条件，系统能够根据用户的选择生成相应的统计报表。即席查询与普通应用查询最大的不同是普通的应用查询是定制开发的，而即席查询是由用户自定义查询条件的。如”select id from user where user_no = “+”001”。

另一种查询是参数化查询，如”select id from user where user_no = #{userNo}”。

对于多个条件的查询，使用+将多个条件连接起来。但在url中，+被转为空格，所以必须用它UTF编码%2B

curl -X GET "localhost:9200/megacorp/employee/_search?q=first_name:Marx%2B_index:2"

也可以不指定index而进行全集群查询

curl -X GET "localhost:9200/_all/_search?q=first_name:Marx"

由于即席查询允许用户在索引的任何字段上执行可能较慢且重量级查询，这可能会暴露隐私信息，甚至将集群拖垮。所以不建议想用户暴露查询查询字符串搜索功能。

使用request body检索

使用request body可以完成一些复杂的查询，如查询first_name为John的数据。

curl -X GET "localhost:9200/merp/employee/_search" -H 'Content-type: application/json' -d'
{
    "query": {
        "match": {
            "first_name": "John"
        }
    }
}

结果

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.2876821,
        "hits": [{
            "_index": "megacorp",
            "_type": "employee",
            "_id": "1",
            "_score": 0.2876821,
            "_source": {
                "first_name": "John",
                "last_name": "Smith",
                "age": 25,
                "about": "I love to go rock climbing",
                "interests": ["sports", "music"]
            }
        }]
    }
}

可以加上过滤年龄过滤

curl -X GET "localhost:9200/merp/employee/_search" -H 'Content-type: application/json' -d'
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "first_name": "John"
                }
            },
            "filter": {
                "range": {
                    "age": {
                        "gt": 30
                    }
                }
            }
        }
    }
}'

ES的搜索是可以看到文档匹配分数的，这是mysql不具备的。如搜索”go climbing”，它会搜索文档属性中存在”go”，”climbing”的文档，并给予匹配度。

curl -X GET "localhost:9200/megacorp/employee/_search" -H 'Content-type: application/json' -d'
{"query":{"match":{"about":"go climbing"}}}'

短语搜索

按整个参数去检索，而不是把它分词检索，如检索”rock climbing”短语，使用match_phrase

curl -X GET "localhost:9200/megacorp/employee/_search" -H 'Content-type: application/json' -d'
{
    "query": {
        "match_phrase": {
            "about": "rock climbing"
        }
    }
}'

高亮显示

加上”highlight”，可以将属性中所有匹配的关键字加上<em></em>高亮显示。属性名支持通配符表示法。

curl -X GET "localhost:9200/megacorp/employee/_search" -H 'Content-type: application/json' -d'           
{
    "query": {
        "match_phrase": {
            "about": "rock climbing"
        }
    },
    "highlight": {
        "fields": {
            "about": {}
        }
    }
}'

结果

{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1,"max_score":0.5753642,"hits":[{"_index":"megacorp","_type":"employee","_id":"1","_score":0.5753642,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
,"highlight":{"about":["I love to go <em>rock</em> <em>climbing</em>"]}}]}}

聚合检索

curl -X GET "localhost:9200/megacorp/employee/_search" -H 'Content-type: application/json' -d'
{"aggs": {"all_interests": {"terms": {"field": "interests" } } } }'

注意，使用聚合检索的属性不能是text，es6对于String分成了支持聚合的keyword和不支持聚合的text。否则会提示

Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.