ElasticSearch分析与映射

ElasticSearch 分析器

分析器将文本块标准化为适用于倒排索引单独的词,转为标准形式。

字符过滤器
去除字符串的HTML,将&转为and

分词器
将字符串做拆分，如根据空格或逗号隔开

标记过滤
每个分词器产生的词都需经过标记过滤步骤，对词进行更改。如(Quick转为quick,books转为book)

ElasticSearch自带的分析器

1.标准分析器
ElasticSearch默认的分析器，也是最通用的分析器

2.简单分析器

将单词做切分。如：hello world拆分为hello,world拆分为hello

3.空格分析器
通过空格切分文本，不转换大小写。

分析器使用

当创建一个文档时，全文字段会被分析器拆分为单独的词创建倒排索引。
在进行全文字段搜索时，也经过同样的处理。
当使用一个确切值搜索时，不使用分析处理。

ElasticSearch映射

为了ES能正确的处理数据，如日期字段处理成日期，数据字段处理成数字。
当索引一个新的文档，ES会通过动态地映射猜测字段来行，字段类型源于JSON的基本数据类型
ElasticSearch需要能判断字段类型的标准。

字段类型

类型	对应的数据类型
String	string
Whole number	stringbyte , short , integer , long
StringFloating point	string
String	stringfloat , double
StringBoolean	string
Date	date

映射示例:

JSON格式	映射的数据类型
StringBoolean: true or false	“boolean”
Whole number: 123	“long”
Floating point: 123.45	“double”
String, valid date: “2014-09-15”	“date”
String: “foo bar”	“string”
Date	date

除了以上的基本类型意外，JSON还有null，数组和对象等复杂类型。

数组
数组值需要包含多个值即可:

{ "tag": [ "search", "nosql" ]}

空字段
空字段不会被索引，以下格式被认为是空字段:

"empty_string": "",
"null_value": null,
"empty_array": [],
"array_with_null_value": [ null ]

对象
写法示例:

{
    "tweet": "Elasticsearch is very flexible",
    "user": {
        "age": 26,
        "gender": "male",
        "id": "@johnsmith",
        "name": {
            "first": "John",
            "full": "John Smith",
            "last": "Smith"
        }
    }
}

动态映射结果:

{
    "gb": {
        "tweet": {
            "properties": {
                "tweet": { <1>
                    "type": "string"
                },
                "user": { <2>
                    "properties": {
                        "age": {
                            "type": "long"
                        },
                        "gender": {
                            "type": "string"
                        },
                        "id": {
                            "type": "string"
                        },
                        "name": { <3>
                            "properties": {
                                "first": {
                                    "type": "string"
                                },
                                "full": {
                                    "type": "string"
                                },
                                "last": {
                                    "type": "string"
                                }
                            },
                            "type": "object"
                        }
                    },
                    "type": "object"
                }
            }
        }
    }
}

其中<1>根对象，<2><3>内部对象。

在内部，该索引会被转换为如下格式:

{
    "tweet": [elasticsearch, flexible, very],
    "user.id": [@johnsmith],
    "user.gender": [male],
    "user.age": [26],
    "user.name.full": [john, smith],
    "user.name.first": [john],
    "user.name.last": [smith]
}

对象数组
示例:

{
	"followers": [
		{ "age": 35, "name": "Mary White"},
		{ "age": 26, "name": "Alex Jones"},
		{ "age": 19, "name": "Lisa Smith"}
	]
}

对象值，会都被分别拆分。

{
"followers.age": [19, 26, 35],
"followers.name": [alex, jones, lisa, smith, mary, white]
}

属性之间的关联会被擦除，如age:35和name:Mary White

映射查看

在生成通过_mapping可以查看ES中的映射。

GET /gb/_mapping/tweet

结果:

{
    "gb": {
        "mappings": {
            "tweet": {
                "properties": {
                    "date": {
                        "format": "strict_date_optional_time||epoch_millis",
                        "type": "date"
                    },
                    "name": {
                        "type": "string"
                    },
                    "tweet": {
                        "type": "string"
                    },
                    "user_id": {
                        "type": "long"
                    }
                }
            }
        }
    }
}

index

在索引时，根据使用index控制设置字段属性。
参数值如下:

值	解释
analyzed	首先分析这个字符串，然后索引。换言之，以全文形式索引此字段。
not_analyzed	索引这个字段，使之可以被搜索，但是索引内容和指定值一样。不分析此字段。
no	不索引这个字段。这个字段不能为搜索到。

给字段设置分析器

在索引时，可以通过analysed设置分析器

{
    "tweet": {
        "analyzer": "english",
        "type": "string"
    }
}

SW_LCC

发布了79 篇原创文章 · 获赞 3 · 访问量 5263

私信关注