ElasticSearch学习（一）------建立索引库，设置索引规则

一、创建索引库，并且设置默认分词器为 IK

curl -XPUT http://localhost:9200/myindex -d '
{
    "settings" : {
        "index" : {
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_max_word"
                }
            }
        }
    }
}
'

max_result_window 这个属性指定了查询数量限制，ES 默认限制了分页查询 start + limit <= 10000。

如果已经建立了索引库，但是索引库中还没有内容的时候，需要更换分词器，那么需要先关闭索引库，设置新的分词器，在打开索引库。（索引库中已经有内容的话，建议还是删掉索引库完全重建好了，旧索引分词不符合预期也没有留着的必要）

关闭索引库：

curl -XPOST http://localhost:9200/myindex/_close -d '
{}
'

重新设置分词器：

curl -XPUT http://localhost:9200/myindex -d '
{
    "settings" : {
        "index" : {
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_max_word"
                }
            }
        }
    }
}
'

打开索引库：

curl -XPOST http://localhost:9200/myindex/_open -d '
{}
'

自定义分词方式：

首先大概解释几个概念：

Analyzers：语法分析器，ES 包含很多内置的分析器，比如 standard, simple, whitespace 等等。

Tokenizer：分词器，将指定文本分割为一个一个单词。

Character Filter：当一串文本被传递到 Tokenizer 之前，可以用 Character Filter 过滤一遍，处理其中的字符，比如将指定的字符替换成别的字符。

Filter：经过 Tokenizer 分词结束的单词，可以用 filter 进行处理，比如将其转换成小写字母之类的。

接下来举个例子，比如我输入的文本为：“张三吃饭#李四洗澡”，我希望仅仅按照 # 分词，也就是我最后得到的结果是 “张三吃饭”，“李四洗澡” 两个词语，并且 “张三吃饭”不会被分词为 “张三”、“吃饭”两个词。

curl -XPUT http://localhost:9200/myindex -d '
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_smart"
                },
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "pattern",
                    "pattern": "#"
                }
            }
        }
    },
    "mappings": {
        "mytype": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                }
            }
        }
    }
}
'

然后我将 “张三吃饭#李四洗澡”这串文本传递到 myindex/mytype 进行索引。

curl -XPOST http://localhost:9200/myindex/mytype/1 -d '
{
    "title": "张三吃饭#李四洗澡"
}
'

接着，查询分词结果：

curl -XGET http://localhost:9200/myindex/mytype/1/_termvectors?fields=title

得到结果：

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "1",
    "_version": 1,
    "found": true,
    "took": 46,
    "term_vectors": {
        "title": {
            "field_statistics": {
                "sum_doc_freq": 2,
                "doc_count": 1,
                "sum_ttf": 2
            },
            "terms": {
                "张三吃饭": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 4
                        }
                    ]
                },
                "李四洗澡": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 5,
                            "end_offset": 9
                        }
                    ]
                }
            }
        }
    }
}

可以看到分词结果符合预期。

二、设置索引规则（mapping）

默认情况下，即使没有事先设置 mapping，ES也会根据提交的 json 数据自动创建 mapping 规则，但是自动创建的 mapping 比较简单，只会将字段设置为 long 和 text 两种类型。

手动建立 mapping 规则方法如下：

curl -XPOST http://localhost:9200/myindex/mytype/_mappings -d '
{
    "properties": {
        "id": {
            "type": "integer"
        },
        "title": {
            "type": "text"
        },
        "content": {
            "type": "text"
        }
    }
}
'

对于 String 字段，可以设置类型为 text 或者 keyword。text类型的数据会被分词处理，而 keyword 类型的数据不会被分词处理。因此想根据某个字段精确查询的话，可以将其设置为 keyword 类型（版本5.0之后）。

如果一个索引库已经存在索引文档，这时想要更改索引的mapping的话，最好删除当前索引库，重新建立索引库，设置 mapping 之后，将数据重新添加到索引库中。

mapping 的缺陷：

5.0 版本之后，在同一个索引库中，相同名字的字段 mapping type 只有一个（7.0 准备移除 mapping type 的概念），也就是说比如学生信息有name字段，学生成绩信息也有name字段，这两个name字段在 Lucene 中是用一个字段来存储的。那么意味着在每个索引库中，每个被索引的字段名必须不重复。

ES 早期的概念中，将 index 类比于数据库，将type类比于表，这对于索引是不合理的，在最新的文档中他们也承认了这个问题，每个 table 中的字段名即使重复，也不会对于其他表造成影响，而在 Lucene 中并不是这样，相同的字段应该就是只有一份索引。

以上问题就意味着，如果你想要对两个有关联的 table （比如外键）单独做到一个索引库的两个type中是不会成功的，因为有相同的字段名。如果你确实想对这两个table中的数据做索引，那么最好是建立一个独立的数据对象，包含了这两个表中所有字段（去掉重复部分），然后将整个对象做索引。

三、重建 setting 或者 mapping 无缝迁移生产环境数据

程序开发设计时总会有缺陷，当你的索引库 setting 或者 mapping 需要重建时，最简单粗暴的办法当然是删除索引库，按照新规则创建索引库，然后重新创建索引，但是这种暴力方式会导致你的服务一段时间内不可用，数据量越大，影响时间越长。

其实我们可以通过给索引库建立别名的方式，来解决这个问题，基本的思路就是，给你的旧 index 取一个别名为 index-alias，然后代码中使用 index-alias去访问 index，然后按照新规则创建一个 index2, 将 index 中的数据完全重新索引到 index2 中，然后将 index-alias 这个别名，跟 index2 绑定起来，这样就做到了重建索引库之后无缝切换，整个切换过程中服务依旧可以使用，只是会有部分搜索结果不正常，但是总比完全停止要好。

1. 给现有索引库建立别名

curl -X POST http://localhost:9200/_aliases -d '
{
    "actions": [
        { "add" : { "index" : "index", "alias" : "index-alias" } }
    ]
}
'

2. 创建新的 index2

curl -X PUT http://localhost:9200/index2 -d '
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_max_word"
                }
            }
        }
    },
    "mappings": {
        "user": {
            "properties": {
                "name": {
                    "type": "keyword"
                }
            }
        }
    }
}
'

3. 将 index 中的数据完全重建索引到 index2 中

curl -X POST http://localhost:9200/_reindex -d '
{
    "source": {
        "index": "index"
    },
    "dest": {
        "index": "index2"
    }
}
'

4. 更换别名

curl -X POST http://localhost:9200/_aliases -d '
{
    "actions": [
    	{ "remove" : { "index" : "index", "alias" : "index-alias" } },
        { "add" : { "index" : "index2", "alias" : "index-alias" } }
    ]
}
'

5. 删除旧索引库

curl -X DELETE http://localhost:9200/index

至此，索引库重建完毕。

如果仅仅是 mapping 新增字段的话，可以简单一点（因为是新增，此时 ES 里面应该没有需要新增的字段的数据）

curl -H "Content-Type: application/json" -X POST http://localhost:9200/index-alias/type1/_mapping -d '
{
	"properties": {
		"newPropertity": {
			"type": "nested"
		}
	}
}
'

ElasticSearch学习（一）------建立索引库，设置索引规则

猜你喜欢