ElasticSearch
Inverted index
Establishing an inverted index: segment the article title and store each word in term. These words also correspond to an id, which is the document.
Inverted index retrieval: Suppose we search for Huawei mobile phones
- Participles: "Huawei" "mobile phone"
- Find the corresponding two keys and their document ids from the database
- Since the document ids are 2,3 and 1,2 respectively; it can be seen that document id=2 has the highest degree of overlap and best meets the search conditions, so the search results will be ranked at the top.
- Search results are stored in the result set
Environment configuration
First you need to download the following three things (the 7.8 version is chosen here to be compatible with lower versions of JAVA. Higher versions of ES must require higher versions of JDK, which is very inconvenient)
Attention! Since we are building an environment under Windows, when downloading the ik word segmenter, be sure to download the compiled package elasticsearch-analysis-ik-7.8.0.zip
. Do not download the source code package! ! !
All versions of the three-piece set must be consistent! There is no such thing as downward or upward compatibility!
Installation under windows is very simple. Unzip all three compressed packages to a directory with a non-Chinese path.
First drop all the contents of the ik 分词器
compressed package into the plugins folder in the root directory of es7.8
Open the JVM configuration file of es7.8:es7.8/config/jvm.options
Adjust the running memory, otherwise the memory will explode and crash after running it
-Xms1g
-Xmx1g
You're done, double-click to run the following two bat files (note the order)
es根目录/bin/elasticsearch.bat
kibana根目录/bin/kibana.bat
es runs on port 9200 by default, and kibana runs on port 5601 by default.
Test ik tokenizer
Open kibana consolelocalhost:5601
Click on the menu in the upper left corner, scroll to the bottom and select dev tools
Here you can test our es code at will, such as inserting indexes and queries
According to the format below, we used the ik intelligent word segmenter to perform word segmentation on a line of text containing Chinese and English.
POST _analyze
{
"text": "我再也不想学JAVA语言了",
"analyzer": "ik_smart"
}
Add extended dictionary
Internet hot words cannot always be included by the ik word segmenter, let alone Chinese, so in special cases we need to add an extended dictionary to help the ik word segmenter correctly identify new Internet words
First open the ik word segmenter extension settings file:es根目录/plugins/analysis-ik/config/IKAnalyzer.cfg.xml
Change it to the following
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">ext.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">stopword.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
New file in the same directoryext.dic
is used to store expansion words. Change the line after each expansion word is written.
We can add the following two expansion words
小黑子
煤油树枝
香精煎鱼
香菜凤仁鸡
梅素汁
Restart es7.8 and return to our kibana interface again
It can be seen that the ik word segmenter successfully identified the hot words on the Internet and performed the word segmentation operation!
Operation index
To create a simple index, you only need to make brief modifications to the following code
PUT /heima
{
"mappings": {
"properties": {
"info":{
// 设置字段名为"info"的映射
"type": "text", // 设置字段类型为"text"
"analyzer": "ik_smart" // 使用中文分词器"ik_smart"进行分词
},
"email":{
// 设置字段名为"email"的映射
"type": "keyword", // 设置字段类型为"keyword",表示不会进行分词
"index": false // 设置不对该字段进行索引,即无法通过该字段进行搜索
},
"name":{
// 设置字段名为"name"的映射
"type": "object", // 设置字段类型为"object",表示是一个嵌套对象
"properties": {
// 定义嵌套对象的属性
"firstname":{
// 设置嵌套对象的属性名为"firstname"的映射
"type":"keyword" // 设置属性类型为"keyword",表示不会进行分词
},
"lastname":{
// 设置嵌套对象的属性名为"lastname"的映射
"type":"keyword" // 设置属性类型为"keyword",表示不会进行分词
}
}
}
}
}
}
The result after execution in dev tools is
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "heima"
}
Indexing and document operations
Once created, the index library and mapping in es cannot be modified, but new fields can be added to them.
The following command adds a new field called age to the index heima
PUT /heima/_mapping
{
"properties":{
"age":{
"type":"keyword"
}
}
}
Get index library:GET /索引库名称
Delete index library:DELETE /索引库名称