Elasticsearch (ES) Search Engine: Text search: analyzer/tokenizer, synonyms/stop words, pinyin search, highlighting, spelling correction

Original link: https://xiets.blog.csdn.net/article/details/132349032

Copyright Statement: Reprinting of original articles is prohibited

Column directory: Elasticsearch column (general directory)

Text search mainly refers to full-text search. Full-text search is the core function of search engines. Unlike structured data with exact matching, text (text) data requires additional processing when building indexes and searching.

Elasticsearch needs to rely on the analyzer component when storing and searching text data. Lucene is responsible for the physical construction and sorting of the index, and the analyzer will perform word segmentation and grammatical processing of the text data before establishing the index. When searching text data, you also need to segment and syntax the search terms first, and then use the segmented sub-words to perform multiple sub-searches.

Full-text search mainly targets textfields of type and uses matchquery mode to search, and analyzer is the core of full-text search.

1. Analyzer

1.1 Filters and tokenizers

The analyzer is used to segment the text before establishing an index when storing or updating text data. When querying text-type fields, you also need to use an analyzer to segment the query words first.

ES has a variety of built-in analyzers, which include three parts:

  • Character filter ( char_filter): There can be 0 or more, which perform coarse-grained processing of the original text data, such as removing HTML tags, escaping special characters in the original text, etc.
  • Word segmenter ( tokenizer): There is and can only be one word segmenter, which segments the text into words according to the specified rules.
  • Word filter ( filter): There can be 0 or more. It receives the word segmentation processing results and standardizes and optimizes them, such as converting characters to lowercase, adding synonyms, deleting stop words, adding pinyin, etc.

When the parser processes text data, the flow direction of the three parts of text data is as follows:

            文本
             |
            \|/
         字符过滤器1
         字符过滤器2
             |
            \|/
           分词器
             |
            \|/
         词语过滤器1
         词语过滤器2
             |
            \|/
    分词1, 分词2, 分词3, ...

1.2 Built-in analyzer

ES comes with various built-in analyzers, see Built-in analyzer reference .

The built-in analyzers can be used directly in any index without further configuration. Here are some of the built-in analyzers:

  1. Standard parser ( standard): Divides text into words on word boundaries, as defined by the Unicode text segmentation algorithm. It removes most punctuation marks, lowercase words, and supports removal of stop words.
  2. Simple parser ( simple): Whenever a non-alphabetic character (e.g. space, punctuation mark) is encountered, the parser divides the text into words, all of which are lowercase.
  3. Whitespace Analyzer ( whitespace): Whenever any whitespace character is encountered, the analyzer breaks the text into words, it does not lowercase the words.
  4. Stop parser ( stop): simpleSimilar to , but supports removing stop words.
  5. keyword analyzer ( keyword): A "noop" analyzer that takes any text given and outputs the exact same text as a single word.
  6. Regex Analyzer ( pattern): Split text into words using regular expressions, it supports lowercase and stop words.
  7. Language analyzers ( english/ french): A number of language-specific analyzers are provided.
  8. Fingerprint Analyzer ( fingerprint): A professional analyzer that creates fingerprints that can be used for repeat detection.

If no analyzer is specified, by default text type fields use the standard analyzer ( standard) analyzer when indexing and querying.

The standard analyzer ( ) includes: Standardstandard Tokenizer , Lower Case Token Filter , and Stop Token Filter . The stop Token filter is disabled by default.

1.3 Test Analyzer

Analyzer API: Analyze API , Test an analyzer

ES provides an API for testing analyzer analysis results. Analyzer-related requests:

  • GET /_analyze
  • POST /_analyze
  • GET /<index>/_analyze
  • POST /<index>/_analyze

Test analyzer request format:

POST /_analyze
{
    
    
    "analyzer": "standard",             // 指定分析器名称
    "text": "ES, Hello World!"          // 需要分析的文本
}

POST /_analyze
{
    
    
    "analyzer": "standard",
    "text": ["test es", "the test"]     // 分析文本也可是传递一个数组, 同时分析多个文本
}

POST /<index>/_analyze                  // 使用索引中自定义的分析器
{
    
    
    "analyzer": "my_analyzer",          // 创建索引时在 settings 中自定义的分析器
    "text": "ES, Hello World!"          // 需要分析的文本
}

POST /<index>/_analyze                  // 基于现有索引的 text 类型字段分析
{
    
    
    "field": "<field_name>",            // 使用给定字段使用的分析器, 如果字段不存在则使用默认分析器
    "text": "ES, Hello World!"          // 需要分析的文本
}

GET /_analyze                           // 也可以自定义分析器 (手动指定分词器和过滤器)
{
    
    
    "char_filter" : ["html_strip"],     // 字符过滤器
    "tokenizer" : "keyword",            // 分词器
    "filter" : ["lowercase"],           // 分词过滤器
    "text" : "this is a <b>test</b>"    // 需要分析的文本
}

Test analyzer request:

POST /_analyze
{
    
    
    "analyzer": "standard",
    "text": "ES, Hello World!"
}

// 返回
{
    
    
    "tokens": [                     // 每一个分词表示一个 token
        {
    
    
            "token": "es",          // 分词
            "start_offset": 0,      // 分词在原文本中的开始位置
            "end_offset": 2,        // 分词在原文本中的结束位置(不包括)
            "type": "<ALPHANUM>",   // 词类型
            "position": 0           // 该分词是所有分词中的第几个分词
        },
        {
    
    
            "token": "hello",
            "start_offset": 4,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
    
    
            "token": "world",
            "start_offset": 10,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

Use the default analyzer to analyze Chinese:

POST /_analyze
{
    
    
    "analyzer": "standard",
    "text": "我是中国人"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
    
    
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
    
    
            "token": "中",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
    
    
            "token": "国",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
    
    
            "token": "人",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        }
    ]
}

When the standard analyzer ( standard) processes Chinese words, it directly divides each Chinese character into a word. Chinese sentences generally contain many words, so it is not suitable for analyzing Chinese.

1.4 Specify the analyzer when creating the index

Text type fields in index mapping need to use an analyzer. If no analyzer is specified, the standard analyzer ( ) is used by default standard. You can also specify the analyzer when creating the index.

Specify the default analyzer when creating the index, request format:

PUT /<index>                            // 创建索引
{
    
    
    "settings": {
    
                           // 索引设置
        "analysis": {
    
                       // 分析器的所有设置
            "analyzer": {
    
                   // 分析器设置
                "default": {
    
                // 当前索引默认的分析器, text字段如果没有指定分析器则用这个
                    "type": "simple"    // "simple" 表示分析器名称
                }
            }
        }
    },
    "mappings": {
    
                           // 映射
        "properties": {
    
                     // 映射的属性
            // ... 字段列表
        }
    }
}

When creating an index, specify an analyzer for a specific text type field. The request format is:

PUT /<index>                                    // 创建索引
{
    
    
    "mappings": {
    
                                   // 映射
        "properties": {
    
                             // 映射的属性
            "<field>": {
    
                            // 字段
                "type": "text",                 // 字段类型
                "analyzer": "standard",         // 当前字段建立索引(写入文档)时使用的分析器
                "search_analyzer": "simple"     // 搜索当前字段时分析搜索词使用的分析器, 如果没有设置则默认与建立索引使用的分析器相同
            }
        }
    }
}

1.5 Custom Analyzer

The analyzer is a combination of character filter, tokenizer, and word filter. The built-in analyzer in ES has already prepared the relevant combination. The custom analyzer mentioned here refers to a new analyzer combined based on the existing filters and tokenizers of ES.

Use a custom analyzer when creating an index, request format:

PUT /<index>
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "tokenizer": {
    
                          // 分词器
                "<tokenizer_name>": {
    
               // 自定义分词器, 节点名称就是自定义的分词器名称
                    "type": "pattern",          // 使用现有分词器, "pattern"表示正则分词器
                    "pattern": "<regex>"        // 用于切分词语的正则表达式, "pattern"分词器的参数
                }
            },
            "char_filter": {
    
                        // 字符过滤器
                "<char_filter_name>": {
    
             // 自定义字符过滤器, 节点名称就是自定义的字符过滤器名称
                    "type": "html_strip"        // 使用现有的字符过滤器, "html_strip"表示用于过滤HTML标签字符的过滤器
                }
            },
            "filter": {
    
                             // 词语过滤器
                "<filter_name>": {
    
                  // 自定义词语过滤器, 节点名称就是自定义的词语过滤器名称
                    "type": "stop",             // 使用现有的词语过滤器, "stop"表示停用词过滤器
                    "ignore_case": true,        // 忽略大小写, "stop"词语过滤器的参数
                    "stopwords": ["the", "is"]  // 停用词, "stop"词语过滤器的参数
                }
            },
            "analyzer": {
    
                                       // 分析器 (分词器 + 字符过滤器 + 词语过滤器)
                "<analyzer_name>": {
    
                            // 自定义分析器, 节点名称就是自定义的分析器名称
                    "tokenizer": "<tokenizer_name>",        // 指定使用的分词器, 必须是现有的或者前面自定义的
                    "char_filter": ["<char_filter_name>"],  // 指定使用的字符过滤器, 必须是现有的或者前面自定义的
                    "filter": ["<filter_name>"]             // 指定使用的词语过滤器, 必须是现有的或者前面自定义的
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "<field>": {
    
    
                "type": "text",
                "analyzer": "<analyzer_name>"   // 指定字段使用的分析器, 必须是现有的或者前面自定义的
            }
        }
    }
}

Custom analyzer example:

PUT /demo-index
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "tokenizer": {
    
    
                "my_tokenizer": {
    
    
                    "type": "pattern",      // 正则分词器
                    "pattern": ","          // 用于切分词语的正则表达式 (用","分隔)
                }
            },
            "char_filter": {
    
    
                "my_char_filter": {
    
    
                    "type": "html_strip"
                }
            },
            "filter": {
    
    
                "my_filter": {
    
    
                    "type": "stop",         // 停用词词语过滤器
                    "ignore_case": true,
                    "stopwords": ["the", "is"]
                }
            },
            "analyzer": {
    
    
                "my_analyzer": {
    
    
                    "tokenizer": "my_tokenizer",
                    "char_filter": ["my_char_filter"],
                    "filter": ["my_filter"]
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "title": {
    
    
                "type": "text",
                "analyzer": "my_analyzer"   // 使用 settings 中自定义的分析器
            }
        }
    }
}

// 测试索引字段中使用的自定义分析器
POST /demo-index/_analyze
{
    
    
    "field": "title",
    "text": "The,Happy,中国人"
}
// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "Happy",
            "start_offset": 4,
            "end_offset": 9,
            "type": "word",
            "position": 1
        },
        {
    
    
            "token": "中国人",
            "start_offset": 10,
            "end_offset": 13,
            "type": "word",
            "position": 2
        }
    ]
}

2. Chinese analyzer

Corresponding to English, word segmentation is very easy, just use spaces and punctuation marks to segment words. But for Chinese word segmentation, ES's built-in analyzer is not capable, and additional word segmentation plug-ins are required. ES can support third-party analyzers by installing plug-ins. The more commonly used third-party Chinese analyzers are IK and HanLP.

The IK analyzer is a word segmentation algorithm based on a dictionary. This algorithm analyzes the text to be analyzed from a dictionary prepared in advance according to a certain strategy. When the text matches a certain word in the dictionary, the word segmentation is generated. The dictionary-based word segmentation algorithm is the simplest and fastest algorithm among word segmentation algorithms. This algorithm can be divided into three different matching methods: forward maximization matching, reverse maximization matching, and two-way maximization matching.

The HanLP analyzer is a machine learning algorithm based on statistics, which requires first building a corpus with marked word segmentation forms. Then, when analyzing the text, the frequency of occurrence of each word in the corpus is counted, and based on the statistical results, the word segmentation that should be segmented in a certain context is given.

2.1 IK analyzer

IK analyzer is an open source, lightweight Chinese word segmentation toolkit developed based on Java language. The IK analyzer provides an ES plug-in (IK Analysis for Elasticsearch) and supports cold update and hot update of the dictionary.

IK analyzer ES plug-in related links:

2.1.1 Install IK plug-in

Install the analyzer plug-in. There are two installation methods.

Installation method one: Manual download and installation

  • Download the plug-in version corresponding to the ES version from the plug-in download address: elasticsearch-analysis-ik-X.X.X.zip;
  • pluginsCreate a analysis-ikfolder in the ES plug-in directory : cd <es-root>/plugins/ && mkdir analysis-ik;
  • Unzip the plugin to plugins/analysis-ikthe folder: unzip elasticsearch-analysis-ik-X.X.X.zip -d <es-root>/plugins/analysis-ik.

The official name of the plug-in is defined in the plug-in description file ( plugins/analysis-ik/plugin-descriptor.properties) name=analysis-ikfield, but pluginsthe name of the plug-in folder under the directory can be customized.

Installation method two: Install using ES plug-in command

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/vX.X.X/elasticsearch-analysis-ik-X.X.X.zip

The plug-in command online installation supports file://URLs in the format, that is, you can download the zip installation package of the plug-in to the local and then install it.

<es-root>/config/Use the plug-in command to install the plug- in. A folder ( ) with the same name as the plug-in name will be created in the ES configuration file directory ( analysis-ik), and then the plug-in configuration file will be installed in this directory ( <es-root>/config/analysis-ik).

View installed third-party plugins: send a request GET /_cat/plugins?v, or run a command ./bin/elasticsearch-plugin list.

After installing the plug-in, you need to restart ES. If the console prints the following line of log, it means that the IK analyzer plug-in is installed successfully:

[2023-08-09T20:00:00,000][INFO] ... loaded plugin [analysis-ik]

Load the configuration file of the IK plug-in:

When ES starts, it will load the IK plug-in and load the IK dictionary configuration. The dictionary configuration file is named IKAnalyzer.cfg.xml. It is loaded from the path first . If the file cannot be found, it is loaded from the path <es-root>/config/analysis-ik/IKAnalyzer.cfg.xmlin the plug-in directory .<es-root>/plugins/analysis-ik/config/IKAnalyzer.cfg.xml

2.1.2 Using the IK Analyzer

The IK plug-in contains ik_smarttwo ik_max_wordanalyzers (analyzer) and and two tokenizers (tokenizer) with the same name.

The difference between ik_smart and ik_max_word:

  • ik_smart splits the text into the finest granularity and exhausts all possible combinations.
  • ik_max_word splits the text into the coarsest granularity, which is suitable for phrase queries.

Use the IK Analyzer test to analyze text:

GET /_analyze
{
    
    
    "analyzer": "ik_smart",         // 最粗粒度拆分
    "text": "我是中国人"
}
// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
    
    
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
    
    
            "token": "中国人",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

GET /_analyze
{
    
    
    "analyzer": "ik_max_word",      // 最细粒度拆分
    "text": "我是中国人"
}
// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
    
    
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
    
    
            "token": "中国人",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "中国",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
    
    
            "token": "国人",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

2.1.3 IK extended dictionary

The IK word segmenter is based on dictionary word segmentation. Words that are not in the dictionary cannot be split:

GET /_analyze
{
    
    
    "analyzer": "ik_max_word",
    "text": "雄安新区"
}
// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "雄",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
    
    
            "token": "安新",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "新区",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

Splitting "Xiong'an New Area" did not split the word "Xiong'an" because this word is not in the default dictionary, and the dictionary can be expanded manually.

The IK dictionary configuration file is in the <es-root>/config/analysis-ik/IKAnalyzer.cfg.xmlor <es-root>/plugins/analysis-ik/config/IKAnalyzer.cfg.xmllocation.

IKAnalyzer.cfg.xmlContents of the configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!-- 用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict"></entry>
     <!-- 用户可以在这里配置自己的扩展停止词字典 -->
    <entry key="ext_stopwords"></entry>
    <!-- 用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!-- 用户可以在这里配置远程扩展停止词字典 -->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

The dictionary is a plain text file, with each line representing a word. IKAnalyzer.cfg.xmlCreate a my.dicfile named locally in the directory where the configuration file is located, and write the following content:

雄安
雄安新区

Modify IKAnalyzer.cfg.xmlconfiguration:

<!-- 用户可以在这里配置自己的扩展字典 -->
<!-- 配置本地字典文件的路径, 路径必须是相对于配置文件所在目录的相对路径, 多个路径使用 ; 分隔, 文件必须是 UTF-8 编码 -->
<entry key="ext_dict">my.dic</entry>

Save the configuration and restart ES. If the console prints the following log, it means the extended dictionary is loaded successfully:

[2023-08-09T20:30:00,000][INFO] ... [Dict Loading] .../plugins/analysis-ik/config/my.dic

Test the word segmentation again:

GET /_analyze
{
    
    
    "analyzer": "ik_max_word",
    "text": "雄安新区"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "雄安新区",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
    
    
            "token": "雄安",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "安新",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "新区",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

2.1.4 IK hot update expanded dictionary

The IK plug-in supports hot update of the dictionary (no need to restart ES). IKAnalyzer.cfg.xmlThe hot update configuration in the configuration file:

<!-- 用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">words_location</entry>
<!-- 用户可以在这里配置远程扩展停止词字典 -->
<entry key="remote_ext_stopwords">words_location</entry>

where words_locationis one or more URLs. For example http://mysite.com/my.dic, multiple URLs can be configured, and multiple URLs are ;separated by . The dictionary link request needs to meet the following two points:

  1. This HTTP URL request returns the text content of one line and one word segmentation.
  2. The HTTP server needs to support caching requests (basically all HTTP Servers will support it), that is, the response header needs to return Last-Modifiedor ETag.

If a remote dictionary is configured, the IK plug-in will directly send GETa request to load the remote dictionary when it is loaded for the first time. After that, a request is sent every one minute HEAD, and carries the If-Modified-Sinceor If-None-Matchrequest header. If the HTTP server returns the status of 304 Not Modified, no processing will be done and it will continue to wait for the next request. If the status of 200 OK is returned, the request will be sent again immediately GET. Get dictionary content to update.

When loading the remote dictionary, the console will print the following log:

[2023-08-09T20:35:00.000][INFO] ... [Dict Loading] http://mysite.com/my.dic

Note: Hot update of the remote dictionary only updates the dictionary of the analyzer. Document fields that have been indexed before will not be re-indexed, but newly written documents and word segmentation when querying documents will use the updated dictionary. If the old document If you need to use the updated dictionary for word segmentation, you need to re-update the document and rebuild the index.

2.2 HanLP analyzer

HanLP is a multilingual natural language processing toolkit for production environments. HanLP supports multiple language processing functions such as Chinese word segmentation, part-of-speech tagging, and syntactic analysis. HanLP also has an Elasticsearch analyzer plugin.

HanLP related links:

2.2.1 Install HanLP plug-in

Install HanLP ES plug-in:

# 需使用兼容 ES 版本的版本
./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/vX.X.X/elasticsearch-analysis-hanlp-X.X.X.zip

After successful installation, restart ES and the console will output the following log:

[2023-08-10T20:30:00,000][INFO] ... loaded plugin [analysis-hanlp]

2.2.2 Using HanLP analyzer

Analyzers/tokenizers provided by the HanLP ES plug-in:

  • hanlp:HanLP default word segmentation
  • hanlp_standard: standard participle
  • hanlp_index: Index participle
  • hanlp_nlp:NLP word segmentation
  • hanlp_crf:CRF word segmentation
  • hanlp_n_short:N-shortest path participle
  • hanlp_dijkstra: shortest path participle
  • hanlp_speed:Extreme dictionary word segmentation

Use the HanLP parser to analyze text:

POST /_analyze
{
    
    
    "analyzer": "hanlp_standard",
    "text": "雄安新区"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "雄",
            "start_offset": 0,
            "end_offset": 1,
            "type": "ag",
            "position": 0
        },
        {
    
    
            "token": "安",
            "start_offset": 1,
            "end_offset": 2,
            "type": "ag",
            "position": 1
        },
        {
    
    
            "token": "新区",
            "start_offset": 2,
            "end_offset": 4,
            "type": "n",
            "position": 2
        }
    ]
}

The word segmentation result did not segment "Xiong'an" and needs to be added to the extended dictionary.

2.2.3 HanLP extended dictionary

<es_root>/plugins/analysis-hanlp/data/dictionary/custom/Create a file in the directory mydict.txtand enter the following content:

雄安

Each line in the dictionary text is a participle.

Configuration file location for the HanLP plugin: <es_root>/config/analysis-hanlp/hanlp.propertiesor <es_root>/plugins/analysis-hanlp/config/hanlp.properties.

Modify the value of the field hanlp.propertiesin the configuration file CustomDictionaryPathand add the extended dictionary file path to the end:

# 路径默认相对于 analysis-hanlp 插件根目录, 多个路径使用 ; 分隔
CustomDictionaryPath=...;data/dictionary/custom/mydict.txt;

Restart ES and test the analyzer again:

POST /_analyze
{
    
    
    "analyzer": "hanlp_standard",
    "text": "雄安新区"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "雄安",
            "start_offset": 0,
            "end_offset": 2,
            "type": "n",
            "position": 0
        },
        {
    
    
            "token": "新区",
            "start_offset": 2,
            "end_offset": 4,
            "type": "n",
            "position": 1
        }
    ]
}

2.2.4 HanLP hot update extended dictionary

HanLP plug-in configuration hot update dictionary configuration file location: <es_root>/config/analysis-hanlp/hanlp-remote.xmlor <es_root>/plugins/analysis-hanlp/config/hanlp-remote.xml.

hanlp-remote.xmlContents of the configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>HanLP Analyzer 扩展配置</comment>
    <!-- 用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry>-->
    <!-- 用户可以在这里配置远程扩展停止词字典 -->
    <!-- <entry key="remote_ext_stopwords">stop_words_location</entry> -->
</properties>

where words_locationis a URLor URL 词性, such as:

  • http://mysite.com/mydic.txt
  • http://mysite.com/mydic.txt nt

For details, please refer to: HanLP remote dictionary configuration .

The loading rules of the NanLP remote extended dictionary are the same as the loading rules of the IK remote extended dictionary. They request the URL every minute to check whether there is an update. If there is an update, the remote dictionary is reloaded.

3. Use synonyms

In search engines, synonyms are used to handle different query terms searching for the same target. For example, when a user searches for "Android phone" and "Android phone", they actually want to search for the same target. For another example, in e-commerce search, users search for "pineapple" and "pineapple", "tomato" and "tomato", "rice cooker" and "rice cooker", "Li-Ning" and "LI-NING". It's the same product.

ES implements synonym search through the word segmentation filter in the analyzer. To use synonyms, you can create an index for synonyms simultaneously when building the index, or search for synonyms at the same time through the synonym analyzer during search.

ES's built-in word segmentation filter synonymand synonym_graphsupport synonym filtering.

3.1 Use synonyms when creating indexes

To use synonyms when building indexes, ES has a built-in synonymsynonym segmentation filter. The following uses the IK analyzer and the synonyms word segmentation filter to create an index:

PUT /menu
{
    
    
    "settings": {
    
                                   // 索引设置
        "analysis": {
    
    
            "filter": {
    
                             // 过滤器定义
                "my_filter": {
    
                      // 自定义分词过滤器, 名称为 "my_filter"
                    "type": "synonym",          // 使用内置的 同义词过滤器
                    "synonyms": [               // 同义词列表, 同义词之间用 , 分隔
                        "凤梨,菠萝",
                        "番茄,西红柿"
                    ]
                }
            },
            "analyzer": {
    
                           // 分析器定义
                "my_analyzer": {
    
                    // 自定义分析器, 名称为 "my_analyzer"
                    "tokenizer": "ik_max_word", // 分词器, 使用安装的第三方分词器 "ik_max_word"
                    "filter": [                 // 分词过滤器
                        "lowercase",            // 自带的分词过滤器
                        "my_filter"             // 前面自定义的分词过滤器
                    ]
                }
            }
        }
    },
    "mappings": {
    
                                   // 映射
        "properties": {
    
    
            "name": {
    
                               // text类型的字段
                "type": "text",
                "analyzer": "my_analyzer"       // 使用上面自定义的分析器
            }
        }
    }
}

Test a custom analyzer in the index:

GET /menu/_analyze
{
    
    
    "field": "name",
    "text": "西红柿炒蛋"
}

// 返回, 西红柿除了拆分出“西红柿”外, 还拆分出了它的同义词“番茄”
{
    
    
    "tokens": [
        {
    
    
            "token": "西红柿",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
    
    
            "token": "番茄",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
    
    
            "token": "炒蛋",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

Write some documentation:

PUT /menu/_doc/001
{
    
    "name": "番茄炒蛋"}

PUT /menu/_doc/002
{
    
    "name": "凤梨炒西红柿"}

PUT /menu/_doc/003
{
    
    "name": "菠萝猪扒包"}

Search documents:

GET /menu/_search
{
    
    
    "query": {
    
    
        "match": {
    
    
          "name": "番茄"
        }
    }
}

// 返回
{
    
    
    "took": 436,
    "timed_out": false,
    "_shards": {
    
    
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.7520058,
        "hits": [
            {
    
    
                "_index": "menu",
                "_id": "001",
                "_score": 0.7520058,
                "_source": {
    
    
                    "name": "番茄炒蛋"
                }
            },
            {
    
    
                "_index": "menu",
                "_id": "002",
                "_score": 0.6951314,
                "_source": {
    
    
                    "name": "凤梨炒西红柿"
                }
            }
        ]
    }
}

It can be seen from the returned results that when searching for "tomato", in addition to matching "tomato scrambled eggs", it also matched "pineapple fried tomatoes".

3.2 Use synonyms when searching

The purpose of searching for synonyms can be achieved when the document is indexed or when searching, so there is no need to use synonyms in both places. By not using synonyms when indexing documents, and then using synonym search when searching, you can reduce the number of document indexes, and when you update synonyms later, you don't need to re-index the document.

Index creation using the synonym filter when searching:

PUT /menu
{
    
    
    "settings": {
    
                                   // 索引设置
        "analysis": {
    
    
            "filter": {
    
                             // 过滤器定义
                "my_filter": {
    
                      // 自定义分词过滤器, 名称为 "my_filter"
                    "type": "synonym",          // 使用内置的 同义词过滤器
                    "synonyms": [               // 同义词列表, 同义词之间用 , 分隔
                        "凤梨,菠萝",
                        "番茄,西红柿"
                    ]
                }
            },
            "analyzer": {
    
                           // 分析器定义
                "my_analyzer": {
    
                    // 自定义分析器, 名称为 "my_analyzer"
                    "tokenizer": "ik_max_word", // 分词器, 使用安装的第三方分词器 "ik_max_word"
                    "filter": [                 // 分词过滤器
                        "lowercase",            // 自带的分词过滤器
                        "my_filter"             // 前面自定义的分词过滤器
                    ]
                }
            }
        }
    },
    "mappings": {
    
                                   // 映射
        "properties": {
    
    
            "name": {
    
                               // text类型的字段
                "type": "text",
                "analyzer": "ik_max_word",          // 文档建立索引时使用 "ik_max_word" 分析器
                "search_analyzer": "my_analyzer"    // 搜索时使用上面自定义的分析器
            }
        }
    }
}

3.3 Using a synonym dictionary

If there are many synonyms, settingsthe configuration in will become very cumbersome, and it is not easy to update the synonym dictionary. ES's built-in synonym filter supports placing synonyms in local text files. The file location must be in <es_root>/config/the directory and must be placed in each cluster node.

<es_root>/config/Create a folder in the directory , download and create a text file mydictin the folder to store synonyms, one group of synonyms per line, and use separation between synonyms, as shown below:synonyms.txt,

<es_root>$ cat config/mydict/synonyms.txt
凤梨,菠萝
番茄,西红柿

synonym_graphUse ES's built-in synonym filter to customize the analyzer when creating an index , and use the local dictionary for synonyms:

PUT /menu
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "filter": {
    
    
                "my_filter": {
    
                                  // 自定义分词过滤器
                    "type": "synonym_graph",                // 使用内置的 同义词过滤器
                    "synonyms_path": "mydict/synonyms.txt"  // 指定同义词词典, 文件路径相对于 `<es_root>/config/` 目录
                }
            },
            "analyzer": {
    
    
                "my_analyzer": {
    
                    // 自定义分析器
                    "tokenizer": "ik_max_word",
                    "filter": [                 // 分词过滤器
                        "lowercase",
                        "my_filter"             // 前面自定义的分词过滤器
                    ]
                }
            }
        }
    },
    "mappings": {
    
                                       // 映射
        "properties": {
    
    
            "name": {
    
                                   // text类型的字段
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "my_analyzer"    // 使用上面自定义的分析器
            }
        }
    }
}

4. Use stop words

Stop words (Stop Words) refer to some words that have no search meaning in the words after text segmentation. For example, the text is "I like the scenery here". After word segmentation, the
words "I", "here" and "the" are used very frequently and have no unique meaning for retrieving text information. Therefore, when building indexes and searching These stop words can be ignored, improving indexing and search efficiency and saving storage space.

The following website has lists of commonly used stop words in English and Chinese:

In addition to filtering using character filters, special characters such as HTML tags can also be filtered using stopword filters.

4.1 Stop word filter

ES's built-in stopword segmentation filter can be used to filter stop words.

Create an index, customize word segmentation filters, and use stop words:

PUT /demo
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "filter": {
    
    
                "my_stop_filter": {
    
             // 自定义分词过滤器
                    "type": "stop",                     // 使用内置的停用词过滤器
                    "stopwords": ["我", "这里", "的"]    // 停用词列表
                }
            },
            "analyzer": {
    
    
                "my_analyzer": {
    
                // 自定义分析器
                    "tokenizer": "ik_max_word",
                    "filter": ["my_stop_filter"]
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "title": {
    
    
                "type": "text",
                "analyzer": "my_analyzer"   // 使用自定义的分析器
            }
        }
    }
}

Test the analyzer for an index mapped field:

POST /demo/_analyze
{
    
    
    "field": "title",
    "text": "我喜欢这里的风景"
}

// 返回, 从返回中可以看出, “我”、“这里”、“的” 被过滤掉了
{
    
    
    "tokens": [
        {
    
    
            "token": "喜欢",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "风景",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

Test the analyzer without stopfilter:

POST /_analyze
{
    
    
    "analyzer": "ik_max_word",
    "text": "我喜欢这里的风景"
}

// 返回, 没有使用停止词过滤器 “我”、“这里”、“的” 没有被过滤掉
{
    
    
    "tokens": [
        {
    
    
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
    
    
            "token": "喜欢",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "这里",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "的",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 3
        },
        {
    
    
            "token": "风景",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

4.2 Built-in analyzer uses stop words

Many built-in analyzers come with stop word filters, which can be used as long as the relevant parameters are set.

Below uses the built-in analyzer , customizes the analyzer standardby setting its stopwords parameter, and then creates the index:stopwords

PUT /demo
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "analyzer": {
    
    
                "my_standard": {
    
    
                    "type": "standard",
                    "stopwords": ["the", "a"]
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "title": {
    
    
                "type": "text",
                "analyzer": "my_standard"
            }
        }
    }
}

Test the analyzer for indexed fields:

POST /demo/_analyze
{
    
    
    "field": "title",
    "text": "the a beautiful view."
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "beautiful",
            "start_offset": 6,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
    
    
            "token": "view",
            "start_offset": 16,
            "end_offset": 20,
            "type": "<ALPHANUM>",
            "position": 3
        }
    ]
}

4.3 IK analyzer uses stop words

The IK analyzer only uses English stop words by default and does not use Chinese stop words. IKAnalyzer.cfg.xmlThe extended stop word dictionary and remote extended stop word dictionary can be configured in the dictionary configuration file of the IK plug-in .

IKAnalyzer.cfg.xmlConfiguration file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!-- 用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict"></entry>
     <!-- 用户可以在这里配置自己的扩展停止词字典 -->
    <entry key="ext_stopwords"></entry>
    <!-- 用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!-- 用户可以在这里配置远程扩展停止词字典 -->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

IKAnalyzer.cfg.xmlConfiguring the stop word dictionary is the same as configuring the extended dictionary. Create a my_stopwords.dicfile in the directory where the IK dictionary configuration file is located and write some content:

$ cat my_stopwords.dic
我
这里
的

Modify the IK dictionary configuration file and configure local extended stop words:

<entry key="remote_ext_dict">my_stopwords.dic</entry>

Save the configuration and restart ES. The console prints the following log indicating that the stop word dictionary is loaded successfully:

[2023-08-10T20:30:00,000][INFO] ... [Dict Loading] .../plugins/analysis-ik/config/my_stopwords.dic

Analyze the text using the IK analyzer configured with stop words:

POST /_analyze
{
    
    
    "analyzer": "ik_max_word",
    "text": "我喜欢这里的风景"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "喜欢",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
    
    
            "token": "风景",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

The IK analyzer supports configuring a remote extended stop word dictionary (stop word hot update), and the configuration method is the same as configuring a remote extended dictionary.

4.4 HanLP parser uses stop words

The parser of the HanLP plug-in does not enable stop words by default:

POST /_analyze
{
    
    
    "analyzer": "hanlp_standard",
    "text": "我喜欢这里的风景"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "rr",
            "position": 0
        },
        {
    
    
            "token": "喜欢",
            "start_offset": 1,
            "end_offset": 3,
            "type": "vi",
            "position": 1
        },
        {
    
    
            "token": "这里",
            "start_offset": 3,
            "end_offset": 5,
            "type": "rzs",
            "position": 2
        },
        {
    
    
            "token": "的",
            "start_offset": 5,
            "end_offset": 6,
            "type": "ude1",
            "position": 3
        },
        {
    
    
            "token": "风景",
            "start_offset": 6,
            "end_offset": 8,
            "type": "n",
            "position": 4
        }
    ]
}

Enable stop words for the HanLP analyzer, which can be enabled in a custom analyzer:

PUT /demo
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "analyzer": {
    
    
                "my_hanlp": {
    
                               // 自定义分析器
                    "type": "hanlp_standard",           // 使用 HanLP 插件中的分析器
                    "enable_stop_dictionary": true      // 启用停用词词典
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "title": {
    
    
                "type": "text",
                "analyzer": "my_hanlp"      // 使用自定义分析器
            }
        }
    }
}

Test the analyzer for indexed fields:

POST /demo/_analyze
{
    
    
    "field": "title",
    "text": "我喜欢这里的风景"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "喜欢",
            "start_offset": 1,
            "end_offset": 3,
            "type": "vi",
            "position": 0
        },
        {
    
    
            "token": "风景",
            "start_offset": 6,
            "end_offset": 8,
            "type": "n",
            "position": 1
        }
    ]
}

The stop word dictionary of the HanLP analyzer supports local expansion and remote expansion. The local expansion is configured in the field value in the plug-in configuration file. The <es_root>/config/analysis-hanlp/hanlp.propertiesdefault CoreStopWordDictionaryPathvalue of the field is <es_root>/plugins/analysis-hanlp/data/dictionary/stopwords.txt. You can also add extended stop words directly to the end of the file.

The remote stop word dictionary of the HanLP analyzer is configured in the same way as the extended remote dictionary, and is <es_root>/config/analysis-hanlp/hanlp-remote.xmlconfigured in the configuration file.

5. Pinyin search

Pinyin search is very practical in Chinese searches. Chinese generally use the Pinyin input method. If pinyin search is supported, users only need to enter the pinyin or first letter of the keyword to search for relevant results.

To support Pinyin search, you need to install the ES Pinyin analyzer plug-in: https://github.com/medcl/elasticsearch-analysis-pinyin

5.1 Install Pinyin plug-in

The analysis-pinyin plug-in is not compiled and packaged, so you need to build the project yourself. To compile the ES Pinyin plug-in project, you need to ensure that Git and Maven tools are installed locally.

First clone the Git repository locally:

git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git

Modify the value of the field pom.xmlin the file in the project elasticsearch.versionto the same version number as ES:

<project>
    <properties>
        <elasticsearch.version>8.8.2</elasticsearch.version>
    </properties>
</project>

pom.xmlUse Maven to build the project and execute the command in the project root directory ( the directory where the file is located):

mvn install

During the compilation process, dependency packages and plug-ins will be downloaded first. Wait patiently. After the build is successful, <project_root>/target/releasesa compressed package will be generated in the directory elasticsearch-analysis-pinyin-X.X.X.zip. This is the compiled and packaged ES plug-in.

To install the ES plug-in, directly extract the plug-in compressed package to <es_root>/plugins/the directory, or use the plug-in command to install:

$ ./bin/elasticsearch-plugin install file:///.../target/releaseselasticsearch-analysis-pinyin-X.X.X.zip

View installed plugins:

$ ./bin/elasticsearch-plugin list
analysis-hanlp
analysis-ik
analysis-pinyin

Restart ES, and the console prints the following log indicating that the plug-in is loaded successfully:

[2023-08-10T20:30:00,000][INFO] ... loaded plugin [analysis-hanlp]
[2023-08-10T20:30:00,000][INFO] ... loaded plugin [analysis-ik]
[2023-08-10T20:30:00,000][INFO] ... loaded plugin [analysis-pinyin]

5.2 Using Pinyin plug-in

The analysis-pinyin plugin includes:

  • Analyzer:pinyin
  • Tokenizer:pinyin
  • Word segmentation filter:pinyin

Test the pinyin analyzer:

POST /_analyze
{
    
    
    "analyzer": "pinyin",
    "text": "美丽的风景"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "mei",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "mldfj",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
    
    
            "token": "li",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 1
        },
        {
    
    
            "token": "de",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 2
        },
        {
    
    
            "token": "feng",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 3
        },
        {
    
    
            "token": "jing",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 4
        }
    ]
}

The pinyin analyzer's word segmentation is relatively simple. By default, it only separates the pinyin word segmentation and the word segmentation composed of the first letter of each Chinese character in the keyword. There are also many optional parameters. Please refer to the introduction in the GitHub warehouse .

The following uses the IK word segmenter and Pinyin word segmentation filter to form a custom analyzer:

PUT /demo
{
    
    
    "settings": {
    
    
        "analysis": {
    
    
            "filter": {
    
    
                "my_pinyin": {
    
                          // 自定义分词过滤器
                    "type": "pinyin",               // 使用 "pinyin" 分词过滤器
                    "keep_first_letter": true,      // 保留首字母
                    "keep_full_pinyin": true,       // 保留全拼
                    "keep_original": true,          // 保留原始输入
                    "lowercase": true               // 小写
                }
            },
            "analyzer": {
    
    
                "my_ik_pinyin": {
    
                       // 自定义分析器
                    "tokenizer": "ik_max_word",     // 使用 "ik_max_word" 分词器
                    "filter": ["my_pinyin"]         // 使用自定义的分析器
                }
            }
        }
    },
    "mappings": {
    
    
        "properties": {
    
    
            "title": {
    
                                  // text 类型的索引字段
                "type": "text",
                "analyzer": "my_ik_pinyin"          // 使用自定义的分析器
            }
        }
    }
}

Test a custom analyzer in the index:

POST /demo/_analyze
{
    
    
    "analyzer": "my_ik_pinyin",
    "text": "美丽的风景"
}

// 返回
{
    
    
    "tokens": [
        {
    
    
            "token": "mei",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
    
    
            "token": "li",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "美丽",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "ml",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "feng",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "jing",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 3
        },
        {
    
    
            "token": "风景",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 3
        },
        {
    
    
            "token": "fj",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

Write some documentation:

PUT /demo/_doc/001
{
    
    "title": "美丽的风景"}

PUT /demo/_doc/002
{
    
    "title": "附近的景区"}

Pinyin search:

POST /demo/_search
{
    
    
    "query": {
    
    
        "match": {
    
    
            "title": "fujin"
        }
    }
}

// 返回
{
    
    
    "took": 1,
    "timed_out": false,
    "_shards": {
    
    
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.7427702,
        "hits": [
            {
    
    
                "_index": "demo",
                "_id": "002",
                "_score": 1.7427702,
                "_source": {
    
    
                    "title": "附近的景区"
                }
            }
        ]
    }
}

Pinyin first letter search:

POST /demo/_search
{
    
    
    "query": {
    
    
        "match": {
    
    
            "title": "fj"
        }
    }
}

// 返回
{
    
    
    "took": 1,
    "timed_out": false,
    "_shards": {
    
    
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.22920428,
        "hits": [
            {
    
    
                "_index": "demo",
                "_id": "001",
                "_score": 0.22920428,
                "_source": {
    
    
                    "title": "美丽的风景"
                }
            },
            {
    
    
                "_index": "demo",
                "_id": "002",
                "_score": 0.22920428,
                "_source": {
    
    
                    "title": "附近的景区"
                }
            }
        ]
    }
}

6. Highlight

When searching for text, it is sometimes necessary to highlight the part that matches the keyword, that is, to perform color or font processing on the matching keyword.

Highlighting official website introduction: Highlighting

First create an index and write some text:

PUT /demo
{
    
    
    "mappings": {
    
    
        "properties": {
    
    
            "title": {
    
    
                "type": "text",
                "analyzer": "ik_max_word"
            }
        }
    }
}

PUT /demo/_doc/001
{
    
    "title": "美丽的风景"}

PUT /demo/_doc/002
{
    
    "title": "附近的风景"}

6.1 Highlight search results

Highlighting is achieved by adding tags to the matching keywords in the returned results. Examples of highlighted results returned by search:

POST /demo/_search          // 搜索
{
    
    
    "query": {
    
    
        "match": {
    
              // 分词查询
          "title": "风景"
        }
    },
    "highlight": {
    
              // 高亮设置
        "fields": {
    
             // 需要高亮的字段
            "title": {
    
          // 高亮 "title" 字段匹配部分
                "pre_tags": "<font color='red'>",       // 高亮词的前标签, 默认为 "<em>"
                "post_tags": "</font>"                  // 高亮词的后标签, 默认为 "</em>"
            }
        }
    }
}

// 返回
{
    
    
    "took": 2,
    "timed_out": false,
    "_shards": {
    
    
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.18232156,
        "hits": [
            {
    
    
                "_index": "demo",
                "_id": "001",
                "_score": 0.18232156,
                "_source": {
    
    
                    "title": "美丽的风景"
                },
                "highlight": {
    
    
                    "title": [
                        "美丽的<font color='red'>风景</font>"
                    ]
                }
            },
            {
    
    
                "_index": "demo",
                "_id": "002",
                "_score": 0.18232156,
                "_source": {
    
    
                    "title": "附近的风景"
                },
                "highlight": {
    
    
                    "title": [
                        "附近的<font color='red'>风景</font>"
                    ]
                }
            }
        ]
    }
}

6.2 Highlighting strategy

ES supports three highlighting strategies, namely: unified, plainand fvh, which are specified by specifying parameters when searching highlight.fields.<field>.type.

unifiedis the default highlighting strategy of ES, which is implemented using Lucene Unified Highlighter and uses the BM25 algorithm to score individual sentences as if they were documents in the corpus.

plainThe strategy uses the standard Lucene highlighter. This strategy is more accurate. It requires loading all documents into memory and re-executing the query and analysis. This strategy requires a lot of resources when processing the text highlighting of a large number of documents, so it is more suitable for search highlighting of single-field documents.

fvhThe (fast vector highlighter) strategy uses Lucene Fast Vector highlighter, which is suitable for use when the document contains large text fields.

Specify a highlighting strategy when searching:

POST /demo/_search
{
    
    
    "query": {
    
    
        "match": {
    
    
          "title": "风景"
        }
    },
    "highlight": {
    
    
        "fields": {
    
    
            "title": {
    
    
                "pre_tags": "<font color='red'>",
                "post_tags": "</font>",
                "type": "plain"         // 指定高亮策略
            }
        }
    }
}

7. Spelling correction

When users enter search keywords, they may enter typos with similar pronunciation or strokes. When searching for keywords containing typos, many search engines will automatically identify and automatically search for correct keywords after correction.

7.1 Implementation Principle of Spelling Correction

When ES performs a full-text search ( matchquery), you can specify an edit distance parameter ( fuzziness). Edit distance refers to how many times a word can be edited to obtain another word. An editing operation includes replacing a character, deleting a character, inserting a character, and exchanging the positions of two characters. Each operation is an edit distance. For example, from "Nan Da Dao" to "Shen Nan Da Dao", you can replace "Nan" with "南", and then insert a "深" in the head to get "Shen Nan Da Dao", which means from "Nan Da Dao" The edit distance to "Shennan Avenue" is 2.

In order to correct spelling errors, you can first create an index specifically to save popular search words. When the content cannot be searched using the keywords provided by the user, use the keywords to search the hot search word index and specify the edit distance (correct search words), and then search for the content using the correct hot search words.

7.2 Use hot search terms to correct search terms

Create an index of hot search words and write some documents:

PUT /hotword
{
    
    
    "mappings": {
    
    
        "properties": {
    
    
            "hotword": {
    
                        // text类型的字段
                "type": "text",
                "analyzer": "ik_max_word"
            }
        }
    }
}

PUT /hotword/_doc/001
{
    
    "hotword": "环城大道"}

PUT /hotword/_doc/002
{
    
    "hotword": "林荫小道"}

PUT /hotword/_doc/003
{
    
    "hotword": "滨海立交"}

Specify the edit distance as 1 and search for keywords:

POST /hotword/_search
{
    
    
    "query": {
    
    
        "match": {
    
    
            "hotword": {
    
    
                "query": "幻城大道",      // 搜索词, 将分词为 "幻城" 和 "大道"
                "fuzziness": 1          // 最大编辑距离为 1
            }
        }
    }
}

// 返回
{
    
    
    "took": 2,
    "timed_out": false,
    "_shards": {
    
    
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.9808291,
        "hits": [
            {
    
    
                "_index": "hotword",
                "_id": "001",
                "_score": 0.9808291,
                "_source": {
    
    
                    "hotword": "环城大道"       // "大道" 匹配 "环城大道"
                }
            },
            {
    
    
                "_index": "hotword",
                "_id": "002",
                "_score": 0.49041456,
                "_source": {
    
    
                    "hotword": "林荫小道"       // "大道" 经过1个编辑距离后得到 "小道", 匹配 "林荫小道"
                }
            }
        ]
    }
}

The word segmentation search for "Huancheng Avenue" will separate the two participles "Huancheng" and "Avenue". Among them, "Avenue" matches "Ringcheng Avenue", "Avenue" passes through 1 edit distance to get "Xiaodao", and then "Xiaodao" " and "Boulevard Avenue" is matched, so the search results include "Ring City Avenue" and "Boulevard Road". The edited word segmentation matching score is lower than the original word segmentation matching score, so the matching score of "Boulevard Avenue" is much lower than that of "Ring City Avenue".

When users use "Huancheng Avenue" to search for documents, if the returned result is empty or the amount of data is very small, they can try to use "Huancheng Avenue" to search from the hot search word index, and the result will be the keyword "Huancheng Avenue" with the highest score. Then use the keyword "Ringcheng Avenue" to search the original document, thus achieving a certain degree of spelling correction.

Another way of thinking about spelling correction: The input method used by most users is basically Pinyin input, so the incorrect keywords entered by the user are likely to have correct Pinyin. The keyword field of the hot search word index can be segmented using the pinyin analyzer. Then when searching the hot search word index with wrong keywords, it will be converted to pinyin and then searched. Use pinyin to search for the correct keyword from the hot search word index.

Guess you like

Origin blog.csdn.net/xietansheng/article/details/132349032