Elasticsearch Mapping field type text and term, match and analyzer
1. Text scene
text
The type is suitable 全文搜索
for the scene, and ES will analyze the text into multiple words and index them.
text
The type is suitable for storing易阅读的
(human-readable)非结构化
text, such as email content, comments, product introduction, etc.- For text with poor readability, such as system logs, Http request body and other machine-generated data, you can use
wildcard
the type text
The type is not suitable for排序
and聚合
(although possible, it is not recommended)- If you want to do
排序
a sum聚合
, it is recommended to usekeyword
the type - So you can add and to
text
the type , each perform their dutieskeyword子类型
token_count子类型
PUT pigg_test_text
{
"mappings": {
"properties": {
"name": {
# 姓名 name
"type": "text",
"fields": {
"keyword": {
# 子字段 name.keyword
"type": "keyword",
"ignore_above" : 256
},
"length": {
# 子字段 name.length
"type": "token_count",
"analyzer": "standard"
}
}
},
"tag": {
# 标签 tag
"type": "keyword"
},
"word": {
# 台词 word
"type": "text"
}
}
}
}
2. term
Query
term
Determine whether a field is包含
a certain value. Generally used inkeyword
,integer
,date
,token_count
,ip
and other types- Avoid
text
using it on the typeterm
, andtext
you should use itmatch
tomatch_phrase
search the full text
First insert the data of two heroes in "Glory of the King":
PUT pigg_test_text/_doc/1
{
"name": "亚瑟王",
"tag": ["对抗路", "打野"],
"word": [
"王者背负,王者审判,王者不可阻挡"
]
}
PUT pigg_test_text/_doc/2
{
"name": "关羽",
"tag": ["对抗路", "辅助"],
"word": [
"把眼光从二爷的绿帽子上移开",
"聪明的人就应该与我的大刀保持安全距离"
]
}
(1) Query name
is 亚瑟王
the person
name.keyword
Use the query on the aboveterm
to return the document with id=1- Note:
term
It doesn't mean equal, it包含
means yes
GET pigg_test_text/_search
{
"query": {
"term": {
"name.keyword": "亚瑟王"
}
}
}
(2) Query people who can take the road of confrontation
tag
Use the query on the aboveterm
to return the documents with id=1 and 2, because they aretag
both包含
against the path
GET pigg_test_text/_search
{
"query": {
"term": {
"tag": "对抗路"
}
}
}
(3) Query who is a jungler or assistant
tag
Use the query on the aboveterms
( note the extra s ), and return the documents with id=1 and 2- Because
terms
the query only matches any one包含
in the array
GET pigg_test_text/_search
{
"query": {
"terms": {
"tag": ["打野", "辅助"]
}
}
}
(4) The query name
is a person with 3 characters
- Do an exact match on the type, the
token_count
returned documentname.length
亚瑟王
GET pigg_test_text/_search
{
"query": {
"term": {
"name.length": 3
}
}
}
3. match
Query
Although the contents of the above 亚瑟王
and 关羽
these two documents are in Chinese, and I have not configured them ik中文分词器
, this does not affect our learning. We only need to know that Chinese is standard analyzer
divided into independent Chinese characters by default.
Use match
the full-text search 鼓励王
to return the document, because the Chinese character 亚瑟王
is matched .王
GET pigg_test_text/_search
{
"query": {
"match": {
"name": "鼓励王"
}
}
}
If I don't understand the above statement, 亚瑟王
how can I store it? and 鼓励王
how to search? These two angles to explain the problem.
1. 亚瑟王
How to store?
- Use
_termvectors
can help us see how the text is divided into terms - There are many kinds of translations for domestic blogs
term
: entries, roots, terms, etc. We don’t have to worry about it, just know the meaning
查询id=1的文档的name字段的词条向量
GET pigg_test_text/_doc/1/_termvectors?fields=name
The three words , , 亚
are returned , indicating that there is a relationship similar to the following in :瑟
王
倒排索引
word | Document ID |
---|---|
Asia | 1 |
se | 1 |
king | 1 |
2. 鼓励王
How to search?
Method 1 : Use the Analyzer to _analyze
analyze standard
how the search keywords are segmented. Here you have to specify the name
field , ie .search-time
analyzer
search_analyzer
GET /_analyze
{
"analyzer" : "standard",
"text" : "鼓励王"
}
返回"鼓"、"励"、"王"这3个token
The second method : use _validate
to verify whether the statement is legal, its parameter explain
(default is true) will explain the execution plan of the statement
GET pigg_test_text/_validate/query?explain
{
"query": {
"match": {
"name": "鼓励王"
}
}
}
The returned result is as follows, name:鼓 name:励 name:王
indicating that it is 鼓励王
divided into 3 Chinese characters and name
matched on the field respectively.
"valid" : true,
"explanations" : [
{
"index" : "pigg_test_text",
"valid" : true,
"explanation" : "name:鼓 name:励 name:王"
}
]
Method 3 : How does _explain
the query 鼓励王
match the document with id=1? The premise of this method is that we already know which document the keyword matches, and want to know the reason for the match.
解释`鼓励王`为何在name字段上匹配到id=1的文档
GET /pigg_test_text/_explain/1
{
"query" : {
"match" : {
"name" : "鼓励王" }
}
}
The returned content is relatively long and complicated, because it involves the scoring mechanism, here is a key point:
"description" : "weight(name:王 in 0) [PerFieldSimilarity], result of:",
The description is that 王
this word matches the document with id=1 in the field 鼓励王
.name
3. Match parameters
match
There are two more important parameters: operator
and minimum_should_match
, they can control match
the behavior of the query.
3.1 operator
match
The query above 鼓励王
can actually be written as follows:
GET pigg_test_text/_search
{
"query": {
"match": {
"name": {
"query": "鼓励王",
"operator": "or"
}
}
}
}
operater
The default value of this isor
that as long as any word is matched, the match is successful.- If you want to match all three words "encourage the king", you can set
"operator": "and"
GET pigg_test_text/_validate/query?explain=true
{
"query": {
"match": {
"name": {
"query": "鼓励王",
"operator": "and"
}
}
}
}
返回如下:说明这3个字都得匹配
"explanations" : [
{
"index" : "pigg_test_text",
"valid" : true,
"explanation" : "+name:鼓 +name:励 +name:王"
}
]
3.1 minimum_should_match
minimum_should_match
You can set the minimum number of words to match, don'toperator
use it together, the meaning will conflict.- It can assign positive numbers, negative numbers, percentages, etc., but what we often use is to set a positive number, which specifies the minimum number of matching words.
指定要至少匹配成功2个字,才算文档匹配成功
GET pigg_test_text/_search
{
"query": {
"match": {
"name": {
"query": "鼓励王",
"minimum_should_match": "2"
}
}
}
}
4. Match phrase match_phrase
match_phrase
Phrase query, this will “绿帽子”
be matched as a whole phrase instead of split into 3 words
该语句返回关羽这个文档,因为他的台词包含"绿帽子"
GET pigg_test_text/_search
{
"query": {
"match_phrase": {
"word": "绿帽子"
}
}
}
The execution plan of the query statement:
GET pigg_test_text/_validate/query?explain
{
"query": {
"match_phrase": {
"word": "绿帽子"
}
}
}
返回如下:
"explanations" : [
{
"index" : "pigg_test_text",
"valid" : true,
"explanation" : "word:\"绿 帽 子\""
}
]
Fourth, the analyzer analyzer
text
The most important parameter of the type is analyzer
(analyzer), which determines how to tokenize the text when index-time
(creating or updating documents) and (searching documents).search-time
analyzer
: When only configuredanalyzer
, both inindex-time
andsearch-time
when, useanalyzer
the configured analyzersearch_analyzer
: When configuredsearch_analyzer
,search-time
usesearch_analyzer
the configured analyzer when
standard
The analyzer is text
the default analyzer of the type, which 词边界
splits the text according to the text (such as English according to spaces, Chinese into independent Chinese characters). It removes most punctuation and stop words and lowercases them. standard
Analyzers are mostly applicable to Western voices like English.
The configuration of the analyzer (analyzer) consists of 3 important parts, in order, character filters
, tokenizer
whenever token filters
a document is included by the ingest node, it needs to go through the following steps to finally write the document into the ES database
English | Chinese | Analyzer configuration items | Number | illustrate |
---|---|---|---|---|
character filters |
character filter | char_filter |
0~n | Strip html tags, convert special characters such as & turnand |
tokenizer |
tokenizer | tokenizer |
1 | Divide the text into word tokens according to certain rules, such as common standard ,whitespace |
token filters |
word filter | filter |
0~n | Normalize the token generated in the previous step. For example, lowercase, delete or add terms, synonym conversion, etc. |
Example: tokenizer
Use simple_pattern_split
, configure _
to split text by underscore.
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "亚瑟王__鼓励王_可丽王"
}
_
The words after being segmented according to the underscore are ["亚瑟王", "鼓励王", "可丽王"]
.
- In fact, we don’t need to configure a lot in our work, just understand it here, and don’t need to delve into every option
- Because ES provides many built-in analyzers out of the box , we can choose according to the scene
- For Chinese word segmentation, the famous IK word segmentation device , Pinyin word segmentation device , you can refer to what I wrote before
IK word breaker ik_max_word, ik_smart
Pinyin word breaker
ik Chinese word breaker + pinyin pinyin word breaker + synonyms