[Elasticsearch Tutorial 18] Text and term, match and analyzer of Mapping field type

1. Text scene

textThe type is suitable 全文搜索for the scene, and ES will analyze the text into multiple words and index them.

  • textThe type is suitable for storing 易阅读的(human-readable) 非结构化text, such as email content, comments, product introduction, etc.
  • For text with poor readability, such as system logs, Http request body and other machine-generated data, you can use wildcardthe type
  • textThe type is not suitable for 排序and 聚合(although possible, it is not recommended)
  • If you want to do 排序a sum 聚合, it is recommended to use keywordthe type
  • So you can add and to textthe type , each perform their dutieskeyword子类型token_count子类型
PUT pigg_test_text
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "name": {
    
     			# 姓名 name
        "type": "text",
        "fields": {
    
    
          "keyword": {
    
    		# 子字段 name.keyword
            "type": "keyword",
            "ignore_above" : 256
          },
          "length": {
    
     		# 子字段 name.length
            "type": "token_count",
            "analyzer": "standard"
          }
        }
      },
      "tag": {
    
    				# 标签 tag
        "type": "keyword"
      },
      "word": {
    
    				# 台词 word
        "type": "text"
      }
    }
  }
}

2. termQuery

  • termDetermine whether a field is 包含a certain value. Generally used in keyword, integer, date, token_count, ipand other types
  • Avoid textusing it on the type term, and textyou should use it matchto match_phrasesearch the full text

First insert the data of two heroes in "Glory of the King":

PUT pigg_test_text/_doc/1
{
    
    
  "name": "亚瑟王",
  "tag": ["对抗路", "打野"],
  "word": [
      "王者背负,王者审判,王者不可阻挡"
    ]
}

PUT pigg_test_text/_doc/2
{
    
    
  "name": "关羽",
  "tag": ["对抗路", "辅助"],
  "word": [
      "把眼光从二爷的绿帽子上移开",
      "聪明的人就应该与我的大刀保持安全距离"
    ]
}

(1) Query nameis 亚瑟王the person

  • name.keywordUse the query on the above termto return the document with id=1
  • Note: termIt doesn't mean equal, it 包含means yes
GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "term": {
    
    
      "name.keyword": "亚瑟王"
    }
  }
}

(2) Query people who can take the road of confrontation

  • tagUse the query on the above termto return the documents with id=1 and 2, because they are tagboth 包含against the path
GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "term": {
    
    
      "tag": "对抗路" 
    }
  }
}

(3) Query who is a jungler or assistant

  • tagUse the query on the above terms( note the extra s ), and return the documents with id=1 and 2
  • Because termsthe query only matches any one包含 in the array

GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "terms": {
    
    
      "tag": ["打野", "辅助"]
    }
  }
}

(4) The query nameis a person with 3 characters

  • Do an exact match on the type, the token_countreturned documentname.length亚瑟王
GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "term": {
    
    
      "name.length": 3
    }
  }
}

3. matchQuery

Although the contents of the above 亚瑟王and 关羽these two documents are in Chinese, and I have not configured them ik中文分词器, this does not affect our learning. We only need to know that Chinese is standard analyzerdivided into independent Chinese characters by default.

Use matchthe full-text search 鼓励王to return the document, because the Chinese character 亚瑟王is matched .

GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name": "鼓励王"
    }
  }  
}

If I don't understand the above statement, 亚瑟王how can I store it? and 鼓励王how to search? These two angles to explain the problem.

1. 亚瑟王How to store?

  • Use _termvectorscan help us see how the text is divided into terms
  • There are many kinds of translations for domestic blogs term: entries, roots, terms, etc. We don’t have to worry about it, just know the meaning
查询id=1的文档的name字段的词条向量

GET pigg_test_text/_doc/1/_termvectors?fields=name

The three words , , are returned , indicating that there is a relationship similar to the following in :倒排索引

word Document ID
Asia 1
se 1
king 1

2. 鼓励王How to search?

Method 1 : Use the Analyzer to _analyzeanalyze standardhow the search keywords are segmented. Here you have to specify the namefield , ie .search-timeanalyzersearch_analyzer

GET /_analyze
{
    
    
  "analyzer" : "standard",
  "text" : "鼓励王"
}

返回"鼓""励""王"3个token

The second method : use _validateto verify whether the statement is legal, its parameter explain(default is true) will explain the execution plan of the statement

GET pigg_test_text/_validate/query?explain
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name": "鼓励王"
    }
  }  
}

The returned result is as follows, name:鼓 name:励 name:王indicating that it is 鼓励王divided into 3 Chinese characters and namematched on the field respectively.

"valid" : true,
"explanations" : [
  {
    
    
    "index" : "pigg_test_text",
    "valid" : true,
    "explanation" : "name:鼓 name:励 name:王"
  }
]

Method 3 : How does _explainthe query 鼓励王match the document with id=1? The premise of this method is that we already know which document the keyword matches, and want to know the reason for the match.

解释`鼓励王`为何在name字段上匹配到id=1的文档
GET /pigg_test_text/_explain/1
{
    
    
  "query" : {
    
    
    "match" : {
    
     "name" : "鼓励王" }
  }
}

The returned content is relatively long and complicated, because it involves the scoring mechanism, here is a key point:

"description" : "weight(name:王 in 0) [PerFieldSimilarity], result of:",

The description is that this word matches the document with id=1 in the field 鼓励王.name

3. Match parameters

matchThere are two more important parameters: operatorand minimum_should_match, they can control matchthe behavior of the query.

3.1 operator

matchThe query above 鼓励王can actually be written as follows:

GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
        "name": {
    
    
          "query": "鼓励王",
          "operator": "or"
        }
    }
  }  
}
  • operaterThe default value of this is orthat as long as any word is matched, the match is successful.
  • If you want to match all three words "encourage the king", you can set"operator": "and"
GET pigg_test_text/_validate/query?explain=true
{
    
    
  "query": {
    
    
    "match": {
    
    
        "name": {
    
    
          "query": "鼓励王",
          "operator": "and"
        }
    }
  }  
}

返回如下:说明这3个字都得匹配
"explanations" : [
  {
    
    
    "index" : "pigg_test_text",
    "valid" : true,
    "explanation" : "+name:鼓 +name:励 +name:王"
  }
]

3.1 minimum_should_match

  • minimum_should_matchYou can set the minimum number of words to match, don't operatoruse it together, the meaning will conflict.
  • It can assign positive numbers, negative numbers, percentages, etc., but what we often use is to set a positive number, which specifies the minimum number of matching words.
指定要至少匹配成功2个字,才算文档匹配成功
GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
        "name": {
    
    
          "query": "鼓励王",
          "minimum_should_match": "2"
        }
    }
  }  
}

4. Match phrase match_phrase

match_phrasePhrase query, this will “绿帽子”be matched as a whole phrase instead of split into 3 words

该语句返回关羽这个文档,因为他的台词包含"绿帽子"
GET pigg_test_text/_search
{
    
    
  "query": {
    
    
    "match_phrase": {
    
    
      "word": "绿帽子"
    }
  }
}

The execution plan of the query statement:

GET pigg_test_text/_validate/query?explain
{
    
    
  "query": {
    
    
    "match_phrase": {
    
    
      "word": "绿帽子"
    }
  }
}

返回如下:
"explanations" : [
  {
    
    
    "index" : "pigg_test_text",
    "valid" : true,
    "explanation" : "word:\"绿 帽 子\""
  }
]

Fourth, the analyzer analyzer

textThe most important parameter of the type is analyzer(analyzer), which determines how to tokenize the text when index-time(creating or updating documents) and (searching documents).search-time

  • analyzer: When only configured analyzer, both in index-timeand search-timewhen, use analyzerthe configured analyzer
  • search_analyzer: When configured search_analyzer, search-timeuse search_analyzerthe configured analyzer when

standardThe analyzer is textthe default analyzer of the type, which 词边界splits the text according to the text (such as English according to spaces, Chinese into independent Chinese characters). It removes most punctuation and stop words and lowercases them. standardAnalyzers are mostly applicable to Western voices like English.

The configuration of the analyzer (analyzer) consists of 3 important parts, in order, character filters, tokenizerwhenever token filters
a document is included by the ingest node, it needs to go through the following steps to finally write the document into the ES database
insert image description here

English Chinese Analyzer configuration items Number illustrate
character filters character filter char_filter 0~n Strip html tags, convert special characters such as &turnand
tokenizer tokenizer tokenizer 1 Divide the text into word tokens according to certain rules, such as common standard,whitespace
token filters word filter filter 0~n Normalize the token generated in the previous step.
For example, lowercase, delete or add terms, synonym conversion, etc.

Example: tokenizerUse simple_pattern_split, configure _to split text by underscore.

PUT my-index-000001
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
    
        "my_analyzer": {
    
    
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
    
    
        "my_tokenizer": {
    
    
          "type": "simple_pattern_split",
          "pattern": "_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
    
    
  "analyzer": "my_analyzer",
  "text": "亚瑟王__鼓励王_可丽王"
}

_The words after being segmented according to the underscore are ["亚瑟王", "鼓励王", "可丽王"].

IK word breaker ik_max_word, ik_smart
Pinyin word breaker
ik Chinese word breaker + pinyin pinyin word breaker + synonyms

Guess you like

Origin blog.csdn.net/winterking3/article/details/126663540