Parsing -analysis
Resolving -analysis
It can be understood as a word.
Parsing performed by parser --analyzer, resolver including two kinds of built-in and user-defined.
1.1 parser
1.1.1. Built-in parser
doc:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
Standard Analyzer: by word boundaries break down, most punctuation is ignored, lowercase term, supported the removal of stop words.
Simple Analyzer: non-alphabetic characters for the word point, formatted letters to lowercase.
Whitespace Analyzer: a blank character segmentation point of not executed lowercase.
Stop Analyzer: similar to the simple analyzer, but supported the removal of stop words.
Pattern Analyzer: regular analytic word
Language Analyzers: word in other languages
Fingerprint Analyzer:
The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.
1.1.2. Custom parser
Temporarily involved.
1.2. Index word / word search
The index is well understood word, write the word division to form the index.
Each text field to specify a unique analyzer;
If not specified, the default to index settings / default parameters prevail, essentially standard analyzer.
Search word
For search statement, word will be carried out, using an index word default parser;
Search word can be set independently of the word is, but generally do not have.
1.2.1. Segmentation example
A built-english parser as an example:
"The QUICK brown foxes jumped over the lazy dog!"
First of lowercase, remove stop words are high frequency, convert the word to word prototype, the end result is the sequence:
[ quick, brown, fox, jump, over, lazi, dog ]
2. Case
Environment configuration:
Creating index test_i
Create a field msg, use the default configuration, ie, standard tokenizer
Create a field msg_english, use english word breaker;
# Create a test environment
d = {"msg":"Eating an apple a day keeps doctor away."}
rv = es.index("test_i", d)
Q. (Rw)
d = { "properties": {
"msg_english": {
"type": "text",
"analyzer": "english"
} } }
rv = es.indices.put_mapping (body = d, index = [ "test_i"]) # normally returns true
# View Data Structure
rv = es.indices.get_mapping(index_name)
{
"test_i": {
"mappings": {
"properties": {
"msg": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
} } },
"msg_english": {
"type": "text",
"analyzer": "english"
} } } }}
Into the document:
d = {"msg_english":"Eating an apple a day keeps doctor away."}
rv = es.index("test_i", d)
Query: Query is divided into two parts, the first match by field msg eat, no hits entry, query msg_english field
# search apis
def search_api_test():
data = { "query" : { "match" : {"msg_english":"eat"} }, }
rv = es.search(index="test_i", body=data)
Q. (Rw)
search_api_test()
result
{ "took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "test_i",
"_type": "_doc",
"_id": "XG7KFG0BpAsDZnvvGLz2",
"_score": 0.2876821,
"_source": {
"msg_english": "Eating an apple a day keeps doctor away."
} } ] }}
The difference between word test, visual test standard word and english word's: supplement
Test code:
# Participle test
d1 = {"analyzer":"standard","text":"Eating an apple a day keeps doctor away."}
d2 = {"analyzer":"english","text":"Eating an apple a day keeps doctor away."}
rv1 = es.indices.analyze(body=d1, format="text")
rv2 = es.indices.analyze(body=d2, format="text")
print ([x [ "token"] for x in rv1 [ "tokens"]]) # d1 segmentation results
print ([x [ "token"] for x in rv2 [ "tokens"]]) # d2 segmentation results
Output:
['eating', 'an', 'apple', 'a', 'day', 'keeps', 'doctor', 'away']
['eat', 'appl', 'dai', 'keep', 'doctor', 'awai']