Elasticsearch series --- inverted indexing works with Word Breaker

Overview

Data in this chapter explains the basic principles of inverted index of ES and several commonly used word Introduction.

Inverted indexing process

Inverted index search engine index common method used to store the mapping storage location in the full-text search for a word in a document. By inverted index, we enter a keyword, you can get a list of documents containing the keyword very quickly.

We look at the English, suppose we have two documents:

I have a friend who loves smile
love me, I love you

In order to create an inverted index, we press the easiest to separate each word with a space, you can get the following results:
* indicates there is this entry in the column document open, which means that no entry

Term	doc1	doc2
I	*	*
have	*
a	*
friend	*
who	*
loves	*
smile	*
love		*
me		*
you		*

If we want to search for I love you, we just need to find documents that contain each term:
| Term | doc1 | doc2 |
|: ---- |: -: | -----: |
| the I | * | * |
| Love | | * |
| you | | * |

Both documents can be matched, if you press the number of entries to count hits, doc2 better match doc1.

The inverted index is the simplest expression, in inverted index stored in the ES results, record the location of each entry will appear in the document.

Desired word processing

Let's look at the process of establishing this index, loves and love there a difference? No, it is the meaning of love, is a third person singular, is a prototype. If you can get rid of some of the differences in syntax, so search results are not more realistic demand?
E.g:

processed to extract stem loves love
a, words have no meaning and the like, direct masked
and many more

Index now looks like this:
| Term | doc1 | doc2 |
|: ---- |: -: | -----: |
| Friend | * | |
| Love | * | * |
| Smile | * | |
| Me | | * |
| you | | * |

This is not to streamline a lot?
When this process is called normalization, in the establishment of inverted index, it will perform a series of operations, each word split out the appropriate treatment, in order to enhance the search back to search the document associated probability, such as when conversion state, singular and plural conversion, conversion synonyms, sensitive conversion.

Tokenizer debut

Word's role is to put the entire document, according to a certain cut into a semantic one entry, the goal is to enhance the documentation of the recall, and reduce noise invalid data.

recall recall, also known as searchability, refers to the time of the search, we can increase the number of search results.
NR: refers to reducing the number of documents a low correlation to the overall search term interference sort of results.

Word document contains the following process steps:

Character filter

Pretreatment of the string, such as HTML tags cleaning Love -> Love, the I & you -> and the I like you.

Word Breaker

The string is cut into individual terms, such as English by spaces and punctuation segmentation, Chinese by word segmentation, for different languages, different word, a relatively simple standard segmenter, there are particularly complex Chinese word, a complex which contains a segmentation logic is:

I Love you --> I/Love/you

I and my country -> I / and / I / Motherland

Token filter
the word is derived entries for further processing, such as changing the terms (English stemming loves -> love), delete moot terms (English a, and, this, in Chinese, "the "" the, "" it "), an increase of entries (supplementary synonyms)

Word is very important, a good word can significantly improve the recall, the result of an inappropriate word may have been ambiguous search, and finally the good results and then take it to establish inverted index.

Introduction common tokenizer

Elasticsearch itself provides built-in word breaker, also allows the use of third-party word breaker.

Built-in word breaker

Standard tokenizer standard analyzer

ES default word, a word boundary demarcation according to the text of the definition of the Unicode consortium, remove most of the punctuation, the last entry will be lowercase.

Simple word is simple analyzer

In any letter is not the place delimited text, the lowercase entries

Space tokenizer whitespace analyzer

Where text is divided spaces

Language word breaker language analyzer

Language specific word, such as english, an English word, to maintain a set of stop words in English and, the like, is used to delete entries for English grammar rules, have the ability to extract the stem of the word.

Built-in support for tokenizer is mainly the effect of the English is better, we need to use an external Chinese word breaker.

The outer part of the word is

IK Chinese word breaker ik_max_word
will do the most fine-grained text split; split out as many words.
Such as the Nanjing Yangtze River Bridge -> Nanjing / Nanjing / mayor / Yangtze River Bridge / Yangtze River / Bridge
IK Chinese word breaker ik_smart
will do the splits most coarse-grained; words will not be separated again occupy other words
, such as the Nanjing Yangtze River Bridge -> Nanjing / Yangtze River Bridge
CJK word is cjk
support Asian languages Chinese, Japanese, Korean,
such as the Nanjing Yangtze River Bridge -> Nanjing / Beijing City / Mayor / Yangtze / Changjiang River / Bridge
Ali Chinese word breaker aliws
Ali from the research of Chinese word breaker
as the Nanjing Yangtze River Bridge -> Nanjing / City / Yangtze River / Bridge

Many outside part of the word is open source, there are many, there are different languages, different areas, you can combine their own business characteristics, choose their own word is, here is not introduced one by one, are interested they can go to find out.

Integrated word breaker

To Elasticsearch 6.3.1 version, for example, integrated IK word, a word other process is similar, in the bin directory ES execution plug-in installation command:
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.1/elasticsearch-analysis-ik-6.3.1.zip

Where that address is behind install elasticsearch-analysis-ik 's github release corresponds ES version of the download address.

After a successful installation, ES start the log you can see the following information:
[2019-11-27T12:17:15,255][INFO ][o.e.p.PluginsService] [node-1] loaded plugin [analysis-ik]

Word test results

ES There analyze API to see how the text is the word, it can be used for debugging and learning, request command as follows:

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "南京市长江大桥"
}

In response to the results:

{
  "tokens": [
    {
      "token": "南京市",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "南京",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "市长",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "长江大桥",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "长江",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "大桥",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 5
    }
  ]
}

summary

This part describes the basic idea inverted index, showing the structure simplified, and illustrates the basic steps of word processing. Currently particularly popular market segmentation component, the open source community is also very active, it can according to the actual needs of the project background, select the appropriate integration, note the version number of compatibility issues can be.

High focus on Java concurrency, distributed architecture, more dry goods share technology and experience, please pay attention to the public number: Java Architecture Community