ElasticSearch inverted index Brief

abstract

  • What index is the inverted row? Why inverted index?
  • Inverted index is how to work?

1. What index is an inverted?

Suppose there is a dating site, the information table below:

Beauty 1: " I'm looking to do PHP's brother in Shanghai. "

We need to match the gender, city, language column .

Beauty 2: " I'm looking for Beijing love traveling, love food JAVA brother. "

It is more complex, real-world scenario, there will be more complicated permutations and combinations.

For this type of search, indexing relational database would be difficult to cope with the inverted index for full-text search.

Inverted index is an index of a form of database, storing the "Content -> Document" mapping relations , the aim of fast full-text search.

2. The inverted index is how to work?

It includes two processes:

  • Creating inverted index
  • Inverted index search

2.1 Creating inverted index

For example, there are two documents:

  • Document#1

Recipe of pasta with sauce pesto

  • Document#2

Recipe of delicious carbonara pasta

First word of the document, form one of the token , that is the word , and then save the correspondence between the token and documents.

The results are as follows:

2.2 inverted index search

Search example:

  • Search " Pasta Recipe "

先分词,得到2个 token,( “pasta”、“recipe” )。

然后去倒排索引中进行匹配。

这2个词在2个文档中都匹配,所以2个文档都会返回,而且分数相同。

  • 搜索 “carbonara pasta

同样,2个文档都匹配,都会返回。

这次 document#2 的分数要比 document#1 高。

因为 #2 匹配了2个词(“carbonara”、“pasta”),#1 只匹配了一个(“pasta”)。

2.3 转换

有时我们可以在保存和搜索之前对 token 进行一些转换,最普遍的例如:

  • 扔掉停止词

停止词是那些使用量非常大,但又没有什么意义的词。

例如英文中的 “of”, “the”, “for” ……

  • 元素化

把单词处理为字典中的标准词,例如:

“running” => “run”

“walks” => “walk”

“thought” =>“think”

  • 词干分析

通过切断词尾将一个词转换成词根形式的过程。

不能处理不规则动词的情况,但可以处理字典中没有的词。


推荐阅读:

发布了22 篇原创文章 · 获赞 0 · 访问量 1003

Guess you like

Origin blog.csdn.net/duysh/article/details/104047728