abstract
- What index is the inverted row? Why inverted index?
- Inverted index is how to work?
1. What index is an inverted?
Suppose there is a dating site, the information table below:
Beauty 1: " I'm looking to do PHP's brother in Shanghai. "
We need to match the gender, city, language column .
Beauty 2: " I'm looking for Beijing love traveling, love food JAVA brother. "
It is more complex, real-world scenario, there will be more complicated permutations and combinations.
For this type of search, indexing relational database would be difficult to cope with the inverted index for full-text search.
Inverted index is an index of a form of database, storing the "Content -> Document" mapping relations , the aim of fast full-text search.
2. The inverted index is how to work?
It includes two processes:
- Creating inverted index
- Inverted index search
2.1 Creating inverted index
For example, there are two documents:
- Document#1
“Recipe of pasta with sauce pesto”
- Document#2
“Recipe of delicious carbonara pasta”
First word of the document, form one of the token , that is the word , and then save the correspondence between the token and documents.
The results are as follows:
2.2 inverted index search
Search example:
- Search " Pasta Recipe "
先分词,得到2个 token,( “pasta”、“recipe” )。
然后去倒排索引中进行匹配。
这2个词在2个文档中都匹配,所以2个文档都会返回,而且分数相同。
- 搜索 “carbonara pasta”
同样,2个文档都匹配,都会返回。
这次 document#2 的分数要比 document#1 高。
因为 #2 匹配了2个词(“carbonara”、“pasta”),#1 只匹配了一个(“pasta”)。
2.3 转换
有时我们可以在保存和搜索之前对 token 进行一些转换,最普遍的例如:
- 扔掉停止词
停止词是那些使用量非常大,但又没有什么意义的词。
例如英文中的 “of”, “the”, “for” ……
- 元素化
把单词处理为字典中的标准词,例如:
“running” => “run”
“walks” => “walk”
“thought” =>“think”
- 词干分析
通过切断词尾将一个词转换成词根形式的过程。
不能处理不规则动词的情况,但可以处理字典中没有的词。
推荐阅读: