Machine learning the basics - inverted indexing and search engine

This article originating in individual public number: TechFlow

Today's article, we continue to explore the search engine, and we talk about search engine, the most important part - the inverted index .

Before introducing the inverted index, we take a look at what is the index. The index is a database concept which, Wikipedia's argument is " database index is a database management system in a sorted data structure to assist in rapid query, update data in a database table ." Can simply put in the index as a dictionary access catalog, check our example, we want a place called "index" of the word, through the catalog, you can find the location of the letters i started quickly. The index is the same, but we no longer find the first letter of a word, but data .

In introducing the article before search engines which, we have said, the search engine spiders crawling the web page after the text information, will carry out a word, and then stored. That store is not a complete document , but the document among the keyword information . Obviously, the number of Web search engine which contains a very large, in order to ensure efficiency, we must use the index.

We called each page is a document (document), as it prepares a document Id , and then by the document among the list of keywords in series . Then the following data structure should become like this.

In this picture there, we go through the document ID information query keywords among the documents contained. We first found in the corresponding documentation, go check the id of them, this is a match the query of our everyday thinking, it is considered to be a " positive inquiries ." Therefore, the structure of the index is known as forward index .

But it is not enough only forward index, such as user searches for "Peking University", we can get two "Beijing" and "university" keywords. We hope that the two key words can be retrieved by a corresponding document, at this time, we do not know the document Id. That this is a reverse query , use the dictionary metaphor, we hope the word to look it up in the dictionary position among.

In the case we have only positive index, to do this, we must traverse all the documents, then 11 selected document contains "Beijing" and "University" keyword. Obviously, this is not desirable. So in order to solve this problem, we must establish a reverse index, keyword to the document . In this way, we can quickly filter out document information corresponding to the keyword.

The inverted index is an inverted index .

Once you have inverted index, the rest is much more convenient. We can easily recall all documents that contain the keyword according to the keyword, and then, through the appropriate algorithm associated with the keyword between each document is calculated , it can be related to screening of. This is also the time before the introduction of search engines, mentioned correlation filter .

The entire inverted index of technology should be easy to understand, but when the actual operation, more complicated, but also involves a lot of optimization. Here are just among the many optimization program, the most widely used one.

ElasticSearch Optimization

Speaking of inverted index, can not fail to mention ElasticSearch. ElasticSearch almost be said to be the world's most widely used open source search engines. From Wikipedia, GitHub and then to Baidu, Tencent, as well as numerous small and medium sized companies, they are inseparable ElasticSearch figure. It combines a system through a search engine, full-text search, structural analysis , and other functions, and also simple configuration, superior performance and other advantages.

ElasticSearch as distributed search engine, very clever design patterns, combined with the complexity of the distributed system itself, a lot of content can be studied. Today this passage about the inverted index, a simple talk about where the optimization inverted index.

Which said earlier, we set by keyword inverted index to achieve the purpose of search keywords. From a logical point of view there is no problem, but in fact the problem is not small. The biggest problem is that this keyword is too much, more than a pretty, and these keywords are not ordered, if we want to find some of them, can only traverse all keywords table. This is clearly unacceptable to us here have to do optimization.

One of the simplest is optimized for these keywords are sorted , then we build a sorting the Dictionary . Once you have dictionary, we can find a kind way to quickly search-half of the. It appears to be very perfect, but still a problem.

Complexity is no problem, O (logN) complexity is acceptable and unacceptable is the disk to read the disk. Because the dictionary is too big, every time we need to find the disk overhead, and each time the disk needs to consume random read time of 10ms. For a high-performance system, which is also unacceptable.

We want to optimize the answer, it is necessary to reduce random disk reads .

Want to reduce the use of disk, the best way is to put the data memory. But we said earlier, this dictionary is too big, memory is likely to fit. So we have to do it again a simple index, such as:

Key words starting with the letter A is stored in the page x
beginning letter B y keywords stored in the page
......

In fact, this is the way the dictionary, the question is if the keywords are in English, of course, can do so. However, the search engine not only supports English, keyword might be a variety of languages, and even symbols. And even in English, the number of each letter in the index is not the same. For example, words beginning with the letter s particularly large, but very little of the beginning of z. If so simple, in fact, does not necessarily enhance operational efficiency.

So in order to solve this problem, we need to introduce a data structure, namely the prefix tree (Trie tree) .

A long prefix tree probably look like this:

Principle prefix tree is not difficult, in fact, to map the same prefix string to the tree with a forked them . Each of the bifurcated end of the recording location of the content is corresponding to this prefix. In fact, that is to say the index had flat memory mapped to a tree. However, the prefix tree which does not store all of the index, only some of the keywords stored in the prefix. By prefix, we can find a location among the dictionary, then search again from this position onward order, so to avoid the excessive use of the hard drive in a random sequence , thus saving time overhead.

FIG example by the following:

Key words such as we're looking for is "Access", by the prefix tree we first found this prefix position A corresponding dictionary, that is, the position of the figure of Ada. Then we traverse from Ada back order until Access so far. By building the prefix tree reasonable, we can control the cost of traversing the dictionary, so as to achieve the purpose of optimization.

In addition to these, do some optimization on the query and index still merge ElasticSearch joint index. Since it involves a number of other technical and data structures, space limitations, here not described in detail here. We will share in a subsequent article.

If you think something to get, easily point a focus on it ~

Machine learning the basics - inverted indexing and search engine

ElasticSearch Optimization

Guess you like