Inverted index
Refer to Elasticsearch official documents for inverted index
Elasticsearch uses an inverted index structure, which is suitable for fast full-text search. An inverted index will contain a list of all unique terms in a document and the document in which each term appears.
For example, suppose we have two documents, and each document's content
field contains the following:
- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
To create an inverted index, first split each document's content
domain into individual terms, create a sorted list of all unique terms, and then list in which documents each term appears.
The result is as follows:
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
Now, if we want to search quick brown
, we just need to find documents that contain each term:
Term Doc_1 Doc_2
-------------------------
brown | X | X
quick | X |
------------------------
Total | 2 | 1
It can be seen from the results that both documents match, and if a simple similarity algorithm that only counts the number of matching terms is adopted, the first document has a higher matching degree than the second document.
columnar storage
For columnar storage, refer to the official Elasticsearch documentation
In Elasticsearch, Doc Values
it is a columnar storage structure. By default, each field Doc Values
is activated Doc Values
and created during indexing. When a field is indexed, Elasticsearch will add the value of the field to the inverted list for fast retrieval. In the index, it will also store the field Doc Values
.
In Elasticsearch Doc Values
is often applied to the following scenarios:
- Sort a field
- aggregate a field
- Certain filters, such as geo-location filters
- Certain field-related script calculations
Elasticsearch's columnar storage stores documents in the order they are written (not in the order of doc_id).