lucene basic concepts and ElasticSearch

The basic concept lucene

1. Index (Index)

A corresponding one of inverted list, a basic unit of retrieval, in lucene on a corresponding one of the directory.

2. Paragraph (Segment)

An index can be contained between a plurality of independent segments, segments with the segment, adding a new segment may generate a new document, different segments may be combined. Indexed segment is a data storage unit.

3. Document (Document)

Documentation is the basic unit of our key index, different documents are stored in different segments, one segment may contain multiple documents.

The newly added document is stored separately in a newly created segment, with the merger section of different documents into the same segment.

4. Domain (Field)

A document that contains different types of information, can be indexed separately, such as title, time, text, author, etc., can be stored in different domains

Index of different ways in different domains

The term (Term)

Word is the smallest unit of the index, after a string of lexical analysis and language processing.

The same word, but different domain is considered to be two different words, that word is a combination of roots and domain names.

Word vector

Also known as document vectors by word of text and word frequency components.

Semantic tree

Is composed of an intermediate result of the search processing, search, it generates a semantic tree, then the search.

Weights

The main indicators used in the calculation of scores, the word refers to the score in the document, a document separate from the right to say a word of weight is meaningless.

Term Frequency (tf): that is, how many times this Term appear in this document. tf greater the more important.

Document Frequency (df): that is, from the number of documents that contain Term, df greater the more important.

Posting

The entry indexed by a document (generally represented by document number) called Posting, the plurality of document then it is called an entry index Posting-list

Payload

Metadata in the index database entries, i.e. entries metadata or said load, Lucene support the user during the addition of the index, while also providing the ability to read Payload information when the search result, the user's birth Payload It provides advanced indexing technology for flexible configuration, create the conditions for the support of richer search experience.

Inverted list (Invented Indexing)

Inverted table is a data structure used Lucene index, such as the center of the structure with the words, be able to quickly find documents that contain the word, posting list is a data structure, lucene data files together make up a large down row table, but inverted structure is not a specific file storage.

Document Number:

Lucene internal document by document number index. This unique number within a segment, a number of the first section of the document is 0, in ascending order. But this number is only for internal use lucene, but this number in segment time of the merger will change, if you need to use an external segment, you must rearrange the uniqueness of this number, to ensure that only one document at a greater range of , rearranged to achieve a method, the sequence number + the number of segment method, such as two segments, each of which has five documents, the document number is a 0+ first section within the document number, the second document number = 5 + segment within the segment number.

ES basic concepts:

Index (Index)

ElasticSearch saves the data to one or more indexes, if the comparison with the relational database model, the database index position quite instance. The base unit and storing the index file is read, read and write the internal ElasticSearch with Apache Lucence implementation indexed data, is regarded as a separate index ElasticSearch, it may be more than one in Lucene, this is because in the distributed system , elasticSearch slice will be used (shareds) and backup (Replicas) mechanism an index (index) stores a plurality of parts.

File

In ElasticSearch, the document is the main entity exists, all ElasticSearch application needs to last can be modeled as a unified model retrieval: retrieve relevant documents. Document by one or more domains, each domain consists of a domain name and one or more values. In ElasticSearch in each document may have a different set of domains, that document (Document) is no fixed pattern and unified structure. Similarity can be maintained (in Lucene document) between the document structure, from the client's point of view, the document (Document) is a Json object.

Parameter Mapping (mapping)

All documents must go through before storing analysis process. The user can configure the input text into token manner, which should be filtered out token, or other treatment processes, such as removal of HTML tags, in addition, various characteristics ElasticSearch provided, such as ordering information. Save the configuration information above, this is the parameter mapping plays a role in ElasticSearch Although ElasticSearch can automatically identify the type of the value of domains domain, in production applications, you need to configure their own information.

Document type (Type):

Each document must specify its type in ElasticSearch in. A document type such that the same index at a different document structure memory, only the parameter can be found based on the mapping information corresponding to the document type, the document for easy access.

Node (Node):

ElasticSearch single server instance is called a node, for many application scenarios for the deployment of a single server node ElasticSearch sufficient. But given the data overload and fault tolerance, configure multiple nodes in a cluster ElasticSearch is a wise choice.

Cluster (Cluster)

ElasticSearch cluster is a collection of nodes. Search requirements and data storage needs of these nodes together to a single node can not handle. Clusters also deal with because some machines (nodes) outages or weapon upgrades resulting in not providing service to this problem. Nodes of a cluster ElasticSearch offers almost seamless connection (so-called seamless connection, i.e., outside the cluster as a whole in terms of, adding a node or removing a node is transparent to the user), arranged in a cluster in a very simple ElasticSearch in our view, this is in competition with similar products reflected the greatest advantage.

Minute piece index (shard)

As already mentioned, the cluster information can be stored outside the unit capacity. In order to realize this demand, ElasticSearch to distribute data to a plurality of storage on a physical machine Lucene indexes. The Lucene index is called fragmentation index, the distribution process is called fragmentation index. In ElasticSearch cluster, the index fragmentation is done automatically, and all slice index as a whole is presented to the user. It requires prior adjust parameters. Because the number of clusters carve pieces need to configure before index creation, and service startup can not be modified

Index copy (Replica)

Data unit capacity may be introduced over the ElasticSearch cluster fragmentation by indexing mechanisms, the client can be realized by an arbitrary node of the cluster data read and write operations. When the cluster load growth, blocking a user search requests on a single node, by a copy of the index mechanism can solve this problem. Ideas index replication mechanism is simple: create a new copy of the index fragmentation, as it can handle user search requests like the original master slice. But also the way to ensure the security of the data. I.e., if the master slice data loss, ElasticSearch by copies of the index such that data is not lost. Index copy can always add or remove, so the user can dynamically adjust the number when needed.

The time gate (Gateway)

In the process of running the state, ElasticSearch collect information clustered index parameters, the data stored in the Gateway.






Guess you like

Origin blog.csdn.net/baibaichenchen/article/details/75545381