ElasticSearch implements full-text search

1. What is ElasticSearch?

ElasticSearch is a Lucene-based search server. It provides a distributed multi-user capable full-text search engine based on a RESTful web interface. Developed in Java and released as open source under the terms of the Apache License, Elasticsearch is the second most popular enterprise search engine. Able to achieve real-time search, stable, reliable, fast, easy to install and use, zero configuration and completely free.

Let's talk about the basic concepts of ES first.

1. Index (Index)

ES stores data in one or more indexes, which are collections of documents with similar characteristics. In analogy to the traditional relational database field, an index is equivalent to a database in SQL, or a data storage scheme (schema).

An index is identified by its name (which must be in all lowercase characters), and documents are created, searched, updated, and deleted by referencing this name. An ES cluster can create as many indexes as needed.

2. Type

Types are logical partitions (category/partition) within the index, but their meaning is entirely up to the user's needs. Therefore, one or more types can be defined inside an index. In general, types are predefined for documents that have the same field.

For example, in an index, you can define a type to store user data, a type to store log data, and a type to store comment data. In analogy to the traditional relational database field, types are equivalent to "tables".

3. Document

A document is the atomic unit of Lucene indexing and searching. It is a container containing one or more fields, represented in JSON format.

A document consists of one or more fields, each of which has a name and one or more values. Fields with multiple values are often referred to as "multi-valued fields". Each document can store a different set of domains, but documents of the same type should have some similarity.

4. Mapping

In ES, all documents are analyzed first before being stored. Users can define how to split the text into tokens, which tokens should be filtered out, and which texts need additional processing, etc.

In addition, ES provides additional functionality, such as sorting content on-demand in a domain. In fact, ES can also automatically determine the type of a field based on its value.

5. Cluster

An ES cluster is a collection of one or more nodes that collectively store the entire dataset and provide federated indexing and search capabilities across all nodes.

A multi-node cluster has redundancy, which ensures the overall availability of services in the event of one or several node failures.

Clusters are identified by their unique names, the default name is "elasticsearch". A node depends on its cluster name to decide which ES cluster to join. A node can only belong to one cluster.

If features such as redundancy are not considered, an ES cluster with only one node can implement all storage and search functions.

6. Node

An ES host running a single instance is called a node, which is a member of the cluster and can store data and participate in cluster indexing and search operations.

Similar to a cluster, nodes are identified by their name, which defaults to a random Marvel character name automatically generated at startup.

Users can customize any name they wish to use, but for administrative purposes, this name should be as recognizable as possible.

A node determines which cluster it wants to join by using the ES cluster name configured for it.

7. Shard and Replica

The "shard" mechanism of ES can store the data inside an index in multiple nodes in a distributed manner. It divides an index into multiple underlying physical Lucene indexes to complete the partitioning and storage function of index data. A physical Lucene index is called a shard.

Inside each shard is a fully functional and independent index, so it can be stored by any host in the cluster. When creating an index, the user can specify the number of shards, and the default number is 5.

An ES cluster can be composed of multiple nodes, and each shard is stored on these nodes in a distributed manner.

ES can automatically move shards between nodes as needed, such as when nodes are added or nodes fail. In short, sharding realizes the distributed storage of the cluster, and the replica realizes its distributed processing and redundancy functions.

OK, the basic concepts and principles related to ES are roughly explained above, so how does ES implement full-text retrieval?

Elasticsearch implements full-text retrieval. First, determine the tokenizer. ES has many tokenizers by default. You can refer to the official documentation. Understand how the tokenizer is mainly implemented.

General Chinese tokenizers use third-party ik tokenizers, mmsegf tokenizers and paoding tokenizers, which may have been originally built on lucene and later ported to ES. At present, we are using IK word segmentation in the latest version of ES.

Installing the ik tokenizer to elasticsearch is very simple. It has a plugin directory analysis-ik, and a configuration directory ik, which can be copied to the plugins and conf directories respectively.

When you have a large amount of text data, ES will segment it and save these words in the index. When you enter a keyword for query, the index will play a role in finding the corresponding same query words, so as to achieve Full Text Search

ElasticSearch implements full-text search

Guess you like