A thorough understanding of Elasticsearch

This article has been included in the Github warehouse, which includes computer foundation, Java foundation, multithreading, JVM, database, Redis, Spring, Mybatis, SpringMVC, SpringBoot, distributed, microservices, design patterns, architecture, school recruitment and social recruitment sharing, etc. Core knowledge points, welcome to star~

Github address

If you can't access Github, you can access the gitee address.

gitee address


Share with you the basics of Elasticsearch, what it does and its usage and fundamentals.

1. Data in life

Search engines retrieve data, so let's start with the data in life. The data in our lives are generally divided into two types:

  • structured data
  • unstructured data

Structured data: also known as row data, is data logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specifications, and is mainly stored and managed by a relational database. Refers to data with a fixed format or limited length, such as databases, metadata, etc.

Unstructured data: also known as full-text data, with variable length or no fixed format, not suitable for representation by two-dimensional database tables, including all formats of office documents, XML, HTML, Word documents, emails, various reports, pictures And audio, video information, etc.

Explanation: If you want to distinguish more carefully, XML and HTML can be divided into semi-structured data. Because they also have their own specific tag format, they can be processed as structured data as needed, or they can be extracted from plain text and processed as unstructured data. The most comprehensive Java interview site

According to the two data classifications, the search is also divided into two types:

  • structured data search
  • Unstructured Data Search

For structured data, because they have a specific structure, we can generally store and search through two-dimensional tables (Table) of relational databases (MySQL, Oracle, etc.), and can also build indexes.

For unstructured data, that is, there are two main methods for searching full-text data:

  • sequential scan
  • Full Text Search

Sequential scanning: You can also know its general search method through the text name, that is, query specific keywords in a sequential scanning manner.

For example, if you are given a newspaper, let you find out where the words "Ping An" appeared in the newspaper. You definitely need to scan the newspaper from beginning to end and mark which sections the keyword appears in and where it appears.

This method is undoubtedly the most time-consuming and the least efficient. If the typesetting font of the newspaper is small, and there are many sections or even multiple newspapers, it will be almost enough after you scan your eyes.

Full-text search: Sequential scanning of unstructured data is slow, can we optimize it? Isn't it enough to find a way to make our unstructured data have a certain structure?

Extract part of the information in the unstructured data, reorganize it to make it have a certain structure, and then search the data with a certain structure, so as to achieve the purpose of relatively fast search.

This approach constitutes the basic idea of ​​full-text retrieval. This part of the information extracted from unstructured data and then reorganized is called index.

The main workload of this method is the creation of the index in the early stage, but it is fast and efficient for the later search.

Two, first talk about Lucene

After a brief understanding of the types of data in life, we know that SQL retrieval of relational databases cannot handle this kind of unstructured data.

The processing of this kind of unstructured data needs to rely on full-text search, and the best open-source full-text search engine toolkit on the market currently belongs to Apache's Lucene.

But Lucene is just a toolkit, it is not a complete full-text search engine. The purpose of Lucene is to provide software developers with an easy-to-use toolkit to facilitate the full-text search function in the target system, or to build a complete full-text search engine based on this.

At present, the open source and available full-text search engines based on Lucene are mainly Solr and Elasticsearch.

Both Solr and Elasticsearch are relatively mature full-text search engines, and their functions and performance are basically the same.

However, ES itself has the characteristics of distribution and easy installation and use, while the distribution of Solr needs to be realized by a third party, such as using ZooKeeper to achieve distributed coordination management.

Both Solr and Elasticsearch rely on Lucene at the bottom layer, and Lucene can realize full-text search mainly because it implements the query structure of inverted index.

How to understand the inverted index? If there are three existing data files, the contents of the files are as follows:

  • Java is the best programming language.
  • PHP is the best programming language.
  • Javascript is the best programming language.

To create an inverted index, we split the content domain of each document into individual words (we call them terms or terms) through a tokenizer, create a sorted list of all unique terms, and then list each In which document the term appears.

The result looks like this:

Term          Doc_1    Doc_2   Doc_3  
-------------------------------------  
Java        |   X   |        |  
is          |   X   |   X    |   X  
the         |   X   |   X    |   X  
best        |   X   |   X    |   X  
programming |   x   |   X    |   X  
language    |   X   |   X    |   X  
PHP         |       |   X    |  
Javascript  |       |        |   X  
-------------------------------------  

This structure consists of a list of all unique words in the document, and for each word there is a list of documents associated with it.

This structure of determining the position of the record by the attribute value is the inverted index. A file with an inverted index is called an inverted file.

We convert the above content into the form of a graph to illustrate the structural information of the inverted index, as shown in the following figure:

There are mainly the following core terms that need to be understood:

  • Term: The smallest storage and query unit in the index. For English, it is a word. For Chinese, it generally refers to a word after word segmentation.
  • Dictionary (Term Dictionary): or dictionary, is a collection of entries Term. The usual index unit of a search engine is a word, and the word dictionary is a string collection composed of all words that have appeared in the document collection. Each index item in the word dictionary records some information about the word itself and a pointer to the "posting list".
  • Post list: A document usually consists of multiple words, and the post list records in which documents a word has appeared and where it appeared. Each record is called a Posting. The inverted list not only records the document number, but also stores word frequency and other information.
  • Inverted File: The inverted list of all words is often stored sequentially in a certain file on the disk. This file is called an inverted file, and the inverted file is the physical file that stores the inverted index.

From the above figure, we can understand that the inverted index is mainly composed of two parts:

  • dictionary
  • inverted file

Dictionaries and posting lists are two very important data structures in Lucene, and they are an important cornerstone for fast retrieval. The dictionary and the inverted file are stored in two parts, the dictionary is stored in memory and the inverted file is stored on disk.

Let me share with you a Github warehouse, which contains more than 300 classic computer book PDFs compiled by Dabin, including C language, C++, Java, Python, front-end, database, operating system, computer network, data structure and algorithm, machine learning, Programming life , etc., you can star it, next time you look for a book directly search on it, the warehouse is continuously updated~

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-gXsrSsN9-1681429550268)(http://img.topjavaer.cn/image/image-20221030094126118.png)]

Github address

3. ES Core Concepts

After laying the groundwork for some basic knowledge, we officially enter the introduction of today's protagonist Elasticsearch.

ES is an open source search engine written in Java. It uses Lucene for indexing and searching internally. By encapsulating Lucene, it hides the complexity of Lucene and provides a simple and consistent RESTful API instead.

However, Elasticsearch is more than Lucene, and it's not just a full-text search engine.

It can be accurately described as follows:

  • A distributed real-time document store, each field can be indexed and searched.
  • A distributed real-time analytical search engine.
  • It is capable of scaling up to hundreds of service nodes and supports PB-level structured or unstructured data.

The introduction of Elasticsearch on the official website is that Elasticsearch is a distributed, scalable, near real-time search and data analysis engine.

Let's look at how Elasticsearch achieves distributed, scalable and near real-time search through some core concepts.

Cluster

The cluster construction of ES is very simple. It does not need to rely on third-party coordination and management components, and the cluster management function is realized internally.

An ES cluster consists of one or more Elasticsearch nodes. Each node can join the cluster by configuring the same cluster.name. The default value is "elasticsearch".

Make sure to use different cluster names in different environments, otherwise you will end up with nodes joining the wrong cluster.

An Elasticsearch service startup instance is a node (Node). The node sets the node name through node.name, if not set, a random universally unique identifier is assigned to the node as the name at startup.

① Discovery mechanism

Then there is a question, how can ES connect different nodes to the same cluster through the same setting cluster.name? The answer is Zen Discovery.

Zen Discovery is Elasticsearch's built-in default discovery module (the responsibility of the discovery module is to discover nodes in the cluster and elect the Master node).

It provides unicast and file-based discovery, and can be extended to support cloud environments and other forms of discovery through plugins.

Zen Discovery is integrated with other modules, for example, all communication between nodes is done using the Transport module. Nodes use the discovery mechanism to find other nodes by means of Ping.

Elasticsearch is configured by default to use unicast discovery to prevent nodes from accidentally joining the cluster. Only nodes running on the same machine will automatically form a cluster.

If the nodes of the cluster are running on different machines, using unicast, you can provide Elasticsearch with some list of nodes it should try to connect to.

When a node contacts members in the unicast list, it will get the status of all nodes in the entire cluster, and then it will contact the Master node and join the cluster.

This means that the unicast list does not need to contain all the nodes in the cluster, it just needs enough nodes that a new node can contact one of them and talk to it.

If you use the master candidate node as a unicast list, you only need to list three. This configuration is in the elasticsearch.yml file:

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]  

After the node is started, Ping first, if discovery.zen.ping.unicast.hoststhere is a setting, then Ping the Host in the setting, otherwise try to ping several ports of localhost.

Elasticsearch supports the same host to start multiple nodes, and the Ping Response will contain the basic information of the node and the master node that the node considers.

At the beginning of the election, select from the Master that each node considers. The rule is very simple, sort according to the lexicographical order of the ID, and take the first one. If each node does not have a recognized Master, select from all nodes, and the rules are the same as above.

There is a restriction here that discovery.zen.minimum_master_nodesif the number of nodes does not reach the minimum limit, the above process will be repeated until the number of nodes is sufficient to start the election.

The result of the final election is that a Master can be elected for sure, and if there is only one Local node, it will be elected itself.

If the current node is Master, it starts to wait for the number of nodes to reach discovery.zen.minimum_master_nodes, and then provides services.

If the current node is not the Master, try to join the Master. Elasticsearch calls the above process of service discovery and master selection Zen Discovery.

Since it supports any number of clusters (1- N), it cannot limit the number of nodes to be odd like Zookeeper, and it cannot use the voting mechanism to elect the master, but a rule.

As long as all nodes follow the same rules, the information obtained is equal, and the selected master nodes must be consistent.

However, the problem of distributed systems lies in the situation of unequal information. At this time, the problem of split-brain (Split-Brain) is easy to appear.

Most solutions are to set a Quorum value, which requires that the available nodes must be greater than the Quorum (generally more than half of the nodes) in order to provide external services.

In Elasticsearch, this Quorum configuration is discovery.zen.minimum_master_nodes.

②The role of the node

Each node can be either a candidate master node or a data node, which can be set in the configuration file .../config/elasticsearch.yml, which is true by default.

node.master: true  //是否候选主节点  
node.data: true    //是否数据节点  

Data nodes are responsible for data storage and related operations, such as adding, deleting, modifying, querying, and aggregating data. Therefore, data nodes (Data nodes) have relatively high requirements for machine configuration, and the requirements for CPU, memory, and I/O It consumes a lot.

Usually as the cluster expands, more data nodes need to be added to improve performance and availability.

Candidate master nodes can be elected as master nodes (Master nodes). Only candidate master nodes in the cluster have the right to vote and be elected, and other nodes do not participate in the election.

The master node is responsible for creating indexes, deleting indexes, tracking which nodes are part of the cluster, and deciding which shards are assigned to relevant nodes, tracking the status of nodes in the cluster, etc. A stable master node is very important to the health of the cluster.

A node can be either a candidate master node or a data node, but the data node consumes a lot of CPU and memory core I/O.

So if a node is both a data node and a master node, it may affect the master node and thus the state of the entire cluster.

Therefore, in order to improve the health of the cluster, we should divide and isolate the roles of the nodes in the Elasticsearch cluster. Several machine farms with low configuration can be used as candidate master node farms.

The master node and other nodes check each other through Ping, and the master node is responsible for Ping all other nodes to determine whether any nodes have hung up. Other nodes also judge whether the master node is available through Ping.

Although the roles of the nodes are distinguished, the user's request can be sent to any node, and the node is responsible for distributing the request, collecting the results, etc., without the master node forwarding.

This kind of node can be called a coordinator node. The coordinator node does not need to be specified and configured. Any node in the cluster can act as the coordinator node.

③Split brain phenomenon

At the same time, if multiple Master nodes are elected in the cluster due to network or other reasons, resulting in inconsistency in data update, this phenomenon is called a split-brain, that is, different nodes in the cluster disagree on the selection of the Master, and multiple A Master competition.

The "split brain" problem can be caused by several reasons:

  • Network problem: The network delay between clusters causes some nodes to fail to access the Master. They think that the Master is down and a new Master is elected, and the shards and replicas on the Master are marked red, and a new master shard is allocated.
  • Node load: The role of the master node is both Master and Data. When there is a large amount of access, it may cause ES to stop responding (fake death state) and cause a large area of ​​delay. At this time, other nodes cannot get the response from the master node and think that the master node hangs up , the primary node will be re-elected.
  • Memory recovery: The role of the master node is both Master and Data. When the ES process on the Data node occupies a large amount of memory, it will cause large-scale memory recovery of the JVM, causing the ES process to lose its response.

In order to avoid the split-brain phenomenon, we can start from the reasons and make optimization measures through the following aspects:

  • Appropriately increase the response time to reduce misjudgment. Set the response time of the node status through the parameter discovery.zen.ping_timeout, the default is 3s, which can be adjusted appropriately.

If the Master does not respond within the response time range, it is judged that the node has hung up. Increase the parameter (such as 6s, discovery.zen.ping_timeout:6) to properly reduce misjudgment.

  • Election triggers. We need to set the value discovery.zen.munimum_master_nodesof .

This parameter indicates the number of candidate master nodes that need to participate in the election when the master node is elected. The default value is 1, and the official recommended value is ( , master_eligibel_nodes2)+1where master_eligibel_nodesis the number of candidate master nodes.

This can not only prevent the occurrence of split-brain phenomenon, but also maximize the high availability of the cluster, because as long as no less discovery.zen.munimum_master_nodesthan candidate nodes survive, the election can proceed normally.

When it is less than this value, the election behavior cannot be triggered, the cluster cannot be used, and fragmentation chaos will not be caused.

  • Role separation. That is, the role separation of candidate master nodes and data nodes we mentioned above can reduce the burden on the master node, prevent the false death of the master node, and reduce the misjudgment that the master node is "dead".

Fragmentation (Shards)

ES supports PB-level full-text search. When the amount of data on the index is too large, ES splits the data on an index and distributes it to different data blocks through horizontal splitting. The split database blocks are called for a slice.

This is similar to MySQL's sub-database and sub-table, except that MySQL sub-database and sub-table need to rely on third-party components, and ES implements this function internally.

When writing data in a multi-shard index, the specific shard to be written is determined by routing, so the number of shards needs to be specified when creating the index, and once the number of shards is determined, it cannot be modified.

The number of shards and the number of copies described below can be configured through the Settings when creating an index. By default, ES creates 5 primary shards for an index, and creates a copy for each shard.

PUT /myIndex  
{  
   "settings" : {  
      "number_of_shards" : 5,  
      "number_of_replicas" : 1  
   }  
}  

ES uses the function of sharding to improve the scale and performance of the index. Each shard is an index file in Lucene, and each shard must have a primary shard and zero to multiple copies.

Replicas

A replica is a copy of a shard. Each primary shard has one or more replica shards. When the primary shard is abnormal, the replica can provide data query and other operations.

The primary shard and the corresponding replica shard will not be on the same node, so the maximum number of replica shards is N-1 (where N is the number of nodes).

Document creation, indexing, and deletion requests are all write operations and must be completed on the primary shard before being replicated to the relevant replica shards.

In order to improve the writing ability of ES, this process is written concurrently. At the same time, in order to solve the problem of data conflicts in the process of concurrent writing, ES controls through optimistic locking. Each document has a (version) number. When the document is _versionmodified The version number is incremented.

Once all replica shards report success, they report success to the coordinator node, and the coordinator node reports success to the client.

It can be seen from the above figure that in order to achieve high availability, the Master node will avoid placing the primary and replica fragments on the same node.

Assuming that the node Node1 service is down or the network is unavailable at this time, then the primary shard S0 on the primary node is also unavailable.

Fortunately, there are two other nodes that can work normally. At this time, ES will re-elect a new master node, and all the data of S0 we need exist on these two nodes.

We will promote the replica shard of S0 to the primary shard, and this process of promoting the primary shard happens instantaneously. At this point the status of the cluster will be Yellow.

Why is our cluster status Yellow instead of Green? Although we have all 2 primary shards, it is also set that each primary shard needs to correspond to two replica shards, and only one replica shard exists at this time. So the cluster cannot be in the Green state.

If we also shut down Node2, our program can still keep running without losing any data, because Node3 keeps a copy for each shard.

If we restart Node1, the cluster can redistribute the missing replica fragments, and the state of the cluster will return to the original normal state.

If Node1 still has the previous shards, it will try to reuse them, but at this time the shards on Node1 are no longer primary shards but replica shards. Just copy the modified data files on the shard.

summary:

  • Sharding data is to increase the capacity of data that can be processed and facilitate horizontal expansion, and making copies for shards is to improve the stability of the cluster and increase the amount of concurrency.
  • Copies are multiplication, the more you consume, the more secure you will be. Sharding is division. The more shards there are, the less and more dispersed the single shard data will be.
  • The more copies, the higher the availability of the cluster, but since each shard is equivalent to a Lucene index file, it will occupy a certain amount of file handles, memory and CPU. In addition, the data synchronization between shards will also occupy a certain amount of network bandwidth, so the number of shards and copies of the index is not as high as possible.

Mapping

Mapping is used to define information such as the storage type, word segmentation method and whether to store the fields in the index of ES, just like the Schema in the database, which describes the fields or attributes that the document may have, and the data type of each field.

It's just that the field type must be specified when creating a table in a relational database, but ES can not specify the field type and then dynamically guess the field type, or specify the field type when creating an index.

The mapping that automatically recognizes the field type according to the data format is called dynamic mapping (Dynamic Mapping), and the mapping that specifically defines the field type when we create an index is called static mapping or explicit mapping (Explicit Mapping).

Before explaining the use of dynamic mapping and static mapping, let's first understand what field types are there in the data in ES? Later, we will explain why we need to establish static mapping instead of dynamic mapping when creating an index.

The field data types in ES (v6.8) mainly include the following categories:

Text A field for indexing full-text values, such as the body of an email or a product description. These fields are tokenized, and they are passed through a tokenizer to convert the string into a list of individual terms before being indexed.

The parsing process allows Elasticsearch to search every complete text field within a single word. Text fields are not used for sorting and are rarely used for aggregation.

Keyword A field for indexing structured content such as email addresses, hostnames, status codes, zip codes, or labels. They are commonly used for filtering, sorting, and aggregation. The Keyword field can only be searched by its exact value.

Through the understanding of field types, we know that some fields need to be clearly defined. For example, whether a field is of Text type or Keyword type is very different. Maybe we need to specify its time format for time fields, and we need to specify specific fields for some fields. The tokenizer and so on.

If this cannot be done precisely with dynamic mapping, the automatic recognition will often be somewhat different from what we expect.

Therefore, when creating an index, a complete format should be to specify the number of shards and replicas, as well as the definition of Mapping, as follows:

PUT my_index   
{  
   "settings" : {  
      "number_of_shards" : 5,  
      "number_of_replicas" : 1  
   }  
  "mappings": {  
    "_doc": {   
      "properties": {   
        "title":    { "type": "text"  },   
        "name":     { "type": "text"  },   
        "age":      { "type": "integer" },    
        "created":  {  
          "type":   "date",   
          "format": "strict_date_optional_time||epoch_millis"  
        }  
      }  
    }  
  }  
}  

Fourth, the basic use of ES

The first thing to consider when deciding to use Elasticsearch is the version issue. Elasticsearch (excluding 0.x and 1.x) currently has the following commonly used stable major versions: 2.x, 5.x, 6.x, 7.x (current).

You may find that there is no 3.x and 4.x, ES jumped directly from 2.4.6 to 5.0.0. In fact, it is for the unification of the version of the ELK (ElasticSearch, Logstash, Kibana) technology stack, so as to avoid confusion to users.

In the case where Elasticsearch is 2.x (the last version of 2.x 2.4.6 was released on July 25, 2017), Kibana is already 4.x (Kibana 4.6.5 was released on July 25, 2017) .

Then the next major version of Kibana must be 5.x, so Elasticsearch directly released its main version as 5.0.0.

After the unification, we will not hesitate to choose the version. After selecting the version of Elasticsearch, we can choose the same version of Kibana, so we don't have to worry about version incompatibility.

Elasticsearch is built using Java, so in addition to paying attention to the unified version of ELK technology, we also need to pay attention to the version of JDK when choosing the version of Elasticsearch.

Because each major version depends on different JDK versions, the current version 7.2 can already support JDK11.

installation and use

①Download and decompress Elasticsearch, it can be used after decompression without installation, the directory after decompression:

  • bin: Binary system command directory, including startup commands and plugin installation commands, etc.
  • config: Configuration file directory.
  • data: Data storage directory.
  • lib: Dependency package directory.
  • logs: The log file directory.
  • modules: module library, such as the module of x-pack.
  • plugins: Plugin directory.

② Run in the installation directory bin/elasticsearchto start ES.

③ By default, it runs on port 9200. Request curl http://localhost:9200/ or enter http://localhost:9200 in the browser to get a JSON object, which contains information such as the current node, cluster, and version.

{  
  "name" : "U7fp3O9",  
  "cluster_name" : "elasticsearch",  
  "cluster_uuid" : "-Rj8jGQvRIelGd9ckicUOA",  
  "version" : {  
    "number" : "6.8.1",  
    "build_flavor" : "default",  
    "build_type" : "zip",  
    "build_hash" : "1fad4e1",  
    "build_date" : "2019-06-18T13:16:52.517138Z",  
    "build_snapshot" : false,  
    "lucene_version" : "7.7.0",  
    "minimum_wire_compatibility_version" : "5.6.0",  
    "minimum_index_compatibility_version" : "5.0.0"  
  },  
  "tagline" : "You Know, for Search"  
}  

Cluster health status

To check the cluster health, we can run the following command GET /_cluster/health in the Kibana console and get the following information:

{  
  "cluster_name" : "wujiajian",  
  "status" : "yellow",  
  "timed_out" : false,  
  "number_of_nodes" : 1,  
  "number_of_data_nodes" : 1,  
  "active_primary_shards" : 9,  
  "active_shards" : 9,  
  "relocating_shards" : 0,  
  "initializing_shards" : 0,  
  "unassigned_shards" : 5,  
  "delayed_unassigned_shards" : 0,  
  "number_of_pending_tasks" : 0,  
  "number_of_in_flight_fetch" : 0,  
  "task_max_waiting_in_queue_millis" : 0,  
  "active_shards_percent_as_number" : 64.28571428571429  
}  

The cluster status is indicated by green, yellow and red:

  • Green: The cluster is healthy, everything is fully functional, and all shards and replicas are working properly.
  • Yellow: Warning status, all primary shards are functioning normally, but at least one replica is not working properly. At this point the cluster is functional, but high availability is compromised to some extent.
  • Red: The cluster cannot be used normally. One or some shards and their replicas are abnormally unavailable. At this time, the query operation of the cluster can still be executed, but the returned results will be inaccurate. Write requests assigned to this shard will report an error, which will eventually lead to data loss.

When the cluster status is red, it will continue to serve search requests from available shards, but you need to fix those unassigned shards as soon as possible.

5. Principle of ES Mechanism

After the basic concepts and basic operations of ES are introduced, we may still have a lot of doubts:

  • How do they work internally?
  • How are primary and replica shards synchronized?
  • What is the process for creating an index?
  • How does ES allocate index data to different shards? And how are these index data stored?
  • Why is it said that ES is a near real-time search engine and the CRUD (create-read-update-delete) operation of documents is real-time?
  • And how does Elasticsearch ensure that updates are persisted without losing data when the power is off?
  • Also why deleting a document doesn't immediately free up space?

With these questions we proceed to the next content.

Write index principle

The following figure describes a cluster of 3 nodes, with a total of 12 shards, including 4 primary shards (S0, S1, S2, S3) and 8 replica shards (R0, R1, R2, R3), each Each primary shard corresponds to two replica shards, and node 1 is the primary node (Master node) responsible for the status of the entire cluster.

Writing indexes can only be written on the primary shard, and then synchronized to the replica shard. There are four primary shards here. According to what rule is a piece of data ES written to a specific shard?

Why is this index data written to S0 but not to S1 or S2? Why is that piece of data written to S3 instead of S0?

First of all, this will definitely not be random, otherwise we will not know where to look for it when we want to obtain documents in the future.

In fact, this process is determined according to the following formula:

shard = hash(routing) % number_of_primary_shards  

Routing is a variable value, which is documented by default _id, and can also be set to a custom value.

Routing generates a number through the Hash function, and then divides this number by number_of_primary_shards(the number of primary shards) to get the remainder.

The remainder number_of_primary_shards-1between is the location of the shard where the document we are looking for is located.

This explains why we have to determine the number of primary shards when creating the index and never change this number: because if the number changes, all previous routing values ​​will be invalid and the document will never be found again .

Since each node in the ES cluster knows the storage location of the documents in the cluster through the above calculation formula, each node has the ability to process read and write requests.

After a write request is sent to a node, the node is the coordinating node mentioned above. The coordinating node will calculate which shard to write to according to the routing formula, and then forward the request to the main shard of the shard on the node.

If at this time the value obtained after taking the remainder of the data through the routing calculation formula is shard=hash(routing)%4=0.

The specific process is as follows:

  • The client sends a write request to the ES1 node (coordinating node), and the value obtained through the routing calculation formula is 0, then the current data should be written to the primary shard S0.
  • The ES1 node forwards the request to the node ES3 where the primary fragment of S0 is located, and ES3 accepts the request and writes it to disk.
  • Concurrently replicates data to two replica shards R0, where data conflicts are controlled by optimistic concurrency. Once all replica fragments report success, node ES3 will report success to the coordinator node, and the coordinator node will report success to the client.

storage principle

The above describes the write processing process of the internal index of ES. This process is executed in the memory of ES. After the data is allocated to specific shards and copies, it is finally stored on the disk, so that when the power is off There will be no data loss.

The specific storage path can be set ../config/elasticsearch.yml in , and it is stored in the Data folder of the installation directory by default.

It is recommended not to use the default value, because if ES is upgraded, all data may be lost:

path.data: /path/to/data  //索引数据  
path.logs: /path/to/logs  //日志记录  

①Segmented storage

Index documents are stored on disk in the form of segments. What is a segment? The index file is split into multiple sub-files, and each sub-file is called a segment. Each segment is an inverted index itself, and the segment is immutable. Once the index data is written to the hard disk, it cannot be modified.

The segmented storage mode is adopted at the bottom layer, which makes it almost completely avoid the occurrence of locks when reading and writing, and greatly improves the performance of reading and writing.

After the segment is written to disk, a commit point is generated. The commit point is a file used to record all the segment information after committing.

Once a segment has a commit point, it means that the segment only has read permission and loses write permission. On the contrary, when the segment is in memory, it only has permission to write, but not to read data, which means it cannot be retrieved.

The concept of segment is proposed mainly because: In the early full-text search, a large inverted index was established for the entire document collection and written to the disk.

If the index is updated, you need to re-create a full index to replace the original index. This method is very inefficient when the amount of data is large, and because the cost of creating an index is very high, the data cannot be updated too frequently, and the timeliness cannot be guaranteed.

Index files are stored in segments and cannot be modified, so how to deal with adding, updating and deleting?

  • Adding, adding is easy to handle, because the data is new, so you only need to add a new segment to the current document.
  • Delete, because it cannot be modified, so for the delete operation, the document will not be removed from the old segment, but a new .del file will be added, and the segment information of these deleted documents will be listed in the file. The document marked for deletion can still be matched by the query, but it will be removed from the result set before the final result is returned.
  • Update, the old segment cannot be modified to reflect the update of the document. In fact, the update is equivalent to the two actions of deleting and adding. The old document is marked for deletion in the .del file, and the new version of the document is indexed into a new segment. It is possible that both versions of the document will be matched by a single query, but the older version of the document that was deleted will be removed before the result set is returned.

The section is set as non-modifiable, which has certain advantages and disadvantages. The advantages are mainly manifested in:

  • No lock is required. If you never update the index, you don't need to worry about multiple processes modifying data at the same time.
  • Once an index is read into the kernel's filesystem cache, it stays there due to its immutability. As long as there is enough space in the file system cache, most read requests will go directly to memory without hitting disk. This provides a big performance boost.
  • Other caches (like Filter caches) are always valid during the lifetime of the index. They don't need to be rebuilt every time the data changes because the data doesn't change.
  • Writing to a single large inverted index allows the data to be compressed, reducing disk I/O and usage of the index that needs to be cached in memory.

The disadvantages of segment immutability are as follows:

  • When deleting old data, the old data will not be deleted immediately, but will be marked as deleted in the .del file. The old data can only be removed when the segment is updated, which will cause a lot of space waste.
  • If there is a piece of data that is updated frequently, and each update is to add a new one to mark the old one, there will be a lot of wasted space.
  • Every time new data is added, a new segment is required to store the data. When the number of segments is too large, the consumption of server resources such as file handles will be very large.
  • Include all the result sets in the query result, need to exclude the old data marked for deletion, which increases the burden of the query.

② Delayed write strategy

After introducing the form of storage, what is the process of writing the index to disk? Is it directly calling Fsync to physically write to the disk?

The answer is obvious. If it is written directly to the disk, the I/O consumption of the disk will seriously affect the performance.

Then when writing a large amount of data, it will cause the ES to pause and freeze, and the query cannot respond quickly. If this is the case, ES will not be called a near real-time full-text search engine.

In order to improve the writing performance, ES does not add a segment to the disk every time a new piece of data is added, but adopts a delayed writing strategy.

Whenever there is new data, it is first written into the memory, and there is a file system cache between the memory and the disk.

When the default time (1 second) is reached or the data in the memory reaches a certain amount, a refresh (Refresh) will be triggered, and the data in the memory will be generated into a new segment and cached on the file cache system. is flushed to disk and generates a commit point.

The memory here uses the JVM memory of ES, while the file cache system uses the memory of the operating system.

New data will continue to be written into the memory, but the data in the memory is not stored in the form of segments, so no retrieval function can be provided.

When the memory is flushed to the file cache system, a new segment will be generated and the segment will be opened for search without waiting to be flushed to disk.

In Elasticsearch, the lightweight process of writing and opening a new segment is called Refresh (that is, the memory is flushed to the file cache system).

By default each shard is automatically refreshed every second. This is why we say that Elasticsearch is near real-time search, because changes to documents are not immediately visible to search, but become visible within a second.

We can also manually trigger Refresh, POST /_refreshrefresh all indexes, and POST /nba/_refreshrefresh the specified index.

Tips: Although refreshing is a much lighter operation than submitting, it still has performance overhead. Manual refresh is useful when writing tests, but don't manually refresh every time a document is indexed in the production > environment. And not all cases need to be refreshed every second.

Maybe you're using Elasticsearch to index a large number of log files, and you might want to optimize for indexing speed instead of >near real-time search.

At this time, you can reduce the refresh frequency of each index by increasing the refresh_interval = "30s" value . When setting the value, you need to pay attention to the time unit behind it, otherwise the default is milliseconds. At that refresh_interval=-1time means that the automatic refresh of the index is turned off.

Although the strategy of delayed writing can reduce the number of times data is written to the disk and improve the overall writing ability, but we know that the file cache system is also a memory space, which belongs to the memory of the operating system. As long as the memory is powered off or abnormal risk of losing data.

In order to avoid data loss, Elasticsearch adds a transaction log (Translog), which records all data that has not yet been persisted to disk.

picture

After adding the transaction log, the entire process of writing the index is shown in the figure above:

  • After a new document is indexed, it is first written into memory, but in order to prevent data loss, a copy of data will be appended to the transaction log.

    New documents are constantly being written to memory and are also recorded in the transaction log. At this point new data cannot yet be retrieved and queried.

  • When the default refresh time is reached or the data in the memory reaches a certain amount, a Refresh will be triggered to refresh the data in the memory to the file cache system in the form of a new segment and clear the memory. At this time, although the new segment has not been submitted to the disk, it can provide the retrieval function of the document and cannot be modified.

  • As the new document index is continuously written, when the log data size exceeds 512M or the time exceeds 30 minutes, a Flush will be triggered.

    The data in memory is written to a new segment and is written to the file cache system at the same time. The data in the file system cache is flushed to disk through Fsync, a commit point is generated, the log file is deleted, and an empty new log is created.

In this way, when the power is off or needs to be restarted, ES not only needs to load the persisted segment according to the commit point, but also needs the records in the tool Translog to re-persist the unpersisted data to the disk, avoiding Possibility of data loss.

③ segment merge

Since the auto-refresh process creates a new segment every second, this can lead to a burst in the number of segments for a short period of time. And the number of segments too much will bring greater trouble.

Each segment consumes file handles, memory, and CPU cycles. What's more, each search request has to examine each segment in turn and then combine the query results, so the more segments, the slower the search.

Elasticsearch solves this problem by doing periodic segment merging in the background. Small segments are merged into larger segments, which are then merged into larger segments.

Those old deleted documents will be cleared from the file system when the segments are merged. Deleted documents are not copied to the new large section. Indexing and searching are not interrupted during the merge.

Segment merging happens automatically during indexing and searching. The merging process selects a small set of similarly sized segments and merges them in the background into larger segments, which can be either uncommitted or committed. .

After the merge is complete, the old segment will be deleted, the new segment will be flushed to disk, and a new commit point will be written that includes the new segment and excludes the old and smaller segments, and the new segment will be opened for searching.

Segment merging requires a huge amount of calculation and consumes a lot of disk I/O. Segment merging will slow down the write rate, and if left unchecked, it will affect search performance.

Elasticsearch resource-limits the merge process by default, so searches still have enough resources to perform well.

Six, ES performance optimization

storage device

Disks are often the bottleneck on modern servers. Elasticsearch uses disk heavily, and the more throughput your disks can handle, the more stable your nodes will be.

Here are some tips for optimizing disk I/O:

  • Use SSDs. As mentioned elsewhere, they are far superior to mechanical disks.
  • Use RAID0. Striping RAID will increase disk I/O, at the obvious cost of failing when one hard drive fails. Do not use mirroring or parity RAID as replicas already provide this functionality.
  • Alternatively, use multiple hard drives and allow Elasticsearch to stripe data across them via multiple path.data directory configurations.
  • Do not use remotely mounted storage such as NFS or SMB/CIFS. This introduced delay is completely counterproductive to performance.
  • If you're using EC2, beware of EBS. Even SSD-based EBS is usually slower than local instance storage.

internal index optimization

In order to quickly find a term, Elasticsearch first sorts all the terms, and then searches for the term according to the dichotomy method. The time complexity is logN, just like searching through a dictionary. This is the Term Dictionary.

Looking at it now, it seems to be similar to the way traditional databases use B-Tree. But if there are too many terms, the Term Dictionary will be very large, and it is unrealistic to store it in memory, so there is a Term Index.

Just like the index page in a dictionary, which terms start with A and which pages are they on, it can be understood that the Term Index is a tree.

This tree does not contain all Term, it contains some prefixes of Term. Through the Term Index, you can quickly locate a certain Offset in the Term Dictionary, and then search sequentially from this position.

Use FST to compress Term Index in memory, and FST stores all Term in bytes. This compression method can effectively reduce storage space, making Term Index enough to fit in memory, but this method also requires More CPU resources.

For the posting list stored on the disk, compression technology is also used to reduce the space occupied by storage.

Adjust configuration parameters

Suggestions for adjusting configuration parameters are as follows:

  • Assign each document an ordered ID with a well-compressed sequence pattern, and avoid random IDs like UUID-4, which have a low compression ratio and will significantly slow down Lucene.

  • Disable Doc values ​​for index fields that do not require aggregation and sorting. Doc Values ​​are ordered lists document=>field valueof .

  • For fields that do not require fuzzy retrieval, use the Keyword type instead of the Text type, so as to avoid word segmentation of these texts before indexing.

  • If your search results don't require near real-time accuracy, consider index.refresh_intervalchanging .

    If you're doing a bulk import, you can turn off refreshing during the import by setting this value to -1, and you can also index.number_of_replicas: 0turn off . Don't forget to turn it back on when you're done.

  • To avoid deep paging queries, it is recommended to use Scroll for paging queries. In normal pagination query, an from+sizeempty , and each shard will return from+sizepieces of data, which by default only include the document ID and Score to the coordinating node.

    If there are N fragments, the coordinating node will perform secondary sorting on (from+size)×n pieces of data, and then select the documents that need to be retrieved. When from is large, the sorting process will become very heavy and occupy CPU resources seriously.

  • Reduce the number of mapped fields to provide only the fields that need to be retrieved, aggregated or sorted. Other fields can be stored on other storage devices, such as Hbase, and after getting the results in ES, go to Hbase to query these fields.

  • Specify the Routing value when creating an index and query, so that you can accurately query specific shards and improve query efficiency. The choice of routing needs to pay attention to the balanced distribution of data.

JVM tuning

The JVM tuning recommendations are as follows:

  • Make sure that the minimum ( Xms ) and maximum ( Xmx ) sizes of the heap memory are the same to prevent the program from changing the heap memory size at runtime. The heap memory set by Elasticsearch after installation is 1GB by default. ../config/jvm.optionIt can be configured through the file, but it is best not to exceed 50% of the physical memory and exceed 32GB.
  • The GC adopts the CMS method by default, and it is concurrent but has STW problems. You can consider using the G1 collector.
  • ES relies heavily on the file system cache (Filesystem Cache), fast search. In general, you should ensure that at least half of the available memory is physically allocated to the file system cache.

Guess you like

Origin blog.csdn.net/Tyson0314/article/details/130144342