Why build an ElasticSearch cluster instead of a stand-alone ElasticSearch?

  ElasticSearch (hereinafter referred to as ES), as a search engine, can store massive amounts of data and can query the information that the application system wants in a very short time, and its components are open source, so it is widely used. When using ES in a project, cluster mode is generally used instead of stand-alone mode. Why? What are the advantages of cluster mode compared with stand-alone mode? A brief summary is as follows.

1. High Availability

  ES as a search engine, our basic requirement for it is to store massive amounts of data and to query the information we want in a very short time. So the first step we need to ensure is the high availability of ES. What is high availability? It usually refers to reducing the time that the system cannot provide services through design. Assuming that the system has been able to provide services, we say that the availability of the system is 100%. If the system goes down at a certain time, for example, a certain website goes down at a certain time, then it can be said to be temporarily unavailable. Therefore, in order to ensure the high availability of ES, we should minimize the unavailability time of ES.
  So how to improve the high availability of ES? At this time, the role of clusters is reflected. If ES is only placed on one server, that is, it runs on a single machine, if this host is suddenly disconnected from the network or is attacked, then the entire ES service is unavailable. But if it is changed to an ES cluster, one host is down, and there are other hosts that can support it, so that the service can still be guaranteed.
  Maybe some friends will say, then if a host is down, won't it be impossible to access the data of this host? So if the data I want to access happens to exist on this host, wouldn't it be impossible to get it? Could it be that the same data is stored in other hosts? Isn't this wasteful?
  In order to answer this question, here is the information storage mechanism of ES. First answer the above question. If a host is down, the data stored in this host can still be accessed, because there are backups on other hosts, but the backup is not the entire host backup, it is divided. For slice backup, then another concept is introduced here: sharding.
  Sharding is called Shard in English. As the name suggests, sharding is to divide the data into multiple parts. We know that an index (Index) in ES is equivalent to a database. For example, to store user information of a certain website, we will build an index named user. But the index is not stored together when it is stored. It is stored in fragments. ES will divide an index into five fragments by default. Of course, this number can be customized. A shard is a container for data. The data is stored in the shard, and the shard is allocated to each node in the cluster. When your cluster scales up or down, ES will automatically migrate shards in each node so that the data is still evenly distributed in the cluster, so it is equivalent to one piece of data being divided into multiple pieces and stored on different hosts.
  Then this still doesn't solve the problem. If a host goes down, isn't the data in this fragment inaccessible? Other hosts are other shards of storage. In fact, it is accessible, because other hosts store the backup of this fragment, which is called a copy. Here is another concept: copy.
  Replica, called Replica in English, as the name suggests, copy is a copy of the original shard, and the content of the original shard is the same, ES will generate a copy by default, so it is equivalent to five original shards and five shard copies , Which is equivalent to storing two copies of one piece of data and dividing it into ten shards. Of course, the number of copies can also be customized. At this time, we only need to store a copy of a shard on another host, so that when a host is down, we can still find the corresponding data from the copy of another host. So from the outside, there is no difference in the data results.
  Generally speaking, ES will try to store different shards of an index on different hosts, and the copies of the shards should also exist on different hosts as much as possible, which can improve the fault tolerance rate and improve high availability.
  If you only have one host, isn't that all you can do? Fragmentation and copying are actually meaningless. If a host is down, it will all be down.

2. Health status

  For an index, ES actually has a special indicator for measuring the health of the index, which is divided into the following three levels.

  • green. Green means that all primary and replica shards have been allocated. The cluster is 100% available.
  • yellow. Yellow, it means that all the primary shards have been sharded, but at least one copy is missing. No data will be lost, so the search results are still complete. However, your high availability is weakened to some extent. If more shards disappear, you will lose data. So think of yellow as a warning that requires prompt investigation.
  • red. Red means that at least one primary shard and all its replicas are missing. This means that you are missing data, and the search can only return part of the data, and the write request assigned to this shard will return an exception.

  If you only have one host, the health status of the index is also yellow, because there is only one host and no other hosts can place a copy, so this is an unhealthy state, so clustering is also very necessary.

3. Storage space

  Since it is a cluster, the storage space must also be combined. If the storage space of a host is fixed, the cluster has more storage space than a single machine, and the amount of data that can be stored is also larger.

  So in summary, when ES is used in the project, the cluster mode is generally used instead of the stand-alone mode.

Article reference:

Guess you like

Origin blog.csdn.net/piaoranyuji/article/details/114022056