From the Elasticsearch cluster and data layer architecture, look at the distributed system design

Original link: https://mp.weixin.qq.com/s/jJ1LH2MLxRPAvma3Hd6-XA


There are many types of distributed systems, covering a wide range of areas. Different types of systems have different characteristics. Batch computing and real-time computing are very different. In this article, the focus will be on the design of distributed data systems, such as distributed storage systems, distributed search systems, and distributed analysis systems.


Let's take a brief look at the architecture of Elasticsearch.


1. Elasticsearch cluster architecture


Elasticsearch is a very famous open source search and analysis system, which is widely used in various fields of the Internet, especially the following three fields.


One is the search field. Compared with solr, it is a real up-and-coming star and has become the best choice for many search systems.


The second is the Json document database. Compared with MongoDB, it has better read and write performance, and supports richer geographic location queries and mixed queries of numbers and texts.


The third is the analysis and processing of time series data. At present, it has done a very good job in log processing, storage, analysis and visualization of monitoring data. It can be said that it is a leader in this field.


The detailed introduction of Elasticsearch can be viewed on the official website. Let's first look at a few key concepts in Elasticsearch:


  • Node (Node): Physical concept, a running Elasticsearch instance, generally a process on a machine.


  • Index (Index), a logical concept, including configuration information mapping and inverted and positive data files. The data files of an index may be distributed on one machine, or may be distributed on multiple machines. Another layer of indexing means inverted index files.


  • Shard: In order to support a larger amount of data, the index is generally divided into multiple parts according to a certain dimension, each part is a shard, and the shard is managed by the node (Node). A node (Node) generally manages multiple shards. These shards may belong to the same index or may belong to different indexes. However, for reliability and availability, the shards of the same index will be distributed on different nodes as much as possible. )superior. There are two types of shards, primary shards and replica shards.


  • Replica: The backup data of the same shard (Shard), a shard may have 0 or more replicas, and the data in these replicas is guaranteed to be strongly consistent or eventually consistent.


Graphically it might look like this:



  • Index 1: The blue part, there are 3 shards, P1, P2, P3, located in 3 different Nodes, there is no Replica here.


  • Index 2: In the green part, there are 2 shards, P1 and P2, which are located in 2 different Nodes. And each shard has a replica, R1 and R2 respectively. Based on the consideration of system availability, the primary and replica of the same shard cannot be located in the same Node. Here, P1 and R1 of Shard1 are located in Node3 and Node2 respectively. If Node2 goes down at a certain moment, the service will not be affected, because there is still one P1 and R2 that are still available. Because it is a master-slave architecture, when the primary shard fails, it needs to be switched. At this time, a replica needs to be elected as the new master. In addition to taking a little time, there is also the risk of data loss.


Index process


When indexing (Index), a Doc first locates the master shard through routing rules, sends the doc to the master shard to build the index, and then sends the doc to the copy of the shard to build the index, and waits for the copy to build the index successfully. Then it returns success.


In this architecture, the index data is all located in the shard, and the primary shard and the replica shard each store one copy. When a replica shard or the master shard is lost (such as machine downtime, network interruption, etc.), the lost shard needs to be restored in other nodes, and all the data of this shard needs to be fully copied from other replicas (Replica). Construct a new Shard on the new Node.


This copying process takes a period of time. During this time, traffic can only be carried by the remaining primary replicas. Before the recovery is completed, the entire system will be in a dangerous state until the failover ends.


This reflects one of the reasons for the existence of Replica, to avoid data loss and improve data reliability. Another reason for the existence of replicas is that when the number of read requests is large, a Node cannot carry all the traffic. At this time, a replica is needed to offload the query pressure, and the purpose is to expand the query capability.


How to deploy roles


Next, let's look at two different ways of dividing roles:



Elasticsearch supports the above two methods:


1. Hybrid deployment (left):


  • Default way.


  • Regardless of the MasterNode, there are two types of Nodes, Data Node and Transport Node. In this deployment mode, these two different types of Node roles are located in the same Node, which is equivalent to a Node with two functions: Data and Transport.


  • When there is an index or query request, the request is randomly (customized) sent to any Node. This Node will hold a global routing table, select the appropriate Node through the routing table, send the request to these Nodes, and then After all requests have returned, merge the results and return them to the user. A Node plays two roles.


  • The advantage is that it is extremely simple to use, easy to use, and has great value for the promotion system. In the simplest scenario, only one Node needs to be started to complete all functions.


  • The disadvantage is that various types of requests will affect each other. In a large cluster, if a data node has a hot spot, it will affect all other cross-node requests that pass through the data node. If a failure occurs, the failure impact surface will be much larger.


  • Each Node in Elasticsearch needs to maintain 13 connections with every other Node. In this case, each Node needs to maintain connections with all other Nodes, and there is an upper limit on the number of connections in a system, so the number of connections will limit the size of the cluster.


  • In addition, it cannot support the hot update of the cluster.


2. Layered deployment (right):


  • Nodes can be isolated by configuration.


  • Set some Nodes as Transport Nodes, which are specially used for request forwarding and result merging.

    Other Nodes can be set as DataNodes, dedicated to processing data.


  • The disadvantage is that it is complicated to get started, and the number of transports needs to be set in advance, and the number is related to the data node, traffic, etc., otherwise the resources will be idle or the machine will be exploded.


  • The advantage is that the roles are independent of each other and will not affect each other. Generally, the traffic of the Transport Node is evenly distributed, and it is rare that the CPU or traffic of a single machine is full. However, due to the data processing of the DataNode, it is easy for the resources of a single machine to be occupied. Full, such as CPU, network, disk, etc. After being opened independently, if the DataNode fails, it only affects the data processing of a single node, and will not affect the requests of other nodes, and the impact is limited to the minimum range.


  • After the roles are independent, only the Transport Node needs to be connected to all DataNodes, and the DataNodes do not need to be connected to other DataNodes. The number of DataNodes in a cluster is much larger than that of Transport Nodes, so that the scale of the cluster can be larger. In addition, by grouping, the Transport Node can only be connected to the DataNode of the fixed group, so that the problem of the number of connections of Elasticsearch is completely solved.


  • It can support hot update: upgrade the DataNode one by one first, and then upgrade the Transport Node after the upgrade is completed. During the whole process, the user can be invisible.


The deployment layer architecture of Elasticsearch is introduced above. Different deployment methods are suitable for different scenarios. You need to choose the appropriate method according to your own needs.


2. Elasticsearch data layer architecture


Next, let's take a look at the current data layer architecture of Elasticsearch.


data storage


Elasticsearch's Index and meta currently support storage in the local file system, and support different loading methods such as niofs, mmap, simplefs, and smb. The best performance is the MMap method that directly locks the index into the memory. By default, Elasticsearch will automatically select the loading method, and you can configure it yourself in the configuration file. There are a few details here, which can be found in the official documentation.


Both index and meta data exist locally, which brings a problem: when a machine goes down or the disk is damaged, the data is lost. To solve this problem, the Replica function can be used.


Replica


A configuration item can be set for each Index: the number of replicas. If the number of replicas is set to 2, then there will be 3 Shards, one of which is PrimaryShard, and the other two are ReplicaShards. These three Shards will be used as much as possible by Mater. Scheduled to different machines or even racks, the data in the three shards are the same and provide the same service capabilities.


Replica has three purposes:


  • Guarantee service availability: When multiple Replica are set up, if a Replica is unavailable, the request traffic can continue to be sent to other Replica, and the service can quickly resume and start the service.


  • Guarantee data reliability: If there is only one Primary and no Replica, then when the disk of the Primary machine is damaged, the data of all Shards in this Node will be lost, and only reindexing can be done.


  • Provide greater query capabilities: When the query capabilities provided by Shard cannot meet business needs, you can continue to add N replicas, so that the query capabilities can be increased by N times, and the concurrency of the system can be easily increased.


question


Some advantages have been mentioned above, but this architecture also has some problems in some scenarios.


Elasticsearch adopts a technical architecture based on the local file system and uses Replica to ensure data reliability. This architecture can meet most needs and scenarios to a certain extent, but there are also some regrets:


  • Replica brings cost waste. In order to ensure data reliability, Replica must be used, but when one shard can satisfy the processing power, the computing power of another shard will be wasted.


  • Replica brings a drop in write performance and throughput. Each time an Index or an update is performed, the Primary Shard needs to be updated first, and then the Replica is updated in parallel after the update is successful. In addition to the long tail, the write performance will decrease a lot.


  • Dynamically increasing Replica is slow when there is a hotspot or emergency expansion is required. The data of the new shard needs to be completely copied from other shards, which takes a long time to copy.


  • The architecture of the Elasticsearch data layer and the advantages and disadvantages brought by the replication strategy are introduced above. The following briefly introduces several different forms of distributed data system architecture.


3. Distributed system


The first: Distributed system based on local file system



The figure above is a distributed system based on local disk storage of data. Index has a total of 3 Shards, and each Shard has a Replica Shard in addition to the Primary Shard.


When the Node 3 machine is down or the disk is damaged, first confirm that P3 is unavailable, and re-elect the R3 primary shard, and this shard will be switched between active and standby. Then find a new machine Node 7 and restart the new Replica of P3 on Node7.


Since the data will exist on the local disk, the data of Shard 3 needs to be copied from Node 6 to Node 7. If there is 200G data, Gigabit network, it takes 1600 seconds to copy. If there is no replica, these shards will not be able to serve within these 1600 seconds.


In order to ensure reliability, redundant shards are required, which will lead to more physical resource consumption. Another form of this idea is to use dual clusters and backup at the cluster level.


In this architecture, if your data is generated in other storage systems, such as HDFS/HBase, then you also need a data transfer system that distributes the prepared data to the corresponding machines.


In this architecture, in order to ensure availability and reliability, dual clusters or Replica are required to be used in the production environment. The advantages and side effects have been introduced when Elasticsearch was introduced above, so I won't go into details here.


Elasticsearch uses this architecture.


The second: distributed system based on distributed file system (shared storage)



In response to the problem in the first architecture, another idea is: the separation of storage and computing.


The root of the problem of the first idea is that the amount of data is large, and it takes a lot of time to copy the data. Is there a way to not copy the data? In order to achieve this purpose, one idea is to use shared storage for the underlying storage layer. Each shard only needs to be connected to a directory/file in a distributed file system. The shard contains no data, but only the computing part. It is equivalent that each Node is only responsible for the computing part, and the storage part is placed in another underlying distributed file system, such as HDFS.


In the above diagram, Node 1 is connected to the first file; Node 2 is connected to the second file; Node 3 is connected to the third file. When the Node 3 machine is down, it is only necessary to create an empty shard on the Node 4 machine, and then construct a new connection to connect to the third file of the underlying distributed file system. The connection creation speed is very fast. , the total time will be very short.


This is a typical architecture that separates storage and computing, and has the following advantages:


  • Under this architecture, resources can be more flexible. When the storage is not enough, only the capacity of the storage system needs to be expanded; when the calculation is not enough, only part of the computing capacity needs to be expanded.


  • Storage and computing are managed independently, with smaller granularity of resource management, more refined management, and less waste. As a result, the overall cost can be lower.


  • The load is more prominent and the anti-hot spot ability is stronger. Generally, hotspot problems basically appear in the computing part. For the storage and computing separation system, the computing part can expand, shrink and migrate in real time because there is no data bound. When a hotspot occurs, the computing can be scheduled to a new on the node.


This architecture also has a disadvantage:


Accessing a distributed filesystem may not perform as well as accessing a local filesystem. In the previous generation of distributed file systems, this was a relatively obvious problem, but after using various user-mode protocol stacks, this gap has become smaller and smaller.


This is what HBase uses. Solr also supports this form of architecture.


Summarize


The above two architectures have their own advantages and disadvantages. For the deficiencies or defects in some architectures, the ideas are different, and the solutions are also very different. However, the greater the span of ideas, the greater the benefits.


The above only introduces two different architectures of the distributed data (storage/search/analysis, etc.) system in the storage layer, hoping to be useful to everyone. However, the distributed system architecture design involves a wide range of content, many details, and many trade-offs. If you are interested in certain fields or aspects, you can leave a message and discuss it later.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324846610&siteId=291194637