Learning architecture from scratch - high availability storage architecture

Two-machine architecture

The essence of storage high-availability solutions is to replicate data to multiple storage devices and achieve high availability through data redundancy. The complexity is mainly reflected in how to deal with data inconsistency caused by replication delays and interruptions. Therefore, for any high-availability storage solution, we need to think and analyze from the following aspects:
How to replicate data?
What are the responsibilities of each node?
How to deal with replication lag?
How to deal with replication outages?
Common high-availability storage architectures include master-slave, master-slave, master-master, cluster, and partition.

Master/Standby Replication

Master-backup replication is the most common and simplest storage high-availability solution. Almost all storage systems provide the master-backup replication function, such as MySQL, Redis, and MongoDB.

  1. Basic realization

The following is a standard master-backup scheme structure diagram:
insert image description here

  1. Advantages and disadvantages analysis
  • Advantages:
    1. There is no need to perceive the existence of the backup machine;
    2. For the master and backup, both parties only need to copy data, and there is no need to perform status judgment and master-standby switching operations;

  • Disadvantages:
    1. The standby machine is only for backup, and does not provide read and write operations;
    2. Manual intervention is required after a failure, and cannot be automatically restored;

  • Usage scenario:
    The internal back-end management system often uses the master-slave replication architecture, such as student management system, employee management system, vacation management system, etc., because the frequency of data changes in such systems is low, even if it is lost in some scenarios Data can also be completed manually

master-slave replication

The master is responsible for read and write operations, and the slave is only responsible for read operations, not for write operations.

  1. Basic realization

The following is the standard master-slave replication architecture:
insert image description here

  1. Advantages and disadvantages analysis
  • Advantages:
    1. When the master-slave replication fails on the host, the business related to the read operation can continue to run;
    2. The slave of the master-slave replication architecture provides read operations, which makes full use of the performance of the hardware;

  • Disadvantages:
    1. The client needs to perceive the master-slave relationship and send different operations to different machines for processing;
    2. If the master-slave replication delay is relatively large, business problems will occur due to data inconsistency;
    3. Manual intervention is required in case of failure ;

  • Usage scenarios:
    Business scenarios with less writes and more reads use more master-slave replication architectures. For example, in forums, BBS, news websites, etc., the number of read operations is 10 times or even 100 times more than the number of write operations.

switchover

There are two common problems in master-slave replication and master-slave replication schemes:

  • 1. After the host fails, the write operation cannot be performed;
  • 2. If the host cannot be restored, a new host role needs to be manually assigned;

Dual-machine switchover was created to solve these two problems, including master-standby switchover and master-slave switchover schemes. To put it simply, these two solutions are to add the "switching" function on the basis of the original solution, that is, the system automatically determines the role of the host and completes the role switching.

To achieve a complete switching solution, several key design points must be considered:

Judging the status of the master and backup rooms

  • Channels for state transfer: are they connected to each other, or third-party arbitration?
  • The content of status detection: such as whether the machine is powered off, whether the process exists, whether the response is slow, etc.

switch decision

  • Switching timing: Under what circumstances should the standby machine be upgraded to the main machine? The backup machine is upgraded after the power failure of the machine, or the process on the host does not exist.
  • Switching strategy: After the original host recovers from failure, it needs to be switched again to ensure that the original host continues to be the master, or will the original host automatically become the new backup machine after the failure recovers?
  • Degree of automation: Is the switch fully automatic or semi-automatic?

Data conflict resolution
When the original failed host is restored, there may be data conflicts between the old and new hosts. For example, the user adds a piece of data with an ID of 100 on the old host, but this data has not been copied to the old standby. After the old faulty host is recovered, how to deal with the data with duplicate IDs.

According to the different state transfer channels, there are three common forms of active-standby switching architecture: interconnection, intermediary and simulation.

interconnected

Interconnection means that the master and standby machines directly establish a channel for status transfer. There is a problem of status transfer channel failure. It
insert image description here
can be that the master sends the status to the backup, or the backup gets to the master to obtain status information.
It can be shared with the data replication channel, or it can be an independent channel. In order to make full use of the advantage that the switching scheme can automatically determine the host, the client will also
have some corresponding changes.
unique address. For example, virtual IP, the host needs to bind this virtual IP.
2. The client records the addresses of the master and standby machines at the same time;
3. Although the standby machine can receive the operation request from the client, it will directly reject it. The reason for the rejection is "the standby machine does not provide external services";

Disadvantages of interconnection:
1. When the state transfer channel fails, the standby machine may also think that the master is faulty and upgrade to the master, resulting in two masters; 2.
If more channels are added to enhance the reliability of state transfer, it will only reduce The probability of channel failure cannot fundamentally solve this shortcoming, and the more channels there are, the more complex the state decision-making will be.

Intermediary

The intermediary type refers to the introduction of a third-party intermediary in addition to the main and standby machines. The main and standby machines are not directly connected, but are connected to the intermediary, and the status information is transmitted through the intermediary.
insert image description here

Although the intermediary method is simpler in terms of state transfer and state decision-making, there is a problem of how to ensure the high availability of the intermediary itself. If the intermediary itself goes down, the entire system enters a dual-standby state, and services related to write operations are unavailable.

MongoDB's Replica Set adopts the intermediary method, and the architecture diagram is as follows,
insert image description here

  • MongoDB(M), master node: store data
  • MongoDB(S): standby node: store data
  • MongoDB (A): Quorum node: does not store data.
    The client connects to the active and standby nodes.

Open source solutions already have relatively mature intermediary solutions, such as ZooKeeper and Keepalived. ZooKeeper itself has implemented a high-availability cluster architecture, so it has helped us solve the reliability problem of the intermediary itself. In engineering practice, it is recommended to build an intermediary switching architecture based on ZooKeeper.

Analog

The simulation mode means that no state data is transmitted between the master and standby machines, but the standby machine is simulated as a client, initiates simulated read and write operations to the master, and judges the status of the master according to the response of the read and write operations.
insert image description here

Compared with interconnected switching, analog switching has the advantage of simpler implementation, because the establishment and management of state transfer channels are omitted.

Simplicity is both an advantage and a disadvantage. Because the status information obtained by analog read and write operations is only response information (for example, HTTP 404, timeout, response time exceeding 3 seconds, etc.), it is not as diverse as the interconnected type (in addition to response information, it can also include CPU load, I/O load , throughput, response time, etc.), making state decisions based on finite states, there may be deviations.

master-master replication

Master-master replication means that both machines are hosts, and they copy data to each other, and the client can arbitrarily select one of the machines for read and write operations.

insert image description here

Master-master replication is generally much simpler. It does not require state information transmission, state decision-making and state switching, but it has restrictions on usage scenarios. If the master-master replication architecture is adopted, data must be bidirectionally replicated, and many data are cannot be copied in both directions. For example:

  • 1. If the user ID generated after user registration increases according to the number, it cannot be copied in both directions, otherwise multiple hosts will appear with the same ID;
  • 2. The inventory cannot be copied in both directions. If one host is reduced, the other host is also reduced, and it will be overwritten after copying;

Therefore, the master-master replication architecture has strict requirements on data design, and is generally suitable for those temporary, lossable, and overwritable data scenarios. For example, session data generated by user login (can be generated by re-login), user behavior log data (can be lost), forum draft data (can be lost), etc.

Clusters and Partitions

data cluster

Master-slave, master-slave, and master-master architectures all have an implicit assumption in essence: the host can store all data, and the storage and processing capabilities of the host itself are limited. A single server is definitely unable to store and process, we must use multiple servers to store data, this is the data cluster architecture.

A cluster is a combination of multiple machines to form a unified system. The "multiple machines" here are at least 3 in number; in comparison, there are 2 machines for the master and slave, and the master and slave. According to the different roles assumed by the machines in the cluster, clusters can be divided into two types: data centralized clusters and data decentralized clusters.

clustering of data

1 master with multiple backups or 1 master with multiple slaves. Whether it is 1 master and 1 slave, 1 master and 1 backup, or 1 master and multiple backups, and 1 master and multiple slaves, the data can only be written to the host, and the read operation can be flexible and changeable by referring to the master-slave and master-slave architecture. The following figure shows an architecture that reads and writes all data to the host:
insert image description here
Although the architecture is similar, due to the large number of servers in the cluster, the overall complexity is higher, which is specifically reflected in:

  • 1. How does the host copy data to the standby machine?

In the master-slave and master-slave architectures, there is only one replication channel, while in the dataset cluster architecture, there are multiple replication channels. Multiple replication channels will first increase the pressure of host replication. In some scenarios, we need to consider how to reduce the pressure of host replication, or reduce the pressure of host replication on normal read and write.

Secondly, multiple replication channels may cause data inconsistency between multiple backup machines. In some scenarios, we need to check and correct the data consistency between backup machines.

  • 2. How does the standby machine detect the status of the main machine?

In the master-slave and master-slave architecture, only one standby machine needs to judge the status of the master. In the cluster architecture of data concentration, multiple backup machines need to judge the status of the host, and the judgment results of different backup machines may be different. How to deal with different judgments of the status of the host by different backup machines is a complicated problem?

  • 3. After the host fails, how to determine a new host

In the master-slave architecture, if the master fails, the backup machine can be upgraded to the master; while in the data centralized cluster architecture, multiple backup machines can be upgraded to the master, but in fact only one backup machine is allowed to be upgraded to the master , so which backup machine to choose as the new master, and how to coordinate between the backup machines?

The current open source dataset cluster is typically ZooKeeper. ZooKeeper solves the above-mentioned problems through the ZAB algorithm, but the complexity of the ZAB algorithm is very high.

Data Dispersed Cluster

Data scattered cluster refers to multiple servers forming a cluster, and each server is responsible for storing part of the data; at the same time, in order to improve hardware utilization, each server backs up part of the data.

The complexity of data decentralized clusters lies in how to distribute data to different servers. The algorithm needs to consider these design points:

  • Balance
    The algorithm needs to ensure that the data partitions on the server are basically balanced, and the number of partitions on one server cannot be several times that of another server.

  • Fault tolerance
    When a partial server failure occurs, the algorithm needs to allocate the data partition originally assigned to the failed server to other servers.

  • Scalability
    When the capacity of the cluster is insufficient and new servers are expanded, the algorithm can automatically migrate some data partitions to the new servers and ensure the balance of all servers after expansion.

The difference between a data scattered cluster and a data centralized cluster is that each server in a data scattered cluster can handle read and write requests, so there is no role like a host responsible for writing in a data centralized cluster. However, in a distributed data cluster, there must be a role responsible for executing the data allocation algorithm. This role can be an independent server, or a server elected by the cluster itself. If a machine is elected by the cluster server to assume the responsibility of data partition allocation, this server is generally also called the host, but we need to know the "host" here and the "host" in the data cluster. Difference.

The implementation of Hadoop is that an independent server is responsible for the allocation of data partitions, and this server is called Namenode. Hadoop's data partition management architecture is as follows:
insert image description here

Hadoop official website - HDFS architecture

The following is the official explanation of Hadoop, which can explain the basic way of centralized data partition management.

  • HDFS adopts master/slave architecture. An HDFS cluster consists of a Namenode and a certain number of Datanodes.
  • Namenode is a central server responsible for managing the namespace of the file system and client access to files.
  • The Datanode in the cluster is generally one node, which is responsible for managing the storage on the node where it is located. HDFS exposes the namespace of the file system, and users can store data in the form of files. Internally, a file is actually divided into one or more data blocks, and these blocks are stored on a set of Datanodes.
  • The Namenode performs file system namespace operations such as opening, closing, and renaming files or directories. It is also responsible for determining the mapping of data blocks to specific Datanode nodes.
  • Datanodes are responsible for handling read and write requests from file system clients. Create, delete and copy data blocks under the unified scheduling of Namenode.

Different from Hadoop, the Elasticsearch cluster allocates data partitions by electing a server, which is called the master node. Its data partition management architecture is as follows: The responsibilities
insert image description here
of the master node are as follows:

The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.
elasticsearch官方文档——modules-node

In the data centralized cluster architecture, the client can only write data to the host; in the data dispersed cluster architecture, the client can read and write data to any server. It is precisely because of this key difference that the application scenarios of the two clusters are different. Generally speaking, centralized data clusters are suitable for scenarios where the amount of data is small and the number of cluster machines is small. For example, a ZooKeeper cluster generally recommends about 5 machines, and the amount of data can be supported by a single server; and a distributed data cluster, due to its good scalability, is suitable for business scenarios with a huge amount of business data and a large number of cluster machines. For example, Hadoop clusters, HBase clusters, and large-scale clusters can reach hundreds or even thousands of servers.

data partition

The storage high-availability architectures we discussed above are all considered and designed based on hardware failure scenarios. The main consideration is how the system should handle when some hardware may be damaged. However, for some disasters or accidents with great impact, there are Possibly all hardware fails. For example, extreme disasters or accidents such as floods in New Orleans, blackouts in the United States and Canada, and the Los Angeles earthquake may cause all infrastructure in a city or even an area to be paralyzed. In this case, the high-availability architecture designed based on hardware failures is no longer applicable. , we need to design a high-availability architecture based on geographical-level failures, which is the background of the data partition architecture.

Different partitions are distributed in different geographical locations, and each partition stores a part of data. In this way, the huge impact caused by geographical-level failures can be avoided.

The amount of data

The size of the data volume directly determines the complexity of the partition rules. For example, if MySQL is used to store data, assuming that a MySQL storage capacity is 500GB, then 2TB of data requires at least 4 MySQL servers; and if the data is 200TB, it is not as simple as increasing to 800 MySQL servers. If 800 servers are managed in parallel in the same way as 4 servers, the complexity will change substantially, as follows:

Among the 800 servers, there may be one or two server failures every week. Locating two server failures from the 800 servers is not an easy task in many cases, and the operation and maintenance complexity is high.

Adding new servers, partition-related configurations and even rules need to be modified, and each modification may theoretically affect the operation of the existing 800 servers, and it is too common to accidentally change the wrong configuration in practice.

If such a large amount of data is concentrated in a certain city geographically, the risk is very high. In the event of catastrophic failures such as floods and blackouts, all data may be lost. Therefore, the partition rules need to consider geographical disaster recovery.

Zoning rules

Intercontinental partitions are mainly used to provide services to different continents. Since the network delay of intercontinental communication is too large to provide online services, intercontinental data centers may not communicate with each other or only serve as backups; country partitions are mainly used for different countries. Different countries have different languages, laws, services, etc., and the partitions between countries are generally only used as a backup; because the city partitions are all in the same country or region, the network delay is low and the business is similar, and the partitions are provided at the same time. Services, which can meet the needs of business in different places and live more.

copy rule

The data is scattered in multiple regions, and the partition architecture also needs to consider the replication scheme.

There are three types of replication rules: centralized, mutual standby, and independent.

  1. centralized

Centralized backup means that there is a general backup center, and all partitions back up data to the backup center. Its basic structure is as follows: the advantages and disadvantages are
insert image description here
:

The design is simple, and there is no direct connection between the partitions, so they do not affect each other.

It is easy to expand. If you want to add a fourth partition (for example, Wuhan partition), you only need to copy the data of Wuhan partition to Xi'an backup center, and other partitions will not be affected.

The cost is high, and an independent backup center needs to be built.

  1. Mutual standby

Mutual backup means that each partition backs up the data of another partition. Its basic structure is as follows:
insert image description here
the advantages and disadvantages are:

The design is relatively complicated. In addition to undertaking business data storage, each partition also needs to undertake backup functions, and they are related and affected by each other.

Expansion is troublesome. If you add a Wuhan partition, you need to modify the replication of the Guangzhou partition to point to the Wuhan partition, and then point the replication of the Wuhan partition to the Beijing partition. However, how to deal with the data of the Guangzhou partition that has been backed up in the original Beijing partition is also a problem. Whether it is data migration, or the historical data of the Guangzhou partition is kept in the Beijing partition, and the new data is backed up to the Wuhan partition, either way is very troublesome.

Low cost, direct use of existing equipment.

  1. Detached

Independent backup means that each partition has its own independent backup center, and its basic structure is as follows:

insert image description here

The advantages and disadvantages of a standalone backup architecture are:

The design is simple, and each partition does not affect each other.

Easy to expand, newly added partitions only need to build their own backup center.

The cost is high, each partition needs an independent backup center, and the site cost of the backup center is the main cost, so the cost of independent type is much higher than that of centralized type.

Guess you like

Origin blog.csdn.net/zkkzpp258/article/details/130376698