Distributed Architecture Principle Analysis Common Problem Solving

Everyone thinks the writing is okay, you can like, bookmark, and pay attention to it!
You can also visit my personal blog , it is estimated that it will be updated in recent years! Be friends with me! https://motongxue.cn


Distributed Architecture Principle Analysis Common Problem Solving

 

1. Distributed Terminology

1.1. Exceptions

server down

Memory errors, server power outages, etc. will cause the server to go down. At this time, the node cannot work normally, which is called unavailable.

Server downtime will cause the node to lose all memory information, so memory information needs to be saved to persistent media.

network anomaly

There is a special network anomaly called - network partition  , that is, all nodes of the cluster are divided into multiple areas, and each area can communicate within, but the areas cannot communicate.

disk failure

A disk failure is an anomaly with a high probability of occurrence.

Data is stored on multiple servers using redundancy mechanisms.

1.2. Timeout

In a distributed system, in addition to the two states of success and failure, a request also has a timeout state.

The operation of the server can be designed to be  idempotent  , that is, the result of executing it multiple times is the same as that of executing it once. If this method is used, when a timeout occurs, the request can be continuously re-requested until it succeeds.

1.3. Metrics

performance

Common performance indicators are: throughput, response time.

Among them, throughput refers to the total number of requests that the system can handle in a certain period of time, usually the number of read operations or write operations per second; response time refers to the time consumed from sending a request to receiving the returned result.

These two indicators are often contradictory. It is often difficult to achieve low response time for a system that pursues high throughput. The explanation is as follows:

  • In a system without concurrency, the throughput is the reciprocal of the response time. For example, if the response time is 10 ms, the throughput is 100 req/s, so high throughput means low response time.
  • But in a concurrent system, because a request calls for I/O resources, it needs to wait. The server generally uses the asynchronous waiting method, that is, the waiting request does not need to occupy CPU resources all the time after being blocked. This method can greatly improve the utilization of CPU resources. For example, in the above example, the response time of a single request in a non-concurrent system is 10 ms. If it is in a concurrent system, the throughput will be greater than 100 req/s. Therefore, in order to pursue high throughput, the degree of concurrency is usually increased. However, the increase in the degree of concurrency will lead to an increase in the average response time of the request, because the request cannot be processed immediately and needs to be processed concurrently with other requests, and the response time will naturally increase.

availability

Availability refers to the ability of the system to provide normal services in the face of various abnormalities. It can be measured by the ratio of the system available time to the total time, and the availability of 4 nines means that the system is available 99.99% of the time.

consistency

Consistency can be understood from two perspectives: from the perspective of the client, whether the read and write operations meet certain characteristics; from the perspective of the server, whether multiple data copies are consistent.

scalability

Refers to the ability of the system to improve performance by expanding the scale of cluster servers. An ideal distributed system needs to achieve "linear scalability", that is, as the cluster size increases, the overall performance of the system will also increase linearly.

2. Data distribution

The data of a distributed storage system is distributed among multiple nodes, and commonly used data distribution methods include hash distribution and sequential distribution.

The horizontal sharding of the database (Sharding) is also a distributed storage method, and the following data distribution methods are also applicable to Sharding.

2.1. Hash distribution

Hash distribution is to calculate the hash value of the data and distribute it to different nodes according to the hash value. For example, if there are N nodes and the primary key of the data is key, then the serial number of the node assigned to the data is: hash(key)%N.

There is a problem in the traditional hash distribution algorithm: when the number of nodes changes, that is, the value of N changes, almost all data needs to be redistributed, which will lead to a large amount of data migration.

consistent hashing

Distributed Hash Table (DHT): For the hash space [0, 2n-1], the hash space is regarded as a hash ring, and each node is configured on the hash ring. After each data object obtains the hash value through hash modulo, it is stored in the first node in the clockwise direction of the hash ring that is greater than or equal to the hash value.

The advantage of consistent hashing is that when adding or deleting a node, it will only affect the adjacent nodes in the hash ring. For example, if a new node X is added in the figure below, only data object C needs to be stored on node X again. There is no effect on nodes A, B, D.

2.2. Order distribution

Hash distribution destroys the order of data, sequential distribution does not.

Sequentially distributed data is divided into multiple consecutive parts and distributed to different nodes according to the ID or time of the data. For example, in the figure below, the User table ID ranges from 1 to 7000, and it can be divided into multiple sub-tables using sequential distribution. The corresponding primary key ranges from 1 to 1000, 1001 to 2000, ..., 6001 to 7000.

The advantage of sequential distribution is that the space of each node can be fully utilized, while hash distribution is difficult to control how much data a node stores.

But the sequential distribution needs to use a mapping table to store the mapping of data to nodes, and this mapping table is usually stored in a separate node. When the amount of data is very large, the mapping table also becomes larger, so one node may not be able to store the entire mapping table. Moreover, the overhead of maintaining the entire mapping table for a single node is very high, and the search speed will also be slowed down. In order to solve the above problems, an intermediate layer, namely the Meta table, is introduced to share the maintenance work of the mapping table.

2.3. Load balancing

There are many factors to measure the load, such as CPU, memory, disk and other resource usage, the number of read and write requests, and so on.

Distributed system storage should be able to automatically load balance, when the load of a certain node is high, some of its data will be migrated to other nodes.

Each cluster has a master control node, and other nodes are working nodes. The master control node performs overall scheduling according to the global load information, and the working nodes regularly send heartbeat packets (Heartbeat) to send information related to node load to the master control node.

A newly launched working node, due to its low load, if not controlled, the master control node will migrate a large amount of data to the node at the same time, causing the node to fail to work for a period of time. Therefore, the load balancing operation needs to be carried out smoothly, and it takes a long period of time for newly added nodes to reach a relatively balanced state.

3. Distributed theory

3.1. CAP

It is impossible for a distributed system to satisfy consistency (C: Consistency), availability (A: Availability) and partition tolerance (P: Partition Tolerance), at most two of them can be satisfied at the same time.

 

 

consistency

Consistency refers to whether multiple data copies can maintain consistent characteristics.

Under the condition of consistency, the system can transfer from a consistent state to another consistent state after performing a data update operation.

After a data update to the system is successful, if all users can read the latest value, the system is considered to have strong consistency.

availability

Availability refers to the ability of a distributed system to provide normal services in the face of various abnormalities. It can be measured by the ratio of the system's available time to the total time. The availability of 4 nines means that the system is available 99.99% of the time.

Under the condition of availability, the services provided by the system are always available, and the results can always be returned within a limited time for each operation request of the user.

Partition tolerance

Network partition means that the nodes in the distributed system are divided into multiple areas, and each area can communicate within, but the areas cannot communicate with each other.

Under the condition of partition tolerance, when a distributed system encounters any network partition failure, it still needs to provide services that can provide consistency and availability to the outside world, unless the entire network environment fails.

trade off

Partition tolerance is essential in a distributed system because the network needs to always be assumed to be unreliable. Therefore, the CAP theory is actually a trade-off between availability and consistency.

Availability and consistency are often in conflict, and it is difficult to satisfy both at the same time. When synchronizing data between multiple nodes,

  • In order to ensure consistency (CP), it is necessary to make all nodes offline and become unavailable, waiting for the synchronization to complete;
  • In order to ensure availability (AP), the data of all nodes is allowed to be read during the synchronization process, but the data may be inconsistent.

3.2. BASE

BASE is an acronym for the three phrases Basicly Available, Soft State, and Eventually Consistent.

The BASE theory is the result of the trade-off between consistency and availability in CAP. The core idea of ​​its theory is: even if strong consistency cannot be achieved, each application can use an appropriate method according to its own business characteristics to make the system achieve the ultimate consistency.

 

 

basically available

It means that when the distributed system fails, the core is guaranteed to be available, and some availability is allowed to be lost.

For example, when an e-commerce company conducts promotions, in order to ensure the stability of the shopping system, some consumers may be directed to a downgraded page.

soft state

It means that the data in the system is allowed to have an intermediate state, and it is considered that the intermediate state will not affect the overall availability of the system, that is, there is a delay in the process of allowing data copies of different nodes in the system to synchronize.

eventual consistency

The final consistency emphasizes that all data copies in the system can finally reach a consistent state after a period of synchronization.

ACID requires strong consistency and is usually used in traditional database systems. However, BASE requires final consistency and achieves availability by sacrificing strong consistency, which is usually used in large-scale distributed systems.

In an actual distributed scenario, different business units and components have different requirements for consistency, so ACID and BASE are often used together.

4. Distributed transaction issues

4.1. Two-phase commit (2PC)

Two-phase commit (Two-phase Commit, 2PC)

It is mainly used to implement distributed transactions. Distributed transactions refer to transaction operations that span multiple nodes and require the ACID characteristics of transactions to be met.

By introducing a coordinator (Coordinator) to schedule the behavior of participants, and finally decide whether these participants want to actually execute the transaction.

working process

Preparation Phase

The coordinator asks the participant whether the transaction is executed successfully, and the participant sends back the transaction execution result.

 

 

submission stage

If the transaction is successfully executed on each participant, the transaction coordinator sends a notification to the participants to commit the transaction; otherwise, the coordinator sends a notification to the participants to roll back the transaction.

 

 

It should be noted that during the prepare phase, the participants executed the transaction but not yet committed it. Only after receiving the notification from the coordinator during the commit phase, commit or rollback.

question

synchronous blocking

All transaction participants are in a synchronous blocking state while waiting for other participants to respond, and cannot perform other operations.

single point problem

The coordinator plays a very important role in 2PC, and a failure will have a great impact, especially if a failure occurs in phase 2, all participants will always be in a waiting state and cannot complete other operations.

data inconsistency

In phase 2, if the coordinator only sends part of the Commit message and the network is abnormal, only some participants receive the Commit message, that is to say, only some participants submit the transaction, making the system data inconsistent.

too conservative

The failure of any node will cause the failure of the entire transaction, and there is no perfect fault tolerance mechanism.

2PC pros and cons

Advantages: As far as possible to ensure the strong consistency of data, it is suitable for key fields that require high data consistency. (In fact, strong consistency cannot be guaranteed 100%) Disadvantages: The implementation is complicated, availability is sacrificed, and performance is greatly affected. It is not suitable for high-concurrency and high-performance scenarios.

4.2. Compensation transaction (TCC)

The core idea of ​​compensation transaction (TCC) is: for each operation, a corresponding confirmation and compensation (undo) operation must be registered. It is divided into three stages:

  1. The Try stage is mainly to test the business system and reserve resources.
  2. The Confirm phase is mainly to confirm and submit the business system. When the Try phase is successfully executed and the Confirm phase is executed, the default Confirm phase is error-free. That is: as long as Try succeeds, Confirm must succeed.
  3. The Cancel stage is mainly to cancel the business and release the reserved resources when the business execution error needs to be rolled back.

For example, suppose Bob wants to transfer money to Smith, the idea is probably:

  1. First, in the Try stage, the remote interface must be called to freeze the money of Smith and Bob.
  2. In the Confirm stage, the transfer operation of the remote call is performed, and the transfer is successfully unfrozen.
  3. If the execution of the second step is successful, the transfer is successful. If the execution of the second step fails, the unfreezing method (Cancel) corresponding to the remote freezing interface will be called.

Advantages and disadvantages of TCC

  • Advantages: Compared with 2PC, the implementation and process are relatively simple, but the data consistency is also worse than 2PC.
  • Disadvantages: The disadvantages are quite obvious, and it is possible to fail in steps 2 and 3. TCC is a compensation method at the application layer, so programmers need to write a lot of compensation code when implementing it. In some scenarios, some business processes may not be well defined and processed by TCC.

4.3. Local message table (asynchronous guarantee)

The local message table and the business data table are in the same database, so that local transactions can be used to ensure that the operations on these two tables meet the transaction characteristics.

  1. After one party of the distributed transaction operation completes the operation of writing business data, it sends a message to the local message table, and the local transaction can guarantee that the message will be written into the local message table.
  2. Afterwards, the messages in the local message table are forwarded to message queues (MQ) such as Kafka, and if the forwarding is successful, the messages are deleted from the local message table, otherwise continue to be forwarded again.
  3. The other side of the distributed transaction operation reads a message from the message queue and executes the operations in the message.

 

 

This scheme follows the BASE theory and uses eventual consistency.

The local message table uses local transactions to implement distributed transactions, and uses message queues to ensure eventual consistency.

Advantages and disadvantages of local message tables

  • Advantages: A very classic implementation that avoids distributed transactions and achieves eventual consistency.
  • Disadvantage: The message table will be coupled to the business system. If there is no packaged solution, there will be a lot of chores to deal with.

4.4. MQ transaction message

Some third-party MQs support transactional messages, such as RocketMQ, and the way they support transactional messages is similar to the two-phase commit adopted. However, some mainstream MQs on the market do not support transactional messages, such as RabbitMQ and Kafka.

Taking Ali's RocketMQ middleware as an example, the idea is roughly as follows:

  1. Prepared message, the address of the message will be obtained.
  2. Execute local transactions.
  3. Use the address obtained in the first stage to access the message and modify the state.

That is to say, in the business method, you want to submit two requests to the message queue, one for sending a message and one for confirming a message. If the confirmation message fails to be sent, RocketMQ will periodically scan the transaction messages in the message cluster. When it finds a Prepared message, it will confirm it to the message sender. Therefore, the producer needs to implement a check interface, and RocketMQ will send it according to the policy set by the sender. Decide whether to rollback or continue sending acknowledgment messages. This ensures that message sending succeeds or fails at the same time as the local transaction.

Advantages and disadvantages of MQ transaction messages

  • Advantages: Achieves eventual consistency and does not need to rely on local database transactions.
  • Disadvantages: It is difficult to implement, and mainstream MQ does not support it.

5. Consensus issues

5.1. Paxos

It is used to reach a consensus problem, that is, for the values ​​generated by multiple nodes, the algorithm can ensure that only one unique value is selected.

There are three main types of nodes:

  • Proposer: propose a value;
  • Acceptor: vote on each proposal;
  • Informer (Learner): To be informed of the results of the vote, not to participate in the voting process.

The algorithm needs to meet the constraints of safety and liveness (in fact, these two basic properties should be considered by most distributed algorithms):

  • safety: Ensure that the resolution result is correct, unambiguous, and error-free.
  • The resolution (value) can only be finally approved by the proposal proposed by the proposers;
  • In an execution instance, only one final resolution is approved (chosen), which means that the result of the majority acceptance (accept) can become a resolution;

 

  • liveness: Guarantee that the resolution process can be completed within a limited time.
  • Decisions are always made, and learners can get chosen decisions.

 

The basic process includes the proposer making a proposal, first seeking the support of the majority of acceptors, and when more than half of the acceptors support it, sending the final result to everyone for confirmation. A potential problem is that the proposer fails during this process, which can be solved by a timeout mechanism. In an extremely coincidental situation, every time the proposer of a new round of proposals happens to fail, the system will never be able to reach a consensus (the probability is very small).

Paxos can guarantee that the system can reach a consensus when more than $1/2$ of normal nodes exist.

Single proposer + multiple receivers

If the system restricts only a specific node to be the proposer, then consensus can definitely be achieved (there is only one solution, either achieved or failed). As long as the proposer receives votes from a majority of receivers, it can be considered approved, because there are no other proposals in the system.

But once the proposer fails, the system cannot work.

Multiple proposers + single receiver

Limit a node as a receiver. In this case, consensus is also easy to reach. The receiver receives multiple proposals, chooses the first proposal as a resolution, and rejects the subsequent proposals.

The flaw is also prone to single points of failure, including receiver failure or first proposer node failure.

The above two situations are actually similar to the master-slave mode. Although they are not so reliable, they are widely used because of their simple principles.

Some challenges arise when both proposers and acceptors are generalized to multiple situations.

Multiple proposers + multiple receivers

Since limiting to a single proposer or single acceptor can fail, multiple proposers and multiple acceptors must be allowed. The problem suddenly became complicated.

One situation is that there is only one proposer in the same time segment (such as a proposal cycle), and then it can degenerate to a single proposer situation. It is necessary to design a mechanism to ensure the correct generation of proposers, such as according to time, sequence, or everyone guessing (a number for comparison). Considering the large amount of work that the distributed system has to deal with, this process must be as efficient as possible, and it is very difficult to design a mechanism that satisfies this condition.

Another situation is to allow multiple proposers to appear in the same time slice. The same node may receive multiple proposals, how to distinguish them? At this time, the method of only accepting the first proposal and rejecting subsequent proposals is not applicable. Naturally, proposals need to carry different serial numbers. Nodes need to judge which one to accept based on the proposal serial number. For example, accepting a proposal with a larger serial number (which often means accepting a new proposal, because the old proposer has a higher probability of failure) is accepted.

How do you assign serial numbers to proposals? One possible solution is that the proposal number intervals of each node are isolated from each other and do not conflict with each other. In order to meet the incremental requirements, the timestamp can be used as a prefix field.

In addition, even if the proposer receives the votes of the majority of recipients, it cannot be said that it will be passed. Because the system may have other proposals during the process.

5.2. Raft

The Raft algorithm is a simplified implementation of the Paxos algorithm.

Including three roles: leader, candidate and follower, the basic process is:

  • Leader election  - each candidate will propose an election plan randomly after a certain period of time, and the person with the most votes in the most recent stage will be selected as the leader;
  • Synchronize log  - the leader will find the latest log record in the system and force all followers to refresh to this record;

Note: log here does not refer to log messages, but records of various events.

Single Candidate Election

There are three kinds of nodes: Follower, Candidate, and Leader. Leader will periodically send heartbeat packets to Follower. Each Follower has set a random election timeout period, generally 150ms~300ms, if it does not receive the Leader's heartbeat packet within this time, it will become a candidate and enter the election stage.

  • The figure below shows the initial stage of a distributed system. At this time, there are only Followers and no Leaders. Follower A waits for a random election timeout, but does not receive the heartbeat packet from the Leader, so it enters the election phase.

 

 

  • At this point A sends a voting request to all other nodes.

 

 

  • Other nodes will reply to the request. If more than half of the nodes reply, the Candidate will become the Leader.

 

 

  • Afterwards, the Leader will periodically send a heartbeat packet to the Follower, and the Follower will start timing again after receiving the heartbeat packet.

 

 

Multiple Candidate elections

  • If multiple Followers become candidates and get the same number of votes, you need to start voting again. For example, Candidate B and Candidate D in the figure below both get two votes, so you need to start voting again.

 

 

  • When voting is restarted, since each node sets a different random election timeout, the probability that multiple candidates will appear again next time and get the same number of votes is very low.

 

 

sync log

  • Modifications from the client will be passed to the Leader. Note that the change has not been committed, it is just written to the log.

 

 

  • Leader will replicate the modification to all Followers.

 

 

  • The Leader will wait for most of the Followers to make changes before submitting the changes.

 

 

  • At this time, the Leader will notify all the Followers to submit the modification, and at this time, the values ​​of all nodes reach a consensus.

 

 

6. Distributed cache problem

6.1. Cache Avalanche

Cache avalanche refers to: in a high-concurrency scenario, due to the invalidation of the original cache, the new cache has not yet expired (for example: when we set the cache with the same expiration time, a large area of ​​cache expires at the same time), all the caches that should have All requests to access the cache go to the database, which puts a huge pressure on the database CPU and memory, and seriously causes the database to go down. Thus forming a series of chain reactions, causing the entire system to collapse.

solution:

  • Use locks or queues to ensure that there will not be a large number of threads reading and writing the database at one time, so as to avoid a large number of concurrent requests falling on the underlying storage system when failure occurs.
  • There is also a simple solution, which is to disperse the cache expiration time, do not set all cache time lengths to 5 minutes or 10 minutes; for example, we can add a random value based on the original expiration time, such as 1-5 minutes random , so that the repetition rate of the expiration time of each cache will be reduced, and it will be difficult to cause collective invalidation events.

The avalanche effect generated when the cache fails puts all requests on the database, which can easily reach the bottleneck of the database and cause the service to fail to be provided normally. Try to avoid this scenario from happening.

6.2. Cache penetration

Cache penetration refers to: the data queried by the user does not exist in the database, and naturally it does not exist in the cache. In this way, when the user queries, it cannot be found in the cache, and every time the user has to go to the database to query again, and then return empty (equivalent to two useless queries). In this way, the request bypasses the cache and directly checks the database, which is also a frequently mentioned cache hit rate issue.

When the traffic is large, such a situation occurs, and the DB is always requested, which can easily cause the service to hang up.

solution:

  1. Add a step in the encapsulated cache SET and GET parts. If the query KEY does not exist, set an identification KEY with this KEY as the prefix; when querying the KEY later, query the identification KEY first, and if the identification KEY exists, then Return an agreed non-false or NULL value, and then APP will handle it accordingly, so that the cache layer will not be penetrated. Of course, the expiration time of this verification KEY cannot be too long.
  2. If the data returned by a query is empty (whether the data does not exist or the system fails), we still cache the empty result, but its expiration time will be very short, usually only a few minutes.
  3. The Bloom filter is used to hash all possible data into a large enough bitmap, and a data that must not exist will be intercepted by this bitmap, thus avoiding the query pressure on the underlying storage system.

6.3. Cache Warming

Cache preheating should be a relatively common concept. I believe that many small partners should be able to easily understand it. Cache preheating is to directly load relevant cached data into the cache system after the system goes online. This can avoid the problem of first querying the database and then caching the data when the user requests it! Users directly query the cached data that has been warmed up in advance!

solution:

  1. Write a cache refresh page directly, and manually operate it when going online;
  2. The amount of data is not large, and it can be loaded automatically when the project starts;
  3. Regularly refresh the cache;

6.4. Cache update

In addition to the cache invalidation strategy that comes with the cache server (Redis has 6 default strategies to choose from), we can also customize the cache elimination according to specific business needs. There are two common strategies:

  1. Regularly clean up expired caches;
  2. When a user requests it, it is judged whether the cache used by the request has expired. If it expires, it will go to the underlying system to obtain new data and update the cache.

Both have their own advantages and disadvantages. The disadvantage of the first is that it is more troublesome to maintain a large number of cached keys. The disadvantage of the second is that every time a user requests it, it must be judged that the cache is invalid, and the logic is relatively complicated! Which solution to use depends on your own application scenarios.

6.5. Cache downgrade

When the traffic increases sharply, the service has problems (such as slow response time or no response), or non-core services affect the performance of the core process, it is still necessary to ensure that the service is still available, even if the service is damaged. The system can perform automatic downgrade based on some key data, or configure switches to achieve manual downgrade.

The ultimate goal of downgrading is to keep core services available, even if lossy. And some services cannot be downgraded (such as adding to shopping cart, checkout).


Updated on December 22, 2020

Everyone thinks the writing is okay, you can like, bookmark, and pay attention to it!
You can also visit my personal blog , it is estimated that it will be updated in recent years! Be friends with me! https://motongxue.cn


Distributed Architecture Principle Analysis Common Problem Solving

 

1. Distributed Terminology

1.1. Exceptions

server down

Memory errors, server power outages, etc. will cause the server to go down. At this time, the node cannot work normally, which is called unavailable.

Server downtime will cause the node to lose all memory information, so memory information needs to be saved to persistent media.

network anomaly

There is a special network anomaly called - network partition  , that is, all nodes of the cluster are divided into multiple areas, and each area can communicate within, but the areas cannot communicate.

disk failure

A disk failure is an anomaly with a high probability of occurrence.

Data is stored on multiple servers using redundancy mechanisms.

1.2. Timeout

In a distributed system, in addition to the two states of success and failure, a request also has a timeout state.

The operation of the server can be designed to be  idempotent  , that is, the result of executing it multiple times is the same as that of executing it once. If this method is used, when a timeout occurs, the request can be continuously re-requested until it succeeds.

1.3. Metrics

performance

Common performance indicators are: throughput, response time.

Among them, throughput refers to the total number of requests that the system can handle in a certain period of time, usually the number of read operations or write operations per second; response time refers to the time consumed from sending a request to receiving the returned result.

These two indicators are often contradictory. It is often difficult to achieve low response time for a system that pursues high throughput. The explanation is as follows:

  • In a system without concurrency, the throughput is the reciprocal of the response time. For example, if the response time is 10 ms, the throughput is 100 req/s, so high throughput means low response time.
  • But in a concurrent system, because a request calls for I/O resources, it needs to wait. The server generally uses the asynchronous waiting method, that is, the waiting request does not need to occupy CPU resources all the time after being blocked. This method can greatly improve the utilization of CPU resources. For example, in the above example, the response time of a single request in a non-concurrent system is 10 ms. If it is in a concurrent system, the throughput will be greater than 100 req/s. Therefore, in order to pursue high throughput, the degree of concurrency is usually increased. However, the increase in the degree of concurrency will lead to an increase in the average response time of the request, because the request cannot be processed immediately and needs to be processed concurrently with other requests, and the response time will naturally increase.

availability

Availability refers to the ability of the system to provide normal services in the face of various abnormalities. It can be measured by the ratio of the system available time to the total time, and the availability of 4 nines means that the system is available 99.99% of the time.

consistency

Consistency can be understood from two perspectives: from the perspective of the client, whether the read and write operations meet certain characteristics; from the perspective of the server, whether multiple data copies are consistent.

scalability

Refers to the ability of the system to improve performance by expanding the scale of cluster servers. An ideal distributed system needs to achieve "linear scalability", that is, as the cluster size increases, the overall performance of the system will also increase linearly.

2. Data distribution

The data of a distributed storage system is distributed among multiple nodes, and commonly used data distribution methods include hash distribution and sequential distribution.

The horizontal sharding of the database (Sharding) is also a distributed storage method, and the following data distribution methods are also applicable to Sharding.

2.1. Hash distribution

Hash distribution is to calculate the hash value of the data and distribute it to different nodes according to the hash value. For example, if there are N nodes and the primary key of the data is key, then the serial number of the node assigned to the data is: hash(key)%N.

There is a problem in the traditional hash distribution algorithm: when the number of nodes changes, that is, the value of N changes, almost all data needs to be redistributed, which will lead to a large amount of data migration.

consistent hashing

Distributed Hash Table (DHT): For the hash space [0, 2n-1], the hash space is regarded as a hash ring, and each node is configured on the hash ring. After each data object obtains the hash value through hash modulo, it is stored in the first node in the clockwise direction of the hash ring that is greater than or equal to the hash value.

The advantage of consistent hashing is that when adding or deleting a node, it will only affect the adjacent nodes in the hash ring. For example, if a new node X is added in the figure below, only data object C needs to be stored on node X again. There is no effect on nodes A, B, D.

2.2. Order distribution

Hash distribution destroys the order of data, sequential distribution does not.

Sequentially distributed data is divided into multiple consecutive parts and distributed to different nodes according to the ID or time of the data. For example, in the figure below, the User table ID ranges from 1 to 7000, and it can be divided into multiple sub-tables using sequential distribution. The corresponding primary key ranges from 1 to 1000, 1001 to 2000, ..., 6001 to 7000.

The advantage of sequential distribution is that the space of each node can be fully utilized, while hash distribution is difficult to control how much data a node stores.

But the sequential distribution needs to use a mapping table to store the mapping of data to nodes, and this mapping table is usually stored in a separate node. When the amount of data is very large, the mapping table also becomes larger, so one node may not be able to store the entire mapping table. Moreover, the overhead of maintaining the entire mapping table for a single node is very high, and the search speed will also be slowed down. In order to solve the above problems, an intermediate layer, namely the Meta table, is introduced to share the maintenance work of the mapping table.

2.3. Load balancing

There are many factors to measure the load, such as CPU, memory, disk and other resource usage, the number of read and write requests, and so on.

Distributed system storage should be able to automatically load balance, when the load of a certain node is high, some of its data will be migrated to other nodes.

Each cluster has a master control node, and other nodes are working nodes. The master control node performs overall scheduling according to the global load information, and the working nodes regularly send heartbeat packets (Heartbeat) to send information related to node load to the master control node.

A newly launched working node, due to its low load, if not controlled, the master control node will migrate a large amount of data to the node at the same time, causing the node to fail to work for a period of time. Therefore, the load balancing operation needs to be carried out smoothly, and it takes a long period of time for newly added nodes to reach a relatively balanced state.

3. Distributed theory

3.1. CAP

It is impossible for a distributed system to satisfy consistency (C: Consistency), availability (A: Availability) and partition tolerance (P: Partition Tolerance), at most two of them can be satisfied at the same time.

 

 

consistency

Consistency refers to whether multiple data copies can maintain consistent characteristics.

Under the condition of consistency, the system can transfer from a consistent state to another consistent state after performing a data update operation.

After a data update to the system is successful, if all users can read the latest value, the system is considered to have strong consistency.

availability

Availability refers to the ability of a distributed system to provide normal services in the face of various abnormalities. It can be measured by the ratio of the system's available time to the total time. The availability of 4 nines means that the system is available 99.99% of the time.

Under the condition of availability, the services provided by the system are always available, and the results can always be returned within a limited time for each operation request of the user.

Partition tolerance

Network partition means that the nodes in the distributed system are divided into multiple areas, and each area can communicate within, but the areas cannot communicate with each other.

Under the condition of partition tolerance, when a distributed system encounters any network partition failure, it still needs to provide services that can provide consistency and availability to the outside world, unless the entire network environment fails.

trade off

Partition tolerance is essential in a distributed system because the network needs to always be assumed to be unreliable. Therefore, the CAP theory is actually a trade-off between availability and consistency.

Availability and consistency are often in conflict, and it is difficult to satisfy both at the same time. When synchronizing data between multiple nodes,

  • In order to ensure consistency (CP), it is necessary to make all nodes offline and become unavailable, waiting for the synchronization to complete;
  • In order to ensure availability (AP), the data of all nodes is allowed to be read during the synchronization process, but the data may be inconsistent.

3.2. BASE

BASE is an acronym for the three phrases Basicly Available, Soft State, and Eventually Consistent.

The BASE theory is the result of the trade-off between consistency and availability in CAP. The core idea of ​​its theory is: even if strong consistency cannot be achieved, each application can use an appropriate method according to its own business characteristics to make the system achieve the ultimate consistency.

 

 

basically available

It means that when the distributed system fails, the core is guaranteed to be available, and some availability is allowed to be lost.

For example, when an e-commerce company conducts promotions, in order to ensure the stability of the shopping system, some consumers may be directed to a downgraded page.

soft state

It means that the data in the system is allowed to have an intermediate state, and it is considered that the intermediate state will not affect the overall availability of the system, that is, there is a delay in the process of allowing data copies of different nodes in the system to synchronize.

eventual consistency

The final consistency emphasizes that all data copies in the system can finally reach a consistent state after a period of synchronization.

ACID requires strong consistency and is usually used in traditional database systems. However, BASE requires final consistency and achieves availability by sacrificing strong consistency, which is usually used in large-scale distributed systems.

In an actual distributed scenario, different business units and components have different requirements for consistency, so ACID and BASE are often used together.

4. Distributed transaction issues

4.1. Two-phase commit (2PC)

Two-phase commit (Two-phase Commit, 2PC)

It is mainly used to implement distributed transactions. Distributed transactions refer to transaction operations that span multiple nodes and require the ACID characteristics of transactions to be met.

By introducing a coordinator (Coordinator) to schedule the behavior of participants, and finally decide whether these participants want to actually execute the transaction.

working process

Preparation Phase

The coordinator asks the participant whether the transaction is executed successfully, and the participant sends back the transaction execution result.

 

 

submission stage

If the transaction is successfully executed on each participant, the transaction coordinator sends a notification to the participants to commit the transaction; otherwise, the coordinator sends a notification to the participants to roll back the transaction.

 

 

It should be noted that during the prepare phase, the participants executed the transaction but not yet committed it. Only after receiving the notification from the coordinator during the commit phase, commit or rollback.

question

synchronous blocking

All transaction participants are in a synchronous blocking state while waiting for other participants to respond, and cannot perform other operations.

single point problem

The coordinator plays a very important role in 2PC, and a failure will have a great impact, especially if a failure occurs in phase 2, all participants will always be in a waiting state and cannot complete other operations.

data inconsistency

In phase 2, if the coordinator only sends part of the Commit message and the network is abnormal, only some participants receive the Commit message, that is to say, only some participants submit the transaction, making the system data inconsistent.

too conservative

The failure of any node will cause the failure of the entire transaction, and there is no perfect fault tolerance mechanism.

2PC pros and cons

Advantages: As far as possible to ensure the strong consistency of data, it is suitable for key fields that require high data consistency. (In fact, strong consistency cannot be guaranteed 100%) Disadvantages: The implementation is complicated, availability is sacrificed, and performance is greatly affected. It is not suitable for high-concurrency and high-performance scenarios.

4.2. Compensation transaction (TCC)

The core idea of ​​compensation transaction (TCC) is: for each operation, a corresponding confirmation and compensation (undo) operation must be registered. It is divided into three stages:

  1. The Try stage is mainly to test the business system and reserve resources.
  2. The Confirm phase is mainly to confirm and submit the business system. When the Try phase is successfully executed and the Confirm phase is executed, the default Confirm phase is error-free. That is: as long as Try succeeds, Confirm must succeed.
  3. The Cancel stage is mainly to cancel the business and release the reserved resources when the business execution error needs to be rolled back.

For example, suppose Bob wants to transfer money to Smith, the idea is probably:

  1. First, in the Try stage, the remote interface must be called to freeze the money of Smith and Bob.
  2. In the Confirm stage, the transfer operation of the remote call is performed, and the transfer is successfully unfrozen.
  3. If the execution of the second step is successful, the transfer is successful. If the execution of the second step fails, the unfreezing method (Cancel) corresponding to the remote freezing interface will be called.

Advantages and disadvantages of TCC

  • Advantages: Compared with 2PC, the implementation and process are relatively simple, but the data consistency is also worse than 2PC.
  • Disadvantages: The disadvantages are quite obvious, and it is possible to fail in steps 2 and 3. TCC is a compensation method at the application layer, so programmers need to write a lot of compensation code when implementing it. In some scenarios, some business processes may not be well defined and processed by TCC.

4.3. Local message table (asynchronous guarantee)

The local message table and the business data table are in the same database, so that local transactions can be used to ensure that the operations on these two tables meet the transaction characteristics.

  1. After one party of the distributed transaction operation completes the operation of writing business data, it sends a message to the local message table, and the local transaction can guarantee that the message will be written into the local message table.
  2. Afterwards, the messages in the local message table are forwarded to message queues (MQ) such as Kafka, and if the forwarding is successful, the messages are deleted from the local message table, otherwise continue to be forwarded again.
  3. The other side of the distributed transaction operation reads a message from the message queue and executes the operations in the message.

 

 

This scheme follows the BASE theory and uses eventual consistency.

The local message table uses local transactions to implement distributed transactions, and uses message queues to ensure eventual consistency.

Advantages and disadvantages of local message tables

  • Advantages: A very classic implementation that avoids distributed transactions and achieves eventual consistency.
  • Disadvantage: The message table will be coupled to the business system. If there is no packaged solution, there will be a lot of chores to deal with.

4.4. MQ transaction message

Some third-party MQs support transactional messages, such as RocketMQ, and the way they support transactional messages is similar to the two-phase commit adopted. However, some mainstream MQs on the market do not support transactional messages, such as RabbitMQ and Kafka.

Taking Ali's RocketMQ middleware as an example, the idea is roughly as follows:

  1. Prepared message, the address of the message will be obtained.
  2. Execute local transactions.
  3. Use the address obtained in the first stage to access the message and modify the state.

That is to say, in the business method, you want to submit two requests to the message queue, one for sending a message and one for confirming a message. If the confirmation message fails to be sent, RocketMQ will periodically scan the transaction messages in the message cluster. When it finds a Prepared message, it will confirm it to the message sender. Therefore, the producer needs to implement a check interface, and RocketMQ will send it according to the policy set by the sender. Decide whether to rollback or continue sending acknowledgment messages. This ensures that message sending succeeds or fails at the same time as the local transaction.

Advantages and disadvantages of MQ transaction messages

  • Advantages: Achieves eventual consistency and does not need to rely on local database transactions.
  • Disadvantages: It is difficult to implement, and mainstream MQ does not support it.

5. Consensus issues

5.1. Paxos

It is used to reach a consensus problem, that is, for the values ​​generated by multiple nodes, the algorithm can ensure that only one unique value is selected.

There are three main types of nodes:

  • Proposer: propose a value;
  • Acceptor: vote on each proposal;
  • Informer (Learner): To be informed of the results of the vote, not to participate in the voting process.

The algorithm needs to meet the constraints of safety and liveness (in fact, these two basic properties should be considered by most distributed algorithms):

  • safety: Ensure that the resolution result is correct, unambiguous, and error-free.
  • The resolution (value) can only be finally approved by the proposal proposed by the proposers;
  • In an execution instance, only one final resolution is approved (chosen), which means that the result of the majority acceptance (accept) can become a resolution;

 

  • liveness: Guarantee that the resolution process can be completed within a limited time.
  • Decisions are always made, and learners can get chosen decisions.

 

The basic process includes the proposer making a proposal, first seeking the support of the majority of acceptors, and when more than half of the acceptors support it, sending the final result to everyone for confirmation. A potential problem is that the proposer fails during this process, which can be solved by a timeout mechanism. In an extremely coincidental situation, every time the proposer of a new round of proposals happens to fail, the system will never be able to reach a consensus (the probability is very small).

Paxos can guarantee that the system can reach a consensus when more than $1/2$ of normal nodes exist.

Single proposer + multiple receivers

If the system restricts only a specific node to be the proposer, then consensus can definitely be achieved (there is only one solution, either achieved or failed). As long as the proposer receives votes from a majority of receivers, it can be considered approved, because there are no other proposals in the system.

But once the proposer fails, the system cannot work.

Multiple proposers + single receiver

Limit a node as a receiver. In this case, consensus is also easy to reach. The receiver receives multiple proposals, chooses the first proposal as a resolution, and rejects the subsequent proposals.

The flaw is also prone to single points of failure, including receiver failure or first proposer node failure.

The above two situations are actually similar to the master-slave mode. Although they are not so reliable, they are widely used because of their simple principles.

Some challenges arise when both proposers and acceptors are generalized to multiple situations.

Multiple proposers + multiple receivers

Since limiting to a single proposer or single acceptor can fail, multiple proposers and multiple acceptors must be allowed. The problem suddenly became complicated.

One situation is that there is only one proposer in the same time segment (such as a proposal cycle), and then it can degenerate to a single proposer situation. It is necessary to design a mechanism to ensure the correct generation of proposers, such as according to time, sequence, or everyone guessing (a number for comparison). Considering the large amount of work that the distributed system has to deal with, this process must be as efficient as possible, and it is very difficult to design a mechanism that satisfies this condition.

Another situation is to allow multiple proposers to appear in the same time slice. The same node may receive multiple proposals, how to distinguish them? At this time, the method of only accepting the first proposal and rejecting subsequent proposals is not applicable. Naturally, proposals need to carry different serial numbers. Nodes need to judge which one to accept based on the proposal serial number. For example, accepting a proposal with a larger serial number (which often means accepting a new proposal, because the old proposer has a higher probability of failure) is accepted.

How do you assign serial numbers to proposals? One possible solution is that the proposal number intervals of each node are isolated from each other and do not conflict with each other. In order to meet the incremental requirements, the timestamp can be used as a prefix field.

In addition, even if the proposer receives the votes of the majority of recipients, it cannot be said that it will be passed. Because the system may have other proposals during the process.

5.2. Raft

The Raft algorithm is a simplified implementation of the Paxos algorithm.

Including three roles: leader, candidate and follower, the basic process is:

  • Leader election  - each candidate will propose an election plan randomly after a certain period of time, and the person with the most votes in the most recent stage will be selected as the leader;
  • Synchronize log  - the leader will find the latest log record in the system and force all followers to refresh to this record;

Note: log here does not refer to log messages, but records of various events.

Single Candidate Election

There are three kinds of nodes: Follower, Candidate, and Leader. Leader will periodically send heartbeat packets to Follower. Each Follower has set a random election timeout period, generally 150ms~300ms, if it does not receive the Leader's heartbeat packet within this time, it will become a candidate and enter the election stage.

  • The figure below shows the initial stage of a distributed system. At this time, there are only Followers and no Leaders. Follower A waits for a random election timeout, but does not receive the heartbeat packet from the Leader, so it enters the election phase.

 

 

  • At this point A sends a voting request to all other nodes.

 

 

  • Other nodes will reply to the request. If more than half of the nodes reply, the Candidate will become the Leader.

 

 

  • Afterwards, the Leader will periodically send a heartbeat packet to the Follower, and the Follower will start timing again after receiving the heartbeat packet.

 

 

Multiple Candidate elections

  • If multiple Followers become candidates and get the same number of votes, you need to start voting again. For example, Candidate B and Candidate D in the figure below both get two votes, so you need to start voting again.

 

 

  • When voting is restarted, since each node sets a different random election timeout, the probability that multiple candidates will appear again next time and get the same number of votes is very low.

 

 

sync log

  • Modifications from the client will be passed to the Leader. Note that the change has not been committed, it is just written to the log.

 

 

  • Leader will replicate the modification to all Followers.

 

 

  • The Leader will wait for most of the Followers to make changes before submitting the changes.

 

 

  • At this time, the Leader will notify all the Followers to submit the modification, and at this time, the values ​​of all nodes reach a consensus.

 

 

6. Distributed cache problem

6.1. Cache Avalanche

Cache avalanche refers to: in a high-concurrency scenario, due to the invalidation of the original cache, the new cache has not yet expired (for example: when we set the cache with the same expiration time, a large area of ​​cache expires at the same time), all the caches that should have All requests to access the cache go to the database, which puts a huge pressure on the database CPU and memory, and seriously causes the database to go down. Thus forming a series of chain reactions, causing the entire system to collapse.

solution:

  • Use locks or queues to ensure that there will not be a large number of threads reading and writing the database at one time, so as to avoid a large number of concurrent requests falling on the underlying storage system when failure occurs.
  • There is also a simple solution, which is to disperse the cache expiration time, do not set all cache time lengths to 5 minutes or 10 minutes; for example, we can add a random value based on the original expiration time, such as 1-5 minutes random , so that the repetition rate of the expiration time of each cache will be reduced, and it will be difficult to cause collective invalidation events.

The avalanche effect generated when the cache fails puts all requests on the database, which can easily reach the bottleneck of the database and cause the service to fail to be provided normally. Try to avoid this scenario from happening.

6.2. Cache penetration

Cache penetration refers to: the data queried by the user does not exist in the database, and naturally it does not exist in the cache. In this way, when the user queries, it cannot be found in the cache, and every time the user has to go to the database to query again, and then return empty (equivalent to two useless queries). In this way, the request bypasses the cache and directly checks the database, which is also a frequently mentioned cache hit rate issue.

When the traffic is large, such a situation occurs, and the DB is always requested, which can easily cause the service to hang up.

solution:

  1. Add a step in the encapsulated cache SET and GET parts. If the query KEY does not exist, set an identification KEY with this KEY as the prefix; when querying the KEY later, query the identification KEY first, and if the identification KEY exists, then Return an agreed non-false or NULL value, and then APP will handle it accordingly, so that the cache layer will not be penetrated. Of course, the expiration time of this verification KEY cannot be too long.
  2. If the data returned by a query is empty (whether the data does not exist or the system fails), we still cache the empty result, but its expiration time will be very short, usually only a few minutes.
  3. The Bloom filter is used to hash all possible data into a large enough bitmap, and a data that must not exist will be intercepted by this bitmap, thus avoiding the query pressure on the underlying storage system.

6.3. Cache Warming

Cache preheating should be a relatively common concept. I believe that many small partners should be able to easily understand it. Cache preheating is to directly load relevant cached data into the cache system after the system goes online. This can avoid the problem of first querying the database and then caching the data when the user requests it! Users directly query the cached data that has been warmed up in advance!

solution:

  1. Write a cache refresh page directly, and manually operate it when going online;
  2. The amount of data is not large, and it can be loaded automatically when the project starts;
  3. Regularly refresh the cache;

6.4. Cache update

In addition to the cache invalidation strategy that comes with the cache server (Redis has 6 default strategies to choose from), we can also customize the cache elimination according to specific business needs. There are two common strategies:

  1. Regularly clean up expired caches;
  2. When a user requests it, it is judged whether the cache used by the request has expired. If it expires, it will go to the underlying system to obtain new data and update the cache.

Both have their own advantages and disadvantages. The disadvantage of the first is that it is more troublesome to maintain a large number of cached keys. The disadvantage of the second is that every time a user requests it, it must be judged that the cache is invalid, and the logic is relatively complicated! Which solution to use depends on your own application scenarios.

6.5. Cache downgrade

When the traffic increases sharply, the service has problems (such as slow response time or no response), or non-core services affect the performance of the core process, it is still necessary to ensure that the service is still available, even if the service is damaged. The system can perform automatic downgrade based on some key data, or configure switches to achieve manual downgrade.

The ultimate goal of downgrading is to keep core services available, even if lossy. And some services cannot be downgraded (such as adding to shopping cart, checkout).

Guess you like

Origin blog.csdn.net/CrazyMooo/article/details/111565858