Introduction to Distributed Solutions

Introduction to Distributed Solutions

  In order to solve bottlenecks such as single machine capacity, performance, and high availability, various distributed technologies and solutions are emerging, such as distributed file systems, distributed computing, distributed messaging, distributed transactions, distributed databases, etc. , They are used to solve different problems in different scenarios. This article briefly explains their principles and functions. The understanding may not be in-depth enough, there are many excerpts, please correct me.

1. Principles and tools

  • CAP principle: The CAP principle is also known as the CAP theorem, which refers to consistency, availability, and partition tolerance in a distributed system. The CAP principle means that these three elements can only achieve two points at the same time, and it is impossible to take care of all three.
  • BASE: BASE is the result of the balance between consistency and availability in CAP. It comes from the conclusion of distributed practice of large-scale Internet systems. It is gradually evolved based on the CAP theorem. Its core idea is that even if strong consistency cannot be achieved (Strong consistency), but each application can adopt an appropriate method to make the system achieve eventual consistency according to its own business characteristics. Next, we focus on explaining the three elements in BASE in detail. Basically available: It means that the distributed system is allowed to lose part of its availability when an unpredictable failure occurs.
  • PAXOS algorithm: The problem solved by the Paxos algorithm is how a distributed system can agree on a certain value (resolution). A typical scenario is that in a distributed database system, if the initial state of each node is the same, and each node performs the same sequence of operations, then they can finally get a consistent state. In order to ensure that each node executes the same command sequence, a "consistency algorithm" needs to be executed on each instruction to ensure that the instructions seen by each node are consistent. A general consensus algorithm can be applied in many scenarios and is an important issue in distributed computing. Therefore, the research on consensus algorithms has not stopped since the 1980s. There are two models of node communication: Shared memory and Messages passing. Paxos algorithm is a consensus algorithm based on message passing model.
  • Zookeeper service: ZooKeeper is a distributed, open source distributed application coordination service. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc. ZooKeeper is based on the Fast Paxos algorithm. The Paxos algorithm has a livelock problem, that is, when multiple proposals are submitted in a staggered manner, they may be mutually exclusive and no one can submit successfully. Fast Paxos has made some optimizations and passed the election. A leader (leader) is generated, and only the leader can submit a proposal.

2. Distributed File System (HDFS)

 Hadoop Distributed File System (HDFS) is designed as a distributed file system suitable for running on commodity hardware. It has a lot in common with existing distributed file systems. But at the same time, the difference between it and other distributed file systems is also very obvious. HDFS is a highly fault-tolerant system, suitable for deployment on cheap machines. HDFS can provide high-throughput data access, which is very suitable for applications on large-scale data sets. HDFS relaxes some POSIX constraints to achieve the purpose of streaming file system data.
 HDFS uses a master/slave architecture. An HDFS cluster is composed of a Namenode and a certain number of Datanodes. Namenode is a central server responsible for managing the namespace of the file system and client access to files. The Datanode in the cluster is generally one node, responsible for managing the storage on the node where it is located. HDFS exposes the name space of the file system, and users can store data on it in the form of files. From an internal point of view, a file is actually divided into one or more data blocks, which are stored on a group of Datanodes. Namenode performs name space operations of the file system, such as opening, closing, and renaming files or directories. It is also responsible for determining the mapping of data blocks to specific Datanode nodes. Datanode is responsible for handling read and write requests from file system clients. The creation, deletion, and replication of data blocks are performed under the unified scheduling of Namenode.
Insert picture description here
 Namenode and Datanode are designed to run on ordinary commercial machines. These machines generally run the GNU/Linux operating system (OS). HDFS is developed in Java language, so any machine that supports Java can deploy Namenode or Datanode. Thanks to the highly portable Java language, HDFS can be deployed on multiple types of machines. A typical deployment scenario is that only one Namenode instance runs on a machine, while other machines in the cluster run one Datanode instance. This architecture does not exclude running multiple Datanodes on one machine, but such cases are relatively rare.
 The structure of a single Namenode in the cluster greatly simplifies the system architecture. Namenode is the arbiter and manager of all HDFS metadata, so that user data will never flow through Namenode.
Reference document: http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html

Three, distributed computing (Map/Reduce)

 Hadoop Map/Reduce is a simple-to-use software framework. Applications written based on it can run on a large cluster composed of thousands of commercial machines, and process data sets of the T level in parallel in a reliable and fault-tolerant manner. .
 A Map/Reduce job usually divides the input data set into several independent data blocks, and the map task (task) processes them in a completely parallel manner. The framework sorts the output of the map first, and then inputs the result to the reduce task. Usually the input and output of the job will be stored in the file system. The entire framework is responsible for task scheduling and monitoring, as well as re-executing tasks that have failed.
 Generally, the Map/Reduce framework and the distributed file system run on the same set of nodes, that is, the computing nodes and storage nodes are usually together. This configuration allows the framework to efficiently schedule tasks on nodes that have stored data, which can make the entire cluster's network bandwidth very efficient.
Insert picture description here

 The Map/Reduce framework consists of a single master JobTracker and a slave TaskTracker for each cluster node. The master is responsible for scheduling all tasks that constitute a job. These tasks are distributed on different slaves. The master monitors their execution and re-executes the tasks that have failed. The slave is only responsible for performing tasks assigned by the master.
 The application should at least specify the location (path) of the input/output, and provide map and reduce functions by implementing appropriate interfaces or abstract classes. Together with the parameters of other jobs, it constitutes a job configuration. Then, the job client of Hadoop submits jobs (jar packages/executable programs, etc.) and configuration information to JobTracker, which is responsible for distributing these software and configuration information to slaves, scheduling tasks and monitoring their execution, while providing status and diagnostic information to job-client.
 Although the Hadoop framework is implemented in JavaTM, Map/Reduce applications do not have to be written in Java.
Reference document: http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html

Four, distributed database (OceanBase)

 OceanBase is a distributed relational database completely independently developed by Ant Financial and Alibaba. It was founded in 2010. OceanBase has the characteristics of strong data consistency, high availability, high performance, online expansion, high compatibility with SQL standards and mainstream relational databases, and low cost. It is suitable for financial scenarios with high performance, cost and scalability requirements.

Main features:

  • High performance: OceanBase uses a read-write separation architecture to divide data into baseline data and incremental data. The incremental data is stored in the memory (MemTable), and the baseline data is stored in the SSD disk (SSTable). Modifications to the data are incremental data, and only the memory is written. So DML is a complete memory operation with very high performance.
  • Low cost: OceanBase achieves high compression through data encoding compression technology. Data encoding is a series of encoding methods based on the value range and type information of different fields in the database relational table. It understands data better than general compression algorithms and can achieve higher compression efficiency. Using PC servers and low-end SSDs, high storage compression rate reduces storage costs, high performance reduces computing costs, and multi-tenant mixed units make full use of system resources.
  • High availability: Data is stored in multiple copies, and a few copy failures do not affect data availability. Through the deployment of "three locations and five centers", automatic non-destructive disaster recovery of city-level failures is realized.
  • Strong consistency: Multiple copies of data synchronize transaction logs through the paxos protocol, and only successful transactions can be submitted by the majority. By default, read and write operations are performed on the master copy to ensure strong consistency.
  • Scalable: The cluster nodes are all peer-to-peer, each node has computing and storage capabilities, and there is no single point of bottleneck. It can be expanded and contracted linearly and online.
  • Compatibility: Compatible with commonly used MySQL/ORACLE functions and MySQL/ORACLE front-end and back-end protocols. The business can be migrated from MySQL/ORACLE to OceanBase with zero or a few modifications.

Application scenarios:

  • OceanBase's product positioning is a distributed relational database. OceanBase products are suitable for finance, securities, etc. involving transactions, payments and accounting, etc., which require high availability and strong consistency, and require performance, cost, and scalability. Attribute scenarios, and various relational structured storage OLTP applications.

Software Architecture:

  • OceanBase is designed as a Share-Nothing architecture, so it does not have any shared storage structure. At least three zones need to be deployed, and data is stored in each zone. There is no single point in the entire design of OceanBase. Each Zone has multiple ObServer nodes, which solves the problem of high reliability and high availability from the architecture.
  • Each node is completely equal, and each has its own SQL engine and storage engine. The storage engine can only access local data, while the SQL engine can access the global Schema and generate a distributed query plan. The query executor can access the storage engine of each node, distribute and collect data among each node, complete the execution of the distributed plan, and return the result to the user.
  • One of the nodes will additionally take on the RootService service, and the RootService will also have multiple devices distributed in each zone. Leases are maintained between the main RootService and all ObServers. When the ObServer fails, the main RootService can detect and perform failure recovery operations. RootService is a functional module of the ObServer process, and each ObServer has the RootService function. The functions of RootService mainly include: server and zone management, zone management, daily merge control, system bootstrapping, DDL operation and so on.
    Insert picture description here

Five, distributed messaging (Kafka)

Distributed architecture:

  • Each Broker represents a kafka server and supports multiple message producers and consumers at the same time.
  • Messages are managed in the unit of topic Topic. Each topic can have multiple partitions stored in different Brokers.
  • Kafka allows topic partitions to have multiple copies, and you can configure the number of partition copies on the server side. When a node in the cluster fails, it can automatically failover to ensure data availability.
  • Zookeeper is used for Broker registration and Topic registration of Kafka, saving the correspondence between consumers and partitions, triggering consumer load balancing, and saving consumers' offsets.
  • The unit of replica creation is the topic partition. Normally, each partition has a leader and zero or more followers. The total number of replicas is the sum of the leader. All read and write operations are handled by the leader. Generally, the number of partitions is more than the number of brokers, and the leaders of each partition are evenly distributed among the brokers. All follower nodes synchronize the leader node's log, and the messages and offsets in the log are consistent with those in the leader. (Of course, at any given time, there may be several messages at the end of the leader node's log that have not yet been backed up). Followers nodes pull messages from the leader node just like ordinary consumers and save them in their log files. Followers nodes can pull message logs from the leader node in batches to their own log files.
  • As with most distributed systems, automatic handling of failures requires precise definition of the concept of node "alive". Kafka has two ways to judge whether a node is alive. The node must be able to maintain the connection with ZooKeeper, and ZooKeeper checks the connection of each node through the heartbeat mechanism. If the node is a follower, it must be able to synchronize the write operations of the leader in time, and the delay cannot be too long.
    Insert picture description here

Application scenarios:

  • Kafka can be used in various situations as Message brokers (such as decoupling data generators from data processing, buffering unprocessed messages, etc.). Compared with other messaging systems, Kafka has better throughput, built-in partitions, replication and fault tolerance, which makes it an ideal large-scale message processing application.
  • The initial use case of Kafka was to rebuild the user activity tracking pipeline into a set of real-time publish-subscribe sources. This means that website activities (web browsing, search, or other user actions) will be published to a central topic, where each activity type has a topic. These feeds provide a range of use cases, including real-time processing, real-time monitoring, and offline processing and reporting of data loaded into Hadoop or offline data warehouse systems.
  • Kafka can provide log submission functions for distributed systems from the outside. Logs help to record data between nodes and behaviors, and a resynchronization mechanism can be used to recover data from failed nodes. Kafka's log compression function supports this usage.
  • After data is written to Kafka, it is written to disk and backed up for fault tolerance. Until the full backup, Kafka let the producer think that the write is complete. Even if the write fails, Kafka will ensure that it continues to write to Kafka using the disk structure, which has good scalability—50kb and 50TB data behave the same on the server. A large amount of data can be stored, and the location of the data can be controlled by the client. You can think of Kafka as a high-performance, low-latency, distributed file system with log storage, backup, and propagation functions.

Message delivery semantic guarantee:

  • At most once-the message may be lost but never retransmitted.
  • At least once—The message can be retransmitted but never lost.
  • Exactly once-this is exactly what people want, each message is delivered only once.

Other guarantees:

  • The messages sent by the producer to a specific topic partition will be processed in the order in which they are sent. That is to say, if the record M1 and the record M2 are sent by the same producer, and the M1 record is sent first, then the offset of M1 is smaller than that of M2 and appears earlier in the log.
  • A consumer instance looks at the records in the order in the log.
  • For a topic with N copies, we can tolerate at most N-1 server failures, so as to ensure that no records submitted to the log will be lost.

Reference document: https://kafka.apachecn.org/intro.html

Six, distributed transactions (RocketMQ)

Strong consistency plan (database level), two-stage submission 2PC:

  • 2PC introduces the role of a transaction coordinator to coordinate and manage the submission and rollback of each participant (also known as each local resource). The two phases refer to the preparation (voting) and submission phases.
  • The coordinator of the preparation phase will send a preparation command to each participant. You can understand the preparation command as the completion of everything except committing the transaction. After synchronously waiting for the response of all resources, it enters the second phase, the commit phase (note that the commit phase is not necessarily a commit transaction, or a rollback transaction).
  • If all participants in the first stage return to prepare successfully, then the coordinator sends a commit transaction command to all participants, and then waits for all the transactions to be successfully committed, and then returns the transaction execution success.
  • If one participant returns failure in the first stage, then the coordinator will send all participants a request to roll back the transaction, that is, the execution of the distributed transaction fails.
  • If the second stage commit fails, if the second stage performs a rollback transaction operation, then the answer is to keep retrying until all participants are rolled back, otherwise those who prepared successfully in the first stage will continue to block With.
  • If the second stage commit fails, if the second stage is to commit the transaction operation, then the answer is to keep retrying, because it is possible that some participants' transactions have been successfully submitted, and there is only one way at this time, which is to go forward. Rush, keep retrying, until the submission is successful, in the end, it really doesn’t work, it can only be handled manually.
  • The coordinator fails, and a new coordinator is obtained through election.

TCC(Try - Confirm - Cancel):

  • Try refers to reservation, that is, the reservation and locking of resources, pay attention to reservation.
  • Confirm refers to the confirmation operation, this step is actually executed.
  • Cancel refers to the cancellation operation, which can be understood as canceling the action in the reservation phase.
  • For example, if a transaction needs to perform three operations A, B, and C, then perform reservation actions for the three operations first. If all reservations are successful, then the confirmation operation is performed, and if one reservation fails, then all the cancellation actions are performed. The TCC model also has the role of a transaction manager, which is used to record the TCC global transaction status and commit or roll back the transaction.
  • At the business level, since you need to define three actions for each operation to correspond to Try-Confirm-Cancel, TCC has a large intrusion to the business and tightly coupled the business. It is necessary to design the corresponding operation according to the specific scenario and business logic. In addition, the execution of undo and confirm operations may need to be retried, so it is also necessary to ensure that the operations are idempotent.
  • Compared with 2PC and 3PC, TCC has a wider range of applications, but the amount of development is also greater. After all, they are all realized in business, and sometimes you will find that these three methods are really hard to write. However, because it is implemented in business, TCC can implement transactions across databases and across different business systems.

Eventual consistency, local message table:

  • Taking payment services and accounting services as examples, the approximate process is as follows: After the user completes the payment order and the payment service is successfully paid, the user will call the accounting service interface to generate an original accounting voucher to the database. Because after the user completes the payment, he must immediately give the user a payment feedback, all he needs to do is to remind the user that the payment is successful.
  • The local message table, as its name implies, is a table for storing local messages, which is generally stored in the database. Then when the business is executed, the execution operation of the business (payment service) and the operation of putting the message into the message table are placed in the same transaction, so that it can be guaranteed that the business of putting the message into the local table must be executed successfully.
  • Then go to call the next operation (accounting service). If the next operation is called successfully, the message status of the message table can be directly changed to successful.
  • If the call fails, it’s okay. There will be a background task to read the local message table regularly, filter out the unsuccessful messages and then call the corresponding service. After the service update is successful, the status of the message will be changed.
  • At this time, it is possible that the operation corresponding to the message is unsuccessful, so it needs to be retried. Retrying must ensure that the corresponding service method is idempotent, and generally there will be a maximum number of retries. If the maximum number is exceeded, an alarm can be recorded for manual processing .

Eventual consistency, transaction message (RocketMQ):

  • The transaction message scheme can be used to replace the local message table scheme
  • The first step is to send a transaction message to the Broker, that is, a half message. A half message does not mean a half message, but the message is invisible to the consumer, and then the sender executes the local transaction after the transmission is successful. Then send Commit or RollBack commands to Broker based on the results of the local transaction.
  • And the sender of RocketMQ will provide a counter-checking transaction status interface. If the message does not receive any operation request for a period of time, the Broker will know whether the sender's transaction is executed successfully through the counter-checking interface, and then execute Commit or RollBack commands.
  • If it is Commit, the subscriber can receive the message, and then do the corresponding operation, and then consume the message after completion.
  • If it is RollBack, the subscriber cannot receive this message, which means that the transaction has not been executed.
  • It can be seen that it is relatively easy to implement through RocketMQ. RocketMQ provides the function of transaction message, we only need to define the transaction reverse check interface.

Reference document: https://zhuanlan.zhihu.com/p/183753774

Guess you like

Origin blog.csdn.net/ManWZD/article/details/115046513