A Preliminary Study of Distributed Storage

A Preliminary Study of Distributed Storage


origin

Recently, the company is doing dmp services internally. The current plan is to build different redis clusters and pour data into the redis cluster for system query services for online use. However, with the increase in the amount of data and the diversity of data sources, coupled with the support of multiple computer rooms for online services, the continued use of redis clusters will inevitably lead to high costs. Of course, I have also considered using hbase to support online services, but online services have high requirements for requests, and hbase has the risk of high latency, so there is some research work on the distributed kv database this time.

Why do you need a distributed database

Before using a distributed database, we generally use mysql to support general online business. Even in the case of limited single-machine storage, we can use sharding to sub-database and sub-table to support large amounts of data, but sharding also It has its own various drawbacks, such as the complexity of its cross-node join and network transmission problems. Therefore, due to the limited data storage of a single machine, it cannot satisfy our data storage and query, so distributed storage came into being.

What are the basic problems that a distributed database needs to solve?

  1. How data is stored
  2. How to query data, how to index
  3. How to ensure HA
  4. How to ensure consistency

The following will introduce how the more mature open source products in the industry solve the above four problems.

Data storage and query

Any persistent storage will eventually fall to disk. And according to the actual application of data, data storage and data query must be closely related. At present, the data structures used by the more mature storage engines in the industry when indexing data include: B-Tree, B+Tree and LSM-Tree. I will explain the similarities and differences of these three structures in detail below.

B-Tree

A B-tree is a multi-way self-balancing search tree, which is similar to an ordinary binary tree, but the B-book allows each node to have more children. The schematic diagram of the B-tree is as follows:

b_tree

Features of B-trees:

  1. All key values ​​are distributed throughout the tree
  2. Any keyword appears and only appears in one node
  3. The search may end at a non-leaf node
  4. Do a search in the complete set of keywords, the performance is close to the binary search algorithm

B+ Tree

The B+ tree is a variant of the B tree, and it is also a multi-way balanced search tree. The schematic diagram of the B+ tree is:

b_plus_tree

As can also be seen from the figure, the difference between B+ tree and B tree is:

  1. All keywords are stored in leaf nodes, non-leaf nodes do not store real data
  2. Added a chain pointer to all leaf nodes

B/B+ trees are commonly used to implement indexes in file systems and mysql. Everyone knows that mysql is a disk-based database, and indexes exist in the disk in the form of index files. The index search process involves disk IO consumption. Compared with memory IO consumption, the consumption of disk IO is several orders of magnitude higher. , so the organizational structure of the index should be designed to minimize the number of disk IOs when searching for keywords. According to the way mysql internally manages records according to pages, the number of disk IOs required in the process of searching for indexes only requires the height of the tree h-1 times, and $ O(h)=O(log_dN) $, where d is each node The out-degree of , N is the number of records, usually the value of d is a very large number, so h is very small, usually no more than 3.

On the other hand, MySQL chooses to use B+ trees instead of B trees for the following reasons:

  1. The B+ tree is more suitable for external storage (generally refers to disk storage). Since internal nodes (non-leaf nodes) do not store data, a node can store more internal nodes, and each node can index a larger and more accurate range. That is to say, the amount of information in a single disk IO using the B+ tree is larger than that of the B tree, and the IO efficiency is higher.
  2. MySQL is a relational database, and an index column is often accessed according to the interval. A chain pointer is established between the leaf nodes of the B+ tree in order, which enhances the accessibility of the interval. Therefore, the B+ tree is very friendly to the interval range query on the index column. The key and data of each node of the B-tree are together, and interval search cannot be performed.

LSM树(Log Structured Merge Tree)

First of all, we need to understand the fact that random reads and writes to disk are very slow, but sequential reads and writes to disk are at least 3 orders of magnitude faster than random reads and writes to main memory.

The design purpose of LSM tree is to realize sequential write to disk, and sequential write means that we save the disk seek time of random write, which can provide more disk IO opportunities for random read disk, so as to improve the read performance .

Therefore, the design idea of ​​LSM is ready to emerge: keep the incremental modification of data in memory, and write these modification operations to disk in batches when the size limit of execution is reached.

But how exactly does LSM implement this idea?

The LSM tree has a lot of small ordered structures. For example, every m data is sorted in memory once, and the following m data are sorted again. In this way, we can get N/m ordered small data. structure. When querying, because we don't know where a certain data is, we do a binary search from the latest small ordered structure, return if found, and continue to find the next ordered small structure until until found. Its complexity is $N/m * log_2m$.

But there are some problems with the above method:

  1. The data is first written to the memory, and the intermediate power failure or process crash will cause data loss, so it is necessary to write WAL as the basis for data recovery.
  2. As there are more and more small ordered structures, the read performance will get worse and worse. At this time, it is necessary to merge the small files into a large ordered structure.
  3. This is actually an optimization item. The LSM tree uses a Bloom filter to make a rough judgment on whether there is the data to be found in the small file.

The databases currently using LSM trees include hbase, leveldb, and rocksdb.

High availability and consistency issues

HA includes service high availability and data integrity. For the first point, a more common way is to provide services in the form of master and backup, like the master-slave architecture of redis, while the second point is to store data redundantly on multiple servers to achieve high data backup. Available purposes, like hdfs and kafka, zookeeper is a more typical case that satisfies the above two points at the same time.

But when it comes to master-standby and data redundancy, it means that data or state needs to be stored on multiple servers at the same time, then when the master service crashes and needs to provide services from the service or enable replica data, its state and data are required. To be consistent with the main service or data, this is a typical consistency problem between distributed systems.

In the current distributed field, there are several classic algorithms for solving the consistency problem: Poxas, Zab and Raft. Since the Poxas algorithm is too difficult to understand, here we only introduce raft.

Raft algorithm

Raft achieves consistency by electing a noble leader and then giving him full responsibility for managing the replicated log. The leader receives log entries from clients, replicates the log entries to other servers, and tells other servers to apply the log entries to their state machines when security is guaranteed. By way of leaders, Raft decomposes the consistency problem into three relatively independent sub-problems:

  1. Leader election
  2. log replication
  3. safety

In the raft algorithm, each server is in one of three states: leader, candidate, follower. Under normal circumstances, there is only one leader in the system and all other nodes are followers. Followers are passive: they do not send any requests, but simply respond to requests from leaders or candidates. The leader handles all client requests (if a client contacts a follower, the follower will redirect the request to the leader). The third state, candidate, is used when electing a new leader. The diagram below shows these states and their previous transitions.

Term: In a cluster, time is divided into terms, and each term begins with an election. After a successful election, the leader manages the entire cluster until the end of the term. As shown in the figure:

The following describes how the raft algorithm works.

leader election

Raft uses a heartbeat mechanism to trigger leader elections. When the server program starts, they are all followers. Then he will transition to other states according to different trigger conditions:

  1. Receive RPC requests from other servers within a period of time, refresh the local timeout, and continue to be in the follower state.
  2. If no message is received within a period of time, that is, the election times out, then he will think that there is no leader in the system, it will become the electoral state, and initiate RPC requests to other servers, asking them to vote for themselves to become the leader.

In the case of 2, if a candidate obtains the votes of the majority of server nodes in a cluster, then he wins the election and is called the leader; when two servers vote at the same time in an election, the election is called the leader. If there is no leader, proceed to the next election. In an election, each server will vote for at most one server, on a first-come, first-served basis. Once a candidate wins this election, he is immediately called the leader, and he periodically sends heartbeat messages to all followers to establish his authority and prevent new leaders from emerging.

While waiting to vote, a candidate may receive additional log entry RPCs from other servers declaring that it is the leader. If the leader's term number (included in this RPC) is not less than the candidate's current term number, then the candidate will recognize the leader as legitimate and return to the follower state. If the term number in this RPC is smaller than itself, the candidate will reject this RPC and continue to remain a candidate.

log replication

Once a leader is elected, he starts serving clients. Each request from the client contains an instruction that is executed by the replicated state machine. The leader appends the instruction to the log as a new log entry, and then initiates append entry RPCs in parallel to the other servers, asking them to replicate the log entry. When the log entry is safely replicated, the leader applies the log entry to its state machine and returns the result of the execution to the client. If followers crash or run slowly, or if the network loses packets, the leader will repeatedly try to append log entry RPCs (despite having replied to the client) until all followers eventually store all log entries.

Each log entry stores a state machine instruction and the term number when the instruction was received from the leader. The term number in the log is used to check for inconsistencies, and it also has an integer index value to indicate its position in the log.

Raft ensures high-level consistency between logs from different servers by maintaining the following features:

  • If two entries in different logs have the same index and term number, then they store the same instruction.
  • If two entries in different logs have the same index and term number, then all log entries before them are also the same.

This feature is actually more like a mathematical induction.

Then let's talk about how a new leader handles when a leader crashes, causing inconsistent logs on some servers. In the Raft algorithm, leaders handle inconsistencies by forcing followers to copy their own logs directly. This means that conflicting log entries in the follower will be overwritten by the leader's log. But this override will have some restrictions to ensure that such operations are correct and safe.

The specific operations to ensure log consistency are as follows:

  1. The leader maintains a nextIndex for each follower, which refers to the index address of the log entry that needs to be sent to the follower next time.
  2. When a leader first gains power, he initializes all nextIndex to his own last log index + 1.
  3. If a follower's log is inconsistent with the leader, the consistency check will fail on the next log RPC attached, and after being rejected by the follower, the leader will decrement the nextIndex value and retry.
  4. Eventually the nextIndex will be at a position where the leader and follower logs agree.

When this happens, the append log RPC will succeed, at which point all log entries for the follower conflict will be deleted and the leader's log will be added. Once the append log RPC is successful, the follower's log will be consistent with the leader's, and will continue to be maintained for the following term. Leaders never overwrite or delete their own logs.

safety

As mentioned above, when the newly elected leader and the follower's log conflict, some restrictions will be used to ensure the correctness of the log coverage. These restrictions are reflected in that, when the election is performed, it is guaranteed that any leader for a given term number, Both have all committed log entries from the previous term. raft restricts that only one candidate can win the election if it contains all the submitted log entries. In order to win the election, the elector must contact most of the nodes in the cluster , and every submitted log must also appear in most of the nodes in the cluster , and there must be an intersection between the two majority sets. This ensures that at least one node holds all the submitted logs in the cluster, and voters will reject new voting requests for those logs that do not have their own. When comparing logs, the conditions for voting for a candidate are:

  • The term of the latest log requested for voting should be greater than or equal to the term of the voter's latest log, that is, req.lastLogTerm >= lastEntry.term
  • If req.lastLogTerm == lastEntry.term, the index value of the latest log requested for voting should be greater than or equal to the index value of the voter's latest log, that is, req.lastLogIndex >= lastEntry.index.

raft demo

For a more detailed introduction, see the raft paper

Several mature distributed database architectures

hbase

HBase is a distributed, column-oriented open source database, which is different from general relational databases and is a database suitable for unstructured data storage. Another difference is HBase's column-based rather than row-based schema. HBase uses the very same data model as BigTable. Users store data rows in a table. A data row has a selectable key and any number of columns, one or more columns form a ColumnFamily, and the columns under a Family are located in an HFile, which is easy to cache data. Tables are loosely stored, so users can define various columns for rows. In HBase, the data is sorted by the primary key, and the table is divided into multiple Regions by the primary key.

In a distributed production environment, HBase needs to run on top of HDFS, with HDFS as its basic storage facility. The upper layer of HBase provides the Java API layer for accessing data for applications to access the data stored in HBase. The HBase cluster is mainly composed of Master and Region Server, as well as Zookeeper. The specific modules are shown in the following figure:

Briefly introduce the role of related modules in HBase:

  • Master HBase Master is used to coordinate multiple RegionServers, detect the status of each RegionServer, and balance the load between RegionServers. Another responsibility of HBaseMaster is to assign Regions to RegionServers. HBase allows multiple Master nodes to coexist, but this requires the help of Zookeeper. However, when multiple Master nodes coexist, only one Master provides services, and other Master nodes are on standby. When the working Master node goes down, other Masters will take over the HBase cluster.
  • Region Server For a RegionServer, it includes multiple Regions. The role of RegionServer is to manage tables and implement read and write operations. The Client directly connects to the RegionServer and communicates to obtain the data in HBase. For Region, it is the place where HBase data is actually stored, which means that Region is the basic unit of HBase availability and distribution. If a table is large and consists of multiple CFs, the data of the table will be stored between multiple regions, and multiple storage units (Store) will be associated with each region.
  • Zookeeper For HBase, the role of Zookeeper is crucial. First of all, Zookeeper is an HA solution for HBase Master. That is, Zookeeper ensures that at least one HBase Master is running. And Zookeeper is responsible for the registration of Region and Region Server. In fact, the development of Zookeeper so far has become a standard framework for fault tolerance in distributed big data frameworks. Not only HBase, but almost all open source frameworks related to distributed big data rely on Zookeeper to implement HA.

three

Usually, a cluster contains two configservers and multiple dataServers. The two configservers act as masters and backups of each other and learn the available dataservers in the cluster through the heartbeat between the maintenance and dataservers, and build the data distribution information (comparison table) in the cluster. The dataserver is responsible for data storage, and completes data replication and migration according to the instructions of the configserver. When the client starts, it obtains the data distribution information from the configserver, and interacts with the corresponding dataserver to complete the user's request according to the data distribution information. Its architecture diagram is as follows:

  • ConfigServer function
    1. Know the information of surviving nodes in the cluster through maintenance and dataserver heartbeat
    2. Build a distribution table of data in the cluster based on the information of the surviving nodes.
    3. Provides query services for data distribution tables.
    4. Schedule data migration and replication between dataservers.
  • DataServer function
    1. Provide storage engine
    2. Accept the client's put/get/remove and other operations
    3. Perform data migration, replication, etc.
    4. Plugins: handle some custom functionality when accepting requests
    5. statistics

tidb

The overall structure is as follows:

TiDB cluster is mainly divided into three components:

  • TiDB Server TiDB Server is responsible for receiving SQL requests, processing SQL-related logic, finding the TiKV address for storing the data required for computation through PD, interacting with TiKV to obtain data, and finally returning the result. TiDB Server is stateless. It does not store data itself, but is only responsible for computing. It can be scaled infinitely horizontally. It can provide a unified access address externally through load balancing components (such as LVS, HAProxy, or F5).

  • PD Server Placement Driver (PD for short) is the management module of the entire cluster. It has three main tasks: one is to store the meta information of the cluster (which TiKV node a key is stored in); the other is to schedule and load balance the TiKV cluster ( Such as data migration, Raft group leader migration, etc.); the third is to allocate a globally unique and incremental transaction ID. PD is a cluster and needs to deploy an odd number of nodes. Generally, it is recommended to deploy at least 3 nodes online.

  • TiKV Server TiKV Server is responsible for storing data. From the outside, TiKV is a distributed key-value storage engine that provides transactions. The basic unit for storing data is Region, each Region is responsible for storing data of a Key Range (the left-closed and right-open interval from StartKey to EndKey), and each TiKV node is responsible for multiple Regions. TiKV uses the Raft protocol for replication to maintain data consistency and disaster tolerance. Replicas are managed in units of Regions, and multiple Regions on different nodes form a Raft Group, which are replicas of each other. The load balancing of data among multiple TiKVs is scheduled by PD, which is also scheduled in the unit of Region.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324415922&siteId=291194637