Distributed Database (1): Introduction

1. Introduction to Distributed Database

2. What problems need to be solved for distributed databases


1. Introduction to Distributed Database

    Simply put, a distributed database is a relational database implemented with a distributed architecture. Then why use a distributed architecture? The reason is simple, it is performance and reliability. For various reasons, special equipment such as IBM mainframes is no longer an option for most enterprises, and general-purpose equipment using the x86 architecture cannot meet the requirements in terms of stand-alone performance and reliability. Therefore, distributed architecture has become a necessity. s Choice.

    In recent years, Internet companies such as Alibaba, Tencent, Baidu, ByteDance, Meituan, Didi, Kuaishou, Zhihu, and 58 have all begun to use distributed databases; while the traditional financial and telecommunications industries are also rapidly Follow-up, such as Bank of Communications, China CITIC Bank, China Everbright Bank, Bank of Beijing and some city commercial banks have also launched distributed databases. It can be said that, driven by various factors, distributed databases have become a technological trend, and even part of the new infrastructure. Common distributed databases include:

  • Google's Spanner
  • AWS 的 Auraro
  • TiDB from PingCAP
  • Alibaba's OceanBase and PolarDB
  • Tencent's TBase and TDSQL
  • Huawei's GaussDB

2. What problems need to be solved for distributed databases

  • Storage design: The database actually does two operations, read and write. But these two things sometimes conflict, writing is fast, reading may be slow, and the cost of storage space must also be considered. There is a RUM conjecture that is to say that for this matter, read amplification, write amplification, and storage space amplification can only avoid two at most, and choose two of three. This is the first part, the storage design .
  • Transaction model: The system always needs to be used by multiple people. This brings about concurrency problems. What strategies are used when writing conflicts and read-write conflicts occur? This is the second part of the transaction model.
  • Query engine: The operation interface of the database is SQL. The data structure and operation primitives are defined based on the relational model. There are also various indexes and optimization measures to make SQL execute faster. This is the third part of the query engine .
  • Replication: Any architecture must avoid a single point of failure, so the database will have a replication mechanism, multiple nodes form a master-backup relationship, and data is synchronized between the master and backup, so that reliability is guaranteed. This is the fourth part of replication .
  • Auxiliary work: Finally, there are some necessary auxiliary work, client access, permission control, metadata storage. Such a basic database can be run.

    To sum up, the database is to do five things well, storage, transaction, query, replication and others. For distributed databases, not only must we continue to do these five things, but also one more thing, sharding. Among these six things, storage and the other two things are similar to a single database, and the difficulty lies in the four things of transaction, query, replication, and sharding.

        

    Let's talk about these four things in detail.

    (1) The first thing, which is the extra thing, is called fragment metadata storage and fragment scheduling. Since there are multiple nodes, should the data of that table be placed on one node? Shouldn’t it be scattered to improve performance? In this way, the table is no longer the smallest storage unit of data, replaced by shards. It is the part of the table that is divided horizontally, which is very similar to the concept of partition. However, with this fragmentation, you have to know where to look when using data, right? This is shard metadata. In addition, the shard is not static. There are many factors that can cause the shard to move between nodes, such as too much data stored in the shard or too much access pressure, which requires splitting, merging, and scheduling of the shards.

    (2) The second thing is a transaction, to be precise, a distributed transaction. It is completely different from a stand-alone transaction. Although the database has the XA protocol as a standard for a long time, it theoretically supports cross-database transactions, but the performance is really bad. MySQL cluster using XA protocol, the operation delay is 10 times that of a single machine. What is this concept? It can't be used in a production environment at all. Therefore, we have to study a more efficient distributed transaction model.

    (3) The third one is query. It is easy to find data, but high performance is difficult. Moreover, the data is fragmented. How to allocate a query task, whether to concentrate the data on a certain node or push the logic to each node, is a design trade-off.

    (4) The fourth item is replication, which is a highly reliable design. The original stand-alone replication mechanism can also be used, but under this replication mechanism, only the primary node works and the standby node is idle. Now, the new design is to use the Paxos protocol to establish replication groups on the basis of sharding, so that there are smaller and highly reliable units, so that the master copy of each replication group can be cross-deployed on multiple nodes, so that the machine can be fully utilized. Resources.

Guess you like

Origin blog.csdn.net/MOU_IT/article/details/115339063