New Ideas for Cloud Native Database Design

Insert picture description here

Before talking about new ideas, I will do a brief historical review for friends who have not paid attention to database technology in the past. Then I will talk about the future database field, new trends and cutting-edge thinking in cloud native database design. First, let's take a look at some mainstream database design patterns.

Common Distributed Database Schools
The development process of distributed databases, I have classified them according to the age, and so far has been divided into four generations. The first generation is based on simple sub-database sub-table or middleware to do Data Sharding and horizontal expansion. The second-generation system is a NoSQL database represented by Cassandra, HBase or MongoDB, which is generally used by Internet companies and has good horizontal scalability.

I personally think that the third-generation system is a new generation of cloud databases represented by Google Spanner and AWS Aurora. They are characterized by the integration of SQL and NoSQL expansion capabilities, exposing the SQL interface to the business layer, and they can be used. Horizontal expansion.

The fourth-generation system takes the current TiDB design as an example, and it has entered the era of mixed business loads. A system has the characteristics of not only doing transactions but also processing high-concurrency transactions, and can also integrate some data warehouses or analytical databases. It is called HTAP, which is an integrated database product.

What the future looks like, I will introduce some prospects for the future in the following sharing. From the perspective of the entire timeline, from the development of the 1970s to the present, database is considered an ancient industry. I will not expand on the specific development of each stage.

Insert picture description here

Database middleware
For database middleware, the first-generation system is a middleware system. Basically, there are two mainstream models. One is to manually sub-database and sub-table in the business layer, such as database users in business The layer tells you; the data in Beijing is stored in one database, and the data in Shanghai is stored in another database or written to a different table. This is the simplest manual sub-database sub-table of the business layer. I believe you have operated it. Friends of the database are very familiar.

The second is to specify Sharding rules through a database middleware. For example, the user's city, user's ID, and time are used as the rules of sharding, and the middleware is used to automatically allocate them, without the business layer.

The advantage of this approach is simplicity. If the business is in a particularly simple case, such as writing or reading, it can basically be degraded to be completed on one shard. After the application layer is fully adapted, the delay is still relatively low. On the whole, if the workload is random , The business TPS can also achieve linear expansion.

But the disadvantages are also obvious. For some more complex services, especially some cross-shard operations, such as query or write, it is more troublesome to maintain strong data consistency between cross-shards. Another obvious disadvantage is that it is more difficult for the operation and maintenance of large clusters, especially to do some similar operations such as table structure changes. Imagine if there are one hundred shards, adding one column or deleting one column is equivalent to performing operations on one hundred machines, which is actually very troublesome.

NoSQL-Not Only SQL Around
2010, many Internet companies discovered this big pain point. After carefully thinking about the business, they found that the business is very simple and does not require the particularly complex functions of SQL, so they developed a genre. NoSQL database. The characteristic of NoSQL is to give up advanced SQL capabilities, but there are gains and losses, or to give up something, you can always get something. NoSQL in exchange for a transparent and strong horizontal scalability for the business, but it is contrary Coming over means that if your business was originally written based on SQL, it may bring a relatively large transformation cost. The representative systems include MongoDB, Cassandra, HBase, etc. I just mentioned.

The most famous system is MongoDB. Although MongoDB is also distributed, it is still like the scheme of sub-database and sub-table. It is necessary to choose the key of the shard. Its advantages are familiar to everyone, that is, there is no table structure information. What to write, it is more friendly to document data, but the disadvantages are also obvious. Since the Sharding Key is selected, it may be sharded according to a fixed rule, so it will be more troublesome when there are some cross-shard aggregation requirements , The second is that there is no good support for cross-shard ACID transactions.

Insert picture description here

HBase is a well-known distributed NoSQL database in the Hadoop ecosystem. It is a NoSQL database built on HDFS. Cassandra is a distributed KV database, which is characterized by providing multiple consistency models for KV operations. The disadvantages are the same as many NoSQL problems, including the complexity of operation and maintenance, and the requirements of KV interfaces for original business transformation.

The third-generation distributed database NewSQL
just mentioned Sharding or sub-database sub-table, or NoSQL, both face an intrusive business problem. If your business relies heavily on SQL, then these two solutions are very different. cozy. So some companies with more advanced technology are thinking about whether they can combine the advantages of traditional databases, such as SQL expressiveness, transaction consistency, etc., but also with the good features of the NoSQL era, such as scalability, to develop a new It is a system that is simple and scalable, but is as convenient as a stand-alone database. Under this idea, two genres were born, one is Spanner, the other is Aurora, and both are choices made by top Internet companies when faced with this problem.

Shared Nothing 流派

The genre of Shared Nothing is represented by Google Spanner. The advantage is that it can achieve almost unlimited horizontal expansion. The entire system has no endpoints. Whether it is 1 T, 10 T or 100 T, the business layer basically does not have to worry about the scalability. . The second advantage is that his design goal is to provide strong SQL support. There is no need to specify fragmentation rules and fragmentation strategies. The system will automatically help you expand. The third is to support strong consistent transactions like stand-alone databases, which can be used to support financial-level services.

Insert picture description here

Representative products are Spanner and TiDB. This type of system also has some shortcomings. In essence, a pure distributed database has many behaviors that cannot be exactly the same as the behavior of a single machine. For example, for example, when a single-machine database is doing a transaction transaction, it may be completed on a single machine, but on a distributed database, if the same semantics is to be implemented, the rows that the transaction needs to operate may be distributed in Different machines need to involve multiple network communications and interactions, and the response speed and performance are definitely not as good as one operation on a single machine, so there are some differences in compatibility and behavior from a stand-alone database. Even so, for many businesses, compared with sub-databases and tables, distributed databases still have many advantages. For example, in terms of ease of use, they are much less intrusive than sub-databases and tables.

Shared Everything 流派

The second genre is the Shared Everything genre, which represents AWS Aurora and Alibaba Cloud's PolarDB. Many databases define themselves as Cloud-Native Database, but I think the Cloud-Native here is more that these solutions are usually public clouds. As for whether the technology itself is cloud-native or not, there is no uniform standard provided by service providers. From a purely technical point of view, a core point is that the computing and storage of this type of system are completely separated. The computing nodes and storage nodes run on different machines. Storage is equivalent to running a MySQL on a cloud disk. I personally think that this architecture like Aurora or PolarDB is not a purely distributed architecture.

Insert picture description here

The original MySQL master-slave replication uses Binlog. As a representative of Share Everything Database on the cloud, Aurora's design idea is to replicate the entire IO flow only in the form of redo log, rather than through the entire IO link After hitting the last Binlog, send it to another machine, and then apply the Binlog, so Aurora's IO link is reduced a lot. This is a big innovation.

The log replication unit becomes smaller, which means that I only send the physical log, not Binlog, nor do I send the statement directly. Sending the physical log directly can represent a smaller IO path and a smaller network packet, so the whole The throughput efficiency of the database system is much better than the traditional MySQL deployment scheme.

Insert picture description here

The advantage of Aurora is that it is 100% compatible with MySQL and has good business compatibility. Basically, the business can be used without modification. For some Internet scenarios, if the consistency requirements are not high, the database reading can also be extended horizontally, no matter it is Whether it is Aurora or PolarDB, the read performance has an upper limit.

You can also see the shortcomings of Aurora. In essence, this is still a stand-alone database. Because all data is stored together, Aurora's computing layer is actually a MySQL instance. It does not care about the distribution of the underlying data. If there is Large write volume or large cross-shard query needs, if you want to support large data volume, you still need to sub-database and table, so Aurora is a better cloud stand-alone database.

Fourth-generation system: Distributed HTAP database. The
fourth-generation system is a new form of HTAP database. The English name is Hybrid Transactional and Analytical Processing. It is also easy to understand by the name. It can be used for transactions and real-time in the same system. analysis. The advantage of the HTAP database is that it can have unlimited horizontal scalability like NoSQL, and can do SQL query and transaction support like NewSQL. More importantly, in complex scenarios such as mixed business, OLAP will not affect OLTP business. At the same time, it saves the trouble of moving data around in the same system. At present, I see that only TiDB 4.0 plus TiFlash architecture can meet the above requirements in the industry.

Distributed HTAP database: TiDB (with TiFlash)

Why can TiDB achieve complete isolation between OLAP and OLTP without affecting each other? Because TiDB is a separate architecture for computing and storage, the underlying storage is a multi-copy mechanism, some of which can be converted into column-based storage. OLAP requests can be directly hit to the columnar copy, that is, the copy of TiFlash to provide high-performance columnar analysis services, so that the same data can be used for real-time transactions and real-time analysis. This is TiDB's architecture Great innovations and breakthroughs at the level.

Insert picture description here

The following figure is the test result of TiDB, which is compared with MemSQL. A workload is constructed according to the user scenario. The horizontal axis is the number of concurrency, and the vertical axis is the performance of OLTP. Blue, yellow, and green are the number of concurrent OLAP. The purpose of this experiment is to run both OLTP and OLAP on a system, while continuously increasing the concurrency pressure of OLTP and OLAP, so as to see whether these two workloads affect each other. It can be seen that on the TiDB side, while increasing the concurrency pressure of OLTP and OLAP, the performance of these two workloads has not changed significantly, and they are almost the same. However, the same experiment took place on MemSQL. You can see that the performance of MemSQL has greatly degraded. As the number of concurrent OLAP increases, the performance of OLTP decreases significantly.

Insert picture description here

The following is an example of TiDB in a user's actual business scenario. When performing OLAP business queries, the OLTP business can still achieve smooth write operations, and the delay has been maintained at a low level.

Insert picture description here

Where is the future
Snowflake

Snowflake is a 100% data warehouse system built on the cloud. The underlying storage depends on S3. Basically every public cloud provides object storage services like S3. Snowflake is also a pure computing and storage separation architecture. The computing nodes defined in it are called Virtual Warehouse, which can be considered as EC2 units. The local cache has log disks. The main data of Snowflake is stored on S3. The local computing nodes are on the virtual machine of the public cloud.

Insert picture description here

This is the characteristic of the data format that Snowflake stores in S3. Each S3 object is a file of 10 megabytes, only appended. Each file contains source information, which is stored on the disk through columnar storage.

Insert picture description here

The most important feature of Snowflake is that different computing resources can be allocated to the same piece of data for calculation. For example, a query may only require two machines, and another query requires more computing resources, but it doesn’t matter. In fact, these The data is all on S3. Simply put, two machines can mount the same disk to handle different workloads. This is an important example of decoupling computing and storage.

Google BigQuery

The second system is BigQuery. BigQuery is a big data analysis service provided on Google Cloud. Its architecture is similar to Snowflake. BigQuery data is stored in Google's internal distributed file system Colossus. Jupiter is an internal high-performance network, and the above is Google's computing node.

Insert picture description here

The processing performance of BigQuery is excellent. A two-way bandwidth in the data center can reach 1 PB per second. If you use 2000 dedicated computing node units, the cost is about 40,000 US dollars a month. BigQuery is a pay-as-you-go model. One query may use two slots, and the fees for these two slots are charged. The storage cost of BigQuery is relatively low, and 1 TB of storage is about $20 a month.

RockSet

The third system is RockSet. Everyone knows that RocksDB is a well-known stand-alone KV database. The data structure of its storage engine is called LSM-Tree. The core idea of ​​LSM-Tree is hierarchical design, and the colder data will be lower. RockSet puts the latter layer on top of S3 storage. The upper layer actually uses local disk or local memory as the engine. It is naturally a layered structure, and your application does not perceive whether it is a cloud disk or a local Disk, through a good local cache, you can not perceive the existence of cloud storage below.

So I just looked at these three systems, I think there are several characteristics, one is that they are all naturally distributed, the second is built on standard cloud services, especially S3 and EBS, and the third is pay as You go, fully utilizes the elasticity of the cloud in the architecture. I think the most important of these three points is storage. The storage system determines the design direction of the database on the cloud.

Why is S3 the key?
In storage, I think S3 may be more critical. In fact, we have also studied EBS. The first stage of TiDB is actually merging with EBS block storage, but from a longer-term perspective, I think the more interesting direction is on the S3 side.

First, the first point S3 is very cost-effective, and the price is much lower than EBS. The second S3 provides 9 9s with high reliability. The third is the linear expansion of throughput. The fourth is the natural cross-cloud. Every cloud Object storage service with S3 API. But the problem with S3 is that the random write latency is very high, but the throughput performance is good, so we have to take advantage of this feature of good throughput performance to avoid the risk of high latency. This is a test of the S3 benchmark. It can be seen that with the improvement of models, the throughput capacity is also continuously improved.

Insert picture description here

How to solve the problem of Latency?
If you want to solve the Latency problem of S3, here are some ideas, such as using SSD or local disk as cache like RockSet, or writing logs through kinesis, to reduce the overall write latency. There are also data copying or you have to do some concurrent processing, etc., in fact, you can do Zero-copy data cloning, which is also some way to reduce latency.

The above examples have some common points in the data warehouse. I wonder if you have discovered why they are all data warehouses? The data warehouse’s requirements for throughput are actually higher. It’s not so concerned about latency. A query may run for five seconds to produce results. There is no need to provide results within five milliseconds, especially for some Point Lookup scenarios. That said, Shared Nothing's database may only need one rpc from the client, but for an architecture where computing and storage are separated, the network must go twice in the middle, which is a core problem.

Insert picture description here

You might say that it doesn't matter. Anyway, computing and storage have been separated. If you work hard to achieve miracles, you can add computing nodes. But I don’t think the new ideas need to be so extreme. Aurora is a separate computing and storage architecture, but it is a stand-alone database. Spanner is a purely distributed database. The pure Shared Nothing architecture does not take advantage of some of the advantages provided by cloud infrastructure. .

For example, in the future, our database can be designed like this. In fact, there is a little state in the computing layer, because each EC2 will bring a local disk. Now the mainstream EC2 is SSD, and hot data can be done in this layer. Shared Nothing, high availability is done on this layer, and random reading and writing are done on this layer. Once the hot data is cache missed, it will fall on S3. S3 can only store data in the next few layers. This approach may cause problems. Once it penetrates the local cache, latency will have some jitter.

Insert picture description here

The benefits of this architecture design: First, it has a affinity for real-time business data computing, and there will be a lot of data on the local disk. At this point, many traditional database performance optimization techniques can be used; second, data migration will actually It becomes very simple. In fact, the underlying storage is shared, all on S3. For example, the data migration from machine A to machine B does not actually need to be migrated, as long as the data is read on machine B.

The disadvantages of this architecture are: first, after the cache penetrates, the Latency will become higher; second, the computing node now has a state, if the computing node goes down, Failover has to deal with the log playback problem, which may be Increase the complexity of the implementation a bit.

Insert picture description here

There are still many topics worthy of study. The
above architecture is just a hypothesis. TiDB is not yet such an architecture, but there may be some attempts or research in this direction in the future. There are actually many open questions in this field. We have no answers yet. , Including cloud vendors, including us, and academia have no answers.

Now there are some research topics. First, if we want to use local disks, how much data should be cached, what LRU strategy looks like, what does it have to do with performance, and what does it have to do with workload. Second, for the network, we just saw that the network throughput of S3 is doing very well, what kind of performance should be matched with what throughput, how many computing nodes should be equipped, especially for some more complex queries Reshuffle; third , What is the relationship between computing complexity and computing nodes and models? These problems are actually more complex problems, especially how to use mathematics to express, because these things need to be done automatically.

Even if these problems are solved, I think it is just the beginning of the era of cloud databases. In the future, in serverless, including AI-Driven, how to design a better database, this is the direction of our efforts. In the end, Qu Yuan’s quote is a long way to go. We still have a lot to do. Thank you.

Insert picture description here

Guess you like

Origin blog.csdn.net/liuxingjiaoyu/article/details/112779644