Talk about the future of the database, written on the eve of the fifth anniversary of the establishment of PingCAP

Data is the center of the architecture

As an architect in the Internet industry, almost every day dealing with various types of data, so many years of experience, different industries and different systems, from the technical level, abstract to the highest, summarized in one sentence is:

Data is the center of the architecture.

If you think about it, all the work we actually do revolves around data. The generation of data, the storage of data, the consumption of data, the flow of data ... just change the form and service of data according to different needs. Students in the computer department may still remember the sentence that the teacher said: program = algorithm + data structure. I boldly imitate this sentence: system = business logic x data . It can be said that many architectural problems are caused by the data layer, such as the problems caused by the common "chimney system", especially the data island problem. In fact, the essential reason is that the data layer is not cleared. If you think about the structure, you may be "headaches, heads, feet, feet", and it will be awkward to spend a long time. Conversely, if the data layer is managed well, it will be like opening up the "Two Governors", which has the effect of four or two pounds .

But the ideal is usually full, but the reality is very skinny. At least when we came out to start a business five years ago, we felt that no system could solve the data problem very well. Probably curious readers will ask: Is there Hadoop? And NoSQL? No matter how bad the relational database can be divided into libraries and tables? In fact, these listed are almost all the candidates for dealing with storage problems that year. The common feature of these solutions is: not perfect.

Specifically, these solutions are not very large for data application scenarios. For more complex services, you may need to use n multiple solutions at the same time to complete coverage . This is why as Internet services have become more and more complex in recent years, data pipelines like Kafka have become more and more popular. From the perspective of data governance, they are actually well understood: various data platforms are responsible for their own purposes. For full coverage, it is necessary to build roads between "islands".

We were wondering if we could have a system that could cover as many scenes as possible with a unified interface.

We need a Single Source of Truth. The data should run through every corner of the application logic. In my ideal system, access to any data should be unrestricted (do n’t consider permissions and security first, this is another problem). "Limit" is more broad, for example: there is no upper limit of capacity, as long as there are sufficient physical resources, the system can be infinitely expanded; there is no access model limit, we can freely associate and aggregate data; there is no consistency limit; operation and maintenance is almost No human intervention is required ...

Architecture with distributed database as unified center

I was particularly fascinated by an American drama: Person of Interest (Suspect Tracking). In this TV series, there is a God-like artificial intelligence, The Machine, which collects all data and analyzes it to predict or intervene in the actions of future people. Although this American drama is still a more orthodox theme of heroes, but what fascinates me more is, can we design a The Machine? Although I have not been an AI expert until now, it seems feasible to design a database for The Machine. In the course of starting a business in recent years, we have found that the more exciting points are:

It is possible to use a distributed database as a unified central architecture.

How to understand this? For example, just like the problems brought about by the split mentioned above, the fragmentation of the data layer necessarily means that the business layer needs to be more complicated to make up. In fact, many engineers actually prefer to use linear thinking to think about maintaining the system. cost. But actual experience tells us that this is not the case. The complexity of a system with only one database and ten databases is actually not simple 10x. Considering the flow of data, the maintenance cost can only be more. There are no other problems brought about by heterogeneity.

What does a distributed database-centric architecture look like? It is easy to understand that the center of the entire architecture is a storage system with sufficient scene coverage and unlimited horizontal scalability. Most data flow is restricted to this database, so that the application layer can be almost stateless, because this central database is responsible for most of the state, each application can be accelerated by its own cache. What I want to remind here is why I emphasize the horizontal expansion ability above because the limited expansion ability is also an important reason for the split. We have never had a way to predict the future accurately. It is difficult for us to imagine changes in our business even after a year (think of the epidemic). There is an old saying that is good: the only constant is change.

Another frequently asked question, why should I emphasize that the cache layer needs to be closer to the business layer, or why the giant database at the center should not bear the responsibility of caching? My understanding is that only the business understands the business better, knows what strategy should be used to cache the data, and for performance (low latency) considerations, it makes sense that the cache is closer to the business.

Corresponding to the above sentence, "the only constant is change", the biggest benefit of this structure is "to change with constant", or a simpler word: concise. Google actually thought about this problem very early, because they understood what is really complicated early on.

Another example is HTAP. If you pay attention to the development of the database, you must be familiar with the term HTAP recently. In fact, the essence of HTAP in my opinion is the coverage mentioned above. The following is a typical architecture:

The traditional data architecture usually separates OLTP, OLAP, and offline data warehouses, each system performs its own duties, and is synchronized through an independent Pipeline (sometimes with ETL). Here is what an HTAP system looks like:

Although on the surface, it is only a simple integration of the interface layer, but this meaning is far-reaching. First, the details of data synchronization are hidden, which means that the database layer can decide how to synchronize data by itself. Because the OLTP engine and the OLAP engine are in the same system, many details will not be lost during the synchronization process, such as: transaction information, which means that the internal analysis engine can do things that traditional OLAP cannot do. . In addition, for the use of the business layer, one less system means a more unified experience and smaller learning and transformation costs. Do n’t underestimate the power of unification.

Where is the future?

The above is what happened in the past five years, and it has almost become a reality step by step according to the ideas we had when we started a business. So what will happen in the next five years? As the understanding of the industry and technology deepens, at least one thing I am convinced of is:

Flexible scheduling will be the core capability of the database in the future

No one will deny that the biggest change in IT technology in the last decade was brought by the cloud, and this revolution is still in progress. What are the core capabilities of the cloud? I think it is elastic. The granularity of computing resource allocation has become increasingly fine-grained, like changing from only buying a house to renting a house, and even as flexible as staying in a hotel. what does this mean? The essence is that we do n’t have to pay in advance for "imaginary" business peaks.

In the past, whether we purchased servers or leased cabinets, we needed to set an advance amount. When the peak business did not arrive, in fact, these costs were already paid in advance. The emergence of the cloud has turned elasticity into a fundamental capability of infrastructure, and I expect the same thing to happen with databases. Zhengzhou Gynecology Hospital is good: http://www.xbhnzzyy.com/

Many friends may have doubts. Isn't it true that almost all databases claim to support transparent horizontal expansion? In fact, I hope that everyone will not narrowly understand "flexible scheduling" as scalability, and the word focuses on "scheduling". Let me give you a few examples to facilitate your understanding:

  1. Can the database automatically identify the workload and automatically scale according to the workload? For example: anticipate that the peak is coming, automatically purchase machines, create more copies of the hot data and redistribute the data, and expand the capacity in advance. After the business peak, the automatic recycling machine is used for volume reduction.

  2. 数据库能不能感知业务特点,根据访问特点决定分布?例如:如果数据带有明显的地理特征(比如,中国的用户大概率在中国访问,美国用户在美国),系统将自动的将数据的地理特征在不同的数据中心放置。

  3. 数据库能不能感知查询的类型和访问频度,从而自动决定不同类型数据的存储介质?例如:冷数据自动转移到 S3 之类比较便宜的存储,热数据放在高配的闪存上,而且冷热数据的交换完全是对业务方透明的。

这里提到的一切背后都依赖的是「弹性调度」能力。未来我相信物理资源的成本会持续的降低,计算资源的单价持续下降带来的结果是:当存储成本和计算资源变得不是问题的时候,问题就变成「如何高效的分配资源」。如果将高效分配作为目标的话,「能调度」就是显而易见的基础。 当然就像一切事物发展的客观规律一样,学会跑步之前,先要学会走路,我相信在接下来的一段时间内,我们会看到第一批初步拥有这样能力的新型数据库,让我们拭目以待。

下一个阶段是智能

对于更远的未来是怎么样子的?我不知道,但是就像 The Machine 一样,只有足够数据才能诞生出智能,我相信就像我们不了解宇宙和海洋一样,我们现在对于数据的认识一定是肤浅的,甚至大量的数据我们都还没记录下来,一定有更大奥秘隐藏在这海量的数据中,从数据中能获取什么样的洞察,能够怎么样更好的改变我们的生活,我并不知道,但是做这件事情的主角我猜不会是人类。虽然在这个小节我们讨论的东西可能就有点像科幻小说了,不过我愿意相信这样的未来,从数据的海洋中诞生出新的智能体。郑州试管婴儿费用:http://jbk.39.net/yiyuanfengcai/tsyl_zztjyy/3100/

尾声

创业这五年的时间,回头看看那个最朴素的出发点:写一个更好的数据库彻底解决烦人的 MySQL 分库分表问题。似乎也算没有偏离初心,但是在这个旅途中一步步看到了更大的世界,也越来越有能力和信心将我们相信的东西变为现实:

我有一个梦想,未来的软件工程师不用再为维护数据库加班熬夜,各种数据相关的问题都将被数据库自动且妥善的处理;

我有一个梦想,未来我们对数据的处理将不再碎片化,任何业务系统都能够方便的存储和获取数据;

我有一个梦想,未来的我们在面临数据的洪流时候,能从容地以不变应万变。

最近我听到一句话,我个人很喜欢:雄心的一半是耐心。构建一个完美的数据库并不是一朝一夕的工作,但是我相信我们正走在正确的道路上。

凡所过往,皆为序章。


Guess you like

Origin blog.51cto.com/14510351/2489418