Depth: Decrypt the poems and distances of the database!

‍Data intelligence industry innovation service media

——Focus on digital intelligence and change business

Unlike gold and oil in history, data has become our new treasure, an endless source that drives social progress and catalyzes innovation. However, these various and complex data require a manager, a vault, and an interpreter, which is the role of the database.

In order to systematically and in-depth sort out the development context and latest progress of the database, Data Ape interviewed a number of industry experts, including Liu Wanggen, the co-founder of Transwarp Technology, Liu Qi, the founder and CEO of PingCAP, and the deputy general manager of Dameng Data Technology Service Center Hu Jun, Cui Zhiwei, general manager of NTU General GBase 8s product line, and Yang Shengwen, chief scientist of Kuker Data , came to explore the value and future of the database together.

The Evolution of the Database

To understand the development direction of any field, you first need to explore its historical trajectory, and databases are no exception. Every leap in database technology is a response to changes in requirements and technical challenges in the past. Therefore, only by deeply understanding the development history of databases can we have a clearer insight into its future development trends, thereby exploring new innovation paths and leading database technology to new heights.

Looking back at history, in the 1970s, when a British computer scientist named Edgar Codd was working at IBM, he was troubled by the low efficiency of data storage and retrieval. He had a flash of inspiration and proposed a relational model. Since then, the door to the development of relational databases has been opened.

Based on the relational model, Oracle Corporation came into being. The company's relational database software took the world by storm, sparking a database revolution. At this time, another hero, Professor Michael Stonebraker, created PostgreSQL. This database has many advanced functions, such as GIS data storage, and further enriched the field of relational databases.

At the beginning of the 21st century, the rapid development of the Internet has brought new challenges to databases. Google's engineers found that traditional relational databases are insufficient in large-scale data processing. Therefore, they proposed a distributed database technology - Bigtable, which realizes distributed storage of data and improves the efficiency of processing large-scale data.

At the same time, AWS has developed the cloud database service Amazon RDS, which makes the database more flexible in the cloud and reduces the operation and maintenance costs of enterprises. Subsequently, technology giants such as Microsoft, Google, and Alibaba followed suit and launched their own cloud database products.

In the development of these years, various new technologies have emerged in the database field. Many companies and people played a key role in this process, and their stories became a witness to the development of the database field. From relational database to NoSQL, and then to NewSQL, database technology is constantly evolving.

Distinguish and analyze two key problems of database

Standing on the observation tower of history, we can see how database technology conforms to the trend of the times, gradually forming a glorious context from the past to the present. On this road, every leap in technological innovation and application represents the inheritance and transcendence of history. Now, we are standing at a new starting point, and the development of the database has entered a new stage.

At a fork in history, we need to find reasonable answers to some key questions. At present, there are two questions in the database field that urgently need to be answered: will distributed databases eventually dominate the world, or will they coexist with centralized databases for a long time; there are more and more types of databases, and will they eventually become dedicated to dedicated databases or multi-mode integration?

In order to answer these two questions, Data Ape interviewed several experts in the database field.

1. Distributed VS centralized database

With the development of the Internet and the emergence of big data, distributed databases have gradually become an important development direction of databases. When it comes to distributed databases, the industry is divided on one question - will distributed databases eventually completely replace centralized databases, or will the two coexist in the long run?

Cui Zhiwei, general manager of NTU General GBase 8s product line, told Data Ape, “I don’t think distributed databases will eventually completely replace centralized databases. The reasons are roughly as follows:

① The generation of distributed databases and centralized databases has specific business scenarios. There are not only massive data scenarios on the Internet, but also small business scenarios of small and medium-sized banks. These business scenarios do not see a trend of disappearing or merging;

② Another opportunity for the emergence of distributed databases is that the performance of domestic hardware is insufficient, and distributed databases are used to solve problems through division of labor and cooperation. However, domestic hardware equipment is also improving rapidly, which can gradually solve the problem of insufficient performance;

③ Centralized databases still have the advantages of flexibility, speed, simplicity, and strong consistency in specific business scenarios, and distributed databases still struggle when encountering cross-node joins;

I think the development of databases in the future will enter a stage where a hundred flowers bloom. Different business scenarios use different database products, and various databases work together to solve problems for customers. For example, before the emergence of non-relational databases, text, video, etc. were stored in the large object types of relational databases, and now there are professional document databases; before, it was also necessary to consider the optimization of text search, and now many of them are implemented in ES . "

Regarding this problem, Hu Jun, deputy general manager of Dameng Data Technology Service Center, believes that "distributed databases will be one of the important directions, but distributed databases have their specific applicable scenarios, and it is still possible to use more general ones in many fields." Centralized architecture database. Therefore, we believe that centralized and distributed are two tracks. Although these two tracks may have some competition with each other, in principle they are two complementary things.

Distributed and centralized are actually not two completely opposite technical routes, and there is no concept of one being better than the other, but the matching technical architecture is different for different business scenarios. When customers choose a distributed database, they should comprehensively consider factors such as business model, technology stack selection, operation and maintenance costs, and industrial supplier system. In addition, distributed database is a heavyweight technology, and the threshold for users to use it is relatively high, which users should also consider comprehensively.

In general, we believe that customers should not stick to the type of database when selecting a model, but choose to be oriented by actual needs and pain points, and look for product solutions that can meet their actual needs and technical indicators. For example, at the level of customer needs, regardless of whether the customer defines the database for realizing the required functions as distributed or centralized, the customer can use different systems based on the general Dameng database in the project implementation process according to their actual needs and technical indicators Configuration and different cluster software collocation schemes to build centralized, distributed or mixed and matched database instances. "

Liu Wanggen, co-founder of Transwarp Technology, believes that "distributed and centralized databases will coexist for a long time in different scenarios, but eventually distributed databases will eventually completely 'replace' centralized databases.

Distributed databases have two characteristics. One is that they can be expanded horizontally to provide larger storage and higher performance. The other is that they can provide high availability and ensure data and system security. For centralized databases, in actual production, users will at least deploy dual machines to achieve high availability and disaster recovery. For users, due to the diversity of applications and businesses, the amount of data for users will continue to increase, and they are not satisfied with storing data. They all hope to analyze data in various ways and obtain value from data. Therefore, whether it is from storage or computing, users have higher and higher requirements for database systems, and distributed systems can well meet these requirements.

From the perspective of actual system iteration, the user's original server hardware and database software will also have a life cycle, and in the face of centralized expansion, and the localization needs of some industries, users must consider replacement. When replacing, whether to continue to use the old technology or use the new distributed technology, I believe that users will gradually adopt the new technology. For example, as mentioned above, distributed can be scaled horizontally, so there will be no problem of difficult expansion, and it can provide larger storage and higher computing power to meet more business scenarios. "

2. Dedicated database VS multi-model database

With the development of databases, especially non-relational databases, a large number of databases for specific application scenarios have emerged, typically including:

Real-time database: breakthroughs in the core technologies of real-time databases such as memory storage, event-driven and stream processing, and important progress in low latency and high throughput. This enables real-time databases to quickly respond and process real-time data, with applications in fields such as finance, the Internet of Things, and gaming.

Time series database: Time series database is mainly used to store and query time series data. Core technology breakthroughs include data compression, efficient indexing, and time window queries, and application scenarios include the Internet of Things, monitoring systems, and the financial industry.

Graph database: Graph databases store and query data in a graph structure, which is suitable for processing complex relational networks. Core technology breakthroughs include graph traversal algorithms, subgraph matching, and graph analysis, and application scenarios include social networks, knowledge graphs, and recommendation systems.

Columnar database: Columnar database uses columns as storage units, which optimizes the query performance for a large amount of columnar data. Core technology breakthroughs include column storage, vectorized execution, and data compression, and application scenarios include data warehouses, big data analysis, and reporting systems.

Serverless database: The serverless database abstracts the database service from the underlying infrastructure, so that users do not need to care about the server and operation and maintenance. Core technology breakthroughs include elastic scaling, pay-as-you-go, and automated operation and maintenance, and application scenarios include cloud-native applications and rapid prototyping.

GPU-accelerated database: The GPU-accelerated database utilizes the parallel computing capability of the GPU to accelerate the query and analysis performance of the database. Core technology breakthroughs include GPU computing, data parallel processing, and memory optimization, and application scenarios include real-time data analysis, artificial intelligence, etc.

Vector database: The explosion of large models has also led to the development of vector databases. Liu Wanggen pointed out that whether it is a general model or a fine-tuned industry model, there are certain limitations, including problems such as real-time, long Token, and hallucinations. The data used for large model training includes various types of unstructured data such as documents, pictures, audio and video. Users can convert these data into multi-dimensional vectors through the preprocessing method of representation learning and store them in the vector database, which can solve the above problems well.

Faced with more and more types of databases, people can't help asking, will each type of business implement a dedicated database in the future, or will these databases move towards integration and unification, and finally use a general database to meet diverse data needs? In response to this question, Data Ape also interviewed industry experts.

Hu Jun, deputy general manager of Dameng Data Technology Service Center, believes, "For now, dedicated databases are actually a development trend, because the use effects of specific databases vary greatly in different scenarios; but whether to integrate them depends on their performance. It is a question of the direction of technological development, but the essence of the question is to focus on several key directions on the demand side: cost reduction and efficiency increase, data security, and the supporting role of databases for new technologies. From this perspective, whether it will Moving towards integration still depends on the changes on the demand side, and the development of technology and products still needs to evolve around demand.”

Liu Wanggen, co-founder of Transwarp Technology, believes, "The special database will still be dedicated, and it will gradually move towards integration, but the integration I am talking about does not mean the formation of a general-purpose database, but the integration of multiple models, that is, multi-mode databases. .

To analyze the relationship between massive data, a graph database is required to store and analyze data through a graph model, and the analysis performance is higher and more intuitive. However, the massive data generated all the time in the industrial Internet of Things scenario cannot be solved by traditional relational methods, or the storage cost is too high and the analysis efficiency is low. Here, a special time-series database is needed to provide high-performance real-time data. Write, complex analysis, and high compression ratio to reduce storage costs, etc. For another example, the currently popular large models require a dedicated vector database.

However, there are also problems with these different databases for different scenarios. First of all, each of these independent systems needs to be maintained separately, and the operation and maintenance costs are very high. At the same time, the interface standards are inconsistent, and users need to learn different interfaces for adaptation, which leads to high development costs. Similarly, these products also use their own independent computing engines and storage, and it is difficult for data storage to communicate with each other in their respective ecosystems. If you need to import data from one product to another, you need to import and export. ETL transfer efficiency is low. At the same time, it is difficult to guarantee the accuracy, consistency and effectiveness of the data. Data often leads to data inconsistency during the transfer process, which ultimately affects business accuracy.

Multi-mode databases solve this problem very well, using a unified platform to handle a variety of different data models and providing a unified interface to the outside world. Transwarp not only achieves a unified interface, but also achieves a unified computing, storage management, and resource management framework. Users only need to maintain one system, and the data of multiple models are stored and managed in a unified manner. One SQL can realize the operation and query of different data models, model conversion and flow, and cross-model association analysis, which solves the combination of different model data. It has the advantages of low complexity, low development cost, low operation and maintenance cost, and high data processing efficiency. "

Find the future direction of development

It should be pointed out that although the database has a history of decades of development, it does not mean that he has entered his twilight years. On the contrary, the database is in its "mature age" and is in the process of rapid development, with new technologies and new ideas emerging one after another.

Then, if the database needs to be further developed, what are the core breakthrough directions in the future?

Hu Jun, deputy general manager of Dameng Data Technology Service Center, told Data Ape, "At present, database technology is developing rapidly, and many technical methods are worthy of attention, such as HTAP technology, cloud technology, artificial intelligence technology, new hardware equipment technology and so on. At this stage, Dameng focuses on distributed database, HTAP and cloud database technologies, which are technological trends that will be implemented in the past two years. Trends such as AI for DB and multi-mode database still take some time."

In the view of data ape, to achieve further development of the database, breakthroughs can be made from the following directions:

Integrated lake and warehouse

The integration of lake and warehouse refers to the integration of data lake and data warehouse technologies to realize data management, processing and analysis. This technology solves the contradiction between the rigidity of traditional data warehouses and the looseness of data lakes by combining the flexibility of data lakes with the structured management of data warehouses. Core technologies include metadata management, data integration, data conversion, etc. The current challenges mainly include data consistency, performance, and security.

Separation of deposit and calculation

Separation of storage and calculation refers to the separation of data storage and calculation to improve the efficiency of data processing and analysis. This technology solves the problems of traditional data warehouses, such as the shortage of computing resources and performance bottlenecks, by storing data in a distributed storage system and performing data processing and analysis through computing engines. The core technologies include distributed storage, computing engines, etc. The current challenges mainly include data security, data consistency, and computing task scheduling.

Flow batch integration

Stream-batch integration refers to the combination of data stream processing and batch processing to realize the integration of real-time data analysis and offline data analysis. By combining the advantages of data flow processing and batch processing, this technology solves the problems of poor real-time performance and low batch processing efficiency of traditional data warehouses. The core technology includes real-time data processing, batch processing engine, etc. The current challenges mainly include data consistency, computing performance, and data security.

Transaction and Analytics Convergence

The integration of transaction and analysis refers to the combination of transaction processing and data analysis to realize real-time data analysis and decision support during data processing. This technology solves the problem of separation of data analysis and transaction processing in traditional data warehouses by combining real-time data analysis and transaction processing. The core technology includes transaction processing engine, real-time data analysis, etc. The current challenges mainly include performance, data security, etc.

AI, especially the fusion of large models and databases

As Liu Qi, founder and CEO of PingCAP, said, "AI is really going to reshape the entire software industry this time. The main impact of AI technology on the software industry has two aspects, one is code and the other is data.

AI completed half of the work of human code writing in just 7 months. In the past seven months, more than 46% of the new code on GitHub has been generated by AI. From the perspective of software development efficiency, AI has actually completed almost half of human work. In terms of data, users do not need to edit SQL, but only need to use natural language to describe what data they want to get and what analysis they want to do, and the graph can be automatically generated.

The capabilities brought by AI have made the threshold of data consumption extremely low, which also brings huge challenges to databases. In the era of AI, we hope to provide a database that is 'available to everyone, with an open ecosystem'. Based on this, we believe that the future database should at least be a cloud-native architecture with lower cost, elastic expansion, and large-scale data integration capabilities. In a nutshell, data architecture modernization is a global trend. "

Hu Jun, deputy general manager of Dameng Data Technology Service Center, pointed out, "The establishment and training of large models, as well as the application of reasoning, can only be done on the basis of massive data. As the core software for data storage and management, the database system is This technical architecture occupies an important position; how to improve the support for various large model data types, massive data processing efficiency, model data security, ecological adaptation, and reduce the cost of massive data storage provides challenges to the database system at the same time , It also brings opportunities. As an important carrier of traditional data and big data, the current database system is also involved and applied in the field of large models. With the rapid development of large models, it will also drive the rapid development of databases. "

Yang Shengwen, Chief Scientist of Kuke Data, told Data Ape, "The training of large models needs to rely on massive data and powerful computing power support, and the reasoning of large models with hundreds of billions of parameters also has high requirements for computing resources. Most of the models currently on the market Database products cannot well support the training and reasoning of large models.

Thanks to the separation of storage and calculation, dual computing engines (MPP engine and ML engine), and the advantages of cloud-native architecture, HashData has great value for the training, fine-tuning, reasoning, and application of large models. First of all, HashData can efficiently store and manage the original massive data, and use a powerful data processing engine to analyze, clean and transform these data, and finally generate high-quality training data. Secondly, using HashData's powerful ML engine, large models can be efficiently fine-tuned in combination with enterprise data, and even large models can be trained from scratch. Third, using the built-in vector database capability of HashData greatly simplifies the construction of intelligent applications based on knowledge enhancement of large models. HashData has also developed HashML, a data science toolbox for data scientists, data engineers and application developers, which makes the whole process from data processing, model fine-tuning to intelligent application development easier, and greatly reduces the threshold of AI technology application. "

It should be pointed out that the above technical directions are not independent, but closely related. Liu Wanggen, co-founder of Transwarp Technology, believes that database technology is showing an integration trend, and integration has several meanings, including integration of lakes and warehouses, integration of multi-mode processing, integration of transaction analysis, and so on. In the past, everyone used the hybrid architecture of Hadoop lake + MPP warehouse, which was formed due to historical development and technical limitations. However, with the development of integrated lake and warehouse technology, integration can be realized from the technical level. For example, in the replacement of traditional data warehouses such as Teradata, many users choose to upgrade to the lake warehouse integrated architecture when replacing.

In terms of multi-mode processing integration, in order to meet the needs of some specific scenarios, different database types are used. The development, operation and maintenance of these different database systems have brought great troubles to users, so it is necessary to move towards integration, and also It is a multimodal database. In addition, like OLAP and OLTP, in fact, they were integrated at the earliest time. Later, with the growth of transaction and analysis business, they gradually developed separately, and now they are gradually unified with the development of database technology. In short, the database is moving towards integration, making data processing more intelligent and popular, and then realizing the cost reduction and efficiency increase of database processing.

In the complicated technological progress, we have glimpsed the clues of the future, and also explored the blueprint for the development of the database. Just as a new day is coming in the dawn of the morning, the database is also entering a new chapter in the interweaving of history and innovation.

In front of us, cloud computing, big data, artificial intelligence, Internet of Things, blockchain, 5G, and other unknown technological trends are coming like a tide, constantly shaping the new form of the database. And the database, like a ship that is not afraid of wind and waves, takes us forward bravely and braves the waves. Every voyage opens the door to the future. Every exploration is not only a technological innovation, but more importantly, it will become a new tool for us to understand and change the world, and a new way for us to explore the unknown and create the future.

How will databases change the digital world tomorrow? How would we change the database? This is a question full of suspense, but also an answer worth looking forward to.

Text: Misty Rain / Data Ape

Depth: Decrypt the poems and distances of the database!

Guess you like