The reason behind the rise of cloud-native distributed databases and data warehouses

Introduction: What kind of changes will the database face? What are the unique advantages of cloud native databases and data warehouses? At the DTCC 2020 conference a few days ago, Li Feifei, Vice President of Alibaba Group, President of Alibaba Cloud Database Product Division, and Distinguished Scientist of ACM, gave a wonderful sharing on "Cloud Native Distributed Database and Data Warehouse System Lights the Road to Data Cloud".

In the era of cloud computing, cloud-native distributed databases and data warehouses have begun to rise, providing features such as elastic expansion, high availability, and distribution.

What kind of changes will the database face? What are the unique advantages of cloud native databases and data warehouses? In a few days ago DTCC 2020 General Assembly , the vice president of Alibaba Group, Ali cloud president of database products division, ACM Distinguished Scientist Li Feifei "lit clouds on the road to cloud data a native distributed database and data warehousing system," it was wonderful to share.

Flying knife.JPG

Li Feifei, Vice President of Alibaba Group, President of Alibaba Cloud Database Product Division, ACM Outstanding Scientist

The following is a record of the speech dry goods compiled by the editor:

1. Background and Trends

1. Background

The essence of the database is the management of "data" across the entire link, including production-processing-storage-consumption, etc. In the current data era, data is one of the core assets of all enterprises, so the value of the database has always been Constantly improve, and constantly discover new value in new areas.

2. Industry trends

Trend 1: Data production/processing is undergoing qualitative changes.
Keywords: explosive growth in scale, real-time and intelligent production/processing, accelerated data migration to the cloud
1.png

The following conclusions can be drawn from the analysis of Gartner, IDC and various traditional vendors:

  • Data is growing explosively, and the proportion of unstructured data is increasing;
  • The demand for real-time and intelligent production/processing is increasing, and the pursuit of offline integration;
  • Database systems, big data systems, data management analysis systems, etc. have obvious trends in cloud migration, and the acceleration of data migration to the cloud is unstoppable.

Trend 2: Cloud computing accelerates the evolution of database systems.
Keywords: business start-open source-analysis-heterogeneous NoSQL-cloud native, integrated distributed, multi-mode, HTAP
2.png

The epitome of the development of databases and data warehouses from the 1980s to the present

Cloud computing faces two major challenges

Challenge 1: Combination of Distributed and ACID
From the perspective of traditional big data processing, although part of the distributed level expansion gained by sacrificing ACID is very good and solves the needs in many scenarios, the application demand for ACID has always existed. In the scenario of distributed parallel computing, the application's demand for ACID has become stronger.

Challenge 2: How to use resources. The
traditional von Neumann architecture is tightly coupled with computing and storage. Multiple servers can be connected into a system through distributed protocols and processing methods. However, between servers and servers, nodes With nodes, the coordination of distributed transactions and the optimization of distributed queries, especially when ensuring strong consistency and strong ACID characteristics, have many challenges.

Global cloud database market structure
Keywords: resource pooling, resource decoupling

3.png

The essence of cloud is to use virtualization technology to pool resources and decouple resources. Alibaba Cloud is one of the core cloud vendors. Based on cloud native technology, Alibaba Cloud has built a cloud native database product system, representing China's database vendors, under the challenge of Gartner combining OPDBMS (transaction processing) and DMSA (management and analysis) into one , Entered the Gartner Cloud DBMS global leader quadrant for the first time in history, ranking third in the world in market share and leading the industry in China.

Database system architecture evolution
Keywords: single node, shared state, distributed
4.png

The above figure is based on tightly coupled storage and computing. DB stands for computing node. In the architecture, the CPU Core and memory of the computing node are still tightly coupled together. The architecture on the left is single and tightly coupled with resources. The distributed architecture on the right, which connects multiple parts into one piece through Shared Nothing, theoretically has very good horizontal scalability, and uses distributed protocols for distributed transaction processing and query processing, but it also encounters distribution in distributed scenarios There are many challenges such as distributed transaction processing and distributed query.

Whether it is the traditional form of middleware distributed table or enterprise-level transparent distributed database, it will face a challenge. Once a distributed architecture is implemented, data can only be sharded and partitioned according to one logic, and business logic and sharding logic are not perfect. Consistent, cross-database transactions and cross-sharding processing will surely occur. Whenever the ACID requirements are high, the distributed architecture will bring higher system performance challenges, such as when the proportion of distributed commits exceeds the entire transaction at a high isolation level If it is 5% or more, there will be significant loss of TPS.

Perfect Partition Sharing does not exist. These are the core challenges that distributed businesses need to solve, and the high consistency guarantee that needs to be achieved in this architecture.

The cloud-native architecture is essentially distributed shared storage at the bottom and distributed shared computing pool on top, with computing and storage decoupling in the middle, which can provide very flexible and highly available capabilities and achieve centralized deployment of distributed technologies , Transparent to the application. Avoid many challenges in the traditional architecture, such as distributed transaction processing, how to partition and sharding distributed data.

5.png

When sharing storage, shared resource pools, and shared computing pools, its horizontal scalability still has certain limitations. We can solve this problem by combining distributed and cloud-native architectures.

In the above picture, the capabilities of Shared Nothing and Shared Storage/Shared everything are connected. Under each shard is a resource pool with very strong capabilities and high flexibility. At the same time, such parts can be connected with distributed technology. Get up, not only enjoy the benefits of distributed horizontal expansion capabilities, but also avoid a large number of distributed transactions and distributed processing scenarios. Because a single node has a particularly strong computing and storage capacity, 200 TB of data is in accordance with the traditional distributed architecture, assuming that one node can only handle 1 TB, then 200 distributed nodes are required. One node of the cloud native architecture can process 100 TB, which is 2000 TB of data. The traditional distributed architecture requires 200 nodes. The combination of the cloud native architecture requires two or three nodes. The probability of distributed transaction processing and distributed query will be great. Decrease, the efficiency of the entire system will be greatly improved.

Trend 3: Key technology
keywords for next-generation enterprise-level databases : HTAP: Big data database integration, cloud native + distributed, intelligence, Multi-Model, software and hardware integration, security and credibility

6.png

The trend of big data and database integration includes the integration of offline integration, Transaction and Analytical Processing integration, the integration of offline computing and online interactive analysis, collectively referred to as integration of big data and database.

The deep integration of cloud native and distributed technologies, the integration of intelligence, machine learning, and AI technology in the database field, how to simplify operation and maintenance and simplify the use of databases. In addition to structured data, how to deal with unstructured data, such as text and other data, integration of software and hardware, and how to combine hardware capabilities such as RDMA and NVM to give full play to the advantages of hardware. The last is the security and trustworthiness of the system.

2. Core technology & product introduction

2.1 Enterprise-level cloud-native distributed database

1) PolarDB, a cloud-native relational database

The core product of Alibaba Cloud's self-developed relational database is the cloud-native relational database PolarDB. Through the following figure, you can understand the idea of ​​PolarDB. Storage and computing are separated. RAFT is used to ensure high availability and high reliability. To make a computing pool, the next generation version of PolarDB can do multiple writes and reads multi-master, and the computing nodes will be further decoupled in the next generation to make a shared memory pool. CPU Core can do a shared computing pool and then access A shared memory, PolarPorxy is responsible for read-write separation and load balancing.
7.png

Based on this architecture, PolarDB, which is 100% compatible with MySQL/PG and highly compatible with Oracle, was born. For open source and commercial database usage scenarios, a lot of performance optimizations have been made. For example, Parallel Query Processing has achieved very excellent performance. Compared with traditional databases, the entire TCO can be only 1/3 to 1/6, and the performance of TPCC under the same load is greatly improved.

The Global Database is built on the basis of PolarDB, and the cross-region architecture solves the nearby read and write needs of many overseas customers.
8.png

2) PolarDB-X, a cloud-native distributed database
9.png

Distributed version of PolarDB-X: Based on X-DB and the original sub-database and sub-table DRDS, the two are combined into a transparent integrated distributed database PolarDB-X. Each distributed node includes two data nodes and one log node. Paxos is optimized to ensure data consistency between data nodes and log nodes.

Its characteristic is that the three nodes can be deployed across AZs to achieve disaster recovery in the same city, without the need for traditional commercial databases to use data synchronization links for disaster recovery, and directly achieve disaster recovery in the same city at the storage layer.

Two remote centers, three centers, and even more remote disaster tolerance architectures, such as direct deployment across remote locations, because the network latency is very large, which may affect performance, and data synchronization is still required through product architectures like ADG and DTS. Achieve remote disaster tolerance architecture.

3) Database and application migration and transformation of ADAM
10.png

ADAM, the full name Advanced Database Application And Application Migration, generates an evaluation report by analyzing the application code and logic tree. The evaluation report is automatically generated, which can provide migration analysis from traditional databases to PolarDB and ADB.

The one-click migration solution uses ADAM to scan the application code, DTS to synchronize the data in real time, and migrate to the cloud-native database, which can achieve a non-cutting transformation of the customer's application.

as the picture shows:
11.png

In summary, distributed is just a technology. In fact, many database applications do not need to be distributed. The cloud-native capabilities can well meet the needs of application flexibility, high availability, and horizontal expansion. If you really need distributed capabilities, you can combine the Shared Nothing architecture and technology to expand, so you must design the system and application migration and transformation plans from the customer's perspective according to application requirements.

2.2 Cloud Native Data Warehouse and Data Lake

12.png

Integrated design becomes the core concept of the next generation data analysis system

The database market is not only a TP relational database. This is why Gartner combines traditional OPDBMS (transaction processing) and DMSA (management and analysis) into one Cloud DBMS, and asserts that Modern DBMS can do both and there is only one Cloud DBMS market. In addition to transaction processing, the database system also needs to realize the integration of data processing through calculation and analysis, for example, to play a role in the field of data warehouses and data lakes.

13.png

Cloud-native data warehouse + cloud-native data lake to build a new generation of data storage and processing solutions The

field of data analysis is the status quo of the group leader, online query, offline computing, there are many sub-fields, using the resource pooling and resources of cloud native computing technology Decoupling, you will see the next generation of cloud-native data systems. The next-generation cloud-native data warehouse should have real-time online "addition, deletion, modification, and query" capabilities. On this basis, it supports offline integration, which can do online interactive analysis and query as well as complex offline ETL and calculations. Dimensional data analysis is the core requirement for cloud-native data warehouses.

There is a certain difference between the addition, deletion, modification, and checking in the data warehouse, because the requirements for isolation mechanism are not so high. For example, snapshot isolation is not required, because it is an analysis system, but it must support online addition, deletion, and modification of traditional databases. The ability to check is not only a scenario that supports Batch Insertion.

1) Cloud native data warehouse

The data warehouse is suitable for normalized and structured data processing, and is suitable for normalized data management and applications. It already has a very clear and stable business logic and requires normalized management.

14.png

Cloud-native data warehouses use cloud-native architecture to upgrade and transform traditional data warehouses. Resource pooling and resource decoupling achieve flexibility, high availability, horizontal expansion, and intelligent operation and maintenance are one of the core essences of cloud-native.

If these are combined, Alibaba Cloud is OSS, Amazon is S3, low-cost object storage is used as a cold storage pool, and efficient cloud disks are used as a local cache, computing nodes are decoupled, and local nodes are accelerated. Connect to a pool through a high-speed network, and then provide a unified transparent service to the application.

AnalyticDB cloud native data warehouse

15.png

The bottom of this architecture is object storage, using the RAFT protocol to achieve data consistency, and accelerating the use of ESSD elastic cloud disks for the local cache of each computing node. The above is the computing pool. In order to realize the integration of big data and database in the data warehouse, the computing nodes in the data warehouse field also need to make the offline computing power of big data stronger. Offline big data systems are basically based on BSP+DAG. The traditional database field is MPP architecture, so in order to achieve offline integration, MPP and BSP+DAG are combined, a Hybrid calculation engine is made, and a Hybrid query and calculation optimizer is made for this. The above is MetaData management, and strive to achieve the original data sharing.

Cloud native data warehouse AnalyticDB MySQL
16.png

AnalyticDB (ADB) is a cloud-native data warehouse designed based on this idea. ADB MySQL is compatible with the MySQL ecosystem and has become the No. 1 TPC-DS performance and cost performance list. Unified support for interactive analysis and complex offline ETL calculations. ADB also made another version based on PG, which is ADB for PG. For traditional data warehouses, such as TeraData, use PG's compatibility with Oracle to upgrade traditional data warehouses, use cloud-native architecture, separate storage and computing, and perform cloud-native upgrades to traditional data warehouses, query executors, and others A lot of optimizations have been done in the module.

Cloud native data warehouse AnalyticDB for PostgreSQL

17.png

For example, vector execution (vector execution), Code Generation, ADB PG also supports vectorization of unstructured data into high-dimensional vector data for later processing, and then the vector data and structured data are processed in an engine to achieve unstructured Fusion processing of data and structured data. ADB PG got the first place in the TPC-H performance and cost performance list.

2) A new generation of data warehouse solutions

Based on this, a new generation of data warehouse architecture was launched. The bottom is the core cloud-native data warehouse ADB, and the top is data modeling and data asset management, because the data warehouse field is not only a problem of engines, but also a series of problems such as modeling. . For traditional data warehouses, we made a solution to upgrade to cloud-native data warehouses, using ADB, ecological partners and the entire intelligent tool to achieve an integrated solution.

18.png
19.png

DLA cloud native data lake analysis (Serverless, unified metadata + open storage and analysis and computing)

20.png

Data sources are more complex and diverse scenarios are the biggest difference between cloud-native data lakes and data warehouses. The core scenario of the data lake is the unified management, calculation and analysis of multi-source heterogeneous data sources. The cloud native data lake has a unified interface to manage, calculate and analyze multi-source heterogeneous data. The core point is metadata management and discovery. It integrates different computing engines to manage and analyze multi-source heterogeneous data.

Data Lake Analytics + OSS Cloud Native Data Lake

21.png

The above picture shows the architecture of Cloud Native Data Lake Analytics Data Lake Analytics. The following is object storage or other different storage sources, equipped with Kubernetes+Container technology, analysis and calculation through serverless technology, and isolation and security protection between multiple users. This can meet the needs of customers for low-cost, flexible and rich computing and analysis processing for multi-source heterogeneous data.

2.3 Intelligent, safe and reliable and ecological tools

1) Cloud native + intelligent database management and control platform

22.png

The intelligent data management and control platform uses cloud native technology and artificial intelligence technology for intelligent database management and operation, including partitioning, index recommendation, anomaly detection, slow SQL management, parameter tuning, etc., which can greatly improve the efficiency of management and operation , We have developed a Database Autonomy Service module (DAS) to realize the automatic driving of the database system and greatly improve the efficiency of operation and maintenance management and control.

2) Encrypted data in the cloud will never leak

23.png

In addition to the traditional Access Control, transmission and disk encryption, we have developed a fully encrypted database to ensure absolute data security. Combining secure hardware TEE to achieve this, it can achieve full encryption of data processing.

3) Database ecological tools

24.png

In addition to the aforementioned database application migration tool ADAM and database synchronization tool DTS, we also provide a wealth of other database ecological tools, including data management service DMS and data database backup service DBS, which can provide data blood relationship, data warehouse development and construction A series of enterprise-level data processing capabilities and developer-oriented service capabilities such as model, data security management, data backup and disaster recovery, CDM, etc.

4) Database backup solution DBS

25.png

DBS can do multi-cloud and multi-end backup of traditional data, back up offline data to the cloud, or back up data on the cloud to offline, to achieve second-level RPO, support multiple data sources, multi-source, multi-end cloud backup, and Support Snapshot Recovery.

Three, case analysis

1) Double Eleven Shopping Festival•Database Challenge

26.png

The picture above shows the real curve of Double Eleven in 2020. The system peak of 145 times bursts instantly. The combination of cloud native capabilities and distributed capabilities can perfectly and smoothly support the challenge of high concurrency and massive data on Double Eleven.

2) China Post•Replacement of large traditional commercial databases

27.png

China Post used to be based on a traditional commercial data warehouse, and now uses ADB cloud native data warehouse to upgrade, provide more reliable offline integrated computing and analysis capabilities, and realize the demand for unified national data delivery platform into one system.

3) Client of a super large ministries

28.png

The national taxation data unified system application of the State Administration of Taxation uses PolarDB-X distributed database and DTS and ADB to realize a complete set of solutions from TP to AP data processing, calculation, analysis, query, and processing. At the same time, data development is done through DMS And management. Supports complex queries with high concurrency and low latency; supports real-time visibility and efficient storage of massive real-time data; supports accurate calculations at the financial level.

4) Alibaba Cloud database technology to fight the new crown epidemic

29.png

Utilizing the elastic high-availability and intelligent operation and maintenance capabilities provided by cloud-native databases, combined with distributed horizontal expansion, provides a large number of enterprises and users with very flexible and high-availability capabilities. During the epidemic, the online education industry began to use the cloud on a large scale. The native and distributed new-generation database technology architecture and products can reduce costs and increase efficiency to fight the epidemic.

5) Customer case•China Unicom

30.png

China Unicom’s core cBSS system has been upgraded for traditional commercial databases, using the capabilities of the distributed database PolarDB-X to help realize the real-time online transaction data processing of this core billing system.

6) Customer case•Malaysian e-commerce giant PrestoMall

31.png

PrestoMall, Malaysia’s third-largest e-commerce company, because of the high cost of using the traditional commercial database Oracle, especially the challenge of instantaneous high concurrency in the big promotion scenario, the cloud-native database PolarDB was used to upgrade the traditional commercial database and achieved a significant drop in TCO .

7) Customer case•International advertisers data lake analysis + computing solutions

32.png

For a leading international advertising company, the processing of multi-source heterogeneous data such as text, pictures, and structured data cannot be unified into a data warehouse. The data lake is used to make a unified analysis engine. Using DLA+OSS to build a new generation of serverless data lake, greatly improving the access processing and computing capabilities of multi-source heterogeneous data, while saving a lot of computing costs. For complex and rich computing and analysis scenarios, smooth solutions have been successfully migrated from AWS.

Four, summary

33.png

The above picture is a big picture of Alibaba Cloud database products. From OLTP, OLAP, NoSQL to database ecological tools and cloud native intelligent management and control, Alibaba Cloud hopes to use the rich cloud native database product system to provide enterprise customers and users with better and more reliable Products and cost-effective solutions.

【Related Reading】

[Contains dry goods PPT download] DTCC 2020 | Alibaba Cloud Ye Zhengsheng: Database 2025

[Dry goods PPT download included] DTCC 2020 | Alibaba Cloud Zhao Diankui: PolarDB's Oracle smooth migration road

[Dry goods PPT download included] DTCC 2020 | Alibaba Cloud Zhu Jie: The latest technology development trend of NoSQL

[Containing dry goods PPT download] DTCC 2020 | Alibaba Cloud Wang Tao: Practice of Alibaba e-commerce database on the cloud

[Dry goods PPT download included] DTCC 2020 | Alibaba Cloud Zhang Xin: Alibaba Cloud Cloud Native Multi-Live Solution

DTCC 2020 | Alibaba Cloud Liang High School: DAS-based Workload-based Global Automatic Optimization Practice

[Dry goods PPT download included] DTCC 2020 | Alibaba Cloud Chengshi: Database Management in the Cloud Native Era

[Containing dry goods PPT download] DTCC 2020 | Alibaba Cloud Ji Jiannan: Interpretation of key technologies for online analysis to enter the Fast Data era

Original link: https://developer.aliyun.com/article/781040?

Copyright statement: The content of this article is voluntarily contributed by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find that there is suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Guess you like

Origin blog.csdn.net/alitech2017/article/details/112474962