[Yunqi 2023] Wang Feng: Technical Interpretation of Open Source Big Data Platform 3.0

This article is compiled based on the transcript of the speech at the 2023 Yunqi Conference. The speech information is as follows:

Speaker : Wang Feng | Alibaba Cloud researcher, head of open source big data platform of Alibaba Cloud Computing Platform Division

Speech Topic : Technical Interpretation of Open Source Big Data Platform 3.0

Real-time and serverless are inevitable choices in the open source big data 3.0 era

Alibaba Cloud's open source big data platform is incubated within Alibaba Group's internal business. As early as 2009, we began to use the open source Hadoop technology system to serve Alibaba's rapidly growing e-commerce business. The Hadoop technology system within Alibaba was called Yuntiyi at the time. When it matured, it began to migrate to the cloud. We launched the first open source big data product E-MapReduce, or EMR for short, on Alibaba Cloud. We define this as the first stage of the open source big data platform, which is the 1.0 era. From now on, we will truly enter the cloud native era.

With the evolution of big data technology, big data processing has evolved from offline technology architecture to real-time, and we have begun to introduce Apache Flink stream computing technology. Alibaba has invested a lot of resources in the Apache Flink community and has gradually become the largest user and community promoter. Up to now, Apache Flink has developed into a global standard for stream computing and real-time computing. At the same time, we also launched real-time computing Flink version of real-time computing cloud product services on Alibaba Cloud.

EMR is also constantly evolving technologically, upgrading from the traditional Hadoop data warehouse architecture to a technical architecture centered around the cloud-native data lake with the data lake as the core. Therefore, we call the two technological evolution trends of real-time and data lake as Open source big data platform 2.0 stage.

Starting from this year, we are thinking about how the open source big data platform will evolve in the next period. We have made the following technical explorations of the 3.0 architecture to better serve our customers.

First, we tried to integrate real-time technical analysis with the data lake architecture. We launched a new generation of Streaming Lakehouse architecture, which is a real-time data warehouse analysis architecture.

Second, as the implementation of serverless architecture continues to deepen, we begin to consider what is the final state of cloud native architecture. This year we have made all core computing and storage components of the open source big data platform serverless. 

Third, we have now fully entered the stage of AI explosion, and all walks of life have begun to use AI technology to innovate themselves. We began to consider the integration of AI, hoping to introduce new AI technology into the big data platform system to realize the ability of big data AI integration and help the platform with intelligent operation and maintenance and data management.

Starting this year, we have adopted a new data analysis architecture, a completely cloud-native architecture, and deeply integrated it with AI to create a new 3.0 architecture. Next, I will select the core technical architecture features of several 3.0 platforms to share with you: what we have done, what results we have achieved, and how we will develop in the future.

A new generation of streaming lake warehouse

First, let’s introduce a new generation of data analysis architecture—streaming lake warehouse. I believe that the vast majority of users are aware of the limitations of the traditional Hadoop Hive data warehouse architecture and the trend of technology development, and have begun to evolve traditional Hadoop technology towards the new generation of Lakehouse analysis Lakehouse architecture.

Obviously, there are many advantages to upgrading to the new Lakehouse data analysis architecture. For example, the new Lakehouse architecture is a complete separation of storage and calculation, with better scalability and flexibility. At the same time, the new data lake format also brings better real-time support and improved query performance. The benefits brought by the Lakehouse architecture are obvious.

But is the Lakehouse architecture perfect? I don’t think it’s reached this point yet. Now we see that the Lakehouse architecture still has room for further development in the real-time direction. This is also the pain point that many open source users encounter when using the Lakehouse architecture: when the data is migrated to the Lakehouse architecture, how to accelerate the data in more real-time Processing pipeline, how to analyze the data in Lakehouse in real time like a traditional data warehouse.

Current Hucang cannot achieve complete real-time or even quasi-real-time effects. The reason is that the storage format of the data lake limits the development of real-time. As you can see, the current data lake storage formats are mainly built by the three swordsmen of Iceberg, Delta, and Hudi. Different users and manufacturers will choose different database formats. However, Iceberg and Delta are data lake formats designed for batch processing. They work better with batch computing engines. They implement batch processing on Lakehouse, and may even be relatively powerful micro-batch processing, which is updated through merge. This architecture cannot completely achieve real-time, or it cannot be particularly fine-grained in real-time. For example, minute-level granularity or even ten-minute-level granularity is very difficult.

The original intention of Hudi was to solve this problem, implement a real-time data lake format, improve real-time updates, and accelerate the timeliness of the data lake. However, the current architectural design and engineering implementation results have not met expectations. Many customers have also encountered many pitfalls in using Hudi, facing very big challenges in terms of system stability and system operation and maintenance complexity. .

In fact, we can see that the root cause is that there is no data lake format designed for real-time data update or real-time analysis in the lake warehouse architecture. Last year, we conducted technical exploration in the Flink community and launched a new sub-project called Flink Table Store in the Flink community. The purpose was to try to see PMF (market acceptance). Through Flink Table Store, I found that it is very necessary to design a data lake format that is truly oriented to real-time updates. Especially with Flink, a real-time streaming computing engine, it can fully realize real-time data chain on the data lake Lakehouse architecture. road. 

In order to allow this project to develop better, we decided to separate this project from the Flink community this year and incubate it as an independent Apache Foundation project, giving it a larger development space and naming it Apache Paimon.

Paimon is a data lake format truly designed for real-time updates, and it is completely open. It not only supports Flink, but also supports mainstream computing engines such as Spark, Presto, Channel, and StrarRocks.

And because it is designed for real-time, the performance and stability are very good. In our typical application scenarios, compared with the open source Hudi solution, the Upsert performance of the Alibaba Cloud streaming lake warehouse solution is improved by more than 4 times, and the Scan performance is improved by more than 4 times. Improved by more than 10 times.

Therefore, based on Flink and Paimon, we launched a new generation of streaming lake warehouse data analysis technology, from the real-time input of the entire data into the lake to the real-time ETL data update on the lake, using a set of unified SQL to perform full-link real-time in Lakehouse data processing. Due to the openness of Paimon, we can also introduce the commonly used open source analysis engines such as Spark, Presto, and StrarRocks into this architecture, as well as Alibaba Cloud's self-developed engines MaxCompute and Hologres, which can seamlessly connect with Paimon data. , to achieve a completely open lake warehouse system, so that the entire link can achieve a complete ecology, which can not only realize the real-time flow of data across the entire link, but also achieve real-time analysis of the entire data link. This is the evolutionary trend in the data analysis architecture in the entire 3.0, promoting the real-time implementation of Hucang.

Comprehensive Serverless

Second, I would like to introduce the product architecture. The integration of our products with cloud native has also taken an important step. We hope that the open source big data platform will become fully serverless. In fact, serverless technology has been explored for several years. The first serverless product of the open source big data platform, serverless Flink, was launched two years ago and is used by many customers on Alibaba Cloud.

We have received a lot of positive feedback from customers through serverless Flink, and everyone wants to use open source products that can be used out of the box. Therefore, this year we launched four more serverless open source big data products, two for computing and two for storage. For computing, we chose Spark and StarRocks, which are the most popular among users. These two engines have launched two computing serverless products, EMR Serverless StrarRocks and the soon-to-be-released EMR Serverless Spark.

At the same time, in terms of storage, we have also launched two serverless products. The first one is OSS-HDFS, a fully managed serverless HDFS product jointly launched with the OSS object storage team. Another data lake management product is a fully managed serverless source data management service that is fully compatible with the HMS protocol. Through the combination of these products, we can realize the processing and analysis of almost all big data scenarios.

Why we launched four serverless big data products in quick succession within one year is entirely due to our technological accumulation. All demands for serverless are accumulated into the big data serverless platform base. This platform base can shield Alibaba Cloud's various heterogeneous hardware and resource pools, and provide a complete set of multi-tenant system management, including network isolation, resource isolation, etc. This allows us to quickly incubate new serverless big data products.

Serverless Flink

The first product is serverless Flink, which can connect the upstream and downstream storage of Alibaba Cloud. Whether it is a database, data lake, data warehouse, or message queue, all mainstream storage data sources on Alibaba Cloud can be connected with one click, providing a The one-stop SQL development platform includes intelligent operation and maintenance management services to achieve out-of-the-box use. At the same time, we have made a lot of optimizations to the core engine of Flink in the serverless Flink product, and it is widely used within Alibaba. Compared with the open source Flink engine, the performance is improved by two to three times. Therefore, using the serverless Flink product is not only convenient to improve development. Efficiency will also significantly save costs in terms of operating efficiency.

Another new serverless data product launched in the first half of this year is serverless StarRocks, which mainly addresses the needs of users in real-time interactive analysis of OLAP scenarios. Now OLAP or real-time analysis is also a hot topic. We have evaluated that the most mainstream or best OLAP engine in the open source industry is StarRocks, so we chose StarRocks to launch the first serverless OLAP product on EMR. Because StarRocks is a fully vectorized C++ engine, its performance is Very good, supports tens of thousands of concurrency.

Serverless StarRocks

At the same time, the latest version of StarRocks actually supports a storage-computing separation architecture. Combined with the cloud-native capabilities of the entire product, the Virtual Warehouse function is launched to take into account flexibility and isolation between user businesses. With this separation of storage and calculation, StarRocks and the data lake can be connected. The streaming lake warehouse will precipitate a lot of real-time updated data on the lake. At this time, serverless StarRocks can be used to query the real-time updated data on the lake. Instant query can obtain a good integrated lake and warehouse effect, which is called big lake and small warehouse. Layout.

Serverless Spark

Another blockbuster serverless product this year is serverless Spark. I believe that Spark is the most commonly used computing engine in open source big data systems, and it is also the most important computing engine seen in EMR.

In recent years, we have continuously heard calls from users, hoping for a truly fully managed, operation-free and serverless Spark product that can help customers reduce the burden of operation and maintenance, improve development efficiency, and even improve operation efficiency. Therefore, this year we have invested a lot of resources under the goal of comprehensive serverless and created the serverless Spark product, which will soon be tested and commercialized.

The Serverless Spark product actually integrates the advantages of the previous two Flink and StarRocks Serverless products. One-stop development and intelligent operation and maintenance can be implemented out of the box, and pay-as-you-go is fully flexible, including connection with the data lake, etc. In addition, we have built-in a Serverless data service based on Celeborn in Serverless Spark, which can eliminate the dependence on the local disk and completely realize the Serverlessization of the entire data calculation.

Serverless HDFS(OSS-HDFS)

I just talked about several serverless computing products. Next, there is another product that is very important, which is the serverless storage product. Our name is serverless HDFS, and the official product name is OSS-HDFS. This is a product form jointly built with the OSS team.

Everyone knows that HDFS is considered a de facto standard file system protocol in the big data industry. As more and more users move data to the data lake, they hope to continue to use the HDFS protocol to access data on the data lake. In this way The calculations are all compatible.

Therefore, we can also package OSS data into a seemingly infinite cloud HDFS, which can meet the needs of many users. Therefore, this year we jointly released the OSS-HDFS serverless file system with the OSS team, which is fully compatible with HDFS. With this, many users do not have to maintain local HDFS clusters by themselves, eliminating the complexity of operation and maintenance. They are completely pay-as-you-go and have very good flexibility. Combined with the original warehouse data we calculated, they can do intelligent data analysis. Hot and cold data stratification helps users better reduce costs and increase efficiency.

As mentioned just now, serverless is the progress in cloud native architecture in open source big data 3.0. In the future, more products will be launched on the serverless side.

Smarter open source big data

Currently, AI is booming. Alibaba Cloud's open source big data platform has also introduced AI technology into the big data platform system to help us with intelligent platform operation and maintenance or data management. This year, we upgraded the intelligent operation and maintenance tools EMR Doctor and Flink Advisor, which have been widely used by customers and Alibaba Cloud internal platform operation and maintenance. The average cluster problem identification time has been reduced by 30%, and the effective utilization rate of cluster resources has increased by 75%.

As we all know, operation and maintenance in EMR products is very challenging, because there are many components on EMR, such as Hadoop, Hive, Kafka, Spark, Flink, Presto, etc. Once there is a problem in the system, how to quickly locate the problem is a A very troublesome thing for users. Sometimes even if no problems occur, users want to improve the resource utilization and storage efficiency of the entire cluster.

Before, it was all based on human experience. In the past few years, we also invested a lot of engineers to help customers solve these problems with human flesh. But in recent years, we have accumulated these experiences and knowledge into knowledge bases and rule bases in AI, and combined them with some traditional machine learning algorithms and data analysis. This method can intelligently locate problems and provide users with suggestions to optimize clusters and solve problems.

also. We have also done a lot of practice in Flink products and launched the intelligent diagnosis service Flink Advisor. It can help users locate in the entire life cycle of development and operation, why your tasks went wrong, where they went wrong, and how to correct and improve it. Even when there are no problems with your tasks, we will still perform health checks on your tasks to determine potential risks. This ability is similar to the health score, which helps users take precautions before they happen, and gives users some intelligent suggestions to allow users to to optimize tasks. In fact, this is all done using analysis technology combined with big data and AI.

Finally, when it comes to AI, I think there is one word that first enters the attention of developers, which is vector retrieval. In the AI ​​era, all unstructured data can be represented by vectors, and vector retrieval technologies are springing up like mushrooms after a spring rain. There are currently various open source vector retrieval technologies in the industry. After our evaluation, we believe that Milvus technology is currently the most popular and the vector retrieval technology with the greatest demand from users. Therefore, the open source big data platform will also launch a fully managed serverless vector retrieval service based on open source The Milvus ecosystem, Alibaba Cloud's PAI machine learning platform and various large models form a complete big data AI integrated technical solution to serve customers who have needs for vector retrieval in AI scenarios.

The above is the sharing of the core technical architecture and technology development trends of the open source big data platform 3.0. We hope that these new technologies can be implemented in products, serve customers, and get feedback from customers. Thank you all for listening.

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10142058