[2023 Yunqi] Chen Shouyuan: Alibaba Cloud’s annual release of open source big data products

This article is compiled based on the transcript of the speech at the 2023 Yunqi Conference. The speech information is as follows:

Speaker : Chen Shouyuan | Director of Open Source Big Data Products, Alibaba Cloud Computing Platform Division

Speech topic : Alibaba Cloud’s annual release of open source big data products

With the continuous development of cloud computing, future data processing and application trends will revolve around Cloud Native, Severless and Data+AI. Among them, cloud-native architecture has become a mainstream trend because it can improve the scalability and flexibility of data processing and applications, support large-scale deployment and faster response times. At the same time, Serverless, as a new computing model, can improve processing efficiency, reduce operating costs and reduce resource waste. Its unique characteristics make it an ideal choice for processing large-scale data. In addition, the integration of Data and AI is developing rapidly, and the degree of intelligence and automation is constantly improving. At the same time, high-quality data is required to support the accuracy and effectiveness of the algorithm.

EMR: Toward the next generation of lake warehouse and comprehensive serverless

Let's enter the product release process. We will focus on the above three points: what to do and what releases to better serve users on the cloud to describe the key releases of our products.

First, let's look at EMR. EMR is a cloud-native open source big data platform system. For EMR, offline IDCs. A large number of offline users built on the open source Hadoop ecosystem will choose EMR as their first stop when moving to the cloud. Because the transformation cost is extremely small, migration to the cloud can be almost seamless. This is a huge saving in human and machine capital for users. We position Alibaba Cloud EMR as the first stop for users to migrate their websites to the cloud.

Our product matrix has been upgraded this year, and we hope to provide diversified EMR product forms based on more diversified IaaS on the cloud. The core user problem solved by EMR Universal Edition is to help users migrate their big data systems to the cloud. This is also the solution with the highest compatibility with users' offline deployment. The second is the EMR container version, which is the EMR ACK version. Nowadays, cloud-native containerization of IT infrastructure is basically deeply rooted in the hearts of the people. A large number of our customers will choose containerized platforms when building IT systems on the cloud, such as Alibaba Cloud's ACK. Users will naturally think of how to migrate Data and AI workloads to the same cluster of IT infrastructure, so that Data & AI workloads can be mixed with IT facility workloads. EMR container version, or EMR onACK, is a product that helps users solve such problems. .

The last thing we want to emphasize today is the EMR Serverless version. For the EMR Serverless sub-product line, some internal features or functions have been released in Yunqi before. Today is a more complete matrix presentation of the EMR Serverless product line. Today we will focus on the serverlessization of the two mainstream EMR computing engines, Serverless Spark and Serverless StrarRocks. Today we also officially propose a complete EMR Serverless product line matrix. .

The EMR Serverless version is the latest generation product and technology released in the EMR product line. In fact, EMR's layout around Serverless was in full swing a year ago and two years ago. The previous OSS-HDFS and Serverless HDFS were actually released last year and the year before that, but this year we have made more efforts. We hope to integrate the mainstream big data computing engine, storage engine, development platform, and metadata management on EMR. All are serverless. Only in this way can we better satisfy cloud native users and make better use of big data. Serverless Spark better solves the Data ETL processing capability in the lake warehouse scenario, Serverless StrarRocks better solves the Data analytic capability in the lake warehouse scenario, Serverless HDFS better solves the data storage capability in the lake warehouse scenario, and finally EMR Stutio Help users migrate their offline experience to the cloud, allowing users to better use the big data infrastructure on the cloud while eliminating operation and maintenance. Therefore, this year EMR has implemented almost all aspects of computing, storage, and development environments. EMR’s main engines and platforms can be serverless. We hope to close the entire big data development and operation loop, thereby further helping cloud-native developers to be more efficient. Make good use of big data.

Let's still return to the main scene of EMR, the universal version of EMR, which has made a lot of updates around the lake warehouse scene. The main scenario of EMR still revolves around Hucang processing, and a lot of updates have been made around Hucang computing, storage, operation and maintenance, and development. At the computing level, our core is to reduce costs and improve efficiency. The IaaS layer is adapted to the new Etian CPU, and the PaaS layer has Native Spark RunTime. These are all from the IaaS layer and PaaS layer to better help users reduce costs and improve efficiency. In the storage part, Serverless HDFS (also called OSS-HDFS) has been released a long time ago, but this year it is hoped that Serverless HDFS and local HDFS will have the same user experience at the usage level, including file performance, data access, source The data acquisition and other solutions are almost exactly the same. To achieve the above goals, we have done a lot of system performance optimization and system security optimization. The improvement of our Open file performance and the improvement of DU's access to source data are all achievements this year.

EMR operation and maintenance, which is mainly reflected in two aspects. On the cloud, EMR can be combined with cloud native to create greater platform value for users, which lies in flexibility. This year we have done a lot of flexibility optimization. A large number of our customers have given us feedback that EMR's platform flexibility is becoming more and more stable; another key point of operation and maintenance is EMR Doctor. We hope to help users solve open source big data operations through AI, automation, and intelligent operation and maintenance platform methods. dimensionality problem. Judging from the feedback from community open source big data users, the biggest and most painful point in using open source big data is system operation and maintenance. How to effectively ensure that our business runs healthily on the cloud in the long term is a very big pain point for many users when using open source big data on and off the cloud. EMR Doctor solves this problem. EMR development, that is, EMR Studio, we hope that cloud-native serverless hosting of our development platform and scheduling platform will help users completely migrate from offline experience to a set of experiences on the cloud. The above are all major updates from EMR surrounding the Lake Cang scene.

Finally, back to EMR For AI, each of our products is embracing positive changes, which are divided into three parts: EMR DataScience, EMR Doctor, and EMR+DataWorks’ Code Pilot. EMR DataScience is in the container version of EMR. We provide a new cluster called EMR DataScience, which has many of the most popular components of AI built in, including Pytorch and TF. We hope that users can process big data on one platform and also process AI tools natively in the cloud. This is the related work that EMR DataScience helps users do. EMR Doctor, as mentioned earlier, this job hopes to use AI-based and intelligent methods to help users implement AIOps, and be able to use automated means to locate problems, diagnose problems, and detect problems early. EMR+Dataworks, the big release of DataWorks this year is the release of code pilot, but as a platform, it is actually connected to EMR and so on. In fact, code pilot is also a feature independent of the platform engine, which can generate HIVE in EMR. Code, users can use the development platform above DataWorks to generate MaxCompute SQL through natural language and operate the business. This can greatly reduce the cost of user development code. You are welcome to try it when DataWorks provides public beta.

Flink Streaming Lakehouse: a new generation of streaming lakehouse solution

Let's take a look at Flink Streaming Lakehouse. The concept of Lakehouse has actually been very popular in the past few years. The reason is that a Lakehouse system has the rigor of Data Warehouse, including ACID, version management, data format verification, etc.; at the same time, it also has Data Warehouse. Lake's flexibility can accommodate a large amount of unstructured text, including pictures, videos, audios, images, etc. Lakehouse can carry both structured data and unstructured data, which is a very good underlying storage solution for the integration of AI and big data for users. However, when we looked at Lakehouse, we found that Lakehouse has a very big problem in terms of timeliness. The core mission and value of Flink is to help our customers solve the real-time transformation and upgrade of big data. So the Flink community worked with us to release the Streaming Lakehouse solution.

Back to Streaming Lakehouse, I will mainly talk about the key points of three scenarios from the product direction. As mentioned earlier, Lakehouse's solution will become more and more important in the AI ​​era because it can store both structured data and non-stage data. This is an important bearing point for integrated storage of big data and AI. However, Lakehouse still encounters timeliness problems in practice. The entire Lakehouse Data Pipeline may reach hour-level delays when connected in series. From the initial data entry to the use of data value, such as BI and AI, one can see the entire The data link reaches the hour level, which actually creates a huge delay for users to build a real-time lake warehouse. Therefore, Flink hopes to help users achieve real-time implementation of Lakehouse, and help users achieve great improvements through streaming and real-time.

Finally, there is Unified. In fact, the Flink community has been focusing on Unified Batch & Streaming in the past few years. We hope to achieve integration at the computing level, that is, integrating streaming and batching. When we promoted the streaming-batch integrated solution in the open source community, we found that if users only integrate at the computing level, they can only solve half of the problem. Half of the problem lies in storage. Storage is still two sets of storage solutions. Two sets of storage and two sets of data will cause offline and real-time data inconsistency, which is a very big problem for users, so the Flink team and the community are working together Paimon was built. Paimon is based on the underlying distributed file system. For example, OSS will build a Unified storage that can be used for both streams and batches. We call it batch-stream integrated storage. Therefore, Flink+Paimon constitutes the Lakehouse solution, which has both Unified process and Unified Storage. The combination of this layer can truly and completely help users achieve an integrated flow-batch solution. This is the value point of our Streaming Lakehouse. Ultimately, we hope to help users provide real-time, streaming and serverless lakehouse solutions in the Data+AI era.

Back to the main line of Flink, our mission has always been to help users upgrade and transform big data, so the pursuit of cost-effectiveness in real-time scenarios has always been the direction of the Flink team's efforts. There are two important points in pursuing real-time cost-effectiveness this year. One is that Flink fully embraces Yitian and combines it with Yitian. The overall real-time computing performance of Flink has been improved by 50%. This is because the Flink team has made a lot of optimizations at the IaaS level. At the same time, we are still doing a lot of optimization on the Flink enterprise-level kernel at the PaaS layer, including optimization of operators, and we will announce the optimization of native runtime in the future. Compared with the open source Flink engine, this part of optimization will double the improvement of our real-time computing Flink version. Especially in the throughput part, it can solve many users' real-time computing scenarios with high throughput or large traffic.

Elasticsearch:Serverless 和 Search for Data & AI

Next, let’s talk about Elasticsearch, which is also an important part of open source big data. When it comes to Elasticsearch, perhaps most people are still stuck in the relatively early search for data, which is full-text retrieval, similar to full-text retrieval by search engines. But today I want to tell you that this idea needs to be refreshed. Elasticsearch is not only a search for data, but also a search for AI. Today I will focus on how ES transforms from Data to a Data+AI search system.

The first is our Elasticsearch version release. Frankly speaking, the current product form, that is, the independent cluster version of ES on PaaS, has very well met the many market needs of our Chinese public cloud and private cloud customers. Many medium and large companies have highly recognized Alibaba Cloud's ES product form. The product customer audience is very good both in terms of base and future growth. But in fact, as customers have put cost reduction and efficiency improvement on the agenda in the past two years, it has been discovered that a group of very large potential customers as well as medium and long-tail customers are still interested in the advantages brought by the independent cluster version on the cloud. Cost is still considered a relatively large entry barrier to cloud adoption. They very much hope to start ES on the cloud in a low-threshold or even zero-threshold way. This is the original intention of our ES Serverless. We hope to help users start using Elasticsearch on the cloud in a zero-threshold way.

At the same time, Elasticsearch Serverless is also our first ES version in China that supports common scenarios. Last year we also released an Elasticsearch Serverless version, but it more addressed the needs of log ELK scenarios. However, this version will have problems with data consistency, so this year we will carry out a large number of product technical architecture reconstructions. This release of ES Serverless is an upgrade release for general scenarios. It not only supports log scenarios, but also orders, finance and other scenarios. The data consistency here can be well guaranteed. This is a very different point in the upgrades we released this year compared to last year. ES Serverless can truly offer pay-as-you-go, second-level elasticity, simple operation and maintenance, and is fully compatible with open source ES, which many other vendors may not be able to achieve.

The following focuses on the parts of ES for AI and Data, which marks that ES is truly a search engine from Data to Data&AI. There is a large advertising column outside the Yunqi venue, focusing on the release of ESRE. This is a major release of ES Company. Let me briefly explain the core of the release, which is to support AI-related retrieval, including vector retrieval, including multi-channel parallel query optimization. These things are all focused on the ES kernel to help users do AI retrieval. Alibaba Cloud ES has integrated a large number of solutions around ES's latest AI capabilities, which is the enhancement solution on the right. We have joined forces with the DAMO Academy AI solution and the PAI-EAS solution, and will even work with the community on more joint solutions. These solutions can help our users better use Alibaba Cloud and DAMO Academy on the cloud. AI technology is better integrated with the community's ES. Therefore, we hope that this version of ES8.9 can help users build the next generation of data+AI oriented retrieval system.

Regarding the upgrade of ES's self-research capabilities, Alibaba Cloud ES is cooperating with ES companies and is also based on open source ES to do more optimization and incubation. In fact, it is completely based on open source and is fully compatible with open source. We have made a lot of enhancements. Three upgrades have been made here, including the upgrade of scenarios, that is, the upgrade and transformation of log scenarios to general scenarios. Last year, ES was more focused on logging scenarios and ELK scenarios. This year, ES Serverless is fully open to general scenarios. In addition, there is the optimization of the search kernel engine, including the separation of reading and writing, and the separation of storage and computing, which can better solve the problems of cluster stability, cost flow control, and resource elasticity. Finally, we have made a relatively large experience upgrade in the purchase link and related consoles. We highly recommend you to use the Alibaba Cloud ES Serverless version to experience the completely serverless ES.

Milvus: Search engine in the AI ​​era

The last one today and a completely new product this year. The front part is all the superposition of our existing functions and existing product lines. Milvus is the new search engine in the AI ​​era that we will release this year. Currently, Milvus is almost the hottest and most eye-catching technology in the world in the vector retrieval part. We will start external testing of the vector retrieval Milvus version in December. Compared with the open source Milvus, we will make enterprise-level enhancements to the corresponding products. At the same time, on top of the open source compatible Milvus, we will also combine the technology of DAMO Academy to provide better enterprise-level vector retrieval capabilities. At the same time, we will definitely do a lot of product joint work on the cloud, including a large amount of unstructured data in our storage that can be retrieved and queried by users. At the same time, we will do more in-depth integration with the PAI platform and DAMO Academy AI model to develop AI vector retrieval capabilities and larger model vector support. These solutions will be built on our products in the future. Therefore, we ultimately hope to help users using Milvus on the cloud build a search system in the AI ​​era faster, more conveniently, and with a lower threshold.

Let’s review the three trends of big data we talked about. Cloud Native, the entire IT investment is accelerating the transformation to the cloud. Serverless, we believe that all future PaaS platforms will eventually be serverless, and all AI products, big data products and other PaaS products will be serverless. Finally, there is Data+AI. In the future, AI and big data will be thoroughly integrated. This is why our entire open source big data has been actively planning around these three points.

Finally, I hope everyone will pay more attention to Alibaba Cloud and its open source big data. Thank you all!

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10149103