[2023 Yunqi] Liu Yiming: Thoughts and releases on big data platform construction in the Data+AI era

Introduction:  This article is compiled based on the transcript of the speech at the 2023 Yunqi Conference. The speech information is as follows:

Speaker: Liu Yiming | Head of Alibaba Cloud’s self-developed big data products

Speech topic: How to build a big data platform in the Data+AI era

The topic shared today is how to build a big data platform in the Data+AI era. This topic is not only a reflection and summary of our work in the past year, but also a hope that through this reflection and summary, whether everyone uses Alibaba Cloud's platform and technology, It can provide some inspiration for the selection, operation and maintenance, and innovation of future big data platforms. At the same time, we will also think about whether there are any new changes in the roles and working methods of big data people in the future.

The core of Alibaba Cloud big data are two distributed computing engines. Under the ODPS (Open Data Processing Platform) brand, today’s sharing will focus more on the two core engines of ODPS (for batch data processing and mass storage). MaxCompute, Hologres for real-time data warehouse and interactive analysis scenarios). Now let’s get to the point. I would like to share with you our reflections on our past platform development, what capabilities are the key capabilities, and what capabilities we have improved this year.

Cost reduction capability: Flexible payment models drive significant reductions in big data costs

The ability to reduce costs is the core capability of every big data platform. Especially as a service provider on the public cloud, we do not want everyone to use the big data platform on the cloud to become a cost black hole. The more you use it, the more expensive it becomes. Every year, the boss says that money is wasted. It’s unclear where to go. We hope to not only provide users with a platform where costs and fees are clearly explained and used, but also a platform where unit ownership costs can be continuously reduced through the correct use of products. Cost reduction never means using cheaper specifications and fewer resources. This will potentially sacrifice the service quality of the platform and is not the correct cost reduction approach. Low prices often lack quality assurance, and in the end you will receive lower quality services. Lower quality R&D investment eventually led to the unsustainability of the platform.

A reasonable way to reduce costs is to first choose a suitable procurement strategy, payment strategy, and choose a suitable technology . Taking MaxCompute as an example, the platform provides a variety of payment methods, from the more classic prepaid or annual and monthly subscription to the most commonly used postpaid or pay-as-you-go model. Prepayment provides more precise budget control, and the cost is clearly stated in advance, but resource usage is limited and cannot meet temporary needs, and will also cause idle waste of idle resources. The pay-as-you-go model generates expenses based on actual business scale and does not require capacity planning in advance, but actual expenses can easily exceed budget control. Now we want to do some combination of the two modes.

We see that most data processing operations have a certain time pattern. There are often peak periods at night. When you go to work in the morning and you see the calculation results, the relative water level during the day is the low peak period. Here you can use MaxCompute's time-sharing elasticity capability to run at low water levels on a daily basis. Additional resources can be flexibly released during peak periods. Time-sharing elasticity was launched last year. This year, through the optimization of inventory management, inventory efficiency has been improved. Starting from September 20, the CU unit price of the elastic part of MaxCompute will be directly reduced by 50% . If there is a situation where the 8-hour job run time is not enough in a day, the time-sharing method will definitely reduce costs. We hope that each user can choose a time-sharing strategy based on their actual usage scenarios.

The principle is similar to Spot Instance on ECS. This year, MaxCompute launched free-time jobs, also commonly called SpotJobs. The pricing is directly one-third of the pay-as-you-go pricing . Free-time jobs serve the idle resources of the big data cluster, and are not necessarily It ensures that the same resources can be obtained every day and the execution is as fast. When the cluster is busy, there will be more job waiting time, but for jobs that are not sensitive to delay, such as the import of historical data and daily development and debugging job scenarios. , the cost can be effectively reduced by 66% by using idle time operations.

Time-sharing elasticity can meet both flexibility and budget management, so what is the optimal setting? MaxCompute has released a cost optimizer to help users analyze the resource distribution characteristics of all jobs in the past 30 days, display peak and trough periods, and provide suggestions on how to design elastic strategies. On the basis of flexibility, we have added a key constraint to the job called baseline. Jobs before the baseline need sufficient resource guarantees so that the results can be calculated on time. Jobs after the baseline can run slower, saving resources and costs. This differentiates the priority and importance of jobs. After using the cost optimizer, most users usually see more than 20% cost reduction. It is recommended that everyone adopt it as soon as possible.

Next, let’s talk about how to reduce storage costs. Data will be classified into different characteristics when used in practice. Some data are accessed frequently and the importance of the data may be higher. Some data are accessed less frequently and are read once or twice a month. Some data are audit requirements and cannot be accessed. Deleted, not necessarily visited once a year. Data is valuable and distributed, so should our data costs also have a hierarchical design? certainly. MaxCompute provides different storage capabilities for different access characteristics and different value data, and tiered storage provides tiered unit prices. Through tiered storage, some low-frequency access data can be seen, and the cost of long-term access data can be reduced to one-third of the previous level .

Computing and storage can save costs through platform usage strategies. In fact, further cost reductions can be achieved through innovation in storage technology. JSON is a very widely used data structure on the Internet. It is semi-structured, flexible in query, and convenient in storage. Schema can be adjusted at any time. However, in the past, if JSON was stored as a string, even if only one byte was accessed, several data would need to be stored. Parsing all the megabytes is a huge waste of computing and IO. Another solution is to widen the JSON structure in advance before the JSON data is dropped into the database, which requires a lot of processing and is also a waste of computing resources.

How to effectively improve the storage and access efficiency of JSON data type has become a key capability of big data platforms. This year, including MaxCompute and Hologres, both provide JSON native management capabilities, including metadata support and storage column compression, using semi-structured as a To support advanced processing types, in user practice, the JSON storage cost for most users will be reduced to one-fifth of the previous level, and queries will become faster .

Light operation and maintenance capabilities: Serverless changes the big data operation and maintenance model

The big data platform on the cloud should provide operation and maintenance that is simple and easy to use, help users do the dirty work, help big data engineers upgrade their roles, and consider the stability, scalability, and stability of the system platform every day from being relatively passive in the past. How to allocate resources, backup, disaster recovery, upgrades, bug fixes and other dirty work are freed, and you can become a data analyst, an AI expert, and a domain expert, instead of doing repetitive operation and maintenance work.

We believe that serverless architecture is the key to solving operation and maintenance problems, so how to implement serverless architecture? From the perspective of big data architecture, we usually divide it into three types: 1. Shared-Nothing architecture, integrating storage and calculation. Through horizontal expansion between nodes, computing power and storage capabilities are improved. 2. Shared-Everything, all computing and storage are decoupled, and all resources can be shared. 3. Shared-Data, the data part is shared, and the computing part is isolated, providing better isolation capabilities. Each technology will choose a different architecture.

MaxCompute chooses Shared-Everything, which requires high isolation technology implementation on the platform side and even higher requirements on the operation and maintenance side and scheduling side. All computing resources and storage resources are shared in a unified public cluster. Hologres chooses Shared-Data architecture. This system needs to consider more about the isolation and stability of resources in online service scenarios, so different systems choose different architectures.

Behind this architecture, we will manage the entire cluster as a unified computing resource. The greatest value for users is that not only is the cost of use reduced, there is no need to do capacity planning in advance, and more importantly, there is no need to deal with complex upgrade operation and maintenance, allowing users to implement version iterations with zero downtime. This These are all values ​​created by the Serverless architecture. The platform side hopes to solve the dirty work, including upgrades, backups, disaster recovery, and elasticity, through architectural methods. This is also the core concept behind Serverless.

In the past, everyone said that Serverless was more about saving money on resources and only paying for the resources used. However, I believe that Serverless is more about changing the operation and maintenance method and allowing engineers to focus more on value creation.

Hologres has been evolving on the serverless architecture. This year, it proposed the concept of elastic computing groups. Behind the concept of this computing group is shared data and shared access layer , but the resources are divided on the computing nodes. When different business teams use the same data At this time, each team can flexibly allocate resources for its own usage scenarios, while ensuring data consistency and supporting real-time writing and real-time query. This is an innovation made in Hologres.

Open ability: lake and warehouse integration and openness

When talking about the openness of the big data platform, we talk more about Open Storage + Open Format. Today, Alibaba’s big data platform hopes to achieve one more level. Cloud computing will have higher requirements for technology openness. On the one hand, cloud vendors do not want to kidnap users, and MaxCompute does not want everyone to be kidnapped on the platform after using it and cannot switch. On the other hand, the intensity and density of interaction between different technologies on the cloud platform are much greater than offline. Technologies need to be deployed and connected at the minute level. Users have high requirements for the interactivity of technology. We hope to make openness Very thoroughly, we don’t want to keep innovation only in our own hands, we want to return innovation to users .

First of all, Alibaba Cloud's big data fully embraces Open Storage + Open Format, providing an integrated lake and warehouse solution, providing users with a near-native metadata management and data reading and writing experience. There are two ideas in the industry about what is integrated into a lake and a warehouse. One is to grow a warehouse on the lake and turn the lake into a warehouse. The typical feature is that the data structure on the lake provides better update capabilities and is close to the development experience of a database. Another way is to expand the external capabilities from the warehouse management capabilities to manage semi-structured and unstructured data on the lake in the form of metadata, which is equivalent to the warehouse managing the lake. This is also the form of integrating the lake and the warehouse. MaxCompute is the second form. It uses warehouses to manage lakes. The Hudi format, Delta Lake and other formats stored on OSS, including Alibaba's own innovative Paimon format this year, can be directly accessed as surfaces in MaxCompute and Hologres. At the same time, we have also made some innovations, defining unstructured files on OSS as abstract directory tables, so that more refined security control methods can be used for authorization in the data warehouse. Which users can access which files and how to access them. Including audits can be recorded.

The key to the integration of lakes and warehouses is the management of metadata. Regardless of whether the data is stored in the warehouse or on the lake, there needs to be a unified view to see all the metadata. Who defines the data and how to parse the data. This is the core concept of lakes and warehouses. , but not necessarily one system or two systems.

MaxCompute has made great changes in terms of openness this year. Everyone used to think that the concept of warehouse is that data computing is all here, but today we hope to provide MaxCompute storage as an independent product form to provide services to the outside world, provide the storage layer with productization capabilities, provide Storage API, and support high throughput and high performance. Native IO interface . Regardless of whether you use the PAI platform for machine learning or Spark or Presto, you can access the data in the warehouse just like MaxCompute's native SQL engine. We hope to open up the data from the self-developed big data platform and support users to continue to innovate using third-party engines. .

Intelligent optimization capabilities: AI-powered smart data warehouse

In the past, optimization relied heavily on DBA students' understanding of the technical principles of a data warehouse. In the cloud era, users host data on cloud platforms, and cloud platforms have a great responsibility to help users optimize. We hope to move forward from experience-based operation and maintenance in the past to intelligent operation and maintenance.

For example, MaxCompute recommends public SQL calculation subsets through materialized views to achieve resource reuse. This is a very effective method of exchanging space for time. After more than a year of iteration, great improvements have been made in recommendation efficiency. Most of the recommended materialized views are of high quality, which can save costs and improve efficiency.

Big data becomes the infrastructure of AI

AI is very hot this year, with many great innovations, but in fact, big data also plays a key infrastructure role in AI innovation. At the same time, we also hope that users who use the cloud big data platform will no longer need to do inefficient and heavy operation and maintenance work, but will do more AI scenario and application innovation. We have also proposed the integration of big data and AI. In fact, big data and AI have their own division of labor. Big data provides data support for AI. This includes that the big data platform must handle large-scale data, provide a distributed computing framework, and provide scientific A one-stop development environment for computing. Secondly, the machine learning platform will also provide optimized algorithms and optimized models for the big data platform.

Based on SQL in the past, we believe that Python should also become the first-level development language of the MaxCompute platform . MaxCompute is newly released, One Env+One Data+One Code. The core behind this is to provide a Python running environment and a notebook interactive development experience, so that students with SQL basics and students with Python experience need to use the Python Library. In data processing scenarios, efficient development and debugging can be achieved in a unified development environment, and Python and coMaxCompute data can be natively connected.

Comprehensively upgrade DataFrame capabilities and release the distributed computing framework MaxFrame, which is 100% compatible with data processing interfaces such as Pandas. With one line of code, native Pandas can be automatically converted into MaxFrame distributed computing, connecting data management, large-scale data analysis, processing to ML development. The entire process breaks the boundaries of big data and AI development and use, greatly improving development efficiency.

Finally, let’s talk about the vector database. Hologres has built-in Proxima, a DAMO Academy vector engine, which supports high-performance, real-time vector retrieval services . Vector data can be accessed using the SQL interface, helping everyone to better use AI scenarios in the original interactive analysis scenario.

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10143493