[Yunqi 2023] Lin Wei: Interpretation of big data AI integration

This article is compiled based on the transcript of the speech at the 2023 Yunqi Conference. The speech information is as follows:

Speaker : Lin Wei | Alibaba Cloud researcher, chief architect of Alibaba Cloud Computing Platform Division, head of Alibaba Cloud artificial intelligence platform PAI and big data development and governance platform DataWorks

Speech topic : Interpretation of big data AI integration

This year is the year of the explosion of AI. The birth of large language models has promoted the large model craze sweeping the entire industry. Many people believe that the "iPhone era of AI" has arrived. Training large models is actually not simple, because the increase in the number of model parameters means that better computing power and more data are needed for training, and appropriate tools are needed to allow developers to quickly iterate the model. Only in this way can the accuracy of the model be improved faster. . In the past few years, Alibaba Cloud has been promoting AI engineering and scale, which is actually the main driver of this round of AI explosion.

Let’s look at a typical model development process, including data pre-training, model training, and model deployment. We tend to focus so much on training that we neglect the entire production process. But to train a good model, data is increasingly important. Including data collection, data cleaning, feature extraction, data management, and then to the training process, which data needs to be distributed to participate in training and which data is used to evaluate model quality. All data needs to have a verification part to verify the quality. This step is very critical. The harm of low-quality data to the model is beyond imagination. This is why Wu Daen has been promoting the idea that better machine learning is 80% data processing + 20% model.

The evolution of “model-centric” and “data-centric” model development approaches

This also reflects the evolution of how models are developed. In the past, we often talked about model-centered model development. Algorithm engineers spent a lot of time adjusting the model structure, hoping to improve the model generalization ability and solve various noise problems through the model structure. If you look at Paper from five years ago, you will find that a large amount of research was carried out around model structures. The data and computing power at that time were not enough to support today's large model era. At that time, model training was more of "supervised learning" and all labeled data was used. These data were very expensive, which also determined that during the training process, there was not much room for maneuvering on the data. We were more Consider changes in model structure.

Today's large model training involves a lot of unsupervised learning. On the contrary, the model structure does not change that much. Everyone seems to be converging and using the Transformer structure. At this time, we slowly evolved into a data-centered model development paradigm. What is this development paradigm? It requires using a large amount of data for unsupervised training, using large computing power, a large data engine, and a relatively fixed model structure to extract some interesting and intelligent things.

Therefore, the amount of data used for training will skyrocket, and various methods need to be used to clean and evaluate the data. We can see that many large model research teams spend a lot of energy processing data and verify data quality repeatedly and from multiple angles in various environments. Through various dimensions, models are even sometimes generated for evaluation, and the quality of the data is fed back through the model results. In this process, it is necessary to accumulate a lot of data processing tools. Only in this way can the data quality requirements of data-centered model development work be effectively supported. This is also the core idea that everyone talks about the paradigm of data-centered model development.

It is under this trend that we have always believed that big data and AI are two sides of the same coin, and it is necessary to achieve the integration of big data and AI in order to comply with the evolution of the current model development paradigm.

At Alibaba Cloud, we have been working hard to closely integrate the two systems of data and AI. At the computing infrastructure layer, we provide computing clusters suitable for various scenarios, including CPU-based clusters suitable for big data, and heterogeneous computing clusters requiring RDMA networks suitable for large model training. On top of this, a big data and AI integrated platform has been built to cover the entire process of model development, including data collection and integration. The big data platform is then used to conduct large-scale offline analysis to verify data quality. In addition, there are streaming computing capabilities. After the data is processed on the big data platform, it will be "fed" to PAI, the platform responsible for artificial intelligence development, for training and iteration. Finally, in terms of model application incubation, we rely on vector engine databases, such as Hologres, to construct scenario-based applications.

Application scenarios of big data AI integration

Before formally launching the technical aspects of big data AI integration, let’s give two application examples.

The first example is a large model question answering system enhanced by knowledge base retrieval. You can see that many of the recent trends in large-scale models will mention this scenario. Through a large model, a vertical knowledge base for a specific industry can be obtained. How is this done? First, the data in this knowledge base needs to be cleaned and fragmented, converted into a vector through a large model, and then these vectors are stored in a data system. This is a data system for vector retrieval. When a real request comes in, the vector corresponding to the query will be found first, translated into knowledge, and then used to constrain the large model and control the "nonsense" impulse of the large model, so that the feedback results will be more accurate.

This scenario uses a lot of large model capabilities, including large-scale distributed batch processing, because when creating embedding, it is actually a very large amount of data. At the same time, service capabilities such as vector databases will also be used. Real business scenarios are very sensitive to query delays and need to provide a vector very quickly. Of course, the ability to train large models is also used, which requires a good AI system.

The second example is a personalized recommendation system. In the process of making real-time recommendations, the interests of all recommended objects change dynamically. Often the model of such a system is updated all the time, and the model needs to be updated based on the latest behavioral data. We often process the collected logs in real time or offline. The offline data is used to produce a better basic model. The real-time data will also extract this feature. After model training, a delta of the model is generated, and then this delta is The application is applied to the online system and updated daily. Here we can see that there are many data systems, including real-time stream computing systems, AI systems, and batch processing systems.

Technical implementation of big data AI integration

Unified data and AI workspace management

First of all, we connect the processes of AI and big data in the outermost layer of the model structure. This is also our original idea to build a workspace in the PAI product, so that multiple resources can be unified on one development platform. manage. Now Alibaba Cloud artificial intelligence platform PAI can support a variety of computing resources, including ECS ​​resources, stream computing platforms, and PAI Lingjun Intelligent Computing clusters for large model training, as well as the container computing service that Yunqi has released this time ACS and so on.

Simply accessing these resources is not enough. What users need is to connect the accessed resources together organically. So we launched a Flow framework to connect these processes and connect the various steps of model training and data processing. Here we provide a variety of ways to build connections, including static composition, SDK, graphic interaction, etc., to build complex big data and AI interaction flow charts.

Serverless cloud native services

If you want to further integrate big data and AI, users hope to be able to provide big data and AI services in one resource. At this time, Serverless cloud native service technology is inseparable. We have been talking about cloud native, but cloud native actually has many dimensions. Cloud native is more about shared resources, but what are these resources? In fact, it also needs to be defined.

This definition also has many levels. You can say that you are sharing at the hardware level, then what you are sharing are servers and virtual servers; you can also share higher-level virtual resources, such as containers and services themselves. At different levels, the higher the sharing level, the lower the unit computing cost will be, and of course the higher the technical complexity will be. This is why cloud computing teams have been improving the cloud nativeness of their services, or achieving higher technical complexity capabilities, so that they can provide higher-level computing resource sharing in a more economical and affordable way. , providing big data and AI services more cost-effectively.

It is also because of this that all our big data products are in the sixth dimension, which is a product on Share Everything. But we are all built in the fifth dimension, which is Shared container, which is the container computing service layer, so that we can organically connect big data and AI systems to one resource.

Unified scheduling: multi-load, differentiated SLO-enhanced scheduling

In order to achieve such capabilities, it is not that easy, because container computing services were originally created to support microservices. The strength of parallel scheduling of microservices is very different from big data and AI intelligent computing scenarios. In order to enable different big data and AI tasks and services to run on a resource pool, we actually have to do a lot of work. For example, there are many high-concurrency, short-duration tasks in big data scenarios. It is necessary to greatly enhance the throughput capability of K8S itself and solve its performance problems at all levels, including latency and scale.

At the same time, we have diversified tasks, including not only online services, but also computing tasks. We need to enhance resource richness and multi-scenario capabilities in scheduling. For example, in complex AI scenarios, network topology awareness is required, because large AI model training has very high requirements on the network. At this time, how do we perceive this topology on this layer of container services and computing services, and schedule effectively? How do we allow big data and AI workloads to store resources on it? This requires a lot of load awareness and QS awareness. of scheduling.

Multi-tenant security isolation

For cloud services, the most important thing is the safe isolation of multi-tenants. We need to strengthen the capabilities of cloud-native K8S in this direction so that we can safely reuse big data and AI on the same resource. We use a lot of security isolation technologies at the storage layer and network layer. In this way, multiple big data and AI products, and even users' own online services, can be integrated into a resource pool to provide enterprise use on the cloud.

Container Computing Service ACS

At this Yunqi Conference, the container computing service ACS was released, and PAI was also one of the first products supported by the first batch of container computing services. On the container computing service ACS platform, users can well allocate their resource ratios in big data and AI, and then connect them more naturally on a unified resource base, network, and storage IO. .

Multi-level Quota

We all know that computing resources for large models are very expensive. We will also continue to strengthen some refined resource management capabilities on this base, so we will soon release multi-level Quota capabilities so that cluster administrators can better manage resources and allow each team to manage their own resources. But here comes the critical moment. For example, when it comes to the sprint stage, the administrator can pool all resources and train some larger models. This is our multi-level Quota.

Automatic topology-aware scheduling

For model training of very large models, we need to strengthen the scheduling capabilities of container services. To give an example, we can see that in model training we often have a step called All-Reduce. If there is no scheduling control and the order is slightly disordered to form a reduce ring, we will find that there will be some cross traffic of the switch. Finally, through topology-aware scheduling and non-topology-aware scheduling, the performance improvement can be increased by 30-40%, which is very impressive.

MaxCompute 4.0 Data+AI

Large model training often requires massive amounts of data. As we said before, we not only need to save the data, but more importantly, we need to perform batch processing to clean, repeatedly evaluate the data quality, and adjust the data based on feedback. At this time, we need a big data platform and the ability to integrate lakes and warehouses to support us. Alibaba Cloud's data warehouse product MaxCompute has launched MaxFrame's open data format, which can organically and openly connect powerful data management and data computing capabilities with AI systems. In addition, there is Flink-Paimon. In the flow computing scenario, flow computing and online machine learning can be combined to open up the path between data and training.

Dataset accelerationDataSetAcc

In the AI ​​intelligent computing scenario of the PAI Lingjun cluster, there are not only high-density machine learning tasks, but also data processing tasks. However, high-density computing resources are very precious. At this time, you can connect to the remote big data warehouse. solve. But there will be another contradiction here, that is, remote data I/O cannot match high-density computing. In order to solve this problem, we provide a data set acceleration DatasetAcc capability, which uses the local SD and local storage of the PAI Lingjun cluster to create a near-end cache, and asynchronously pulls the data from the remote data warehouse to the near-end. . This can well solve the problem of combining big data and AI intelligent computing clusters in training scenarios and improve training efficiency.

It is precisely because of the ability to effectively connect big data and AI computing clusters that we can better use big data analysis capabilities in the large-scale LLM training process. For example, in the process of training Tongyi Qianwen, we obtained a large amount of repeated text information. Removing duplicates is a very critical step. Otherwise, the entire training data set will be biased by these data, resulting in some over-fitting situations. produce. We used the FlinkML library we constructed to build an efficient text deduplication algorithm. Algorithm students can quickly perform multiple text deduplications and improve the efficiency of the entire model development.

What we talked about earlier was how big data can help AI training, which is what we often hear about Data for AI. But in fact, in the opposite direction, the growth of AI technology can also help data systems to improve its service quality and Efficiency, current data analysis has also moved from BI to BI+AI.

DataWorks Copilot

In the past, data analysis was more about business intelligence, but now there are more AI technologies that can promote the improvement of data analysis capabilities. We have done some work in this area. For example, in the data development and governance platform DataWorks, we launched DatawWorks Copilot, which is the code assistant. The code assistant can help users use natural language to find tables of interest, then help users build SQL queries, and finally execute the queries.

Of course, to truly make a useful code assistant, it is not enough to just use the basic model. The DataWorks platform is based on a large number of public queries, and then we use our own language, which is MaxCompute or Flink language. As a data set, we take the basic model and do finetune with this data set to generate a vertical model, and then in This vertical model does inference and produces a more effective code assistance tool in this specific scenario. In this way, we were able to improve code development by 30%.

DataWorks AI-enhanced analytics

Not only assisting code generation, we also released the DataWorks data insights function this year. We can use AI methods and AI capabilities to automatically provide some intelligent data insights based on existing data. In this way, we can allow users to grasp the characteristics of data more quickly, thus accelerating users' understanding and analysis capabilities of data.

The above sharing is intended to illustrate the current evolution process of the integration of AI and big data through some of the technical points and cases just mentioned. We firmly believe that big data and AI are complementary to each other, and we also hope to promote the faster implementation and realization of data intelligence.

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10141632