Record Sharing | Application of Alluxio in AI/ML Scenarios

Welcome to [Weibo Live Room], a 2-minute overview of big coffee views

This sharing mainly includes five aspects:

  • About Alluxio;
  • Take stock of the challenges companies face when experimenting with AI;
  • The position of Alluxio in the technology stack;
  • Application of Alluxio in model training & model online scenarios;
  • Effect comparison: before using Alluxio VS after using Alluxio.

1. About Alluxio

Alluxio - data orchestration platform, a high-performance data access layer.

2. Take stock of the challenges companies face when trying AI

1. GPU shortage;

2. The online model is slow;

3. Low GPU usage.

3. The position of Alluxio in the technology stack

√ Alluxio is not a persistent storage layer. Persistent storage relies on distributed storage such as S3 Storage, Ceph or HDFS on the cloud;

√ Alluxio is a high-performance access layer in the AI ​​field;

√ Alluxio has made a lot of optimizations to the IO performance of Pytorch and TensorFlow;

√ Further up is the AI/ML orchestration layer such as Ray or MLFlow.

4. Application of Alluxio in Model Training & Model Online Scenarios

√ Start the GPU cluster at the required location;

√ Build AI/ML on existing data lake;

√ Eliminate data copying and reduce cost/complexity;

√ Achieve faster model deployment and launch.

5. Effect comparison: before using Alluxio VS after using Alluxio

√ Before use: the time spent on data loading exceeds 80%, and the GPU usage is less than 20%;

√ After use: The time spent in the data loading process is reduced from 82% to 1%, and the GPU utilization rate is increased from 17% to 93%.

The above is only an overview of the big coffee speeches, click on the video to watch the full content:

Attachment: The full content of the text version shared by the big coffee can be seen below


1. About Alluxio

The popularity of model training is getting higher and higher. Taking advantage of this popularity, we will also share the application of Alluxio in AI/ML scenarios. I believe that everyone already has a good understanding of Alluxio, Spark and other ecosystems, but I still want to introduce it in detail. Alluxio provides a virtual layer-data orchestration layer. It not only provides a higher-performance data access layer, but also There are many optimizations in the upstream and downstream of the big data framework - including access from storage to the upper computing engine, data access performance and ease of use.

Alluxio - data orchestration platform, a high-performance data access layer.

Project born:

Alluxio (formerly known as Tachyon) was originally a sister project of Apache Spark in the UC Berkeley AMP laboratory, researching how to use distributed technology to manage external memory in a unified manner, and provide memory-level data access acceleration for Apache Spark applications. The project was led by Li Haoyuan (who was a doctoral candidate in the AMP laboratory at the time), and other teachers and students of the same laboratory participated.

Alluxio initially focused on big data, and it is very closely integrated with computing engines such as Spark and Presto. From 2020 to now, we have seen that there are many problems in the AI ​​​​scenario that cannot be solved by the current system framework or combined The solution is still relatively expensive, so while we are working on the big data technology stack, we also start to explore the cutting-edge technology of AI scenarios at the same time. Today, we have formed a relatively product-oriented solution that can be provided for everyone to use. Today we will make a systematic sharing based on the challenges encountered by domestic and foreign companies in AI scenarios:

2. Take stock of the challenges companies face when trying AI

1. GPU shortage

In fact, a few years ago, we found that whether using GPUs on the cloud or buying GPUs to build IDCs (data warehouses), AI infrastructure is more difficult. The reasons can be roughly divided into three situations:

1. Many companies cannot buy GPUs;

2. Even if some companies buy GPUs, the quantity is not very large, and it is difficult to meet business needs;

3. Some companies may be able to buy GPUs on Alibaba Cloud or Tencent Cloud, but how to form these GPUs into a systematic computing pool for upper-level business use is relatively difficult.

2. The online model is slow

The company's existing data warehouse/storage solution is relatively old, and it is difficult to iterate. After GPU training, how to launch the model to the inference cluster is an indispensable link, and it is also a difficult link:

1) Many data warehouses and underlying storage are still relatively traditional storage solutions in the company, such as HDFS, which may have been used more than ten years ago, and it is difficult to adjust storage settings now;

2) The data is on the cloud, the current limit is serious, and there are many usage restrictions.

We will talk in depth later on how to solve this problem.

3. Low GPU usage

Nowadays, the utilization rate of GPUs in the model training process of many companies is generally relatively low. Of course, this is not a problem that Alluxio can solve. What we have seen is that most of the data of enterprises is in the data warehouse. How to import these data into GPU clusters is very difficult. Many challenges. Later, we will also share how Alluxio solves this problem among different cloud vendors and large enterprises at home and abroad.

The above mentioned are mostly business pressures or business decision-making pressures. These pressures will basically become technical pressures for engineers. In order to be able to develop models faster, we are actually There are some expectations:

1) faster model development time;

2) More frequent model data updates;

3) Higher accuracy and traceability;

4) Adapt to rapidly growing datasets.

These pressures reflected on the technical side can be summarized into three points:

  1. Extensive data copy task management

For example, with our current application, how to do this system often requires some complex data copy tasks, copying data from the data warehouse to the GPU training cluster, whether it is copied to the local NAS, NFS system, or Copying to a local disk for data management is more complicated.

2. Dedicated storage

In order to meet the needs of AI scenarios, the performance requirements will be relatively high. It can be understood that: 20-30 years ago, GPU was developed together with HPC (High Performance Computing), so at that time everyone generally tended to have a set of IB network, and there is a set of high-performance storage (all-flash) to support business development. In fact, in the cloud or IDC, we find that this problem is very difficult to solve, because most companies/cloud facilities have no way Provide such a high level of dedicated storage to support model training or model distribution tasks.

3. Cloud and infrastructure costs are out of control

After the model is launched, with the growth of business scale, the cost of cloud and infrastructure is very easy to get out of control. We have seen many scenarios, such as a five-fold increase in cloud costs in three years, which is normal.

3. The position of Alluxio in the technology stack

Before entering the detailed technical discussion, let's systematically introduce the position of Alluxio in the AI/ML technology stack.

First of all, Alluxio is not a persistent storage layer. Our persistent storage relies more on distributed storage such as S3 Storage, Ceph, or HDFS on the cloud. These are all interfaces under Alluxio, and they are a persistent storage layer. , is not the same concept as Alluxio.

Further up, Alluxio is a high-performance access layer in the AI ​​field, because for the persistent storage layer, most companies are pursuing price and performance efficiency, and this efficiency means having a very cheap The storage pool can store a lot of resources, and it is not expected to have a set of very expensive high-performance storage for persistent storage. The reason for this is that we have seen hundreds of data volumes in many Internet manufacturers or traditional enterprises. PB or even EB level, but at the same time, there are not so many training data, which may be dozens of TB, or even a little higher than 1PB. If you can put these data in a high-performance storage for upward docking, yes For users, the price/performance ratio is very low, so we rely on this persistent storage layer to do a very simple docking, or now that there is a persistent storage layer, we can directly carry out data docking without changing its architecture.

Going up, we have made a lot of optimizations to the IO performance of Pytorch and TensorFlow, including caching strategy, scheduling optimization/how to connect with it, and Kubernetes deployment. We will introduce how to connect in detail later.

Further up is the AI/ML orchestration layer such as Ray or MLFlow.

This is a relatively clear diagram, because Alluxio is a company developed from a big data scenario. We have been doing AI for about 4-5 years. During these 4-5 years, we have used Alluxio There are a lot of values ​​seen in the customer/user environment, which can be summarized into 4 points:

1. Higher performance and scalable AI/ML pipeline

We do not change the existing cluster deployment, such as the existing object storage, HDFS, etc., and at the same time want to expand the business, there are actually two key points here:

√ Generally, although the two teams of big data and AI are under the same large architecture, the technology stacks are actually very different. For example, the big data technology stack will have Spark, Trino, Hive, HBase, etc., and the downstream docking is HDFS, Some object storage on the cloud, etc. These data are always there, and the data volume may be hundreds of PB or even EB level. At the same time, an AI Infra platform needs to be built. The AI ​​technology stack is actually Pytorch and TensorFlow. The following docking Most of them are object storage, such as Ceph, MinIO, etc. Others will have some dedicated storage, such as providing NFS and NAS systems for upward docking;

√ In fact, the existence of these two systems has created a problem of docking, that is, the data is in the data warehouse, but the processing is in AI Infra, which becomes a very complicated system, and Alluxio can help to get through this system , does not require very complicated data migration every time.

2. Obtain timely and accurate model data at any time

When the data of the model comes out of the training cluster, it needs to be dropped into the storage first, and then pulled up to the inference cluster. This process is often very complicated, such as Data Pipeline. Many Internet companies we communicated with before have a temporary checkpoint store, and then there is a persistent checkpoint store. It is a very complicated process for them to pull each other with low performance and high performance.

3. Avoid complicated data migration

4. The online time of the model is 2-4 times faster than that of object storage and traditional data warehouse

The underlying storage is generally object storage or traditional HDFS. For example, traditional HDFS is designed for massive data storage. This is not designed for performance. In most cases, it is to ensure fault tolerance. At the same time, it is aimed at storage on the cloud. After communicating with many cloud vendors, I learned that in many cases, they cannot use object storage directly on the cloud to support AI services.

Let's talk about how Alluxio builds this system in detail. There are many scenes in it. Here I would like to share with you the original intention of Alluxio's architecture design:

First of all, we have seen in many Internet vendors that most customers/users have a high probability of their data in the data lake (90-95%), and their data does not use a separate data cluster to do this. Instead, there are a lot of data, including the traditional Hive Meta store, data in the popular data lake, and a lot of Streaming Data data coming in directly, and a lot of unstructured data is stored in the data lake .

So how does Alluxio play a role in this?

Now it is more popular to use the Spark or Ray architecture to preprocess the data and put it back into the data lake. Later, TensorFlow and Pytorch will pull the data here for training. For example, look at the picture on the left. If you don’t use Alluxio to pull What can go wrong with the data?

For example, the original data warehouse uses an HDFS cluster, and AI training uses a Ceph cluster:

√ Firstly, the processed/unprocessed data must be pulled into the Ceph cluster, and then the pulled data will be served upwards. There will be some problems here: First, the pulling process will be very complicated, and many companies will After developing a data management system by ourselves, there will be several sets of different processes in it. For example, we use the meta store to correspond to where these tables/data are;

√ Secondly, it is necessary to pull data incrementally;

√ Finally, the data needs to be checked to see if there are any problems.

There is a long delay in this process from pulling to availability, so we want to use the Alluxio cache function to help you solve this problem.

First, we can preload part of the data into Alluxio and put it in storage closer to the calculation, thereby reducing bandwidth consumption. At the same time, even if there is no preloaded data, Alluxio's caching mechanism can help to quickly pull data to the training cluster. This method is similar to the T+1 (T+0) transaction in stock trading, that is, the data can be provided quickly from the moment the data is first accessed, and there is no need to wait for several hours to transfer the data, thus saving a lot of time .

Second, Alluxio can also reduce the data governance problems caused by users' self-development. If the user already has a data governance system, we also provide a variety of APIs, including APIs for updating raw data, to facilitate customized development for users.

In addition, we also focus on how to reduce costs and improve efficiency on the training cluster. In the past, many companies used high-performance storage clusters for training, but this cost may be very expensive, which limits business expansion. We have found that this cost is usually no more than 3-5% compared to the overall cost of the GPU cluster if only the GPU compute nodes are equipped with disk. In addition, many companies have a lot of storage resources, but how to fully utilize these resources remains a challenge.

Alluxio provides many integration points in this regard. We can directly deploy the Alluxio cluster to the training nodes, which consumes very little (about 30-40GB of memory), but can provide high-performance training support. Users only need to pay 3-5% of the cost of the entire computing cluster to make full use of the GPU cluster and help users overcome IO bottlenecks to achieve 100% GPU utilization.

In addition to the training cluster, we also pay special attention to the cost and efficiency of the inference cluster. As the inference cluster scales, the cost can be much higher than the training cluster. Therefore, we are committed to solving the problem of how to quickly deploy the model generated by training to the online cluster.

In the traditional way, the training result will be written back to a Ceph storage, and then the online cluster may be located in the same or different IDC, involving complex management. Many companies will develop a set of their own storage gateway (storage Gateway) to solve cross-domain or cross-IDC problems, but the gateway has a table problem, which solves a cross-domain or cross-IDC problem, but it does not actually solve it It is a high-performance and cross-domain problem. A simple understanding is that the training cluster and online ML can be connected, but if the Gateway in AWS is completely unable to support the inference cluster, such as expanding to 100 or even 1000 node reasoning After the cluster, it will jitter very badly when it goes online. Another example: Alluxio can deploy the entire model to the inference cluster within 2-3 minutes. Generally, this kind of system takes 10 times longer than it, and its P95 and P99 will be very long.

4. Application of Alluxio in Model Training & Model Online Scenarios

Next, we will explain in detail how Alluxio works in different scenarios:

The first one is the problem we mentioned before. In the case of a very shortage of GPUs, the companies we saw did not have a multi-cloud strategy before. The deployment is often forced to become like this. For example, we see that many customers/users, whose data is on AWS, do not want to use other clouds such as Azure, Google Cloud, etc., but we discovered a problem this year. Azure bought all the GPUs. In this case, it is actually hard to say that all clusters can be found on AWS. Then the clusters we see must be in Azure, and there must be a way to directly access AWS. Data, this problem leads to very low data performance if it is directly obtained. If the network bandwidth is very low, the utilization rate of the GPU usually does not exceed 10%. In the case of a better network (such as a dedicated line) Down, it can reach 20-30%.

The second problem is that if you want to build a multi-cluster data management is very complicated, including ensuring data consistency, how to update and pull these data, but for Alluxio, we have done a lot of integration, you can directly Use Alluxio to solve these problems. Secondly, we don’t want everyone to buy a set of hardware solutions. Before joining Alluxio, my laboratory has been doing HPC. A big problem with HPC is that its cost is very high. Buying a set of HPC is usually You can buy 10 sets of Hadoop hardware, or storage hardware on the cloud, so if you need to buy a set of proprietary hardware to build the AI ​​Infra architecture, it is half the effort and the cost is very expensive. After seeing this scenario, we still hope that we can directly Building AI and ML data paths on the data lake can not change the storage system, but can also use the existing ones, and can support training needs without purchasing additional hardware such as IDMA. This is our vision. At the same time, there is no need to consider the problem of data isolation from the tasks in the original data warehouse (the so-called isolation means that data migration is required, and then run into two very independent systems, which is very problematic for data pull and acquisition).

The above picture is mentioned above. Some functions provided by Alluxio, such as automatic data lake loading/unloading and data update functions, can improve the productivity of the data engineering team. A common scenario is: if it is based on the original system, add For a Ceph, the basic timeline will be extended to 3-6 months. It is very common for foreign companies to extend the timeline to more than 6 months. The entire Data Pipeline is built within. If you are interested, you can learn more about the application cases of Zhihu . There are very detailed interpretations in it, telling you how to build this system.

The picture above shows another problem we mentioned earlier: model deployment is limited by the underlying storage, including bandwidth issues, and is also limited by the different locations of the IDC. Our Alluxio can build a multi-cloud multi-architecture, no matter from the public cloud Whether it is a private cloud or model deployment between different public clouds, this problem will be solved very quickly. We will provide a high-concurrency cache system to support high-concurrency pull of business.

To summarize, what is the position of Alluxio in the AI ​​architecture? What problem does Alluxio help you solve?

√ The first one is to reduce the cost of transformation and adaptation, and help everyone focus more on the logic of model launch;

√ The second is to eliminate the dedicated storage architecture. For example, systems such as NAS and NFS must be used in the past. After using Alluxio, it is no longer necessary. Alluxio can be built with the existing HDFS and object storage below. AI platform;

√ The third is that we need to add a cache to increase the GPU utilization to a higher level;

√ The fourth is to meet the company's needs for freely deploying GPUs. Whether it is a GPU bought on the cloud or off the cloud, no matter where the data is, it can achieve very efficient data adaptation. A specific case will be provided later.

5. Effect comparison: before using Alluxio VS after using Alluxio

This is the data we pulled from the tensor board. I believe many engineers who do AI Infra will use this system. We found that there is actually a relatively big problem on the cloud. For example, if we use S3 Fuse, we can directly pull it from S3 Fuse. This is a common usage in the past few years. For example, if there is a local disk, the data can be pulled back up. For model training, either perform a copy task and place it locally, or use an exposure similar to the Fuse interface to pull the data locally and then provide services upward. If this method is used, the proportion of DataLoader is very high Yes, if you have a better understanding of the AI ​​architecture, his DataLoader does it like this. It pulls data from the storage system to the CPU memory, the CPU performs prepsacing or resacing processing, and then puts the data in the CPU Memory and then the GPU processes it. , the latter two are fine on the cloud, because generally the ratio of CPU to GPU is relatively reasonable, and the ratio of Memory is also relatively reasonable, so the problem will be relatively small, but based on the fact that it is originally in cloud storage, there are The problem of pulling to the CPU leads to very poor performance in the first stage of DataLoader. Although it is an asynchronous process, the performance needs to wait for the completion of the previous step, because you can see that the DataLoader ratio can account for 80% % more, the GPU usage is only around 17%, which is measured with Resnt-50, a very standard benchmark.

After we deployed Alluxio, the time of DataLoader dropped to less than 1%, and the utilization rate of GPU increased to 93%. Of course, it does not mean that it cannot be higher, but in fact, the utilization rate of GPU is limited by IO on the one hand, and on the other hand. It is also limited by the performance of the CPU, so this is a very high utilization rate.

In addition, we have recently launched some projects in the AI ​​scene, including the "Alluxio Assisted Model Training Plan". In fact, many large models have already been running on Alluxio, using Alluxio as a high-performance data access layer. We During the period from July 1st to September 30th, 2023, a registration plan will also be open to everyone. You can have 3 months of professional team 1V1 technical support to help you build large-scale model training or more popular dynamic mode training. Scenes.

about the author

If you want to learn more about Alluxio's dry articles, popular events, and expert sharing, click to enter [Alluxio Think Tank] :

Microsoft official announcement: Visual Studio for Mac retired The programming language created by the Chinese developer team: MoonBit (Moon Rabbit) Father of LLVM: Mojo will not threaten Python, the fear should be C++ The father of C++ Bjarne Stroustrup shared life advice Linus also Dislike the acronym, what TM is called "GenPD" Rust 1.72.0 is released, and the minimum supported version in the future is Windows 10 Wenxin said that it will open WordPress to the whole society and launch the "100-year plan" Microsoft does not talk about martial arts and uses "malicious pop-ups "Prompt users to deprecate Google's high-level, functional, interpreted, dynamic programming languages: Crumb
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5904778/blog/10106565