How to achieve gorgeous corner overtaking on a large model track [Book Donation Event | The 10th Issue of "Distributed Unified Big Data Virtual File System Alluxio Principles, Technology and Practice"]


In the fields of artificial intelligence (AI) and machine learning (ML), data-driven decision-making and model training have become core to modern applications and research. With the rapid development of large model technology, the scale of data required for model training continues to expand. Data processing, storage, and transmission are facing huge challenges. Traditional storage and processing methods can no longer meet real-time and performance requirements. At the same time, the problem of data islands between different computing frameworks also restricts the effective use of data. How to stand out on the fiercely competitive large model track and achieve gorgeous corner overtaking has become a direction that many contestants have invested huge manpower and material resources in continuous exploration.

Among them, model training has become the top priority. When we perform model training, we need an efficient data platform architecture to quickly generate analysis results, and model training relies heavily on large data sets. The first step in performing any model training is to transport the training data from storage to the cluster of computing engines, and the efficiency of the data workflow will greatly affect the efficiency of model training. In real-life scenarios, AI/ML model training tasks often have the following requirements for data platforms:

01 I/O efficiency for frequent data access to massive small files

AI/ML workflow includes not only model training and inference, but also early data loading and preprocessing steps. In particular, early data processing has a great impact on the entire workflow. Compared with traditional data analysis applications, AI/ML workloads often have more frequent I/O requests for massive small files during the data loading and preprocessing stages. Therefore, the data platform needs to provide higher I/O efficiency to better speed up workflow.

02 Improve GPU utilization, reduce costs and increase ROI

Machine learning model training is computationally intensive and requires a large amount of GPU resources to process data quickly and accurately. Since GPUs are expensive, it is important to optimize GPU utilization. In this case, I/O becomes the bottleneck—the workload is limited by the speed at which the GPU can feed data, rather than the speed at which the GPU can perform training computations. Data platforms need to achieve high throughput and low latency to fully saturate GPU clusters to reduce costs.

03 Support native interfaces of various storage systems

As the amount of data continues to grow, it is difficult for enterprises to use only a single storage system. Different business units use various types of storage, including local distributed storage systems (HDFS and Ceph) and cloud storage (AWS S3, Azure Blob Store, Google Cloud Storage, etc.). In order to achieve efficient model training, all training data stored in different environments must be accessible, and the interface for user data access should preferably be native.

04 Support single cloud, hybrid cloud and multi-cloud deployment

In addition to supporting different storage systems, the data platform also needs to support different deployment models. As data volumes grow, cloud storage has become a popular choice because it is highly scalable, low-cost, and easy to use. Enterprises want to realize single cloud, hybrid cloud and multi-cloud deployment without restrictions, and achieve flexible and open model training. In addition, the trend of separation of computing and storage is becoming more and more obvious, which will cause remote access to storage systems. In this case, data needs to be transmitted through the network, which brings performance challenges. Data platforms need to meet high-performance requirements when accessing data across heterogeneous environments.

In summary, AI/ML workloads require fast and low-cost access to large amounts of data in various types of heterogeneous environments. Enterprises need to continuously optimize and upgrade their data platforms to ensure that model training workloads can effectively access data and maintain high throughput and high GPU utilization.

Insert image description here

As a powerful distributed unified big data virtual file system, Alluxio has demonstrated its outstanding application value in many fields, and provides a brand-new solution for AI/ML training empowerment. Its core passwords have four Aspect composition:

01 Unify data silos through data abstraction

As a data abstraction layer, Alluxio can achieve seamless data access without copying or moving data. Whether it is local or in the cloud, the data remains in place. Through Alluxio, data is abstracted to present a unified view, greatly reducing the complexity of the data collection phase.

Because Alluxio is already integrated with storage systems, machine learning frameworks only need to interact with Alluxio to access data from any storage to which it is connected. Therefore, we can use data from any data source for training and improve the quality of model training. Without the need to manually move data to a centralized data source, all computing frameworks including Spark, Presto, PyTorch and TensorFlow can access the data without worrying about where the data is stored.

02 Achieving data locality through distributed caching

Alluxio's distributed cache allows data to be evenly distributed across the cluster, rather than copying the entire data set to each machine, as shown in Figure 1. Distributed caching is especially useful when the size of the training data set is much larger than the storage capacity of a single node. When the data is stored remotely, distributed caching will cache the data locally, which is conducive to data access. In addition, because no network I/O is incurred when accessing data, machine learning training is faster and more efficient.

Image

Figure 1 Distributed cache

As shown in the figure above, all training data is stored in the object storage, and two files (/path1/file1 and /path2/file2) represent the data set. Instead of storing all file blocks on each training node, we store the file blocks distributed across multiple machines. To prevent data loss and improve read concurrency, each block can be stored on multiple servers simultaneously.

03 Optimize data sharing throughout the workflow

In model training efforts, there is a large degree of overlap in data reading and writing, both within a single job and between different jobs. Alluxio allows the computing framework to access previously cached data for subsequent workloads to read and write, as shown in Figure 2. For example, if Spark is used for ETL data processing in the data preparation stage, data sharing can ensure that the output data is cached for use in subsequent stages. Through data sharing, the entire data workflow can achieve better end-to-end performance.

Image

Figure 2 Passing data between workflows through Alluxio

04 Orchestrate data workflows by executing data preloading, caching, and training in parallel

Alluxio reduces model training time by implementing preloading and on-demand caching. As shown in Figure 3, loading data from the data source through the data cache can be executed in parallel with the actual training task. Therefore, training will benefit from high data throughput when accessing the data, without having to wait for all data to be cached before starting training.

Image

Figure 3 Alluxio data loading improves GPU utilization

Although there will be an I/O delay initially, as more and more data is loaded into the cache, the I/O wait time will decrease. In this solution, all aspects, including loading of training data sets from object storage to the training cluster, data caching, on-demand loading of data for training, and the training jobs themselves, can be executed in parallel and interleaved with each other, thus greatly accelerating the entire training process.

Image

To learn more about the comparative analysis of Alluxio and traditional AI/ML model training solutions, specific performance testing, and application cases from a wide range of industries, welcome to read "Distributed Unified Big Data Virtual File System - Alluxio Principles, Technology and Practice".

Live broadcast preview

Live broadcast theme

Alluxio: Accelerating the next generation of big data and AI transformation |

"Principles, Technology and Practice of Distributed Unified Big Data Virtual File System Alluxio" book launch conference

Insert image description here

Live broadcast time

September 21 (Thursday)

20:00 - 21:30

This live broadcast mainly introduces Alluxio's technical principles, core functions, usage methods, and practical cases of Alluxio in big data analysis, AI/ML and other scenarios.

How to watch live broadcast

WeChat search video account: IT reading rankings, schedule live broadcast

Image

Lottery method

  • Follow+Like+Collect articles

  • Leave a message in the comment area: Learn the full stack of knowledge and find a winner (you can enter the prize pool by following and leaving a message, each person can leave a maximum of three messages)

  • Random drawing at 8pm on Sunday

  • This time we will give away 2~5 books [the more you read, the more you will give away]
    500-1000 2 free books
    1000-1500 3 free books
    1500-2000 4 free books
    2000+ 5 free books

Guess you like

Origin blog.csdn.net/weixin_44816664/article/details/132985306