InfoWorld Article丨Using Data Orchestration Technology for AI Model Training

This article was originally published on InfoWorld on March 22, 2022.
Reprinted with permission. IDG Communications, Inc., 2022. All rights reserved. Orchestrating data for machine learning pipelines.

insert image description here

Author's Guide

Artificial intelligence (AI) and machine learning workloads rely on large datasets and have high data throughput requirements, both of which can be achieved by optimizing data workflows.
When conducting AI model training, we need an efficient data platform architecture to generate analytical results quickly, and model training relies heavily on large datasets. The first step in performing all model training is to deliver training data from storage to a cluster of compute engines, and the efficiency of the data workflow can greatly affect the efficiency of model training.

Data platform/AI platform engineers need to consider the following issues in terms of data architecture and data management:
Data accessibility: How to efficiently obtain training data when data spans multiple data sources and is stored remotely?
Data workflow management: How to manage data as a workflow without waiting in the training process to ensure the continuous supply of data?
Performance and GPU utilization: How to achieve low metadata latency and high data throughput at the same time, ensuring that the GPU is always busy?

This paper will discuss a new solution for the above-mentioned end-to-end model training data flow problem - data orchestration.
The article first analyzes common challenges and misunderstandings, and then introduces a new technique that can be used to optimize AI model training:

data orchestration

Common Challenges in AI Model Training An
end-to-end machine learning workflow is a series of steps from data preprocessing and cleaning to model training to inference, and model training is the most important and resource-intensive part of the entire workflow.

As shown in the figure below, this is a typical machine learning workflow, starting with data collection, followed by data preparation, and finally model training. During the data collection phase, the data platform engineer usually spends a lot of time making sure the data engineer has access to the data, after which the data engineer prepares the data that the data scientist needs to build and iterate on the model.

insert image description here
The training phase needs to process massive amounts of data to ensure that the GPU can continuously acquire data to generate training models. So we need to manage the data so that it can meet the complexity of machine learning and its executable architecture. In a data workflow, each step has its own technical challenges.

Data Collection Challenges - Data Everywhere The
larger the dataset, the better the model training, so it is critical to collect data from all relevant data sources. When data is distributed on-premises, in the cloud, or across data lakes, data warehouses, and object stores, it is no longer feasible to centralize all data into a single source of truth. Given the existence of data silos, remote access to data over the network will inevitably cause delays. How to ensure the data is accessible while achieving the required performance becomes a huge challenge.

Data Preparation Challenges - Serialized Data Preparation
Data preparation starts with data import in the collection phase, including data cleaning, ETL and transformation, and finally the data is used for model training. If this stage is considered in isolation, the data workflow is serialized and the training cluster wastes a lot of time waiting for data to be prepared. Therefore, AI platform engineers must find ways to create parallelized data workflows for efficient data sharing and efficient storage of intermediate results.

Challenges in model training - I/O constrained and low GPU utilization
Model training requires processing hundreds of terabytes of data, but usually large volumes of small files such as images and audio files. Model training requires multiple runs of epochs to iterate, so data is accessed frequently. In addition, the GPU needs to be kept busy by constantly feeding it with data. Optimizing I/O while maintaining the throughput required by the GPU is not an easy task.

Traditional solutions and common misunderstandings
Before discussing different solutions, let's look at a simplified scenario, as shown in the following figure:

insert image description here

We use a multi-node GPU cluster on the cloud and use TensorFlow as a machine learning framework for model training. The preprocessed data is stored in Amazon S3. Generally speaking, there are two options for the training cluster to obtain this data, which we will discuss next:

Option 1: Copy data to local storage
The first option is shown in the figure below, copy the complete training data set in the remote storage to the local storage of each server used for training. This ensures the locality of the data, and the training job actually reads the data locally rather than accessing it remotely.
From a data workflow and I/O perspective, this scheme is able to achieve maximum I/O throughput since all data is local. The GPU is always busy, except that training has to wait for the data to be completely copied from the object store to the training cluster at the beginning of the training.

insert image description here
Nonetheless, this scheme is not suitable for all situations.

First, the size of the dataset must match the total capacity of the local storage. As the input dataset increases, data copying takes longer and is more error-prone, while also causing a waste of GPU resources.
Second, copying a large amount of data to each training machine puts enormous pressure on the storage system and network. Data synchronization can be complicated when input data changes frequently.
Finally, manually copying data is time-consuming and error-prone because it is very difficult to keep data on cloud storage in sync with training data.

Scenario 2: Direct access to cloud storage
Another common scenario, as shown in the figure below, allows training jobs to directly access target datasets on cloud storage remotely. If this scheme is adopted, the size of the data set is no longer a limitation, but it also faces several new challenges:
First, from an I/O and workflow point of view, the data is processed serially, and all data access Operations all have to go through the network between object storage and training clusters, making I/O a performance bottleneck. Since the throughput of I/O operations is limited by the network speed, the GPU will idle and wait.
Secondly, when the training scale is large, all training nodes need to access the same data set in the same cloud storage at the same time, which will cause huge load pressure on the cloud storage system. At this time, due to high concurrent access, cloud storage is likely to be congested, resulting in low GPU utilization.
Finally, if the dataset consists of a large number of small files, metadata access requests will account for a large portion of data requests. Therefore, metadata operations that fetch a large number of files or directories directly from object storage will become a performance bottleneck, and also increase the cost of metadata operations.

Recommended Solution - Data Orchestration

To address these challenges and myths, we need to rethink the data platform architecture for machine learning workflow I/O. Here, we recommend data orchestration to speed up the end-to-end model training workflow. Data orchestration technology abstracts data access across storage systems, virtualizes all data, and provides data to data-driven applications through standardized APIs and global namespaces.

Unify data silos through abstraction

The solution does not copy and move data, whether on-premises or in the cloud the data remains in place. Through data orchestration technology, data is abstracted to present a unified view, greatly reducing the complexity of the data collection phase.
Because the data orchestration platform is already integrated with the storage system, the machine learning framework only needs to interact with the data orchestration platform to access data from any storage it is connected to. Therefore, we can use data from any data source for training and improve the quality of model training. All computing frameworks including Spark, Presto, PyTorch, and TensorFlow can access data without having to worry about where the data is stored without having to manually move the data to a centralized data source.

Data locality through distributed caching

Instead of replicating the entire dataset to each machine, we recommend a distributed cache to distribute data evenly across the cluster. Distributed caching is especially useful when the size of the training data set is much larger than the storage capacity of a single node, and when the data is stored remotely, distributed caching will cache the data locally, which is beneficial for data access. Additionally, machine learning training is faster and more efficient because there is no network I/O when accessing the data.

As shown in the figure above, all training data is stored in the object store, and two files (/path1/file1 and /path2/file2) represent the dataset. Instead of storing all file blocks on each training node, we store file blocks distributed across multiple machines. To prevent data loss and improve read concurrency, each block can be stored on multiple servers simultaneously.

Optimize data sharing across the entire workflow

In model training jobs, there is a large degree of overlap in data reads and writes, whether within a single job or between jobs. Data sharing ensures that all computing frameworks can access previously cached data for read and write by the next workload. For example, using Spark for ETL in the data preparation stage, data sharing can ensure that the output data is cached for subsequent stages. Through data sharing, the entire data workflow can achieve better end-to-end performance.

Orchestrate data workflows by performing data preloading, caching, and training in parallel

We orchestrate data workflows by implementing preloading and on-demand caching. As shown in the figure below, loading data from a data source through the data cache can be performed in parallel with the actual training task. Therefore, training will benefit from high data throughput when accessing data, without having to wait for the data to be fully cached before starting training.

insert image description here

While there will be I/O latency initially, as more and more data is loaded into the cache, the I/O latency will decrease. In this solution, all links, including the loading of training datasets from object storage to training clusters, data caching, on-demand loading of data for training, and the training job itself, can be executed in parallel and interleaved, thereby greatly accelerating the entire training process.

Now let's see how the new scheme compares to the two traditional schemes. By orchestrating the data required for each step of the machine learning workflow, we avoid the serialization of data at different stages and the resulting low I/O efficiency, and also improve GPU utilization.

insert image description here

How to Orchestrate Machine Learning Workloads
Here we take Alluxio as an example to introduce how to use data orchestration. Again, let's look at this simplified scenario, where Kubernetes or public cloud services can be used to schedule TensorFlow jobs.

insert image description here

Orchestrating machine learning and deep learning training with Alluxio typically involves the following three steps:

  1. Deploy Alluxio in the training cluster;
  2. Mount the Alluxio service to the local file directory of the training node;
  3. Use training scripts to access data cached in Alluxio and the underlying storage from the mount point of the Alluxio service

After Alluxio is mounted, data in different storage systems can be accessed through Alluxio. At this time, benchmark scripts can be used to transparently access data without modifying TensorFlow. This greatly simplifies the application development process compared to the usual operations of integrating each specific storage system and configuring certificates.

You can click to view the reference guide to run image recognition with Alluxio and TensorFlow.

Best Practices for Data Orchestration

Since no solution is suitable for all scenarios, we recommend using data orchestration under the following conditions:

You need to do distributed training;

  1. You have a large amount of training data (>= 10 TB), especially if the training data set contains a large number of small files/images;
  2. Network I/O is not fast enough to keep your GPU resources busy all the time;
  3. Your workflow involves multiple data sources and multiple training/computing frameworks;
  4. You want to keep the underlying storage system stable while meeting the needs of additional training tasks;
  5. Multiple training nodes or tasks share the same dataset.

As machine learning technologies continue to evolve and frameworks perform more complex tasks, so too will the way we manage data workflows. By using data orchestration techniques in the data workflow, the end-to-end training workflow can achieve greater efficiency and resource utilization.

Reprinted with permission. IDG Communications, Inc., 2022. All rights reserved.

{{o.name}}
{{m.name}}

Supongo que te gusta

Origin my.oschina.net/u/5904778/blog/5567180
Recomendado
Clasificación