How to use a distributed storage system to facilitate AI model training

When dealing with small datasets and simple algorithms, traditional machine learning models can be stored on a standalone machine or on a local hard drive. However, as deep learning developed, teams increasingly encountered storage bottlenecks when dealing with larger datasets and more complex algorithms.

This highlights the importance of distributed storage in the field of artificial intelligence (AI). JuiceFS  is an open-source, high-performance distributed file system that provides a solution to this problem.

In this article, we discuss the challenges AI teams face, how JuiceFS can speed up model training, and common strategies to speed up model training.

Challenges for AI teams

AI teams often encounter the following challenges:

  • Large datasets: As data and model sizes grow, standalone storage cannot keep up with application demands. Therefore, distributed storage solutions become a necessary condition to solve these problems.
  • Full archiving of historical datasets: In some cases, large numbers of new datasets are generated daily and must be archived as historical data. This is especially important in the field of autonomous driving, where data collected by road test vehicles, such as radar and camera data, is a valuable asset for companies. In these cases, independent storage proves to be insufficient, so distributed storage becomes a necessary consideration.
  • Too many small files and unstructured data: Traditional distributed file systems are difficult to manage a large number of small files, resulting in heavy metadata storage burden. This is especially problematic for visual models. To solve this problem, we need a distributed storage system optimized for storing small files. This ensures efficient upper layer training tasks and easy management of a large number of small files.
  • POSIX interfaces for training frameworks: Algorithm scientists often rely on local resources for research and data access during the initial stages of model development. However, the original code usually requires minimal modification when extended to distributed storage for larger training needs. Therefore, distributed storage systems should support POSIX interfaces to maximize compatibility with codes developed in the local environment.
  • Sharing common datasets and data segregation: In some domains, such as computer vision, authoritative public datasets need to be shared among different teams within the company. To facilitate data sharing between teams, these datasets are often integrated and stored in shared storage solutions to avoid unnecessary data duplication and redundancy.
  • Inefficient data I/O in cloud-based training: Cloud-based model training typically uses object storage as the underlying storage for a storage-compute separation architecture. However, poor read and write performance of object storage can cause significant bottlenecks during training.

How JuiceFS Helps Improve Model Training Efficiency

What is JuiceFS?

JuiceFS  is an open source, cloud-native distributed file system compatible with POSIX, HDFS and S3 API. JuiceFS adopts a decoupled architecture, stores metadata in the metadata engine, and uploads file data to object storage, providing a cost-effective and highly elastic storage solution.

JuiceFS has users in more than 20 countries, including leading companies in artificial intelligence, Internet, automotive, telecommunications, financial technology and other industries.

The architecture of JuiceFS in the model training scenario.

JuiceFS in model training scenarios

The figure above shows the architecture of JuiceFS in the model training scenario, which consists of three components:

  • Metadata Engine: Any database, such as Redis or MySQL, can be used as a metadata engine. Users can make choices according to their needs.
  • Object storage: You can use any supported object storage service offered by the public cloud or self-hosted.
  • Juice FS client: To access the JuiceFS file system like a local hard disk, users need to mount it on each GPU and computing node.

The underlying storage relies on raw data in object storage, and each compute node has some local cache, including metadata and data cache.

The JuiceFS design allows for multiple levels of local caching on each compute node:

  • Level 1: Memory-based cache
  • Level 2: Disk-based cache

Object storage is only accessed during cache percolation.

For standalone models, the training set or dataset usually does not hit the cache during the first round of training. However, from the second round onwards, with sufficient cache resources, there is little need to access object storage. This can speed up data I/O.

Read and write caching process in JuiceFS

We previously compared the efficiency of training access to object storage with and without caching. The results show that JuiceFS's metadata cache and data cache, compared with object storage, have an average performance improvement of more than 4 times, and a performance improvement of nearly 7 times.

The following figure shows the process of reading and writing cache in JuiceFS:

JuiceFS read and write caching process

For the "block cache" in the above figure, a block is a logical concept in JuiceFS. Each file is divided into 64 MB chunks to improve read performance for large files. This information is cached in the memory of the JuiceFS process to speed up metadata access efficiency.
Read caching process in JuiceFS:

1. An application (which can be an AI model training application, or any application that initiates a read request) sends a request.

2. Request to enter the kernel space on the left. The kernel checks whether the requested data is available in the kernel page cache. If not, the request falls back to the JuiceFS process in userspace, which handles all read and write requests.

By default, JuiceFS maintains a read buffer in memory. When a request fails to retrieve data from the buffer, JuiceFS accesses the block cache index, a local disk-based cache directory. JuiceFS stores files in 4 MB chunks, so the cache granularity is also 4 MB.

For example, when a client accesses a portion of a file, it caches only the 4 MB chunks corresponding to that portion of the data into the local cache directory, not the entire file. This is a significant difference between JuiceFS and other file systems or caching systems.

3. The block cache index quickly locates file blocks in the local cache directory. If a file block is found, JuiceFS reads it from the local disk, into kernel space, and returns the data to the JuiceFS process, which in turn returns the data to the application.

4. After reading the local disk data, it will also be cached in the kernel page cache. This is because Linux systems store data in the kernel page cache by default if direct I/O is not used. The kernel page cache speeds up cache access. If the first request hits and returns data, the request does not go through the file system in the user space (FUSE) layer into the user space process. If not, the JuiceFS client will fetch the data through the cache directory. If not found locally, a network request is sent to object storage, which then fetches the data and returns it to the application.

5. When JuiceFS downloads data from object storage, the data will be asynchronously written to the local cache directory. This ensures that the next time the same block is accessed, it will be hit in the local cache without retrieving it from the object store again.

Unlike data caching, metadata is cached for a shorter period of time. To ensure strong consistency, Open operations are not cached by default. Given the low metadata traffic, its impact on overall I/O performance is minimal. However, in small file-intensive scenarios, the overhead of metadata also occupies a certain proportion.

Why is AI model training too slow?

When you use JuiceFS for model training, performance is a key factor you should consider because it directly affects the speed of the training process. Several factors may affect the training efficiency of JuiceFS:

metadata engine

The choice of metadata engine (such as Redis, TiKV, or MySQL) can significantly affect performance when processing small files. In general, Redis is 3-5 times faster than other databases . If metadata requests are slow, try using a faster database as the metadata engine.

object storage

Object storage affects the performance and throughput of data store access. Public cloud object storage services provide stable performance. If you use your own object storage (such as Ceph or MinIO), you can optimize components to improve performance and throughput.

Local Disk

The location where the cache directory is stored has a significant impact on overall read performance. In cases of high cache hit ratios, the I/O efficiency of the cache disk affects the overall I/O efficiency. Therefore, you must consider factors such as storage type, storage medium, disk capacity, and dataset size.

network bandwidth

After the first round of training, if the dataset is insufficient to be fully cached locally, network bandwidth or resource consumption will affect data access efficiency. In the cloud, different machine models have different NIC bandwidths. This also affects data access speed and efficiency.

memory size

Memory size affects the size of the kernel page cache. When there is enough memory, the remaining available memory can be used as a data cache for JuiceFS. This can further speed up data access.

However, when there is little memory available, you need to gain data access through local disk. This results in increased access overhead. Also, switching between kernel mode and user mode has performance implications, such as context switching overhead for system calls.

How to troubleshoot issues in JuiceFS

JuiceFS provides many tools to optimize performance and diagnose problems.

Tool #1: Commandsjuicefs profile

You can run this command to analyze access logs for performance optimization. After each filesystem is mounted, an access log is generated. However, access logs are not saved in real time and are only displayed when viewed.juicefs profile

Compared to viewing raw access logs, this command aggregates information and performs sliding window data statistics, sorting requests by response time from highest to lowest. This helps you focus on requests with slow response times and further analyze the relationship between the request and the metadata engine or object store.juicefs profile

Tool #2: Commandsjuicefs stats

This command collects monitoring data from a macro perspective and displays it in real time. It monitors CPU usage, memory usage, in-memory buffer usage, FUSE read/write requests, metadata requests, and object storage latency for the current mount point. These detailed monitoring metrics make it easy to view and analyze potential bottlenecks or performance issues during model training.juicefs stats

Other tools

JuiceFS also provides profiling tools for CPU and heap profiling:

  • The CPU analysis tool analyzes the bottleneck of JuiceFS process execution speed, suitable for users who are familiar with the source code.
  • The heap analysis tool analyzes memory usage, especially if the JuiceFS process is using a lot of memory. It is necessary to use a heap analysis tool to determine which functions or data structures consume large amounts of memory.

Common methods to speed up AI model training

Metadata cache optimization

You can optimize metadata caching in two ways, as follows.

Adjust the timeout for the kernel metadata cache

The parameters, and correspond to different types of metadata:--attr-cache--entry-cache--dir-entry-cache

  • attrRepresents file attributes such as size, modification time, and access time.
  • entryRepresents a file and associated attributes in Linux.
  • dir-entryRepresents a directory and the files it contains.

These parameters respectively control the timeout of the metadata cache.

To ensure data consistency, the default timeout value for these parameters is only 1 second. In a model training scenario, the original data is not modified. Therefore, it is possible to extend the timeout for these parameters to days or even a week. Note that the metadata cache cannot be actively invalidated, it can only be refreshed after a timeout period has expired.

Optimize user-level metadata cache for JuiceFS clients

When opening a file, the metadata engine typically retrieves the latest file attributes to ensure strong consistency. However, since the model training data is usually not modified, this parameter can be enabled and a timeout can be set to avoid repeated access to the metadata engine every time the same file is opened.--open-cache

Additionally, this parameter controls the maximum number of cached files. The default is 10,000, which means metadata for the most recently opened 10,000 files will be cached in memory at most. This value can be adjusted based on the number of files in the dataset.--open-cache-limit

Data Cache Optimization

JuiceFS data cache includes kernel page cache and local data cache:

  • The kernel page cache cannot be tuned via parameters. Therefore, reserve enough free memory on the compute nodes so that JuiceFS can fully utilize it. JuiceFS does not cache data in-kernel if resources are tight on compute nodes.
  • Local data caching can be controlled by users, and caching parameters can be adjusted according to specific scenarios.
    • --cache-sizeAdjust the cache size, the default value is 100 GB, which is enough for most scenarios. However, for data sets that take up a particularly large storage space, the cache size needs to be adjusted appropriately. Otherwise the 100 GB cache space may fill up very quickly, making it impossible for JuiceFS to cache any more data.
    • Another parameter that can be used with it is . It determines the amount of free space on the cache disk. The default value is 0.1, which allows up to 90% of disk space to be used for caching data.--cache-size--free-space-ratio

JuiceFS also supports using multiple cache disks at the same time. It is recommended to use all available disks if possible. Data will be evenly distributed to multiple disks through polling to achieve load balancing and maximize the storage advantages of multiple disks.

Cache Warming

To improve training efficiency, you can use cache warming to speed up training tasks. JuiceFS supports warming up metadata cache and local data cache on the client side. This command builds the cache in advance so that it is available when the training task starts, improving efficiency.juicefs warmup

increase buffer size

Buffer size also affects read performance. By default, the buffer size is 300 MB. But in high-throughput training scenarios, this may not be enough. You can adjust the buffer size according to the memory resources of the training node.

In general, the larger the buffer size, the better the read performance. But don't set the value too high, especially in a container environment with a limited maximum memory. It is necessary to set the buffer size according to the actual workload and find a relatively reasonable value. Buffer usage can be monitored in real time using the commands described earlier in this article.juicefs stats

Guess you like

Origin blog.csdn.net/weixin_56863624/article/details/130653222
Recommended