foreword

As a general-purpose high-performance distributed file storage system, YRCloudFile is widely used in scenarios such as AI/autonomous driving, HPC, quantitative analysis, visual effect rendering, and big data. For extreme bandwidth, some require high IOPS for large files, while others need to support a large number of small files, and these requirements will involve many design decisions and optimizations.

Today, we will discuss how to optimize the performance of a large number of small files in the AI training scenario. Since the file access in the training scenario is opened in read-only mode, this article will focus on the optimization part of the read-only small files.

The process of reading and writing files

First of all, let's briefly understand the operation process required to read the file. Take cat a small file as an example: first, before reading the file, you need to check whether the file exists through lookup. To confirm the existence, you need to open the file through open, read the content of the file through read, and then close the file through close after reading the content of the file. For network file systems such as: YRCloudFile, NFS, etc., there are revalidate and stat to refresh inodes. Note that lookup is only called the first time a file is opened by the file system. We can verify the above process through strace:

Figure 1.jpg The operations fadvise and mmap in the above figure are unique operations of cat, so there is no need to understand them in depth here. In addition, regarding lookup and revalidate, the reason why these two operations do not appear in the above figure is that they are hidden inside the file system and are not exposed through system calls.

Metadata Bottleneck for Small Files

In small file operations, metadata operations account for a large proportion, even reaching 70%-80%, while real business reading and writing only accounts for a small part of it. At this time, metadata performance becomes a performance bottleneck .

Through the above discussion, we learned that the operations required to read files are: lookup, open, read, close, stat, revalidate, and each operation is accompanied by a network overhead. Among them, lookup is only called when the file system reads the file for the first time, and the overhead of reading the file multiple times is basically negligible. For ease of description, lookup will not be discussed below. Open, close, stat, and revalidate are all metadata operations, and there is only one call, while only read is the data operation really needed by the business, and the larger the file, the more calls it will make. We can know through calculation that the larger the file, the more read times, the higher the proportion of data operations, and the lower the proportion of metadata operations, and vice versa. Next, let's take a closer look at the scenarios of reading large and small files:

First, let's discuss the scenario of reading large files. For example, to read a 100M file and read 1M data each time, then 100 read calls are required. There are 4 corresponding metadata operations: open, close, stat, and revalidate. The total number of operations is 104, so it can be calculated that data operations account for 100/104 * 100% = 96%, while metadata operations account for 4/104 * 100% = 4%.

In the scenario of reading small files, such as reading a 1M file, only one read call is required to read 1M data each time. The corresponding metadata operations also have four operations: open, close, stat, and revalidate. The total number of operations is 5, which can also be calculated, and the proportion of data operations is 1/5 * 100% = 20%. The metadata operation accounted for 4/5 * 100% = 80%.

Through the above analysis, it is clear that for smaller files, the proportion of metadata operations is higher, and metadata performance has become a bottleneck that severely limits ops performance. For its optimization, we need to reduce the proportion of metadata operations to improve ops.

technical solution

Based on the above discussion, we learned that there is a serious bottleneck in metadata performance when dealing with small files. This is because for each small file, the system needs to frequently read and process its corresponding metadata information, including open, close, stat, and revalidate, etc. These operations will occupy a lot of network and disk resources. Because we need to optimize for these problems.

首先，为了支撑对元数据访问路径的低延迟和高 ops 能力，焱融分布式文件存储 YRCloudFile 采用的 io 框架可提供百万级的 iops 能力。由于元数据需要保证 posix 语义，所以性能上无法和普通读写 io 一样，但同样可以提供数十万的交互能力。

其次，依赖客户端缓存机制，焱融分布式文件存储 YRCloudFile 提供了基于内存缓存的元数据管理技术，在保证语义的前提下，能安全的命中缓存，减少跨网络和磁盘访问开销。

再次，我们实现的 lazy size，lazy close，batch commit，metadata readhead 机制，能同时保证在文件系统语义的前提下，将部分逻辑 offload 到客户端，这样的好处是能够很好的降低元数据服务的压力，并且集群的元数据性能得到很大的提升，包括在延迟和 ops 等方面。

综上，焱融分布式文件存储 YRCloudFile 通过一系列技术操作优化小文件的元数据性能，包括基于内存缓存的元数据管理、轻量级 open、延迟 close 以及批量 close 等。这些技术的应用，可以显著提高焱融分布式文件存储 YRCloudFile 在处理小文件时的性能表现，从而更好地满足用户的需求。

优化前后性能对比

接下来，我们将在具体的 vdbench 测试中来看下焱融分布式文件存储 YRCloudFile 的优化效果。集群配置多副本模式，其中 3 组 mds，3 组 oss，mds 和 oss 均由 nvme ssd 构建。vdbench 脚本为：

hd=default,vdbench=/root/vdbench50406,shell=ssh,user=root
hd=hd01,system=10.16.11.141

fsd=fsd1_01,anchor=/mnt/yrfs/vdbench/4k-01/,depth=1,width=5,files=1000,size=4k,openflags=o_direct

fwd=fwd1_01,fsd=fsd2_01,host=hd01,operation=read,fileio=random,fileselect=random

rd=randr_4k,fwd=fwd5_*,xfersize=4k,threads=64,fwdrate=max,format=restart,elapsed=30,interval=1,pause=1m
复制代码

优化前

Miscellaneous statistics:
(These statistics do not include activity between the last reported interval and shutdown.)
READ_OPENS          Files opened for read activity:             540,793     95,986/sec
FILE_BUSY           File busy:                                   5,129         570/sec
FILE_CLOSES         Close requests:                             540,793     95,986/sec
复制代码

优化后

Miscellaneous statistics:
(These statistics do not include activity between the last reported interval and shutdown.)
READ_OPENS          Files opened for read activity:           4,589,603    653,742/sec
FILE_BUSY           File busy:                                   56,129      1,870/sec
FILE_CLOSES         Close requests:                           4,589,603    653,742/sec
复制代码

通过上述数据，可以看到焱融分布式文件存储 YRCloudFile 显著提高了在处理小文件时的性能表现，性能提高 6 倍以上，以上是基于大量的测试和评估所得出的结论。

写到最后

In this article, we discuss the optimization of Yanrong distributed file storage YRCloudFile in the read-only small file scenario. We first reviewed the basic reading and writing process of files, analyzed the problems existing in small files, and then designed a series of optimization solutions to solve the performance bottleneck of read-only small files in AI training scenarios. The optimization technology of Yanrong distributed file storage YRCloudFile can provide faster, more efficient, and more reliable services for business applications, especially in AI training scenarios, thereby improving user experience and satisfaction. For reference and inspiration.

Yanrong YRCloudFile training acceleration optimization strategy in massive small file scenarios