Yunzhisheng: JuiceFS-based supercomputing platform storage practice

From a technology company focusing on speech and language processing, Yunzhisheng has developed its technology stack to have full-stack AI capabilities such as image, natural language processing, and signal. It is the leading artificial intelligence unicorn enterprise in China. The company embraces cloud computing and has corresponding solutions in smart healthcare, smart hotels, and smart education.

Atlas is the underlying technology platform of Unisound, supporting the iteration of all Unisound models:

The first layer is the business layer, mainly the company's business such as voice processing, image processing, natural language processing, etc.

The second layer is the control center, which can be completed in one stop from data production, data access to model release.

The third layer is the core computing layer, which mainly supports deep learning and data preprocessing.

The bottom layer is the infrastructure layer, which is mainly composed of GPU cluster, CPU cluster and distributed storage. All machines are connected by 100Gbps InfiniBand high-speed network.

Storage Scenarios and Requirements

Unisound's initial construction goal is to build a one-stop AI platform, including AI model production, data preprocessing, model development, model training, and final model launch.

As shown in the figure above, each step needs to interact with data, and data preprocessing and model training require relatively large IO .

• Data preprocessing, mainly speech processing will extract speech features and convert speech features into numpy format files; in the process of image processing, images will be preprocessed and format conversion of training data will be done; • Model development, mainly It is the algorithm engineer who edits the code and debugs the model algorithm; • For model training, multiple rounds of data reading will be required on the way, and the model will be output to the corresponding storage. The IO required for this step is very large; , the service will read the model files in the storage system. To summarize our storage requirements:

The full link that can be connected to the entire model development must be supported in several core functional blocks;
Support CPU, GPU data reading tasks;
Our scenarios are mainly voice, text, and image data. These scenarios are characterized by relatively small file sizes, so high-performance processing in small file scenarios must be supported.
Our business scenario is mainly to read more and write less. During model training, most of the data is read, and basically no data is written. Based on the above requirements, we need a high-performance and reliable distributed storage system.

Unisound Storage Construction History

In the early days, we only had about a dozen GPUs, and we used NFS to build a small-scale cluster. At the same time, the CephFS test environment was introduced in 2016. At that time, the performance of that version of CephFS was not very good in the small file scenario, so CephFS was not brought into the production environment.

Later, we continued to do research and found that Luster is the most commonly used high-performance file system in the HPC field. Tests show that Luster performs well in terms of large-scale construction and performance, so from 2017 to 2022, we will use Luster to carry all data services.

However, as more and more GPUs are used, and now have a floating-point processing capability of about 570 exaflops/s, the IO of the underlying storage can no longer keep up with the computing capability of the upper layer . Therefore, we began to explore new storage and upgrade for subsequent storage expansion. At the same time, we also encountered some problems in the process of using Luster.

First: the operation and maintenance method . Luster is mainly based on the kernel and is directly embedded in the kernel. Sometimes the positioning problem will involve operations such as restarting the machine;

Second: technology stack , because the development of our cloud platform is mainly based on golang, so we prefer to use storage that is more compatible with the development language. Luster uses the C language, which requires more manpower in terms of customization and optimization.

Third: data reliability . Luster mainly relies on hardware reliability (such as RAID technology), and the software layer mainly implements the HA solution of metadata nodes and object and data nodes. Compared with these, we still prefer to use more reliable software solutions such as three copies or erasure codes.

Fourth: The functional requirements of multi-level caching . In 2021, we will use Fluid + Alluxio as the distributed acceleration of Luster. Alluxio can better accelerate the calculation of our cluster and reduce the pressure on the underlying storage. However, we have been exploring the possibility of performing client-side caching directly from the storage system, so that the operation can be more transparent to users.

When JuiceFS was first open sourced in 2021, we conducted research on its features.

First, product features : JuiceFS supports POSIX interface and can be mounted in the form of HostPath. This method is exactly the same as the way we use NAS, and users basically do not need to make any changes when using it; JuiceFS metadata and object storage , there are many alternatives, such as Redis and TiKV are more suitable in the field of AI. The underlying Ceph, MinIO, and some public cloud object storage users can choose by themselves.

Second, upper-layer scheduling : JuiceFS not only supports HostPath, but also supports CSI drive mode, which allows users to access corresponding storage in a more cloud-native way.

Third, business framework adaptation : the POSIX interface is adapted to the deep learning framework. Fourth, operation and maintenance: the metadata engine and object storage are relatively mature in the industry, and there are many choices, and JuiceFS has automatic metadata backup and recycle bin functions. JuiceFS fits well with the business, so we conducted a POC test.

The test environment is shown in the figure above. Compared with JuiceFS, it is found that JuiceFS directly uses the kernel page cache, and compared with Luster's direct access to the mechanical disk, the performance is greatly improved (as shown in the figure below, the smaller the better).

After the POC test, we decided to bring JuiceFS into the production environment. At present, all GPU computing nodes of the entire Atlas cluster, as well as all development and debugging nodes, have installed the JuiceFS client.

JuiceFS directly connects to redis cluster and ceph, and most computing nodes use HostPath to access. At the same time, the JuiceFS CSI Driver is also deployed in the Atlas cluster, and users can access it in a cloud-native manner.

How JuiceFS is used in Atlas

In order to ensure data security, each group on the supercomputing platform belongs to a different directory, each directory is the members of the respective group or department, and the directories between different groups are invisible .

The directory permissions are based on the Linux permission control mechanism. When a user submits a training task in the Atlas cluster, the cluster's task submission tool will automatically read the user's UID and GID information on the system, and then inject it into the SecurityContext field of the task Pod submitted by the user, then the container Pod running on the Atlas cluster will The UIDs of the running processes of all containers are consistent with the information on the storage system to ensure that the permissions do not cross the boundary.

Nodes access JuiceFS to implement multi-level caching:

The first level: the cache is the page cache of the memory.
Level 2: Multiple SSDs on all computing nodes provide level 2 acceleration capabilities.
The third level: use Ceph. If three 1t SSDs still cannot support user data, it will be read from Ceph.

At the beginning of 2021, Unisound and the JuiceFS team will integrate JuiceFSRuntime into Fluid. Because the cache is used in a bare-metal manner, we found that the visibility of the cache is not good for users. The system automatically cleans up the cache, and the controllability of the user is not so high. That's why we integrated JuiceFS into Fluid.

Fluid will start JuiceFS related components, including FUSE and Worker Pod. Among them, the FUSE Pod provides the cache capability of the JuiceFS client, and the Worker Pod realizes the management of the cache life cycle. The AI offline training task of the Atlas platform interacts with the FUSE Pod client to read AI training data.

Through the cache scheduling capabilities provided by Fluid and the observability of data sets, users of the platform can deploy caches on specific computing nodes through affinity scheduling, and at the same time, users can intuitively see the usage of caches (such as the cached data sets) size, percentage of cache, capacity of cache, etc.).

Construction practice of JuiceFS

Currently Atlas cannot access the public network and is in a dedicated isolated network, so all of our deployments are privatized.

The metadata engine of our production environment uses Redis. In 2020, the connection between TiKV and JuiceFS is not very mature. We plan to use Redis for the transition first, and use Ceph for object storage. The system disk of the Redis node is configured with RAID1, and the persistent data of Redis will be periodically synchronized to another backup node. For Redis data persistence, we use the AOF + RDB solution to persist data every second.

The object storage uses a self-built Ceph cluster, and the Ceph cluster is deployed using Cephadm. The current production environment uses the Octopus version. At that time, we borrowed many solutions from the industry, optimized the memory at the memory level, and made corresponding adjustments at the software level, mainly as follows:

Server level (reference) : • 42 Cores 256GB 24 18T HDD • System disk: 2 960G SAS SSD • BlueStore • Disable NUMA • Upgrade kernel: 5.4.146 Enable io_uring • Kernel pid max, modify /proc/sys/kernel/pid_max

Ceph configuration : • Ceph RADOS: directly call the librados interface, do not use the S3 protocol • Bucket shard • Disable the automatic adjustment function of pg • OSD log storage (using bluestore, the recommended ratio of raw capacity - block : block.db : block. wal = 100:1:1, SSD or NVMe SSD is recommended for the latter two) • 3 copies

It is important to mention that the kernel of the Ceph cluster should be upgraded to a newer version, and then the io_uring function should be enabled, so that the performance will be greatly improved. In terms of software, we directly call the rados interface instead of using the S3 protocol, and the efficiency will be slightly higher. All nodes are interconnected with 100G InfiniBand high-speed network.

The object storage connected to JuiceFS in the Yunzhisheng environment is Ceph RADOS. JuiceFS uses librados to interact with Ceph, so the JuiceFS client needs to be recompiled. It is recommended that the version of librados correspond to that of Ceph. Please pay attention to this point. If you use CSI Driver, it will be read in the creation of PV/PVC, and you should /etc/ceph/ceph.confalso pay attention to the version support.

Perfect monitoring system

Now the entire link is relatively long. The bottom layer has metadata engine clusters, Ceph object storage clusters, and upper-layer clients and services. Each layer must have a corresponding monitoring solution.

For the client node, we mainly collect logs. It should be noted that the JuiceFS client logs of each mount point need to be aggregated and error alarms are required to prevent the logs from blowing up the system disk or the node cannot be written.

Each JuiceFS client must also have corresponding monitoring methods, such as checking whether the .stat files and log observation indicators of each mount point are normal, and then checking the IO and logs of Redis and Ceph clusters to ensure that the entire link is controllable Yes, it is more convenient to locate the problem in this way.

The above picture is the monitoring picture of Ceph, because our client nodes use SSD cache, and now the data is basically not read to Ceph, most of the data is read from the cache, so the traffic of Ceph is not large.

The above picture is the data intercepted by JuiceFS monitoring. It can be seen that 100% to 90% of the nodes can be hit. The cache hit rate is still relatively high, and most of the data is still in the cache.

Participate in JuiceFS community building

UniSound has been actively participating in community building during the process of using JuiceFS Community Edition. In 2021, I worked with the JuiceData team to develop the Fluid JuiceFS Runtime mentioned above. Recently, we found that the directory-based quota of the community version has not been developed yet, so we developed a version a few months ago, which limits the number and size of files in the directory. The PR has been submitted, and we are now working with the JuiceFS community Do the merge work.

Usage Scenarios and Benefits of JuiceFS in Atlas

JuiceFS client multi-level cache is currently mainly used in our text recognition, speech noise reduction and speech recognition scenarios. Since the data reading of AI model training is characterized by more reading and less writing, we make full use of the cache of the JuiceFS client to bring the acceleration benefits of IO reading.

Benefit 1: Accelerate AI model training

1) Speech noise reduction test

Scattered files are used in the test of the noise reduction scene model. Each data is in wav format, a small voice file smaller than 100k. In the noise reduction scene, we tested the I/O data in the data load stage, and the memory of the JuiceFS client node. The cache is 512G, and the test is carried out with a batch size of 40 under the data of 500h scale.

From the test results, in terms of data reading efficiency alone, in terms of small wav files, JuiceFS is 6.45 it/s, while Luster is 5.15 it/s, and the performance is improved by 25%. JuiceFS has effectively accelerated our end-to-end model training and shortened the overall model output time.

2) Text recognition scene

In the text recognition scenario, the model is CRNN backbone and MobileNet v2, and the test environment is as follows:

A large data file of LMDB is generated. At this time, IO has relatively high bandwidth requirements instead of performance requirements for small files. The 200G memory cache can support the entire data, so instead of using the underlying storage, we read directly from the client, and the overall performance has also been greatly improved.

In this test, the speed comparison between JuiceFS and Luster is mainly done. According to the experimental results, it takes 1.5s to read each batch from Luster, and 1.1s to read each batch from JuiceFS, an increase of 36%. From the perspective of model convergence time, from Luster's 96 hours to JuiceFS's 86 hours, using JuiceFS can shorten the output time of the CRNN model by 10 hours.

Model debugging and data processing

When doing code debugging, multiple users will run model tests and code traversal on one debugging machine at the same time. At that time, most users would use some remote IDEs, connect to debugging nodes, and then build their own virtual environment , will install a large number of installation packages on Luster in advance.

Most of them are small files of tens of k or hundreds of k, and we need to import these packages into our memory. When using Luster before, because there are too many users, the required throughput is high. At the same time, the performance requirements for small files are relatively high. I found that the effect is not very good. When importing packages, it will be relatively stuck, resulting in slow code debugging and relatively low overall efficiency. Low.

Later, the cache of the JuiceFS client was used, and the first compilation was also relatively slow, but the second compilation because all the data had been placed in the cache, the speed and efficiency were higher, and the code jump was faster. Code hinting imports are also faster. After user testing, there is about 2~4 times speed increase.

epilogue

From Luster to JuiceFS

From 2017 to 2021, when we use Luster, it is also relatively stable. When the cluster storage capacity is less than 50%, the stability of the software is relatively high.

As a storage system in the veteran HPC field, Luster has powered many of the world's largest supercomputing systems and has many years of experience in production environments. It has the advantages of conforming to the POSIX standard, supporting various high-performance and low-latency networks, and allowing RDMA access. It is suitable for high-performance computing in the traditional HPC field and is compatible with the interface of deep learning. All businesses do not need to be done Code modification. But there are also some disadvantages:

First, Luster cannot support cloud-native CSI Driver.

Second, Luster has relatively high requirements for operation and maintenance personnel, because it is all written in C language, sometimes some bugs cannot be quickly resolved, and the overall community is not very open and active.

JuiceFS has such features:

First, JuiceFS is a distributed storage system product in the cloud-native field . It provides CSI Driver and Fluid to better integrate with Kubernetes.

Second, the deployment scheme of JuiceFS is relatively flexible , and the metadata engine is highly optional. If the user network allows object storage, it is actually better to connect to the object storage of the public cloud.

Third, it is relatively simple in terms of storage expansion operation and maintenance . Fully compatible with the POSIX standard, the application of deep learning can be seamlessly migrated, but due to the characteristics of the back-end object storage, JuiceFS will have a high delay in random writing.

Fourth, JuiceFS supports local cache and kernel page cache, which realizes the layering and acceleration of hot and cold data . This is what we value more. It is more suitable in our business scenarios, but it is not suitable for random writing. The distributed cache function of the community version is not yet available.

Subsequent planning

• Metadata engine upgrade, TiKV is suitable for scenarios with more than 100 million files (can support up to 10 billion files), and has high requirements for performance and data security. At present, we have completed the internal test of TiKV and are actively working on it To follow up the progress of the community, the metadata engine will be migrated to TiKV in the future. • Directory quota optimization. At present, the functions of the basic version have been integrated into the JuiceFS community version. Discussions have also been conducted with the JuiceFS community. In some scenarios, some performance needs to be optimized. • Hope to do some Nonroot functions. Now all nodes have root authority to access all data. The authority is too large. We hope to only open root authority on specific nodes. • Finally, we will check whether there is a QoS solution in the community, such as speed limit based on UID or GID.

If you are helpful, please pay attention to our project Juicedata/JuiceFS ! (0ᴗ0✿)