Evolution and Implementation of ByteFUSE Distributed File System

Original Link: Evolution and Implementation of ByteFUSE Distributed File System

Introduction: ByteFUSE is a project jointly developed by the ByteNAS team and the STE team. It is widely used by businesses because of its high reliability, extreme performance, compatibility with Posix semantics, and support for rich usage scenarios. At present, it undertakes online business ES, AI training business, system disk business, database backup business, message queue business, symbol table business, and compilation business, etc. Byte’s internal deployment machines and daily mount points have reached a scale of 10,000, with a total throughput of Nearly 100 GB/s, with a capacity of more than ten PB, its performance and stability can meet business needs.

background

ByteNAS is a fully self-developed, high-performance, high-scalability, multi-write, multi-read, low-latency distributed file system that is fully compatible with Posix semantics. It currently supports Byte internal AI training, database backup, online ES, etc. The key business is also the main product form of NAS on the cloud in the future. Early ByteNAS used the NFS protocol to provide external services. It relied on the TTGW four-layer load balancer to balance the external traffic to multiple connected Proxies at the granularity of the TCP connection. Users can use the VIP provided by TTGW and mount it to communicate with multiple Proxies. One of the Proxy communicates. If the current communicating Proxy hangs up due to machine downtime or other reasons, the TTGW internal detection heartbeat timeout will trigger the Failover mechanism, which will automatically redirect the request from the Client to the new living Proxy. This mechanism is completely transparent to the client. But using TTGW has the following disadvantages:

  • Unable to support large throughput scenarios: The user's throughput is not only limited by the throughput of the TTGW cluster itself, but also limited by the NFS protocol's single read and write limit of 1MB. In addition, NFS is a single TCP connection, and the concurrent requests of the kernel slot are also limited, which will result in limited throughput and mutual influence between metadata and data.
  • Additional network delay: two more network hops for users to access ByteNAS (User-side NFS Client -> TTGW -> Proxy -> ByteNAS)
  • Additional machine cost: machine resources such as TTGW and Proxy are required
  • It is difficult to customize business requirements and performance optimization: limited by the influence of kernel NFS Client, NFS protocol and TTGW, it is difficult to customize requirements and performance optimization

In order to solve the above problems, ByteFUSE came into being. ByteFUSE is a solution based on the user-mode file system (FUSE) framework to connect to ByteNAS. ByteNAS SDK is directly connected to ByteNAS cluster, which not only meets the goal of low latency, but also solves the problem of limited protocol throughput. In addition, since part of the file system is logically moved to the user mode, it will be very convenient for troubleshooting, function expansion and performance optimization. The flow of users accessing ByteNAS using ByteFUSE and NFS protocols is shown in the following figure:

Target

  • High-performance, low-latency, business-friendly architectural model design
  • Fully compatible with Posix semantics
  • Support write-once-read-many/write-many-read
  • Self-developed and maintainable, providing customized feature capability support

Evolution route

1. ByteFUSE 1.0 — complete basic functions, cloud-native deployment support

Access ByteNAS through native FUSE

The overall architecture diagram of native FUSE docking ByteNAS is as follows:

ByteFUSE Daemon: FUSE Daemon integrated with ByteNAS SDK, the user's file system request will be forwarded to ByteFUSE Daemon through FUSE protocol, and then forwarded to the back-end storage cluster through ByteNAS SDK.

Cloud-native deployment support

ByteFUSE has developed a CSI plug-in based on the K8S CSI interface specification [1] to support the use of ByteFUSE in the K8S cluster to access the ByteNAS cluster. Its architecture is shown in the following figure:

  • CSI-Driver: ByteFUSE's cloud-native architecture currently only supports static volumes. The Mount/Umount operation will start/destroy the FUSE Client in CSI-Driver. CSI-Driver will record the status of each mount point. When CSI-Drvier exits abnormally When restarting, all mount points will be recovered to ensure high availability.

  • FUSE Client: It is the ByteFUSE Daemon mentioned above. Under the 1.0 architecture, for each mount point, CSI-Driver will start a FUSE Client to provide services.

2. ByteFUSE 2.0 — cloud-native architecture upgrade, consistency, availability and operability improved

Business needs and challenges

  • FUSE Client resource occupation is uncontrollable and cannot be reused : in multi-FUSE Client mode, a mount point corresponds to a FUSE Client process, and the resource occupation of FUSE Client is strongly related to the number of mount points, which makes the resource occupation of FUSE Client uncontrollable.
  • The strong coupling between the FUSE Client and the CSI-Driver prevents the CSI-Driver from being upgraded smoothly : the life cycle of the FUSE Client process is associated with the CSI-Driver. When the CSI needs to be upgraded, the FUSE Client also needs to be rebuilt, which will also affect business I/O. , and at the same time, this impact duration is strongly related to the upgrade duration (seconds) of the CSI-Driver.
  • Some businesses hope to access ByteFUSE in the Kata container scenario : In the cloud-native scenario, some businesses will run in the form of Kata containers. In order to meet the needs of these businesses to access ByteFUSE, CSI-Driver needs to support the container runtime of kata , that is, the ByteNAS service can be accessed through ByteFUSE in the kata virtual machine.
  • The native FUSE consistency model cannot meet some business requirements : Some businesses are typical write-once-read-many scenarios, which have extremely high requirements for read and write throughput, data visibility, and tail delay. However, when the native FUSE kernel cache is enabled, , cannot provide a consistency model like CTO (Close-to-Open).
  • The usability/operability of native FUSE is weak and cannot be applied to large-scale production environments : native FUSE has weak support for high availability, hot upgrade and other capabilities. When there is a crash of the FUSE process or a bug in the kernel module that needs to be upgraded, etc. , It is often necessary to notify the business to restart the Pod, or even restart the entire physical node, which is unacceptable for most businesses.

Cloud Native Architecture Upgrade

FUSE Client Architecture Upgrade: Single Daemonization

In response to the above business needs and challenges, we have upgraded the architecture, supported the single FUSE Daemon mode to solve the problem of uncontrollable resources and unreusable resources, and adopted the separation of FUSE Daemon and CSI-Driver to solve the problem that CSI-Driver cannot be upgraded smoothly problem, its architecture is shown in the following figure:

AdminServer: Monitor the status of Mountpoint/FUSE Daemon, upgrade FUSE Daemon, and collect cluster information.

FUSE Daemon: manages all mount points of the ByteNAS cluster and processes read and write requests, recovers all mount points after restarting, and the recovery time is at the ms level.

Kata Containers scene support

In order to provide support for Kata scenarios and at the same time solve the high availability and performance scalability issues of native FUSE, we introduced VDUSE[2], a technical framework independently developed by Byte, in the 2.0 architecture to implement ByteFUSE Daemon. VDUSE utilizes the mature software framework of virtio, enabling ByteFUSE Daemon to support mounting from virtual machines or host machines (containers) at the same time. At the same time, compared with the traditional FUSE framework, the FUSE Daemon implemented based on VDUSE no longer relies on the /dev/fuse character device, but communicates with the kernel through a shared memory mechanism. This method is of great benefit to subsequent performance optimization on the one hand. On the other hand, it also solves the Crash Recovery problem very well.

Consistency, availability and operability improvements

Consistency Model Enhancements

Performance and consistency are a fundamental contradiction in distributed system design - maintaining consistency means more nodes communicating, and more nodes communicating means performance degradation. In order to meet the relevant business needs, we have continuously traded off performance and consistency on the basis of the FUSE native cache mode, and implemented the FUSE CTO (Close-to-Open) consistency model [4], and these consistency models according to different configurations Abstracted into the following five types:

Daemon High Availability

Since the ByteFUSE 2.0 architecture introduces the technical framework of VDUSE [2], it supports the use of the Virtio protocol based on shared memory as the transport layer. The built-in inflight I/O tracking feature of the Virtio protocol can persist the requests being processed by ByteFUSE in real time, and ByteFUSE reprocesses outstanding requests when resuming, which makes up for the lack of state preservation when using the character device /dev/fuse as the transport layer in native libfuse. Based on the inflight I/O tracking feature, ByteFUSE further considers the consistency and idempotence of the file system state before and after recovery, and realizes crash recovery without user awareness [3], and realizes hot upgrade of Daeamon based on crash recovery.

Kernel module hot upgrade

While ByteFUSE uses customized kernel modules to achieve better performance, availability and consistency, it also challenges the upgrade and maintenance of these customized kernel modules. In order to solve the problem that the binary kernel module cannot be upgraded with the kernel, we deploy the customized kernel module through DKMS, so that the kernel module can be automatically recompiled and deployed with the kernel upgrade. In order to solve the problem of hot upgrade of the kernel module itself, we realize the coexistence of multiple versions of the same kernel module by binding the symbol name or device number exported by the kernel module to the version number. New ByteFUSE mounts will automatically use the new kernel modules; old ByteFUSE mounts will continue to use the old kernel modules.

At present, through the above-mentioned DKMS technology and "multi-version coexistence" technology, we decouple the upgrade of the ByteFUSE kernel module from the kernel and the ByteFUSE Daemon; in the future, we will further realize the hot upgrade function of the ByteFUSE kernel module to support online running Hot upgrade function for existing ByteFUSE volumes.

3. ByteFUSE 3.0 — Extreme performance optimization, creating an industry-leading high-performance file storage system

Business needs and challenges

Storage system performance requirements for large model training scenarios

In the large model training scenario, training a huge number of models requires huge computing power. However, as the size of the dataset and model increases, the time it takes for the application to load data becomes longer, which affects the performance of the application and slows down I/O seriously hinders the powerful computing power of the GPU. At the same time, the evaluation and deployment of the model needs to read a large number of models in parallel, requiring the storage to provide ultra-high throughput.

In cloud-native high-density deployment scenarios, it is necessary to further reduce resource occupation overhead

In the cloud-native high-density deployment scenario, with the order of magnitude increase of ByteFUSE volumes, new requirements are put forward for the resource (CPU & Memory) occupation and isolation of ByteFUSE single-machine side.

Extreme performance optimization

ByteFUSE 3.0 optimizes the performance of the entire link from the thread model, data copy, kernel side, and protocol stack. The performance is increased by 2.5 times, and 2 cores can be used for a full 100 Gb network card. Its optimization direction is as follows:

Run-to-Completion threading model

A Read/Write request in version 2.0 will have 4 thread switches, and accessing Run-to-Completion (RTC) can save the overhead caused by these 4 thread switches. In order to achieve Run-to-Completion, we have carried out shared-nothing design and non-blocking transformation of locks for ByteFUSE and ByteNAS SDK. The purpose is to ensure that RTC threads will not be blocked and avoid delays affecting requests.

RDMA & User Mode Protocol Stack

Compared with 2.0, the 3.0 architecture has also made great improvements in network transmission, mainly reflected in the introduction of RDMA and user mode protocol stack (Tarzan) to replace the traditional kernel TCP/IP protocol stack. Compared with the kernel TCP /IP protocol stack, RDMA/Tarzan can save the delay caused by switching between user mode and kernel mode and data copy, and further reduce CPU usage.

Full link zero copy

After the introduction of RDMA/Tarzan, the copy of ByteFUSE in network transmission has been successfully eliminated, but in the access of FUSE, there are still two copies from Page Cache to Bounce Buffer and from Bounce Buffer to RDMA/Tarzan DMA Buffer. In order to reduce this part of the copy overhead (according to statistics, the copy of 1M data consumes about 100us), the 3.0 architecture introduces the VDUSE umem [5] feature, which reduces one copy by registering the RDMA/Tarzan DMA Buffer with the VDUSE kernel module. In the future, we will further implement the FUSE PageCache Extension feature to achieve the optimization goal of full link zero copy.

FUSE kernel optimization

(1) Multiple queues

In the native FUSE/viritofs kernel module, there are many single-queue designs for the processing path of FUSE requests: for example, each FUSE mount has only one IQ (input queue), one BGQ (background queue), and the virtiofs device uses a single-queue model to send FUSE requests wait. In order to reduce the lock competition caused by the single-queue model and improve scalability, we support the per-cpu FUSE request queue and the configurable number of virtiofs virtqueues in the FUSE/virtiofs request path. Based on the support of FUSE's multi-queue feature, ByteFUSE can configure different CPU affinity policies according to different deployment environments to reduce inter-core communication or balance inter-core load. The ByteFUSE worker thread can also enable the load balancing scheduling provided by the FUSE multi-queue feature to alleviate the local request queuing phenomenon in the case of uneven inter-core requests.

(2) Huge block support

In order to meet the performance requirements of high-throughput scenarios, ByteFUSE version 3.0 supports customized FUSE kernel module parameters. The native FUSE module of the Linux kernel has some hard-coded data transfers, such as a single maximum data transfer unit of 1 MB, and a single maximum directory tree read unit of 4 KB. In the ByteFUSE kernel module, we increase the maximum single data transfer to 8 MB, and the single maximum directory read unit to 32 KB. In the database backup scenario, if the single write-down is changed to 8MB, the throughput of a single machine can be increased by about 20%.

evolution benefits

Benefits overview

1.0 -> 2.0

  • Reduce resource occupation and facilitate resource control

Compared with a single FUSE Daemon and multiple FUSE Clients, resources such as threads, memory, and connections between multiple mount points can be reused, which can effectively reduce resource usage. In addition, running FUSE Daemon in a Pod alone can better adapt to the Kubernetes ecosystem and ensure that it is under the control of Kubernetes. Users can directly observe the Pods of FUSE Daemon in the cluster, which is highly observable.

  • Decoupling CSI-Driver and FUSE Daemon

As two independently deployed services, CSI-Driver and FUSE Daemon can be deployed and upgraded independently without affecting each other, further reducing the impact of operation and maintenance work on business. In addition, we support the hot upgrade of FUSE Daemon in POD, and the whole upgrade is indifferent to the business.

  • Support kernel module hot upgrade

It can support hot upgrade of ByteFUSE incremental volumes, fix known bugs in the kernel module, and reduce online risks without any sense of business.

  • Support a unified monitoring and control platform to facilitate visual management

AdminServer monitors the status of all FUSE Daemons & mount points in a Region, supports remote restoration of abnormal mount points, supports hot upgrade of FUSE Daemon within a Pod, and supports remote mount point anomaly detection and alarms.

2.0 -> 3.0

The entire architecture implements the Run-to-Complete threading model, which reduces the performance loss caused by locks and context switching. In addition, we replaced the kernel-mode TCP with user-mode TCP, bypassed the kernel, and registered memory to the kernel to achieve full-link zero-copy to further improve performance. For a write request of 1MB, the FUSE Daemon side can save hundreds of us.

performance comparison

FUSE Daemon Machine Specifications:

  • CPU: 32 physical cores, 64 logical cores
  • Memory: 251.27GB
  • NIC: 100Gbps

Metadata performance comparison

Use mdtest for performance testing, test command

mdtest '-d' '/mnt/mdtest/' '-b' '6' '-I' '8' '-z' '4' '-i' '30

, the performance difference is as follows:

in conclusion

The metadata performance of the 3.0 architecture is about 25% higher than that of the 1.0 architecture.

Data performance comparison

FIO uses 4 threads, and its performance is shown in the figure below:

In addition, test the impact of the number of ByteFUSE 3.0 polling threads on performance. For writing, 2 polling threads basically fill up the 100G network card, while for reading, 4 polling threads are required (one more data copy than the write operation). In the future, we will transform the user-mode protocol stack Tarzan to save a data copy for reading and achieve zero-copy reading and writing.

Business landing

Implementation of the ES storage-computing separation architecture scenario

scene description

The Shared Storage architecture of ES allows multiple shard copies of ES to use the same data to solve the problems of slow expansion, slow migration of shards, oscillating search scores, and high storage costs under the Shared Nothing architecture. The underlying storage uses ByteNAS to share the data of the primary and secondary shards and uses ByteFUSE as the access protocol to meet the requirements of high performance, high throughput and low latency.

income

The implementation of the ES storage-computing separation architecture saves nearly 10 million storage costs per year

The landing of the AI ​​training scene

scene description

Using block storage + NFS to share the root file system in the AI ​​Web IDE scenario cannot solve the problem that the mount function is unavailable due to the NFS disconnection process entering the D state and NFS disconnection triggering a kernel bug. In addition, the throughput of limited load balancing in AI training scenarios and the impact of NFS protocol performance cannot meet the high throughput and low latency requirements of training tasks, while ByteNAS provides a shared file system, large throughput and low latency to meet model training.

income

Meet the high throughput and low latency requirements of AI training

Implementation of other business scenarios

Limited by TTGW throughput and stability, the database backup business, message queue business, symbol table business, and compilation business were switched from NFS to ByteFUSE.

future outlook

The ByteFUSE 3.0 architecture can already meet the needs of most businesses. However, in order to pursue more extreme performance and meet more business scenarios, we still have a lot of work to do in the future:

  • ByteFUSE is extended to ToB scenarios; to meet the needs of ultra-low latency and ultra-high throughput of business on the cloud
  • Supports non-Posix semantics; customized interfaces meet the needs of upper-layer applications, such as IO fencing semantics
  • FUSE PageCache Extension; FUSE supports Page Cache user mode extension, and FUSE Daemon can directly read and write Page Cache
  • Supports hot upgrade of kernel modules; supports upgrade of kernel modules of stock and incremental ByteFUSE volumes without user awareness
  • Support GPU Direct Storage[6]; data is directly transmitted between RDMA network card and GPU, bypaas host memory and CPU

References

[1] https://kubernetes-csi.github.io/docs/

[2] https://www.redhat.com/en/blog/introducing-vduse-software-defined-datapath-virtio

[3] https://juejin.cn/post/7171280231238467592

[4] https://lore.kernel.org/lkml/[email protected]/

[5] https://lwn.net/Articles/900178/

[6] https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html

Popular Recruitment

The ByteDance STE team sincerely invites you to join us! The team has been recruiting for a long time. There are positions in Beijing, Shanghai, Shenzhen, Hangzhou, US, and UK.  The following is the recent recruitment information. Those who are interested can directly scan the QR code on the poster to submit their resume. We look forward to meeting you soon. Go to the sea of ​​stars! If you have any questions, you can consult the assistant WeChat: sys_tech, there are many positions, come and smash your resume!

The Indian Ministry of Defense self-developed Maya OS, fully replacing Windows Redis 7.2.0, and the most far-reaching version 7-Zip official website was identified as a malicious website by Baidu. Go 2 will never bring destructive changes to Go 1. Xiaomi released CyberDog 2, More than 80% open source rate ChatGPT daily cost of about 700,000 US dollars, OpenAI may be on the verge of bankruptcy Meditation software will be listed, founded by "China's first Linux person" Apache Doris 2.0.0 version officially released: blind test performance 10 times improved, More unified and diverse extremely fast analysis experience The first version of the Linux kernel (v0.01) open source code interpretation Chrome 116 is officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6150560/blog/10098437