Volcano Engine Cloud Native Storage Acceleration Practice

Most of the machine learning and data lake computing power in Volcano Engine-related businesses run on the cloud-native K8s platform. The computing scenarios of separation of storage and computing and elastic scaling under the cloud native architecture have greatly promoted the development of storage acceleration. Currently, the industry has also derived a variety of storage acceleration services. However, in the face of the diversity of computing and customer scenarios, there is no industry-standard storage acceleration practice, and many users also face a lot of confusion when making selections. We have built a cloud-native storage acceleration service on the Volcano Engine, adapting to various computing scenarios of machine learning and data lakes, and are committed to providing businesses with easy-to-use, transparent acceleration services. This sharing will be based on our business practices on the Volcano engine to share our experience summary and thoughts on storage acceleration.
Author: Guo Jun, Head of Big Data File Storage Technology of Volcano Engine

Cloud nativeStorage acceleration requirements

Cloud native Business Basic services can be mainly divided into three parts : Computing, storage and middleware .
  • The top layer is the computing business, most of which are run based on the K8s base. On the basis of the computing base, some big data tasks and AI training tasks will be carried out, and then there are various computing frameworks.
  • The bottom layer is storage service. At present, separation of storage and computing is the future trend of the industry. For some standards on the cloud Storage services can be divided into the following three categories:
    • The first type is object storage , mainly AWS S3 is astandardproduct, and each cloud vendor also has some innovative services based on standard capabilities; a>
    • The second category is NAS . It is traditionally positioned as a remote file storage. Now all cloud vendors basically have standards. NAS Storage products;
    • The third category is various parallel file systems, called PFS , its design The original intention is to support traditional enterprise HPC scenarios, and can support large concurrency< Data reading of /span>Now it is mainly used to support large-scale AI training scenarios on the cloud. . and high throughput
  • The middle layer is various storage middleware. Due to the inherent local limitations of storage, it is often impossible to cooperate with computing services for large-scale concurrency or elastic scheduling. Therefore, the industry has introduced some storage and acceleration middleware between the entire computing business and storage services. For example, ALLUXIO is a typical representative of storage acceleration. In addition, JuiceFS itself also has many caching and acceleration capabilities. Storage acceleration is essentially to provide computing services with better flexibility read and write capabilities.

Pain points

From a business perspective, storage and acceleration have the following pain points:
The first pain point is model selection. Because there is no unified industry standard for various acceleration middleware, each middleware has some different limitations. This pain point can be viewed from the following perspectives. The first is protocol compatibility. What kind of protocol does the middleware product present to the business? Is it an object storage protocol or is it partially compatible POSIX protocol, there is still a 100% POSIX protocol; in addition, the difference in cost model, the cost price required for the same accelerated bandwidth; the third is the data format, the data format and data directory of the storage base must be transparent Pass it to the business, or reassemble and convert it in the middleware.
The second pain point is the management of middleware products. For storage acceleration middleware products, how to ensure operation and stability, data flow between underlying storage services, and quota and qos In terms of management and control, is there any support from some capabilities?

Common solutions

The picture above is a common storage acceleration solution in the current industry.
  • The first one is object storage + Alluxio. The disadvantage is that POSIX has limited compatibility. The compatibility of POSIX is mainly limited by the capabilities of the object storage itself. There is no way to support functions such as Rename, directory deletion, random writing, overwriting, and appending of atomic directories. The advantage is that the overall cost is relatively cheap, because it is based on object storage, and Alluxio itself is a transparent data format. The directory structure and data seen on the object storage can be directly presented to the business.
  • The second solution is object storage + JuiceFS. One of the biggest advantages of this solution is that the overall POSIX compatibility is very excellent. The overall cost is also relatively cheap, because it often uses local disks on some computers as cache acceleration media. It should be noted that its data format is a private format, because the data stored in the object storage is cut into pieces, so the complete file cannot be seen from the object storage. The governance cost of this solution varies from person to person. If all businesses are conducted based on the JuiceFS service, there will be almost no governance cost. But if you want to do some data flow between JuiceFS and other storage services , you will need to do a lot of governance work.
  • The third solution is services based on various parallel file systems. The advantages are POSIX good compatibility, transparent data format, and low governance cost. However, due to the use of some high-performance components, the price in the industry is relatively expensive.
  • The last solution is that various cloud vendors have launched Object Storage and PFS is stored in object storage and hot data in PFS. However, the actual business experience is not very convenient, and the data flow between both parties also requires a lot of management costs. cold data Combining capabilities, the vision is that

What is “good” storage acceleration?

We understand that "good" storage acceleration should meet the characteristics of transparent acceleration, multi-protocol compatibility, elastic scalability, and basic data governance capabilities.

Transparent acceleration

One of the requirements for transparent acceleration is that the service-based acceleration capability needs to be available out of the box and has a stable SLA guarantee. You can also pay as you go. Another request is to accelerate the native protocol of base storage and expose it directly to the business. From a business perspective, there is no need to modify the code level. Only some configuration adaptation adjustments are needed to see the original directory on the base storage. Structure and data format. At present, whether it is cloud storage or enterprise storage, all storage services are relatively mature. We are not reinventing the wheel, but we just hope to accelerate the transparent capabilities Do a good job and solve business problems better.

Multi-protocol compatible

Multi-protocol compatibility based on object storage requires optimization in the following four aspects:
  • The first is basic acceleration capabilities, including support for S3 protocol, directory tree caching, and the ability to automatically write back to object storage ; a>
  • The second is Rename optimization. Now many cloud vendors support single object atomic Rename operation, which is mainly connected to a single object. Rename API , optimizes the performance of directory Rename to a certain extent;
  • The third is Append support, which connects to cloud vendors’ Appendable objects and supports the common writing mode of Close-Open-Append;
  • The fourth is FUSE mount, which provides CSI Supports high-availability capabilities mounted with FUSE, and can continue to maintain business continuity after the FUSE process crashes and is restarted. IO
In this acceleration solution based on object storage, we will mainly encounter the following three problems.
  1. The first problem is POSIX 's insufficient compatibility, due to many machine learning Training jobs are built based on the standard POSIX file system, so they cannot be run based on this solution.
  2. The second problem is that if users want to promote business based on this architecture, they often need to do some business level IO models Transformation, which is very difficult for algorithm engineers to achieve.
  3. The third problem is that due to the above two limitations, many users will regard this solution as an efficient read-only cache to build business, which also limits the upper limit of the use value of this solution.
In order to solve the above problems, after researching related products on the market, we decided to solve it based on NAS POSIX Compatibility issues. As a standardcloud storage product, NAS is inherently equipped with complete POSIX capabilities. By adapting NAS as a storage base at the acceleration layer, protocol adaptation and consistency assurance are done to solve the bandwidth and performance bottlenecks of the NAS product itself. In terms of cost, capacity-based NAS is a bit more expensive than object storage, but the overall price/performance ratio is still within an acceptable range.

Elastic scaling

The acceleration layer also needs to achieve elastic scaling capabilities, and the acceleration component also needs to be based on cloud native architecture, so the entire data plane is based on /span> is built with distributed metadata, and the acceleration of the data plane is also built based on the native platform. Therefore, both metadata and data plane can have elastic scaling capabilities. SSD NVME

data governance

The following important features are required in data governance:
  • Automatic write-back base: When many businesses write data through acceleration components, they are very concerned about the visibility of the data on the object storage base. nature, because there will be many downstream businesses that need to rely on the output files of the previous business to initiate subsequent businesses. Therefore, we need a deterministic backwash strategy that does not require too much manual intervention.
  • Cache strategy customization: More cache strategy support is needed, such as typical LRFU , TTL etc., supporting relevant mechanisms of preheating capabilities common in the industry.
  • Multi-task isolation: Provides some task-level acceleration guarantees.
  • Timely update of cache: supports access to object storage Event active update, also supports TTL based Passive pull updates for the mechanism .

CloudFS acceleration practice

Based on the demand of Byte internal business for the above storage acceleration capabilities, we have launched a new one that evolves from Byte internal HDFS The File System service that comes is named CloudFS . The overall technical architecture of CloudFS and the internal HDFS architecture are essentially a productized, miniaturized and multi-tenant package of the same set of components on the cloud.
CloudFS In addition to storage acceleration capabilities, it also supports native HDFS mode and multi-data source aggregation Ability. The base currently supports object storage and NAS It is still in the adaptation development stage, and has connected various ecology and volcanoes of big data and AI training in the upper-layer business. Some technical products of engine, etc.

Metadata acceleration

In the example above, from the perspective of the training container, you can see that there are two objects in the dataset. The view of the dataset directory tree structure is consistent with the view of the directory structure of the lowest-level object storage. The most basic technical feature is the need to cache the directory structure of the object storage and pull it on demand. In the metadata service, the directory tree structure of the object storage is copied, but it is stored in the directory tree hierarchy instead of the flat directory structure of the object storage. In addition, we subscribe to object storage event notifications to support active updates. Active event notifications and passive on-demand pulls ensure the consistency of the entire metadata as much as possible. In addition, if the same Bucket is mounted multiple times, there may be duplicate Objects. We have deduplicated the same Object at the metadata level to maximize cache space utilization.

Data plane cache

Next, we will introduce the caching of the entire data plane. We divide the object into multiple data blocks, and each data block can have multiple Replica. As shown in the figure above, r1 and r2 are two Replica of the same database. The data-side caching strategy is relatively lazy. The data will only be retrieved when user data is accessed for the first time. It also supports an adaptive number of copies based on the business load of the entire copy and the business load of the current cache node. Supports adaptive copy number, so the number of copies can be self-adjusted according to the pressure value of the business. In terms of cache management strategy, the ARC cache algorithm is adopted, which can save more data and ensure that hotspot data can be retained in the cache.
In addition, we also support the preheating mechanism, because many users need to save all their data in the cache before running the job, so that there is no need to wait when starting the job later. The specific preheating method is the distribution of a large-scale preheated copy based on the P2P protocol. For write caching, having a multi-copy write caching mechanism can write back to the base asynchronously asynchronously /synchronously. For example, after a Block is written, it will be refreshed to the object storage immediately. However, for objects in the object storage, you can only see the update of the file length or content on the base when the file is closed.

FUSE business entrance transformation

We have made some changes to the FUSE entrance to improve its stability. The first is the transformation of FUSE virtio. Replaced /dev/fuse, which greatly improved performance. At the same time, certain high availability guarantees have been made for the FUSE process. After FUSE crashes and restarts, it can be restored to its previous state in real time. So for the business IO , it can feel a little stuck, but its upper layer will not hang up and can continue to run. There are many training jobs that take a long time to run at one time, so the stability of the training jobs is still relatively large. In addition, FUSE also supports Page Cache to maximize the use of system memory resources. The last feature optimization is to synchronize close files. Because FUSE's original solution based on /dev/fuse, when closing files, it actually closes the underlying files asynchronously. Based on this, we have made a synchronous close file. Supported, which better ensures file visibility. Of course, this part of the transformation requires the installation of a kernel module first, so currently in the Volcano operating system, the default standard veLinux operating system has been built-in. If For other systems, you may need to install some modules to enable this function.

Business Practice—AML Platform Training Acceleration

In the training acceleration practice of the Volcano Engine AML platform, because< a i=4>GPU There are many machine models, some have local disks and some do not. Therefore, for training scenarios without local disks, it is necessary to provide some server-side accelerated storage space support; and For models with local disks, the local disk on the accelerated GPU machine needs to be taken over. Therefore, the acceleration unit DataNode is essentially a form between semi-managed and fully managed. It is often a mixed use scenario.
So we built the control plane and metadata services on the CloudFS server side, which can also support the acceleration unit DataNode. However, the DataNode on the server side is created on demand. If the ECS GPU machine =6> If there is a local disk with sufficient acceleration capabilities, there is no need to build a control plane and metadata service. If it needs to use the server to expand the cache capacity, it can also be expanded in real time. The acceleration unit DataNode is connected to the business network through ENI, so there will be no loss of overall cache bandwidth.

Business Practice—Data Lake Multi-Cloud Management Acceleration

Big Data business in the hybrid cloud scenario In practice , CloudFS as volcanoCloud NativeThe components of the computing platform, will be deployed in the customer's private computer room. We have adapted the object buckets of other cloud vendors, and can remotely Public cloudObject bucketSome newdataAcceleration/ Warm up into the computer room of this private department. On this basis, the relevant business can complete the subsequent big data processing work in the private computer room.
In the first part of the test work, a simple test was conducted on IO streaming, combined with the caching function and Page Cache 's closing makes some comparisons. If there is no preheating, the first pass must be read from the object storage, and the overall demand for business concurrency will be higher. If the concurrency reaches 256, the entire image/second can reach more than 6,000, which has very high concurrency requirements. If all cache hits are achieved, only 32 concurrencies are needed to achieve 8800 image/second. When Page Cache or FUSE side metadata cache is enabled, this data result can be higher.
The second part of the test is based on this data set after running some simple task loads based on learning and training, and making a comparison with Goofys. Whether it is an Epoch cache hit or not, the performance will be greatly improved. Just because the first Epoch is obtained from the underlying storage, the performance improvement is not very obvious. After all cache hits, the performance will be more than doubled.

future plan

Future plans mainly include three aspects:
The first step is to continue polishing the NAS base;
The second is to achieve more fine-grained cache optimization;
Finally, a fine-grained elastic scaling mechanism is established for the cache.
 
question Time:
Q: Which scenarios require CloudFS acceleration, HDFS How is the acceleration performance?
A: Which scenarios require acceleration depends on whether there is a bottleneck in the underlying bandwidth. If the bandwidth between the computing service and the HDFS cluster is sufficient and QPS is required It's not too high, so there's no need for storage acceleration. But if it is like on public cloud , the bandwidth of object storage is often It is limited, and QPS generally has a limit. When it cannot meet business requirements, it is necessary to do such an acceleration.
Q: How is cache elastic scaling done?
A: We are still optimizing. The overall idea is actually very simple. It still depends on the business load. If the bandwidth is relatively low, remove some acceleration unit nodes. If the load reaches a threshold, expand the capacity of several acceleration unit nodes, because these use Volcano ECS . This mechanism is actually based on ECS. Ensure the flexibility of this bandwidth resource.
Q: CloudFS How much iNode metadata can be stored at most? Will too many have any impact on the availability and stability of the cluster?
A: The current scale limit is 5 billion. Stability-related issues have been addressed in byte internal HDFS 's distributed metadata architecture.
 
Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/10141127