Most of the machine learning and data lake computing power in Volcano Engine-related businesses run on the cloud-native K8s platform. The computing scenarios of separation of storage and computing and elastic scaling under the cloud native architecture have greatly promoted the development of storage acceleration. Currently, the industry has also derived a variety of storage acceleration services. However, in the face of the diversity of computing and customer scenarios, there is no industry-standard storage acceleration practice, and many users also face a lot of confusion when making selections. We have built a cloud-native storage acceleration service on the Volcano Engine, adapting to various computing scenarios of machine learning and data lakes, and are committed to providing businesses with easy-to-use, transparent acceleration services. This sharing will be based on our business practices on the Volcano engine to share our experience summary and thoughts on storage acceleration.
Author: Guo Jun, Head of Big Data File Storage Technology of Volcano Engine
Cloud nativeStorage acceleration requirements
Cloud native
Business
Basic services
can be mainly divided into three parts
: Computing, storage and middleware
.
-
The top layer is the computing business, most of which are run based on the K8s base. On the basis of the computing base, some big data tasks and AI training tasks will be carried out, and then there are various computing frameworks.
-
The bottom layer is storage service. At present, separation of storage and computing is the future trend of the industry. For some standards on the cloud Storage services can be divided into the following three categories:
-
The first type is object storage , mainly AWS S3 is astandardproduct, and each cloud vendor also has some innovative services based on standard capabilities; a>
-
The second category is NAS . It is traditionally positioned as a remote file storage. Now all cloud vendors basically have standards. NAS Storage products;
-
The third category is various parallel file systems, called PFS , its design The original intention is to support traditional enterprise HPC scenarios, and can support large concurrency< Data reading of /span>Now it is mainly used to support large-scale AI training scenarios on the cloud. . and high throughput
-
-
The middle layer is various storage middleware. Due to the inherent local limitations of storage, it is often impossible to cooperate with computing services for large-scale concurrency or elastic scheduling. Therefore, the industry has introduced some storage and acceleration middleware between the entire computing business and storage services. For example, ALLUXIO is a typical representative of storage acceleration. In addition, JuiceFS itself also has many caching and acceleration capabilities. Storage acceleration is essentially to provide computing services with better flexibility read and write capabilities.
Pain points
From a business perspective, storage and acceleration have the following pain points:
The first pain point is model selection. Because there is no unified industry standard for various acceleration middleware, each middleware has some different limitations. This pain point can be viewed from the following perspectives. The first is protocol compatibility. What kind of protocol does the middleware product present to the business? Is it an object storage protocol or is it partially compatible
POSIX
protocol, there is still a 100% POSIX protocol; in addition, the difference in cost model, the cost price required for the same accelerated bandwidth; the third is the data format, the data format and data directory of the storage base must be transparent Pass it to the business, or reassemble and convert it in the middleware.
The second pain point is the management of middleware products. For storage acceleration middleware products, how to ensure operation and stability, data flow between underlying storage services, and quota and
qos
In terms of management and control, is there any support from some capabilities?
Common solutions
The picture above is a common storage acceleration solution in the current industry.
-
The first one is object storage + Alluxio. The disadvantage is that POSIX has limited compatibility. The compatibility of POSIX is mainly limited by the capabilities of the object storage itself. There is no way to support functions such as Rename, directory deletion, random writing, overwriting, and appending of atomic directories. The advantage is that the overall cost is relatively cheap, because it is based on object storage, and Alluxio itself is a transparent data format. The directory structure and data seen on the object storage can be directly presented to the business.
-
The second solution is object storage + JuiceFS. One of the biggest advantages of this solution is that the overall POSIX compatibility is very excellent. The overall cost is also relatively cheap, because it often uses local disks on some computers as cache acceleration media. It should be noted that its data format is a private format, because the data stored in the object storage is cut into pieces, so the complete file cannot be seen from the object storage. The governance cost of this solution varies from person to person. If all businesses are conducted based on the JuiceFS service, there will be almost no governance cost. But if you want to do some data flow between JuiceFS and other storage services , you will need to do a lot of governance work.
-
The third solution is services based on various parallel file systems. The advantages are POSIX good compatibility, transparent data format, and low governance cost. However, due to the use of some high-performance components, the price in the industry is relatively expensive.
-
The last solution is that various cloud vendors have launched Object Storage and PFS is stored in object storage and hot data in PFS. However, the actual business experience is not very convenient, and the data flow between both parties also requires a lot of management costs. cold data Combining capabilities, the vision is that
What is “good” storage acceleration?
We understand that "good" storage acceleration should meet the characteristics of transparent acceleration, multi-protocol compatibility, elastic scalability, and basic data governance capabilities.
Transparent acceleration
One of the requirements for transparent acceleration is that the service-based acceleration capability needs to be available out of the box and has a stable
SLA
guarantee. You can also pay as you go. Another request is to accelerate the native protocol of base storage and expose it directly to the business. From a business perspective, there is no need to modify the code level. Only some configuration adaptation adjustments are needed to see the original directory on the base storage. Structure and data format. At present, whether it is
cloud storage
or enterprise storage, all storage services are relatively mature. We are not reinventing the wheel, but we just hope to accelerate the transparent capabilities Do a good job and solve business problems better.
Multi-protocol compatible
Multi-protocol compatibility based on
object storage
requires optimization in the following four aspects:
-
The first is basic acceleration capabilities, including support for S3 protocol, directory tree caching, and the ability to automatically write back to object storage ; a>
-
The second is Rename optimization. Now many cloud vendors support single object atomic Rename operation, which is mainly connected to a single object. Rename API , optimizes the performance of directory Rename to a certain extent;
-
The third is Append support, which connects to cloud vendors’ Appendable objects and supports the common writing mode of Close-Open-Append;
-
The fourth is FUSE mount, which provides CSI Supports high-availability capabilities mounted with FUSE, and can continue to maintain business continuity after the FUSE process crashes and is restarted. IO
In this acceleration solution based on object storage, we will mainly encounter the following three problems.
-
The first problem is POSIX 's insufficient compatibility, due to many machine learning Training jobs are built based on the standard POSIX file system, so they cannot be run based on this solution.
-
The second problem is that if users want to promote business based on this architecture, they often need to do some business level IO models Transformation, which is very difficult for algorithm engineers to achieve.
-
The third problem is that due to the above two limitations, many users will regard this solution as an efficient read-only cache to build business, which also limits the upper limit of the use value of this solution.
In order to solve the above problems, after researching related products on the market, we decided to solve it based on
NAS
POSIX Compatibility issues. As a standardcloud storage product, NAS is inherently equipped with complete POSIX capabilities. By adapting NAS as a storage base at the acceleration layer, protocol adaptation and consistency assurance are done to solve the bandwidth and performance bottlenecks of the NAS product itself. In terms of cost, capacity-based NAS is a bit more expensive than object storage, but the overall price/performance ratio is still within an acceptable range.
Elastic scaling
The acceleration layer also needs to achieve elastic scaling capabilities, and the acceleration component also needs to be based on
cloud native
architecture, so the entire data plane is based on
/span> is built with distributed metadata, and the acceleration of the data plane is also built based on the native platform. Therefore, both metadata and data plane can have elastic scaling capabilities. SSD
NVME
data governance
The following important features are required in data governance:
-
Automatic write-back base: When many businesses write data through acceleration components, they are very concerned about the visibility of the data on the object storage base. nature, because there will be many downstream businesses that need to rely on the output files of the previous business to initiate subsequent businesses. Therefore, we need a deterministic backwash strategy that does not require too much manual intervention.
-
Cache strategy customization: More cache strategy support is needed, such as typical LRFU , TTL etc., supporting relevant mechanisms of preheating capabilities common in the industry.
-
Multi-task isolation: Provides some task-level acceleration guarantees.
-
Timely update of cache: supports access to object storage Event active update, also supports TTL based Passive pull updates for the mechanism .
CloudFS acceleration practice
Based on the demand of Byte internal business for the above storage acceleration capabilities, we have launched a new one that evolves from Byte internal
HDFS
The File System service that comes is named
CloudFS
. The overall technical architecture of CloudFS and the internal HDFS architecture are essentially a productized, miniaturized and multi-tenant package of the same set of components on the cloud.
CloudFS
In addition to storage acceleration capabilities, it also supports native
HDFS
mode and multi-data source aggregation Ability. The base currently supports
object storage
and
NAS
It is still in the adaptation development stage, and has connected various ecology and
volcanoes of big data and AI training in the upper-layer business. Some technical products of engine, etc.
Metadata acceleration
In the example above, from the perspective of the training container, you can see that there are two objects in the dataset. The view of the dataset directory tree structure is consistent with the view of the directory structure of the lowest-level object storage. The most basic technical feature is the need to cache the directory structure of the object storage and pull it on demand. In the metadata service, the directory tree structure of the object storage is copied, but it is stored in the directory tree hierarchy instead of the flat directory structure of the object storage. In addition, we subscribe to object storage event notifications to support active updates. Active event notifications and passive on-demand pulls ensure the consistency of the entire metadata as much as possible. In addition, if the same Bucket is mounted multiple times, there may be duplicate Objects. We have deduplicated the same Object at the metadata level to maximize cache space utilization.
Data plane cache
Next, we will introduce the caching of the entire data plane. We divide the object into multiple data blocks, and each data block can have multiple Replica. As shown in the figure above, r1 and r2 are two Replica of the same database. The data-side caching strategy is relatively lazy. The data will only be retrieved when user data is accessed for the first time. It also supports an adaptive number of copies based on the business load of the entire copy and the business load of the current cache node. Supports adaptive copy number, so the number of copies can be self-adjusted according to the pressure value of the business. In terms of cache management strategy, the ARC cache algorithm is adopted, which can save more data and ensure that hotspot data can be retained in the cache.
In addition, we also support the preheating mechanism, because many users need to save all their data in the cache before running the job, so that there is no need to wait when starting the job later. The specific preheating method is the distribution of a large-scale preheated copy based on the P2P protocol. For write caching, having a multi-copy write caching mechanism can write back to the base asynchronously
asynchronously
/synchronously. For example, after a Block is written, it will be refreshed to the object storage immediately. However, for objects in the object storage, you can only see the update of the file length or content on the base when the file is closed.
FUSE business entrance transformation
We have made some changes to the
FUSE
entrance to improve its stability. The first is the transformation of FUSE virtio. Replaced /dev/fuse, which greatly improved performance. At the same time, certain high availability guarantees have been made for the FUSE process. After FUSE crashes and restarts, it can be restored to its previous state in real time. So for the business
IO
, it can feel a little stuck, but its upper layer will not hang up and can continue to run. There are many training jobs that take a long time to run at one time, so the stability of the training jobs is still relatively large. In addition, FUSE also supports Page Cache to maximize the use of system memory resources. The last feature optimization is to synchronize close files. Because FUSE's original solution based on /dev/fuse, when closing files, it actually closes the underlying files asynchronously. Based on this, we have made a synchronous close file. Supported, which better ensures file visibility. Of course, this part of the transformation requires the installation of a
kernel
module first, so currently in the Volcano operating system, the default standard veLinux operating system has been built-in. If For other systems, you may need to install some modules to enable this function.
Business Practice—AML Platform Training Acceleration
In the training acceleration practice of the
Volcano Engine
AML platform, because< a i=4>GPU There are many machine models, some have local disks and some do not. Therefore, for training scenarios without local disks, it is necessary to provide some server-side accelerated storage space support; and For models with local disks, the local disk on the accelerated GPU machine needs to be taken over. Therefore, the acceleration unit DataNode is essentially a form between semi-managed and fully managed. It is often a mixed use scenario.
So we built the control plane and metadata services on the
CloudFS
server side, which can also support the acceleration unit DataNode. However, the DataNode on the server side is created on demand. If the
ECS
GPU
machine =6> If there is a local disk with sufficient acceleration capabilities, there is no need to build a control plane and metadata service. If it needs to use the server to expand the cache capacity, it can also be expanded in real time. The acceleration unit DataNode is connected to the business network through
ENI, so there will be no loss of overall cache bandwidth.
Business Practice—Data Lake Multi-Cloud Management Acceleration
Big Data business in the
hybrid cloud
scenario In practice
, CloudFS as volcanoCloud NativeThe components of the computing platform, will be deployed in the customer's private computer room. We have adapted the object buckets of other cloud vendors, and can remotely Public cloudObject bucketSome newdataAcceleration/ Warm up into the computer room of this private department. On this basis, the relevant business can complete the subsequent big data processing work in the private computer room.
In the first part of the test work, a simple test was conducted on
IO
streaming, combined with the caching function and Page
Cache
's closing makes some comparisons. If there is no preheating, the first pass must be read from the object storage, and the overall demand for business concurrency will be higher. If the concurrency reaches 256, the entire image/second can reach more than 6,000, which has very high concurrency requirements. If all
cache hits
are achieved, only 32 concurrencies are needed to achieve 8800 image/second. When Page Cache or
FUSE
side metadata cache is enabled, this data result can be higher.
The second part of the test is based on this data set after running some simple task loads based on learning and training, and making a comparison with Goofys. Whether it is an Epoch cache hit or not, the performance will be greatly improved. Just because the first Epoch is obtained from the underlying storage, the performance improvement is not very obvious. After all cache hits, the performance will be more than doubled.
future plan
Future plans mainly include three aspects:
The first step is to continue polishing the
NAS
base;
The second is to achieve more fine-grained cache optimization;
Finally, a fine-grained elastic scaling mechanism is established for the cache.
question Time:
Q: Which scenarios require
CloudFS
acceleration,
HDFS
How is the acceleration performance?
A: Which scenarios require acceleration depends on whether there is a bottleneck in the underlying bandwidth. If the bandwidth between the computing service and the
HDFS
cluster is sufficient and
QPS
is required It's not too high, so there's no need for storage acceleration. But if it is like on
public cloud
, the bandwidth of
object storage
is often It is limited, and QPS generally has a limit. When it cannot meet business requirements, it is necessary to do such an acceleration.
Q: How is cache elastic scaling done?
A: We are still optimizing. The overall idea is actually very simple. It still depends on the business load. If the bandwidth is relatively low, remove some acceleration unit nodes. If the load reaches a threshold, expand the capacity of several acceleration unit nodes, because these use Volcano
ECS
. This mechanism is actually based on ECS. Ensure the flexibility of this bandwidth resource.
Q:
CloudFS
How much iNode metadata can be stored at most? Will too many have any impact on the availability and stability of the cluster?
A: The current scale limit is 5 billion. Stability-related issues have been addressed in byte internal
HDFS
's distributed metadata architecture.
{{o.name}}
{{m.name}}