Born for data elasticity, Alibaba Cloud's cloud-native storage speeds up again

Author: Zhihao, Zhanyi

It has become mainstream for enterprises to run AI and big data applications on Kubernetes. While resource elasticity and development and operation efficiency have been significantly improved, the computing and storage separation architecture also brings challenges: high network latency, expensive network fees, and insufficient storage service bandwidth. .

Taking high-performance computing scenarios such as AI training, genetic computing, and industrial simulation as examples, massive computing needs to be executed concurrently in a short period of time, and multiple computing instances share the same data source for accessing the file system. Many enterprises use Alibaba Cloud file storage NAS or CPFS services to mount to computing tasks run by Alibaba Cloud Container Service ACK to achieve high-performance shared access to thousands of computing nodes.

However, with the increase in computing power scale and performance, as well as the increase in model size and workload complexity, in cloud-native machine learning and big data scenarios, the data access performance and flexibility requirements of high-performance computing for parallel file systems are also increasing. Higher and higher.

How to better provide elastic and extremely fast experience for containerized computing engines has become a new challenge for storage.

To this end, we launched the Elastic File Client (EFC), which is based on the high scalability of Alibaba Cloud's file storage service, native POSIX interface, and high-performance directory tree structure to create a cloud-native storage system. In addition, EFC is combined with the cloud-native data orchestration and acceleration system Fluid to realize data set visibility, elastic scaling, data migration, computing acceleration, etc., providing reliable, efficient, and efficient shared access file storage for cloud-native AI and big data applications . High performance solution.

Fluid, a new abstraction of cloud-native data

Fluid [ 1] is a cloud-native distributed data orchestration and acceleration system, mainly for data-intensive applications (such as big data, AI, etc.).

Different from the traditional storage-oriented PVC, Fluid proposes the concept of elastic data set (Dataset) from the perspective of application, and abstracts "the process of using data on Kubernetes". Fluid is an open source project of the Kubernetes ecosystem, jointly initiated by Nanjing University, Alibaba Cloud, and the Alluxio open source community, and has been donated to the CNCF community in 2021.

Fluid allows data to move, copy, evict, convert, and manage flexibly and efficiently between various storage sources (such as NAS, CPFS, OSS, and Ceph, etc.) and Kubernetes upper-layer applications like fluids.

Fluid can implement functions such as CRUD operations, permission control, and access acceleration of datasets. Users can directly access abstracted data like accessing Kubernetes native data volumes. Fluid currently focuses on two important scenarios: dataset orchestration and application orchestration:

  • In terms of dataset orchestration, Fluid can cache the data of a specified dataset to Kubernetes nodes with specified characteristics to improve data access speed.

  • In terms of application orchestration, Fluid can schedule specified applications to nodes that have stored specified datasets to reduce data transmission costs and improve computing efficiency.

The two can also be combined for collaborative orchestration, that is, collaboratively consider data sets and application requirements for node resource scheduling.

image.png

Fluid provides a layer of efficient and convenient data abstraction for cloud-native AI and big data applications , and provides the following core functions around the abstracted data:

Unified abstraction of application-oriented datasets

Dataset abstraction not only summarizes data from multiple storage sources, but also describes data mobility and characteristics, and provides observability, such as the total data volume of the dataset, the current cache space size, and the cache hit rate. Users can evaluate whether to expand or shrink the cache system based on this information.

Extensible data engine plugin

Although Dataset is a unified abstract concept, different storages have different runtime interfaces, and actual data operations need to be implemented by different runtimes. Fluid's Runtime is divided into two categories: CacheRuntime implements cache acceleration (including open source distributed cache AlluxioRuntime, JuiceFSRuntime, Alibaba Cloud EFCRuntime, JindoRuntime and Tencent Cloud GooseFSRuntime); ThinRuntime provides a unified access interface (such as s3fs, nfs-fuse and other distributed storage systems ), to facilitate access to third-party storage.

Automated Data Manipulation

Provide data preheating, data migration, data backup and other operations in the form of CRD, and support multiple modes such as one-time, timing and event-driven, which is convenient for users to integrate into their own automatic operation and maintenance system.

General Data Acceleration

Combine data distributed caching technology with capabilities such as Autoscaling, Portability, Observability, and Scheduling to improve data access performance.

runtime platform independent

It supports various Kubernetes forms such as native, edge, serverless, and multi-cluster, and can run in diverse environments such as public cloud, edge, and hybrid cloud. You can choose to run the storage client in CSI Plugin or sidecar mode according to the difference in environment.

image

EFC for cloud-native storage, elastic acceleration ensures business stability

After cloud-native modernization, enterprise applications can build more elastic services. The corresponding question is, how to store application data synchronously to achieve cloud native?

What is Cloud Native Storage?

Cloud-native storage is neither a storage system built on the cloud nor a storage deployed in a K8S container, but a storage service that can perfectly integrate with the Kubernetes environment to meet business elasticity and agility.

A cloud native storage needs to meet the following requirements:

1. Storage service stability : The stability and self-recovery capabilities of each node of the system must meet the requirements. Taking file storage as an example, an NFS client or FUSE FO only affects one ECS, but in a cloud-native architecture, a single-point storage failure may affect dozens of Pods in a container cluster.

2. Storage capacity and performance elasticity : The performance of traditional distributed storage improves with the increase in capacity, but the performance requirements for storage in the cloud-native environment actually change rapidly with the expansion and contraction of Pods. The storage system needs to achieve performance elasticity when the computing scale is rapidly increased.

3. Support large-scale scaling of computing Pods : Cloud-native application scenarios have very high requirements for agility and flexibility of services. Many scenarios expect fast container startup and flexible scheduling. It is common for 1,000-2,000 Pods to pop up in one minute. This requires that storage volumes can also be quickly mounted in response to changes in Pods.

4. Provide Pod granular observability : Most storage services provide sufficient monitoring capabilities at the file system level, and only by providing monitoring data from the perspective of PV and data sets from the perspective of cloud native can it really help cloud native platform managers.

5. The performance close to local storage is provided by the separation of storage and computing: the separation of storage and computing brings flexibility and agility, but the network delay and the consumption of remote access protocols also cause the I/O performance of Pod to access storage to drop significantly. New techniques are needed to reduce the negative performance impact.

However, none of the above requirements can be solved independently by relying on storage backend services or clients.

Therefore, Alibaba Cloud has launched an elastic file client - EFC (Elastic File Client), which combines the high scalability of Alibaba Cloud file storage services, native POSIX interfaces, and high-performance directory tree structures to create a cloud-native storage system. It replaces the traditional kernel-mode NFS client of NAS, provides acceleration capabilities such as multi-link access, metadata cache, and distributed data cache, and provides end-side performance monitoring, QoS capabilities, and hot upgrade capabilities.

At the same time, EFC avoids the problem that the POSIX client using the open source FUSE cannot failover at the second level, and ensures the stability of the business during large-scale computing.

Tailored for data-intensive applications, an overview of EFCRuntime's core capabilities

EFCRuntime is a Runtime type implementation that supports Dataset access acceleration capabilities, and the cache engine behind it is EFC. Fluid realizes data set visibility, elastic scaling, data migration, computing acceleration, etc. by managing and scheduling EFCRuntime. The process of using and deploying EFCRuntime on Fluid is simple, compatible with the native Kubernetes environment, and can automatically and controllably improve data throughput.

By accessing Alibaba Cloud file storage through EFCRuntime, you can obtain the following capabilities beyond the basic enterprise-level functions of file storage:

1. POSIX protocol : EFC provides a standard POSIX interface, combined with file storage NAS and CPFS services, to provide container applications with the ability to access shared data through the POSIX interface.

2. Second-level Failover : EFC provides second-level Failover capability. When the FUSE process crashes or undergoes a version upgrade due to various reasons, EFC can be automatically launched in seconds, ensuring that business I/O is almost unaffected.

3. Strongly consistent semantics : EFC implements strong consistency of files and directories through a strongly consistent distributed lease mechanism: files written in a Pod can be read by other Pods immediately; after a new file is created, all Other clients can access it synchronously, making it easier for users to manage data between multiple nodes.

4. Powerful on-side caching capability : EFC optimizes the caching logic of FUSE and provides better small file read and write performance. Compared with traditional NFS clients, the performance is improved by more than 50%.

5. Distributed cache capability : EFC includes Alibaba Cloud's self-developed distributed cache technology, which combines the memory of multiple nodes into a large cache pool. The hot data required for calculation does not need to be read from the remote end every time, and the throughput And the cache pool can naturally expand as the computing scale expands.

6. Small file prefetching capability : EFC prefetches hot data in the hot directory in a targeted manner, saving the overhead of pulling data.

image.png

The training time can be shortened by 87%, and the performance is better than that of open source NFS

We use the insightface (ms1m-ibug) data set [ 2] based on the Kubernetes cluster and use Arena [ 3] to verify the concurrent read speed on this data set. Based on EFCRuntime, when the local cache is enabled, the performance is much better than the open source nfs , the training time is shortened by 87%. (This test scenario will be introduced in detail in subsequent related articles)

image.png

How to quickly get started with EFCRuntime?

The following will take Alibaba Cloud file storage NAS as an example to introduce how to quickly use Fluid EFCRuntime to speed up NAS file access.

First, you need to prepare the Alibaba Cloud Container Service ACK Pro cluster and the Alibaba Cloud NAS file system.

Then, you only need to spend about 5 minutes to create the required EFCRuntime environment. The process of using EFCRuntime is very simple, and you can deploy it according to the following process.

Step1: Create Dataset and EFCRuntime

Create a dataset.yaml file, which contains two parts:

  1. Firstly, it contains the custom resource information of the Dataset, which declares in the Dataset the URL of the Alibaba Cloud NAS file system to be mounted (replacement) and the subpath in the NAS (replacement).

  2. Next, you need to create an EFCRuntime, which is equivalent to starting an EFC distributed cluster to provide caching services.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: efc-demo
spec:
  placement: Shared
  mounts:
    - mountPoint: "nfs://<nas_url>:<nas_dir>"
      name: efc
      path: "/"
---
apiVersion: data.fluid.io/v1alpha1
kind: EFCRuntime
metadata:
  name: efc-demo
spec:
  replicas: 3
  master:
    networkMode: ContainerNetwork
  worker:
    networkMode: ContainerNetwork
  fuse:
    networkMode: ContainerNetwork
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 15Gi

1. mountPoint: indicates the path information of the mounted NAS or CPFS file system. For example: the format of NAS is nfs://:, and the format of CPFS is cpfs://:; if there is no subdirectory requirement, the root directory can be used.

For specific usage, please refer to the document [ 4] : ​​https://help.aliyun.com/document_detail/600930.html?spm=a2c4g.207353.0.0.431b113b6APACM

  1. replicas: Indicates the number of cache workers in the created distributed cluster, which can be adjusted according to the memory configuration of the computing nodes and the size of the dataset. It is recommended that the product of quota and replicas be greater than the total size of the data set to be cached.

  2. The optional values ​​of network are ContainerNetwork and HostNetwork. In the ACK environment, it is recommended to choose ContainerNetwork, and there will be no additional performance loss when using the container network.

  3. mediumtype: Indicates the cache type, only supports one of the cache types in HDD/SSD/MEM. Among them, MEM stands for memory, and it is recommended to use MEM. When using MEM, the cache data storage directory specified by path must be a memory file system (for example: tmpfs)

  4. path: Indicates the cache data storage directory of the EFC cache system Worker. It is recommended to keep /dev/shm.

  5. quota: Indicates the maximum cache capacity provided by a single Worker component. Can be adjusted based on compute node memory configuration and dataset size. It is recommended that the product of quota and replicas be greater than the total size of the data set to be cached.

kubectl create -f dataset.yaml

View the situation of Dataset:

$ kubectl get dataset efc-demo

The expected output is:

NAME       UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
efc-demo                                                                  Bound   24m

Step2: Create an application container to experience the acceleration effect

You can use EFC acceleration services by creating application containers, or submit machine learning jobs to experience related functions.

Next, we will create two application containers to access the same 10GB file in the data set. You can also use other files for testing, which need to be pre-stored in the NAS file system.

Define the following app.yaml file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: efc-app
  labels:
    app: nginx
spec:
  serviceName: nginx
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        command: ["/bin/bash"]
        args: ["-c", "sleep inf"]
        volumeMounts:
        - mountPath: "/data"
          name: data-vol
      volumes:
        - name: data-vol
          persistentVolumeClaim:
            claimName: efc-demo

Execute the following command to view the size of the data file to be accessed:

kubectl exec -it efc-app-0 -- du -h /data/allzero-demo
10G     /data/allzero-demo

Execute the following command to view the read time of the file in the first application container (if you use your own real data file, please replace /data/allzero-demo with the real file path):

kubectl exec -it eac-app-0 -- bash -c "time cat /data/allzero-demo > /dev/null"

The expected output is:

real    0m15.792s
user    0m0.023s
sys     0m2.404s

Then, in another container, test the time-consuming reading of the same 10G file (if you use your own real data file, please replace /data/allzero-demo with the real file path):

kubectl exec -it efc-app-1 -- bash -c "time cat /data/allzero-demo > /dev/null"

Expected output:

real    0m9.970s
user    0m0.012s
sys     0m2.283s

From the above output information, it can be found that the throughput has increased from the original 648MiB/s to 1034.3MiB/s, and the reading efficiency for the same file has increased by 59.5%.

Summary and Outlook

By combining Fluid and EFC, it can better support AI and big data services in cloud-native scenarios. This combination can improve data usage efficiency and enhance the integration of automated operation and maintenance through standardized data warm-up and migration operations.

In addition, we will also support running in Serverless scenarios, so as to provide a better distributed file storage access experience for Serverless containers.

Finally, welcome to use the DingTalk search group number to join us and participate in the discussion together (DingTalk group number: 33214567).

Related Links:

[1] Fluid

https://github.com/fluid-cloudnative/fluid

[2] insightface(ms1m-ibug) data set

https://github.com/deepinsight/insightface/tree/master/recognition/_datasets_#ms1m-ibug-85k-ids38m-images-56

[3] Arena

https://help.aliyun.com/document_detail/212117.html?spm=a2c4g.212116.0.0.47f66806YlI7y4

[4] EFC accelerates NAS or CPFS file access

https://help.aliyun.com/document_detail/600930.html?spm=a2c4g.207353.0.0.431b113b6APACM

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10081359