Hybrid cloud optimized data access based on ACK Fluid (3): Accelerate read access to third-party storage, reduce costs and increase efficiency in parallel

Author: Cheyang

Previous review:

This series will introduce how to support and optimize hybrid cloud data access scenarios based on ACK Fluid. Please refer to related articles:

Hybrid cloud optimized data access based on ACK Fluid (1): Scenarios and architecture

Hybrid cloud optimized data access based on ACK Fluid (2): Building a bridge between elastic computing instances and third-party storage

In the previous article "Building a Bridge between Elastic Computing Instances and Third-Party Storage" , we introduced how to access third-party distributed storage through ACK Fluid, which can realize access and communication between elastic computing instances ECI and ECS and cloud storage systems. Data transmission, which actually solves the first phase of cloud migration: connectivity.

For production environments, if cloud computing access to off-cloud storage systems becomes the norm, performance, cost and stability need to be considered. For example, what is the annual dedicated line cost for accessing offline data on the cloud? Is there a significant gap between the time consumption of cloud computing tasks and the original IDC computing tasks? And once there is a problem with the dedicated line, how to reduce the loss of computing tasks on the cloud?

In this article, we will focus on how to accelerate third-party storage access, achieve better performance, lower costs, and reduce dependence on dedicated line stability.

Overview

Even if cloud computing can access the enterprise's offline storage using Kubernetes' standardized protocol PV storage volume, it cannot avoid challenges and requirements in terms of performance and cost:

  • Limited data access bandwidth and high latency: The data access latency and bandwidth caused by on-cloud computing access to off-cloud storage result in high-performance computing taking a long time and low computing resource utilization.
  • Redundant data reading and expensive network costs: The same data will be accessed repeatedly during the operation of deep learning model hyperparameter tuning and automatic parameter tuning deep learning tasks. However, because the Kubernetes native scheduler cannot sense the status of the data cache, the results of application scheduling are poor and the cache cannot be reused, resulting in more external network and dedicated line costs for repeated data pulls.
  • Offline distributed storage is the bottleneck of concurrent data access, and faces challenges in performance and stability : when large-scale computing power concurrently accesses offline storage and the IO pressure of deep learning training increases, offline distributed storage is easy to become a performance bottleneck. This can impact computing tasks and even cause the entire computing cluster to fail.
  • Severely affected by network stability: Once the network between the public cloud and the data center is not stable enough, data synchronization errors will occur and the application will be unavailable.
  • Data security requirements: Metadata and data need to be protected and are not allowed to be persisted to the cloud disk.

ACK Fluid provides general acceleration capabilities for PV storage volumes based on JindoRuntime, which can support third-party storage that meets PVC to simply, quickly and securely obtain data access acceleration capabilities through distributed cache, which can bring the following benefits:

1. Zero adaptation cost: You only need to implement the third-party storage of PVC in the CSI protocol and it can be used immediately without additional development.

2. Data access performance is greatly improved and engineering efficiency is improved: a. Through access-based and policy-based data preheating, the performance of accessing data under the cloud is equivalent to that of data on the cloud computing cluster b. Elastic data access bandwidth copes with high concurrency and data access The throughput can be increased to hundreds of Gbps, and the capacity can also be reduced to 0, achieving a dynamic balance of low cost and high throughput.

c. Data cache affinity-aware scheduling avoids cross-network data access and reduces latency

3. Avoid repeated reading of hotspot data and save network costs: hotspot data is persisted to the cloud through distributed cache, reducing data reading and network traffic.

4. Data-centered automated operation and maintenance enables efficient data access and improves operation and maintenance efficiency: including automated and scheduled data cache warm-up to avoid repeated data pulls. It also supports data cache expansion, shrinkage and cleanup to realize automated management of data cache.

5. Avoid metadata and data being dropped to disk through distributed memory caching, making it safer: For users who are sensitive to data security, ACK-Fluid provides distributed memory caching that has good performance and avoids users’ worries about data being dropped to disk.

Summary: *ACK Fluid provides out-of-the-box, high performance, low cost, automation and no data disk * benefits for accessing third-party storage PVC for cloud computing .

Demo

1. Prerequisites

  • An ACK Pro version cluster has been created, and the cluster version is 1.18 and above. For specific operations, see Creating an ACK Pro Edition Cluster [ 1] .
  • The cloud native AI suite has been installed and the ack-fluid component has been deployed. Important: If you have installed open source Fluid, please uninstall it before deploying the ack-fluid component.

<!---->

  • The cloud native AI suite is not installed: enable Fluid data acceleration during installation. For specific operations, see Installing the Cloud Native AI Suite [ 2] .
  • The cloud native AI suite has been installed: Deploy ack-fluid on the cloud native AI suite page of the Container Service management console.

<!---->

  • The ACK cluster has been connected through kubectl. For specific operations, see Connecting to the Cluster Through the kubectl Tool [ 3] .
  • The PV storage volumes and PVC storage volume claims corresponding to the storage system that need to be accessed have been created. In the Kubernetes environment, different storage systems have different storage volume creation methods. To ensure a stable connection between the storage system and the Kubernetes cluster, please prepare according to the official documentation of the corresponding storage system. Note: For hybrid cloud scenarios, for the sake of data security and performance, it is recommended that you configure the data access mode to read-only.

2. Query the information declared by PV storage volume and PVC storage volume

Execute the following command to query the information declared by PV storage volumes and PVC storage volumes in Kubernetes.

$ kubectl get pvc,pv

Expected output:

NAME                                          STATUS   VOLUME                          CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/demo-pvc                Bound    demo-pv                         5Gi        ROX                           19h

NAME                                             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                           STORAGECLASS   REASON   AGE
persistentvolume/demo-pv                         30Gi       ROX            Retain           Bound    default/demo-pvc                                        19h

The PV storage volume demo-pv has a capacity of 30GB, supports RWX access mode, has been bound to the storage volume statement with the PVC name demo-pvc, and can be used normally.

3. Create Dataset and JindoRuntime

1) Create the dataset.yaml file. The following Yaml file contains two Fluid resource objects to be created, namely Dataset and JindoRuntime.

  • Dataset: PVC storage volume declaration information to be mounted.
  • JindoRuntime: JindoFS distributed cache system configuration to be started, including the number of copies of the cache system Worker component, and the maximum available cache capacity of each Worker component, etc.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: pv-demo-dataset
spec:
  mounts:
    - mountPoint: pvc://demo-pvc
      name: data
      path: /
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: pv-demo-dataset
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 10Gi
        high: "0.9"
        low: "0.8"

The detailed parameters of the resource object in the configuration file are described below.

2) Execute the following command to create Dataset and JindoRuntime resource objects

$ kubectl create -f dataset.yaml

3) Execute the following command to check the deployment status of Dataset

$ kubectl get dataset pv-demo-dataset

Expected output: Description: The initial startup of the JindoFS cache system involves the image pulling process, which may take 2 to 3 minutes due to factors such as the network environment. The Dataset is in the Bound state, indicating that the JindoFS cache system has been started normally in the cluster and the application Pod can normally access the data defined in the Dataset.

4. Create DataLoad to perform cache warm-up

Since the first access cannot hit the data cache, the data access efficiency of the application Pod may be low. Fluid provides the DataLoad cache warm-up operation to improve the efficiency of the first data access.

1) Create the dataload.yaml file, the code example is as follows

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: dataset-warmup
spec:
  dataset:
    name: pv-demo-dataset
    namespace: default
  loadMetadata: true
  target:
    - path: /
      replicas: 1

The detailed parameter description of the above resource object is as follows.

2) Execute the following command to create the DataLoad object

$ kubectl create -f dataload.yaml

3) Execute the following command to check the DataLoad status

$ kubectl get dataload dataset-warmup

Expected output:

NAME             DATASET           PHASE      AGE   DURATION
dataset-warmup   pv-demo-dataset   Complete   62s   12s

4) Execute the following command to check the data cache status

$ kubectl get dataset

Expected output:

NAME              UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
pv-demo-dataset   10.96GiB         10.96GiB   20.00GiB         100.0%              Bound   3m13s

After the DataLoad cache warm-up operation is completed, the cached data amount (CACHED) of the data set has been updated to the size of the entire data set, which means that the entire data set has been cached, and the cache percentage (CACHED PERCENTAGE) is 100.0%.

5. Create an application container and access the data in the PV storage volume

1) Use the following YAML to create the pod.yaml file, and modify the claimName name in the YAML file to be the same as the Dataset name created in step 2.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx
      command:
      - "bash"
      - "-c"
      - "sleep inf"
      volumeMounts:
        - mountPath: /data
          name: data-vol
  volumes:
    - name: data-vol
      persistentVolumeClaim:
        claimName: pv-demo-dataset # 名称需要与Dataset相同。

2) Execute the following command to create an application Pod

$ kubectl create -f pod.yaml

3) Execute the following command to log in to the Pod and access data

$ kubectl exec -it nginx bash

Expected output:

# Nginx Pod中,/data目录下有一个名为demofile的文件,大小为11 GB。
$ ls -lh /data
total 11G
-rw-r----- 1 root root 11G Jul 28  2023 demofile

# 执行cat /data/demofile > /dev/null命令,将demofile文件中的内容读取并写入/dev/null设备中,用时11.004秒。
$ time cat /data/demofile > /dev/null
real    0m11.004s
user    0m0.065s
sys     0m3.089s

Since all the data in the data set has been cached in the distributed cache system, when reading the data, it will be read from the cache instead of the remote storage system, thus reducing network transmission and improving data access efficiency.

Related Links:

[1] Create ACK Pro version cluster https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/create-an-ack-managed-cluster-2#task- skz-qwk-qfb

[2] Install the cloud native AI suite https://help.aliyun.com/zh/ack/cloud-native-ai-suite/user-guide/deploy-the-cloud-native-ai-suite#task-2038811

[3] Connect to the cluster through the kubectl tool*

https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/obtain-the-kubeconfig-file-of-a-cluster-and-use-kubectl-to-connect-to-the-cluster#task-ubf-lhg-vdb

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10117044