Author: Cheyang
Previous review:
This series will introduce how to support and optimize hybrid cloud data access scenarios based on ACK Fluid. Please refer to related articles:
Hybrid cloud optimized data access based on ACK Fluid (1): Scenarios and architecture
In the previous article "Building a Bridge between Elastic Computing Instances and Third-Party Storage" , we introduced how to access third-party distributed storage through ACK Fluid, which can realize access and communication between elastic computing instances ECI and ECS and cloud storage systems. Data transmission, which actually solves the first phase of cloud migration: connectivity.
For production environments, if cloud computing access to off-cloud storage systems becomes the norm, performance, cost and stability need to be considered. For example, what is the annual dedicated line cost for accessing offline data on the cloud? Is there a significant gap between the time consumption of cloud computing tasks and the original IDC computing tasks? And once there is a problem with the dedicated line, how to reduce the loss of computing tasks on the cloud?
In this article, we will focus on how to accelerate third-party storage access, achieve better performance, lower costs, and reduce dependence on dedicated line stability.
Overview
Even if cloud computing can access the enterprise's offline storage using Kubernetes' standardized protocol PV storage volume, it cannot avoid challenges and requirements in terms of performance and cost:
- Limited data access bandwidth and high latency: The data access latency and bandwidth caused by on-cloud computing access to off-cloud storage result in high-performance computing taking a long time and low computing resource utilization.
- Redundant data reading and expensive network costs: The same data will be accessed repeatedly during the operation of deep learning model hyperparameter tuning and automatic parameter tuning deep learning tasks. However, because the Kubernetes native scheduler cannot sense the status of the data cache, the results of application scheduling are poor and the cache cannot be reused, resulting in more external network and dedicated line costs for repeated data pulls.
- Offline distributed storage is the bottleneck of concurrent data access, and faces challenges in performance and stability : when large-scale computing power concurrently accesses offline storage and the IO pressure of deep learning training increases, offline distributed storage is easy to become a performance bottleneck. This can impact computing tasks and even cause the entire computing cluster to fail.
- Severely affected by network stability: Once the network between the public cloud and the data center is not stable enough, data synchronization errors will occur and the application will be unavailable.
- Data security requirements: Metadata and data need to be protected and are not allowed to be persisted to the cloud disk.
ACK Fluid provides general acceleration capabilities for PV storage volumes based on JindoRuntime, which can support third-party storage that meets PVC to simply, quickly and securely obtain data access acceleration capabilities through distributed cache, which can bring the following benefits:
1. Zero adaptation cost: You only need to implement the third-party storage of PVC in the CSI protocol and it can be used immediately without additional development.
2. Data access performance is greatly improved and engineering efficiency is improved: a. Through access-based and policy-based data preheating, the performance of accessing data under the cloud is equivalent to that of data on the cloud computing cluster b. Elastic data access bandwidth copes with high concurrency and data access The throughput can be increased to hundreds of Gbps, and the capacity can also be reduced to 0, achieving a dynamic balance of low cost and high throughput.
c. Data cache affinity-aware scheduling avoids cross-network data access and reduces latency
3. Avoid repeated reading of hotspot data and save network costs: hotspot data is persisted to the cloud through distributed cache, reducing data reading and network traffic.
4. Data-centered automated operation and maintenance enables efficient data access and improves operation and maintenance efficiency: including automated and scheduled data cache warm-up to avoid repeated data pulls. It also supports data cache expansion, shrinkage and cleanup to realize automated management of data cache.
5. Avoid metadata and data being dropped to disk through distributed memory caching, making it safer: For users who are sensitive to data security, ACK-Fluid provides distributed memory caching that has good performance and avoids users’ worries about data being dropped to disk.
Summary: *ACK Fluid provides out-of-the-box, high performance, low cost, automation and no data disk * benefits for accessing third-party storage PVC for cloud computing .
Demo
1. Prerequisites
- An ACK Pro version cluster has been created, and the cluster version is 1.18 and above. For specific operations, see Creating an ACK Pro Edition Cluster [ 1] .
- The cloud native AI suite has been installed and the ack-fluid component has been deployed. Important: If you have installed open source Fluid, please uninstall it before deploying the ack-fluid component.
<!---->
- The cloud native AI suite is not installed: enable Fluid data acceleration during installation. For specific operations, see Installing the Cloud Native AI Suite [ 2] .
- The cloud native AI suite has been installed: Deploy ack-fluid on the cloud native AI suite page of the Container Service management console.
<!---->
- The ACK cluster has been connected through kubectl. For specific operations, see Connecting to the Cluster Through the kubectl Tool [ 3] .
- The PV storage volumes and PVC storage volume claims corresponding to the storage system that need to be accessed have been created. In the Kubernetes environment, different storage systems have different storage volume creation methods. To ensure a stable connection between the storage system and the Kubernetes cluster, please prepare according to the official documentation of the corresponding storage system. Note: For hybrid cloud scenarios, for the sake of data security and performance, it is recommended that you configure the data access mode to read-only.
2. Query the information declared by PV storage volume and PVC storage volume
Execute the following command to query the information declared by PV storage volumes and PVC storage volumes in Kubernetes.
$ kubectl get pvc,pv
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/demo-pvc Bound demo-pv 5Gi ROX 19h
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/demo-pv 30Gi ROX Retain Bound default/demo-pvc 19h
The PV storage volume demo-pv has a capacity of 30GB, supports RWX access mode, has been bound to the storage volume statement with the PVC name demo-pvc, and can be used normally.
3. Create Dataset and JindoRuntime
1) Create the dataset.yaml file. The following Yaml file contains two Fluid resource objects to be created, namely Dataset and JindoRuntime.
- Dataset: PVC storage volume declaration information to be mounted.
- JindoRuntime: JindoFS distributed cache system configuration to be started, including the number of copies of the cache system Worker component, and the maximum available cache capacity of each Worker component, etc.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: pv-demo-dataset
spec:
mounts:
- mountPoint: pvc://demo-pvc
name: data
path: /
accessModes:
- ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: pv-demo-dataset
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 10Gi
high: "0.9"
low: "0.8"
The detailed parameters of the resource object in the configuration file are described below.
2) Execute the following command to create Dataset and JindoRuntime resource objects
$ kubectl create -f dataset.yaml
3) Execute the following command to check the deployment status of Dataset
$ kubectl get dataset pv-demo-dataset
Expected output: Description: The initial startup of the JindoFS cache system involves the image pulling process, which may take 2 to 3 minutes due to factors such as the network environment. The Dataset is in the Bound state, indicating that the JindoFS cache system has been started normally in the cluster and the application Pod can normally access the data defined in the Dataset.
4. Create DataLoad to perform cache warm-up
Since the first access cannot hit the data cache, the data access efficiency of the application Pod may be low. Fluid provides the DataLoad cache warm-up operation to improve the efficiency of the first data access.
1) Create the dataload.yaml file, the code example is as follows
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: dataset-warmup
spec:
dataset:
name: pv-demo-dataset
namespace: default
loadMetadata: true
target:
- path: /
replicas: 1
The detailed parameter description of the above resource object is as follows.
2) Execute the following command to create the DataLoad object
$ kubectl create -f dataload.yaml
3) Execute the following command to check the DataLoad status
$ kubectl get dataload dataset-warmup
Expected output:
NAME DATASET PHASE AGE DURATION
dataset-warmup pv-demo-dataset Complete 62s 12s
4) Execute the following command to check the data cache status
$ kubectl get dataset
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
pv-demo-dataset 10.96GiB 10.96GiB 20.00GiB 100.0% Bound 3m13s
After the DataLoad cache warm-up operation is completed, the cached data amount (CACHED) of the data set has been updated to the size of the entire data set, which means that the entire data set has been cached, and the cache percentage (CACHED PERCENTAGE) is 100.0%.
5. Create an application container and access the data in the PV storage volume
1) Use the following YAML to create the pod.yaml file, and modify the claimName name in the YAML file to be the same as the Dataset name created in step 2.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
command:
- "bash"
- "-c"
- "sleep inf"
volumeMounts:
- mountPath: /data
name: data-vol
volumes:
- name: data-vol
persistentVolumeClaim:
claimName: pv-demo-dataset # 名称需要与Dataset相同。
2) Execute the following command to create an application Pod
$ kubectl create -f pod.yaml
3) Execute the following command to log in to the Pod and access data
$ kubectl exec -it nginx bash
Expected output:
# Nginx Pod中,/data目录下有一个名为demofile的文件,大小为11 GB。
$ ls -lh /data
total 11G
-rw-r----- 1 root root 11G Jul 28 2023 demofile
# 执行cat /data/demofile > /dev/null命令,将demofile文件中的内容读取并写入/dev/null设备中,用时11.004秒。
$ time cat /data/demofile > /dev/null
real 0m11.004s
user 0m0.065s
sys 0m3.089s
Since all the data in the data set has been cached in the distributed cache system, when reading the data, it will be read from the cache instead of the remote storage system, thus reducing network transmission and improving data access efficiency.
Related Links:
[1] Create ACK Pro version cluster https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/create-an-ack-managed-cluster-2#task- skz-qwk-qfb
[2] Install the cloud native AI suite https://help.aliyun.com/zh/ack/cloud-native-ai-suite/user-guide/deploy-the-cloud-native-ai-suite#task-2038811
[3] Connect to the cluster through the kubectl tool*
The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC