Author: Cheyang
Previous review:
This series will introduce how to support and optimize hybrid cloud data access scenarios based on ACK Fluid. Please refer to related articles:
-Hybrid cloud optimized data access based on ACK Fluid (1): Scenarios and architecture
In the previous article, we discussed Day 1 of combining Kubernetes and data in a hybrid cloud scenario: solving the problem of data access and realizing the connection between cloud computing and offline storage. On this basis, ACK Fluid further solves the cost and performance issues of data access. Entering Day 2, when users actually use this solution in a production environment, the main challenge is how to handle data synchronization of multi-region clusters on the operation and maintenance side.
Overview
Many enterprises will establish multiple computing clusters in different regions for the purpose of performance, security, stability and resource isolation. And these computing clusters require remote access to a single, centralized data store. For example, as large language models gradually mature, multi-region inference services based on them have gradually become capabilities that various enterprises need to support. This is a specific example of this scenario, which has considerable challenges:
- Manual data synchronization across multiple computing clusters across data centers is very time-consuming
- Taking a large language model as an example, there are many parameters, large files, large quantities, and complex management: different businesses choose different basic models and business data, so there are differences in the final models.
- Model data will be continuously updated and iterated based on business input, and model data will be updated frequently.
- The model inference service starts slowly and takes a long time to pull files: The parameter scale of large-scale language models is quite large, and the volume is usually very large or even hundreds of GB, resulting in huge time-consuming pull to the GPU memory and very slow startup time.
- Model updates require all regions to be updated simultaneously, and replication jobs on an overloaded storage cluster severely impact the performance of existing workloads.
In addition to providing the acceleration capabilities of a universal storage client, ACK Fluid also provides scheduled and triggered data migration and preheating capabilities, simplifying the complexity of data distribution.
- Save network and computing costs: Cross-region traffic costs are significantly reduced, computing time is significantly shortened, and computing cluster costs are slightly increased; and it can be further optimized through elasticity.
- Application data updates are greatly accelerated: Since the calculated data access is communicated within the same data center or availability zone, the latency is reduced, and the cache throughput concurrency capability can be linearly expanded.
- Reduce complex data synchronization operations: Control data synchronization operations through custom policies, reduce contention for data access, and reduce operation and maintenance complexity through automation.
Demo
This demonstration introduces how to update the data accessible to the user's computing clusters in different regions through the scheduled preheating mechanism of ACK Fluid.
Prerequisites
- An ACK Pro version cluster has been created, and the cluster version is 1.18 and above. For specific operations, see Creating an ACK Pro Edition Cluster [ 1] .
- The cloud native AI suite has been installed and the ack-fluid component has been deployed. Important: If you have installed open source Fluid, please uninstall it before deploying the ack-fluid component.
<!---->
- The cloud native AI suite is not installed: enable Fluid data acceleration during installation . For specific operations, see Installing the Cloud Native AI Suite [ 2] .
- The cloud native AI suite has been installed: Deploy ack-fluid on the cloud native AI suite page of the Container Service Management Console [ 3] .
<!---->
- The Kubernetes cluster has been connected through kubectl. For specific operations, see Connecting to the Cluster Through the kubectl Tool [ 4] .
Background Information
Prepare the conditions for the K8s and OSS environments, and it only takes about 10 minutes to complete the deployment of the JindoRuntime environment.
Step 1: Prepare OSS Bucket data
- Execute the following command to download a copy of the test data.
$ wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.md
- Upload the downloaded test data to the bucket corresponding to Alibaba Cloud OSS. You can use the client tool ossutil provided by OSS for the upload method. For specific operations, see Installing ossutil [ 5] .
$ ossutil cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md
Step 2: Create Dataset and JindoRuntime
- Before creating a Dataset, you can create a mySecret.yaml file to save the accessKeyId and accessKeySecret of OSS.
The YAML sample to create the mySecret.yaml file is as follows:
apiVersion: v1
kind: Secret
metadata:
name: mysecret
stringData:
fs.oss.accessKeyId: xxx
fs.oss.accessKeySecret: xxx
- Execute the following command to generate a secret.
$ kubectl create -f mySecret.yaml
- Use the following sample YAML file to create a file named dataset.yaml and contain two parts:
- Create a Dataset to describe the remote storage data set and UFS information.
- Create a JindoRuntime and start a JindoFS cluster to provide caching services.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: demo
spec:
mounts:
- mountPoint: oss://<bucket-name>/<path>
options:
fs.oss.endpoint: <oss-endpoint>
name: hbase
path: "/"
encryptOptions:
- name: fs.oss.accessKeyId
valueFrom:
secretKeyRef:
name: mysecret
key: fs.oss.accessKeyId
- name: fs.oss.accessKeySecret
valueFrom:
secretKeyRef:
name: mysecret
key: fs.oss.accessKeySecret
accessModes:
- ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: demo
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.99"
low: "0.8"
fuse:
args:
- -okernel_cache
- -oro
- -oattr_timeout=60
- -oentry_timeout=60
- -onegative_timeout=60
The relevant parameters are explained in the following table:
parameter | illustrate |
---|---|
mountPoint | oss://<oss_bucket>/<path> indicates the path to mount UFS, and the path does not need to contain endpoint information. |
fs.oss.endpoint | The endpoint information of the OSS Bucket can be either a public network address or a private network address. |
accessModes | Represents the access mode of Dataset. |
replicas | Indicates the number of Workers creating JindoFS cluster. |
mediumtype | Indicates the cache type. When defining and creating JindoRuntime template samples, JindoFS temporarily supports one of the cache types in HDD/SSD/MEM. |
path | Indicates the storage path. Currently, only a single path is supported. When selecting MEM for caching, you need to specify a local path to store files such as Log. |
quota | Indicates the maximum cache capacity, in GB. The cache capacity can be configured based on the UFS data size. |
high | Indicates the upper limit of storage capacity. |
low | Indicates the lower limit of storage capacity. |
fuse.args | Represents optional fuse client mount parameters. Usually used in conjunction with Dataset's access mode. When the Dataset access mode is ReadOnlyMany, we enable kernel_cache to use the kernel cache to optimize read performance. At this time we can set attr_timeout (file attribute cache retention time), entry_timeout (file name read cache retention time) timeout, negative_timeout (file name read failure cache retention time), the default is 7200s. When the Dataset access mode is ReadWriteMany, we recommend using the default configuration. The parameters at this time are as follows: - -oauto_cache - -oattr_timeout=0 - -oentry_timeout=0 - -onegative_timeout=0 Use auto_cache to ensure that if the file size or modification time changes, the cache will be invalid. At the same time, set the timeout to 0. |
- Execute the following commands to create JindoRuntime and Dataset.
$ kubectl create -f dataset.yaml
- Execute the following command to view the deployment of Dataset.
$ kubectl get dataset
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
demo 588.90KiB 0.00B 10.00GiB 0.0% Bound 2m7s
Step 3: Create a Dataload that supports scheduled running
- Use the following sample YAML file to create a file named dataload.yaml.
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: cron-dataload
spec:
dataset:
name: demo
namespace: default
policy: Cron
schedule: "*/2 * * * *" # Run every 2 min
The relevant parameters are explained in the following table:
parameter | illustrate |
---|---|
dataset | Indicates the name and namespace of the data set where dataload is executed. |
policy | Indicates the execution strategy, currently supports Once and Cron. Create a scheduled dataload task here. |
schedule | Indicates the strategy for triggering dataload. |
scheule uses the following cron format:
# ┌───────────── 分钟 (0 - 59)
# │ ┌───────────── 小时 (0 - 23)
# │ │ ┌───────────── 月的某天 (1 - 31)
# │ │ │ ┌───────────── 月份 (1 - 12)
# │ │ │ │ ┌───────────── 周的某天 (0 - 6)(周日到周一;在某些系统上,7 也是星期日)
# │ │ │ │ │ 或者是 sun,mon,tue,web,thu,fri,sat
# │ │ │ │ │
# │ │ │ │ │
# * * * * *
Also, cron supports the following operators:
- Comma (,) indicates enumeration, for example: 1,3,4,7 * * * * indicates that Dataload will be executed at 1, 3, 4, and 7 minutes every hour.
- The hyphen (-) represents the range, for example: 1-6 * * * * means that it will be executed every minute within 1 to 6 minutes of every hour.
- An asterisk (*) represents any possible value. For example: an asterisk in the "hour field" equals "every hour".
- The percent sign (%) means "every". For example: *%10 * * * * means execution every 10 minutes.
- Slashes (/) are used to describe range increments. For example: */2 * * * * means execution every 2 minutes.
You can also view more information here.
For advanced configuration related to Dataload, please refer to the following configuration file:
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: cron-dataload
spec:
dataset:
name: demo
namespace: default
policy: Cron # including Once, Cron
schedule: * * * * * # only set when policy is cron
loadMetadata: true
target:
- path: <path1>
replicas: 1
- path: <path2>
replicas: 2
The relevant parameters are explained in the following table:
parameter | illustrate |
---|---|
policy | Indicates the dataload execution strategy, including [Once, Cron]. |
schedule | Indicates the plan used by cron. It is only valid when the policy is Cron. |
loadMetadata | Indicates whether to synchronize metadata before dataload. |
target | Indicates the target of the dataload and supports specifying multiple targets. |
path | Indicates the path to execute dataload. |
replicas | Indicates the number of cached copies. |
- Execute the following command to create Dataload.
$ kubectl apply -f dataload.yaml
- Execute the following command to check the Dataload status.
$ kubectl get dataload
Expected output:
NAME DATASET PHASE AGE DURATION
cron-dataload demo Complete 3m51s 2m12s
- After waiting for the Dataload status to be Complete, execute the following command to view the current dataset status.
$ kubectl get dataset
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
demo 588.90KiB 588.90KiB 10.00GiB 100.0% Bound 5m50s
It can be seen that all files in oss have been loaded into the cache.
Step 4: Create an application container to access data in OSS
This article creates an application container to access the above files to see the effect of scheduled Dataload.
- Using the following sample YAML file, create a file named app.yaml.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: demo-vol
volumes:
- name: demo-vol
persistentVolumeClaim:
claimName: demo
- Execute the following command to create an application container.
$ kubectl create -f app.yaml
- Wait for the application container to be ready and execute the following command to view the data in OSS:
$ kubectl exec -it nginx -- ls -lh /data
Expected output:
total 589K
-rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.md
- In order to verify the effect of dataload regularly updating the underlying file, we modify the contents of RELEASENOTES.md and re-upload it before the scheduled dataload is triggered.
$ echo "hello, crondataload." >> RELEASENOTES.md
Re-upload the file to oss.
$ ossutil cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.md
- Wait for the dataload task to trigger. When the Dataload task is completed, execute the following command to view the running status of the Dataload job:
$ kubectl describe dataload cron-dataload
Expected output:
...
Status:
Conditions:
Last Probe Time: 2023-07-31T04:30:07Z
Last Transition Time: 2023-07-31T04:30:07Z
Status: True
Type: Complete
Duration: 5m54s
Last Schedule Time: 2023-07-31T04:30:00Z
Last Successful Time: 2023-07-31T04:30:07Z
Phase: Complete
...
Among them, the Last Schedule Time in Status is the scheduling time of the last dataload job, and the Last Successful Time is the completion time of the last dataload job.
At this point, you can execute the following command to view the current Dataset status:
$ kubectl get dataset
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
demo 588.90KiB 1.15MiB 10.00GiB 100.0% Bound 10m
It can be seen that the updated file has also been loaded into the cache.
- Execute the following command to view the updated files in the application container:
$ kubectl exec -it nginx -- tail /data/RELEASENOTES.md
Expected output:
\<name\>hbase.config.read.zookeeper.config\</name\>
\<value\>true\</value\>
\<description\>
Set to true to allow HBaseConfiguration to read the
zoo.cfg file for ZooKeeper properties. Switching this to true
is not recommended, since the functionality of reading ZK
properties from a zoo.cfg file has been deprecated.
\</description\>
\</property\>
hello, crondataload.
As you can see from the last line, the updated file is now accessible to the application container.
environmental cleanup
When you no longer use the data acceleration feature, you need to clean up the environment.
Execute the following command to delete JindoRuntime and application containers.
$ kubectl delete -f app.yaml
$ kubectl delete -f dataset.yaml
Summarize
The discussion on hybrid cloud optimized data access based on ACK Fluid ends here. The Alibaba Cloud Container Service team will continue to iterate and optimize with users in this scenario. As practice continues to deepen, this series will continue to be updated.
Related Links:
[1] Create ACK Pro version cluster https://help.aliyun.com/document_detail/176833.html#task-skz-qwk-qfb
[2] Install cloud native AI suite
[3] Container Service Management Console
https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Fcs.console.aliyun.com%2F
[4] Connect to the cluster through the kubectl tool
[5] Install ossutil
https://help.aliyun.com/zh/oss/developer-reference/install-ossutil#concept-303829
The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC