Hybrid cloud optimized data access based on ACK Fluid (5): Automated cross-regional center data distribution

Author: Cheyang

Previous review:

This series will introduce how to support and optimize hybrid cloud data access scenarios based on ACK Fluid. Please refer to related articles:

-Hybrid cloud optimized data access based on ACK Fluid (1): Scenarios and architecture

-Hybrid cloud optimized data access based on ACK Fluid (2): Building a bridge between elastic computing instances and third-party storage

-Hybrid cloud optimized data access based on ACK Fluid (3): Accelerate read access to third-party storage, reduce costs and increase efficiency in parallel

-Hybrid cloud optimized data access based on ACK Fluid (4): Mount the third-party storage directory to Kubernetes to improve efficiency and standardization

In the previous article, we discussed Day 1 of combining Kubernetes and data in a hybrid cloud scenario: solving the problem of data access and realizing the connection between cloud computing and offline storage. On this basis, ACK Fluid further solves the cost and performance issues of data access. Entering Day 2, when users actually use this solution in a production environment, the main challenge is how to handle data synchronization of multi-region clusters on the operation and maintenance side.

Overview

Many enterprises will establish multiple computing clusters in different regions for the purpose of performance, security, stability and resource isolation. And these computing clusters require remote access to a single, centralized data store. For example, as large language models gradually mature, multi-region inference services based on them have gradually become capabilities that various enterprises need to support. This is a specific example of this scenario, which has considerable challenges:

  • Manual data synchronization across multiple computing clusters across data centers is very time-consuming
  • Taking a large language model as an example, there are many parameters, large files, large quantities, and complex management: different businesses choose different basic models and business data, so there are differences in the final models.
  • Model data will be continuously updated and iterated based on business input, and model data will be updated frequently.
  • The model inference service starts slowly and takes a long time to pull files: The parameter scale of large-scale language models is quite large, and the volume is usually very large or even hundreds of GB, resulting in huge time-consuming pull to the GPU memory and very slow startup time.
  • Model updates require all regions to be updated simultaneously, and replication jobs on an overloaded storage cluster severely impact the performance of existing workloads.

In addition to providing the acceleration capabilities of a universal storage client, ACK Fluid also provides scheduled and triggered data migration and preheating capabilities, simplifying the complexity of data distribution.

  • Save network and computing costs: Cross-region traffic costs are significantly reduced, computing time is significantly shortened, and computing cluster costs are slightly increased; and it can be further optimized through elasticity.
  • Application data updates are greatly accelerated: Since the calculated data access is communicated within the same data center or availability zone, the latency is reduced, and the cache throughput concurrency capability can be linearly expanded.
  • Reduce complex data synchronization operations: Control data synchronization operations through custom policies, reduce contention for data access, and reduce operation and maintenance complexity through automation.

Demo

This demonstration introduces how to update the data accessible to the user's computing clusters in different regions through the scheduled preheating mechanism of ACK Fluid.

Prerequisites

  • An ACK Pro version cluster has been created, and the cluster version is 1.18 and above. For specific operations, see Creating an ACK Pro Edition Cluster [ 1] .
  • The cloud native AI suite has been installed and the ack-fluid component has been deployed. Important: If you have installed open source Fluid, please uninstall it before deploying the ack-fluid component.

<!---->

  • The cloud native AI suite is not installed: enable  Fluid data acceleration during installation . For specific operations, see Installing the Cloud Native AI Suite [ 2] .
  • The cloud native AI suite has been installed: Deploy  ack-fluid on the cloud native AI suite page of the Container Service Management Console [ 3] .

<!---->

  • The Kubernetes cluster has been connected through kubectl. For specific operations, see Connecting to the Cluster Through the kubectl Tool [ 4] .

Background Information

Prepare the conditions for the K8s and OSS environments, and it only takes about 10 minutes to complete the deployment of the JindoRuntime environment.

Step 1: Prepare OSS Bucket data

  1. Execute the following command to download a copy of the test data.
$ wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.md
  1. Upload the downloaded test data to the bucket corresponding to Alibaba Cloud OSS. You can use the client tool ossutil provided by OSS for the upload method. For specific operations, see Installing ossutil [ 5] .
$ ossutil cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md

Step 2: Create Dataset and JindoRuntime

  1. Before creating a Dataset, you can create a mySecret.yaml file to save the accessKeyId and accessKeySecret of OSS.

The YAML sample to create the mySecret.yaml file is as follows:

apiVersion: v1
kind: Secret
metadata:
  name: mysecret
stringData:
  fs.oss.accessKeyId: xxx
  fs.oss.accessKeySecret: xxx
  1. Execute the following command to generate a secret.
$ kubectl create -f mySecret.yaml
  1. Use the following sample YAML file to create a file named dataset.yaml and contain two parts:
  • Create a Dataset to describe the remote storage data set and UFS information.
  • Create a JindoRuntime and start a JindoFS cluster to provide caching services.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo
spec:
  mounts:
    - mountPoint: oss://<bucket-name>/<path>
      options:
        fs.oss.endpoint: <oss-endpoint>
      name: hbase
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeySecret
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: demo
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.99"
        low: "0.8"
  fuse:
   args:
    - -okernel_cache
    - -oro
    - -oattr_timeout=60
    - -oentry_timeout=60
    - -onegative_timeout=60

The relevant parameters are explained in the following table:

parameter illustrate
mountPoint oss://<oss_bucket>/<path> indicates the path to mount UFS, and the path does not need to contain endpoint information.
fs.oss.endpoint The endpoint information of the OSS Bucket can be either a public network address or a private network address.
accessModes Represents the access mode of Dataset.
replicas Indicates the number of Workers creating JindoFS cluster.
mediumtype Indicates the cache type. When defining and creating JindoRuntime template samples, JindoFS temporarily supports one of the cache types in HDD/SSD/MEM.
path Indicates the storage path. Currently, only a single path is supported. When selecting MEM for caching, you need to specify a local path to store files such as Log.
quota Indicates the maximum cache capacity, in GB. The cache capacity can be configured based on the UFS data size.
high Indicates the upper limit of storage capacity.
low Indicates the lower limit of storage capacity.
fuse.args Represents optional fuse client mount parameters. Usually used in conjunction with Dataset's access mode. When the Dataset access mode is ReadOnlyMany, we enable kernel_cache to use the kernel cache to optimize read performance. At this time we can set attr_timeout (file attribute cache retention time), entry_timeout (file name read cache retention time) timeout, negative_timeout (file name read failure cache retention time), the default is 7200s. When the Dataset access mode is ReadWriteMany, we recommend using the default configuration. The parameters at this time are as follows: - -oauto_cache - -oattr_timeout=0 - -oentry_timeout=0 - -onegative_timeout=0 Use auto_cache to ensure that if the file size or modification time changes, the cache will be invalid. At the same time, set the timeout to 0.
  1. Execute the following commands to create JindoRuntime and Dataset.
$ kubectl create -f dataset.yaml
  1. Execute the following command to view the deployment of Dataset.
$ kubectl get dataset

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
demo    588.90KiB        0.00B       10.00GiB         0.0%                Bound   2m7s

Step 3: Create a Dataload that supports scheduled running

  1. Use the following sample YAML file to create a file named dataload.yaml.
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: cron-dataload
spec:
  dataset:
    name: demo
    namespace: default
  policy: Cron
  schedule: "*/2 * * * *" # Run every 2 min

The relevant parameters are explained in the following table:

parameter illustrate
dataset Indicates the name and namespace of the data set where dataload is executed.
policy Indicates the execution strategy, currently supports Once and Cron. Create a scheduled dataload task here.
schedule Indicates the strategy for triggering dataload.

scheule uses the following cron format:

# ┌───────────── 分钟 (0 - 59)
# │ ┌───────────── 小时 (0 - 23)
# │ │ ┌───────────── 月的某天 (1 - 31)
# │ │ │ ┌───────────── 月份 (1 - 12)
# │ │ │ │ ┌───────────── 周的某天 (0 - 6)(周日到周一;在某些系统上,7 也是星期日)
# │ │ │ │ │                          或者是 sun,mon,tue,web,thu,fri,sat
# │ │ │ │ │
# │ │ │ │ │
# * * * * *

Also, cron supports the following operators:

  • Comma (,) indicates enumeration, for example: 1,3,4,7 * * * * indicates that Dataload will be executed at 1, 3, 4, and 7 minutes every hour.
  • The hyphen (-) represents the range, for example: 1-6 * * * * means that it will be executed every minute within 1 to 6 minutes of every hour.
  • An asterisk (*) represents any possible value. For example: an asterisk in the "hour field" equals "every hour".
  • The percent sign (%) means "every". For example: *%10 * * * * means execution every 10 minutes.
  • Slashes (/) are used to describe range increments. For example: */2 * * * * means execution every 2 minutes.

You can also view more information here.

For advanced configuration related to Dataload, please refer to the following configuration file:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: cron-dataload
spec:
  dataset:
    name: demo
    namespace: default
  policy: Cron # including Once, Cron
  schedule: * * * * * # only set when policy is cron
  loadMetadata: true
  target:
    - path: <path1>
      replicas: 1
    - path: <path2>
      replicas: 2

The relevant parameters are explained in the following table:

parameter illustrate
policy Indicates the dataload execution strategy, including [Once, Cron].
schedule Indicates the plan used by cron. It is only valid when the policy is Cron.
loadMetadata Indicates whether to synchronize metadata before dataload.
target Indicates the target of the dataload and supports specifying multiple targets.
path Indicates the path to execute dataload.
replicas Indicates the number of cached copies.
  1. Execute the following command to create Dataload.
$ kubectl apply -f dataload.yaml
  1. Execute the following command to check the Dataload status.
$ kubectl get dataload

Expected output:

NAME             DATASET   PHASE      AGE     DURATION
cron-dataload    demo      Complete   3m51s   2m12s
  1. After waiting for the Dataload status to be Complete, execute the following command to view the current dataset status.
$ kubectl get dataset

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
demo    588.90KiB        588.90KiB   10.00GiB         100.0%              Bound   5m50s

It can be seen that all files in oss have been loaded into the cache.

Step 4: Create an application container to access data in OSS

This article creates an application container to access the above files to see the effect of scheduled Dataload.

  1. Using the following sample YAML file, create a file named app.yaml.
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: demo-vol
  volumes:
    - name: demo-vol
      persistentVolumeClaim:
        claimName: demo
  1. Execute the following command to create an application container.
$ kubectl create -f app.yaml
  1. Wait for the application container to be ready and execute the following command to view the data in OSS:
$ kubectl exec -it nginx -- ls -lh /data

Expected output:

total 589K
-rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.md
  1. In order to verify the effect of dataload regularly updating the underlying file, we modify the contents of RELEASENOTES.md and re-upload it before the scheduled dataload is triggered.
$ echo "hello, crondataload." >> RELEASENOTES.md

Re-upload the file to oss.

$ ossutil cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.md
  1. Wait for the dataload task to trigger. When the Dataload task is completed, execute the following command to view the running status of the Dataload job:
$ kubectl describe dataload cron-dataload

Expected output:

...
Status:
  Conditions:
    Last Probe Time:       2023-07-31T04:30:07Z
    Last Transition Time:  2023-07-31T04:30:07Z
    Status:                True
    Type:                  Complete
  Duration:                5m54s
  Last Schedule Time:      2023-07-31T04:30:00Z
  Last Successful Time:    2023-07-31T04:30:07Z
  Phase:                   Complete
...

Among them, the Last Schedule Time in Status is the scheduling time of the last dataload job, and the Last Successful Time is the completion time of the last dataload job.

At this point, you can execute the following command to view the current Dataset status:

$ kubectl get dataset

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
demo    588.90KiB        1.15MiB     10.00GiB         100.0%              Bound   10m

It can be seen that the updated file has also been loaded into the cache.

  1. Execute the following command to view the updated files in the application container:
$ kubectl exec -it nginx -- tail /data/RELEASENOTES.md

Expected output:

  \<name\>hbase.config.read.zookeeper.config\</name\>
  \<value\>true\</value\>
  \<description\>
        Set to true to allow HBaseConfiguration to read the
        zoo.cfg file for ZooKeeper properties. Switching this to true
        is not recommended, since the functionality of reading ZK
        properties from a zoo.cfg file has been deprecated.
  \</description\>
\</property\>
hello, crondataload.

As you can see from the last line, the updated file is now accessible to the application container.

environmental cleanup

When you no longer use the data acceleration feature, you need to clean up the environment.

Execute the following command to delete JindoRuntime and application containers.

$ kubectl delete -f app.yaml

$ kubectl delete -f dataset.yaml

Summarize

The discussion on hybrid cloud optimized data access based on ACK Fluid ends here. The Alibaba Cloud Container Service team will continue to iterate and optimize with users in this scenario. As practice continues to deepen, this series will continue to be updated.

Related Links:

[1] Create ACK Pro version cluster https://help.aliyun.com/document_detail/176833.html#task-skz-qwk-qfb

[2] Install cloud native AI suite

https://help.aliyun.com/zh/ack/cloud-native-ai-suite/user-guide/deploy-the-cloud-native-ai-suite#task-2038811

[3] Container Service Management Console

https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Fcs.console.aliyun.com%2F

[4] Connect to the cluster through the kubectl tool

https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/obtain-the-kubeconfig-file-of-a-cluster-and-use-kubectl-to-connect-to-the-cluster#task-ubf-lhg-vdb

[5] Install ossutil

https://help.aliyun.com/zh/oss/developer-reference/install-ossutil#concept-303829

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10117316