How to quickly add cloud elasticity to self-built K8s in response to burst traffic

The container technology represented by Kubernetes brings about a change in the application delivery model, and it is rapidly becoming a unified API for data centers around the world.

In order to ensure continuous business stability and uninterrupted user access, high availability, high flexibility and other capabilities are the constant pursuit of application architecture design, and the multi-cluster architecture naturally has such capabilities. And only under the unified and standard API of Kubernetes, the capabilities of multi-cluster and hybrid cloud begin to truly reflect their value.

In the previous article "Choosing the right method, K8s multi-cluster management is not so difficult" , we focused on the application scenarios, architecture implementation, security hardening, and other cloud K8s The powerful observability capability of Alibaba Cloud Container Service ACK is used in the cluster and IDC's self-built K8s cluster to realize unified operation and maintenance management of K8s clusters on and off the cloud.

In this article, we focus on another important usage scenario of ACK One registered clusters——elasticity on the cloud.

Typical Application Scenarios and Advantages of Elastic Capabilities on the Cloud

Scenarios for the cloud elastic capability of ACK One registered clusters:

1. Rapid business growth: The K8s cluster deployed in the local IDC is often limited by the computing resources of the IDC and cannot be expanded in time. The procurement and deployment of computing resources often takes a long period and cannot bear the rapid growth of business traffic.

2. Business periodic growth or sudden growth: The number of computing resources in the local IDC is relatively fixed, which cannot cope with business periodic peaks or sudden business traffic growth.

The fundamental solution to the above scenarios is the flexibility of computing resources, which can flexibly expand or shrink computing resources following changes in business traffic to meet business needs while ensuring cost balance.

The elastic architecture on the ACK One registration cluster cloud is shown in the following figure:

By registering the cluster with ACK One, the K8s cluster in the local IDC can elastically expand the Alibaba Cloud ECS node pool, and use the extreme elasticity of Alibaba Cloud Container Service to expand capacity to cope with business traffic growth and reduce capacity to achieve cost savings. Especially for AI scenarios, by registering the cluster with ACK One, you can connect GPU machines on the cloud to the K8s cluster in IDC.

Best practice for adding Alibaba Cloud GPU computing power to a local IDC K8s cluster

1. Create an ACK One registration cluster

Visit the ACK One console registration cluster page, we have created the registration cluster "ACKOneRegisterCluster1" and connected to the K8s cluster in the local IDC. See: "Choose the right method, K8s multi-cluster management is not so difficult"

ACK One console registration cluster page:
https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Fcs.console.aliyun.com%2Fone

After accessing, you can view the local IDC K8s cluster through the ACK One console. Currently, there is only one master node.

2. Create a GPU node pool and manually expand the capacity to create a GPU node

Create a node pool GPU-P100 in the registered cluster, and add the GPU machine on the cloud to the K8s cluster in IDC.

Execute kubectl in the IDC K8s cluster to view node information.

kubectl get node
NAME                           STATUS   ROLES    AGE     VERSION
cn-zhangjiakou.172.16.217.xx   Ready    <none>   5m35s   v1.20.9    // 云上GPU机器
iz8vb1xtnuu0ne6b58hvx0z        Ready    master   20h     v1.20.9    // IDC机器

k describe node cn-zhangjiakou.172.16.217.xx
Name:               cn-zhangjiakou.172.16.217.xx
Roles:              <none>
Labels:             aliyun.accelerator/nvidia_count=1             //nvidia labels
                    aliyun.accelerator/nvidia_mem=16280MiB        //nvidia labels 
                    aliyun.accelerator/nvidia_name=Tesla-P100-PCIE-16GB  //nvidia labels
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=cn-zhangjiakou.172.16.217.xx
                    kubernetes.io/os=linux
Capacity:
  cpu:                4
  ephemeral-storage:  123722704Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             30568556Ki
  nvidia.com/gpu:     1              //nvidia gpu
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  114022843818
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             30466156Ki
  nvidia.com/gpu:     1              //nvidia gpu
  pods:               110
System Info:
  OS Image:                   Alibaba Cloud Linux (Aliyun Linux) 2.1903 LTS (Hunting Beagle)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.13
  Kubelet Version:            v1.20.9
  Kube-Proxy Version:         v1.20.9
......

3. Run the GPU task test

Submit the GPU test task in the K8s cluster in IDC, and the running result is successful.

> cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
 name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/cuda10.2-vectoradd
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
EOF

> kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Multi-level elastic scheduling strategy, custom elastic resource priority

Custom Elastic Resource Priority Scheduling is an elastic scheduling policy provided by Alibaba Cloud. You can customize the resource policy (ResourcePolicy) to set the order in which application instance Pods are scheduled to different types of node resources during the application release or expansion process. At the same time, during the shrinking process, shrink the capacity in reverse order according to the original scheduling order.

Through the above demonstration, we can register the cluster through ACK One, use ECS resources on the cloud to create a node pool, and add it to the IDC cluster. You can label the node pool or node, and choose whether to run the Pod on the IDC local node or the ECS node on the cloud by setting the "affinity" or "nodeSelector" of the Pod's node. This method needs to modify the configuration of the application pod. If the production system has many applications that need to be processed, scheduling rules need to be written, which is suitable for custom scheduling scenarios. For example, GPU training tasks of a specific CUDA version are scheduled to specific GPUs on the cloud on the ECS instance.

In order to simplify the use of ECS resources on the cloud by K8s clusters in IDC, ACK One registered clusters provide multi-level elastic scheduling functions. By installing the ack-co-scheduler component, you can define ResourcePolicy CR objects and use multi-level elastic scheduling functions.

ResourcePolicy CR is a namespace resource, important parameter analysis:

  • selector: Declare that the ResourcePolicy applies to Pods with key1=value1 on the label in the same namespace.
  • strategy: Scheduling strategy selection, currently only supports prefer.
  • units: User-defined scheduling units. When the application scales up, resources will be selected to run in the order of the resources under units; when the application shrinks, it will be scaled down in reverse order.
  • resource: the type of elastic resource, currently supports idc, ecs and eci three types.
  • nodeSelector: Use the label of node to identify the nodes under the scheduling unit, which only takes effect for ecs resources.
  • max: The maximum number of instances deployed in this group of resources.

ResourcePolicy supports the following scenarios:

Scenario 1: Prioritize the use of cluster resources in IDC, and then use ECS resources on the cloud

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
 name: cost-balance-policy
spec:
  selector:
    app: nginx           // 选择应用Pod
  strategy: prefer
  units:
  - resource: idc        //优先使用idc指定使用IDC中节点资源
  - resource: ecs        //当idc节点资源不足时,使用云上ECS,可以通过nodeSelector选择节点
    nodeSelector:
      alibabacloud.com/nodepool-id=np7b30xxx

Scenario 2: Mixed use of IDC resources and ECS resources on the cloud

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
 name: load-balance-policy
spec:
  selector:
    app: nginx
  strategy: prefer
  units:
  - resource: idc
    max: 2             //在idc节点中最多启动2个应用实例
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id=np7b30xxx
    max: 4             //在ecs节点池中最多启动4个应用实例

Summarize

In the demonstration, we added the Alibaba Cloud GPU P100 machine to the K8s cluster in IDC to expand the GPU computing power of IDC.

Register the cluster via ACK One:

1. You can choose various ECS instance types and specifications on Alibaba Cloud, including: X86, ARM, GPU, etc.

2. You can manually expand and shrink the number of ECS instances.

3. You can configure automatic scaling of the number of ECS instances.

4. You can use multi-level elastic scheduling to prioritize the use of resources in the IDC. When the IDC resources are insufficient, the ECS node pool on the cloud can be automatically expanded to handle unexpected business traffic.

Reference documents:

[1] Register cluster overview

https://help.aliyun.com/document_detail/155208.html

[2] Create an ECS node pool

https://help.aliyun.com/document_detail/208054.html

[3] Configure automatic elastic scaling of ECS nodes

https://help.aliyun.com/document_detail/208055.html

[4] Multi-level flexible scheduling

https://help.aliyun.com/document_detail/446694.html

Author: Zhuang Yu

Click to try cloud products for free now to start the practical journey on the cloud!

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/131644992