How does the K8s cluster in the local IDC use cloud computing resources in a serverless manner

Author: Zhuang Yu

In the previous article "How to Quickly Add Cloud Elastic Capability to Self-built K8s in Response to Burst Traffic" , we introduced how to add cloud nodes to the K8s cluster in IDC to cope with the growth of business traffic, flexibly use cloud resources through multi-level elastic scheduling, and increase utilization rate and reduce cloud costs through automatic elastic scaling.

This method of directly adding nodes is suitable for scenarios that require custom configuration nodes (runtime, kubelet, NVIDIA, etc.) and specific ECS instance specifications. At the same time, this method means that you need to maintain the node pool on the cloud by yourself.

If you don't want to maintain the node pool on the cloud, you can choose the serverless method to use the Alibaba Cloud ECI elastic container instance to run the business Pod, and use the CPU/GPU resources on the cloud more efficiently and elastically.

overview

Using the CPU/GPU resources on the cloud in a serverless way is still aimed at the problem that the K8s cluster in the IDC has insufficient elasticity, which cannot meet the rapid growth of business, periodic business growth and sudden business traffic.

Through the serverless method, business pods can be directly submitted in the K8s cluster, and the pods will run using Alibaba Cloud ECI elastic container instances. The ECI elastic container instances start up quickly and are consistent with the life cycle of the business pods, and are billed according to the running time of the pods. Therefore, there is no need to create cloud nodes for the K8s cluster in the IDC, no need to plan cloud resource capacity, and no need to wait for the ECS to be created, which achieves extreme elasticity and saves node operation and maintenance costs.

The K8s cluster in IDC uses CPU/GPU resources on the cloud in a serverless manner, which is suitable for the following business scenarios:

  • Elastic scaling of peaks and troughs in online business: industries such as online education and e-commerce have obvious peak and trough calculation characteristics. Using Serverless ECI can significantly reduce the maintenance of fixed resource pools and reduce computing costs.
  • Data computing: Use Serverless ECI to host computing scenarios such as Spark, Presto, and ArgoWorkflow, and bill according to the running time of the Pod, effectively reducing computing costs.
  • CI/CD Pipeline:Jenkins、Gitlab-Runner。
  • Job tasks: timed tasks, AI.

insert image description here

Demonstration - K8s cluster in IDC uses cloud resources in Serverless mode

1. Prerequisites

The ACK One registered machine cluster has been connected to the K8s cluster in IDC, see "Choose the right method, K8s multi-cluster management is not so difficult" .

2. Install the ack-virtual-node component

Register the cluster console through ACK One to install the ack-virtual-node component. After installing the component, check the cluster node pool through the registered cluster kubeconfig. virtual-kubelet is a virtual node connected to Alibaba Cloud Serverless ECI.

kubectl get node
NAME                               STATUS   ROLES    AGE    VERSION
iz8vb1xtnuu0ne6b58hvx0z            Ready    master   4d3h   v1.20.9   //IDC集群节点,示例只有1个master节点,同时也是worker节点,可以运行业务容器
virtual-kubelet-cn-zhangjiakou-a   Ready    agent    99s    v1.20.9。//安装ack-virtual-node组件生产的虚拟节点

3. Run Pods with Serverless ECI (CPU/GPU tasks)

Method 1: Configure the Pod label, add the label alibabacloud.com/eci=true to the Pod, and the Pod will run in Serverless ECI mode. In the example, the GPU ECI instance is used to run the CUDA task. You do not need to install and configure NVIDIA driver and runtime to truly implement serverless operation.

a. Submit the Pod and run it using Serverless ECI.

> cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  labels:
    alibabacloud.com/eci: "true"  # 指定Pod使用Serverless ECI运行
  annotations:
    k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge  # 指定支持的GPU规格,该规格具备1个NVIDIA P100 GPU
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/cuda10.2-vectoradd
      resources:
        limits:
          nvidia.com/gpu: 1 # 申请1个GPU
EOF

b. Check the pod. The pod runs on the virtual node virtual-kubelet, and the actual background uses Alibaba Cloud Serverless ECI to run.

> kubectl get pod -o wide
NAME       READY   STATUS      RESTARTS   AGE     IP              NODE                               NOMINATED NODE   READINESS GATES
gpu-pod    0/1     Completed   0          5m30s   172.16.217.90   virtual-kubelet-cn-zhangjiakou-a   <none>           <none>

> kubectl logs gpu-pod
Using CUDA Device [0]: Tesla P100-PCIE-16GB
GPU Device has SM 6.0 compute capability
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Method 2: Set the namespace label

Set the label alibabacloud.com/eci=true for the namespace, and all newly created pods in the namespace will run in Serverless ECI mode.

kubectl label namespace <namespace-name> alibabacloud.com/eci=true

4. Multi-level flexible scheduling

In the demo above, we run the Pod with Serverless ECI by setting the label of the Pod or the namespace. If you expect to use the node resources in the IDC to run Pods preferentially during the running of the application, and then use Alibaba Cloud Serverless ECI to run the Pods when the IDC resources are insufficient. You can use ACK One to register the multi-level elastic scheduling of the cluster. By installing the ack-co-scheduler component, you can define the ResourcePolicy CR object and use the multi-level elastic scheduling function.

ResourcePolicy CR is a namespace resource, important parameter analysis:

  • selector: Declare that the ResourcePolicy acts on the Pod with key1=value1 on the label in the same namespace

  • strategy: Scheduling strategy selection, currently only supports prefer

  • units: User-defined scheduling units. When the application scales up, resources will be selected to run in the order of resources under units; when the application shrinks, it will be scaled down in reverse order

    • resource: the type of elastic resource, currently supports idc, ecs and eci three types
    • nodeSelector: Use the label of node to identify the nodes under the scheduling unit, only valid for ecs resources
    • max: the maximum number of instances deployed in this group of resources

Proceed as follows:

  1. Define the ResourcePolicy CR to give priority to using the cluster resources in the IDC, and then use the Serverless ECI resources on the cloud.
> cat << EOF | kubectl apply -f -
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: cost-balance-policy
spec:
  selector:
    app: nginx           // 选择应用Pod
  strategy: prefer
  units:
  - resource: idc        //优先使用idc指定使用IDC中节点资源
  - resource: eci        //当idc节点资源不足时,使用Serverless ECI
EOF
  1. Create an application Deployment, start 2 copies, and each copy requires 2 CPUs.
> cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      annotations:
        addannotion: "true"
      labels:
        app: nginx      # 此处要与上一步创建的ResourcePolicy的selector相关联。
    spec:
      schedulerName: ack-co-scheduler
      containers:
      - name: nginx
        image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/nginx
        resources:
          requests:
            cpu: 2
          limits:
            cpu: 2
EOF
  1. Execute the following command to expand 4 copies of the application. The K8s cluster in the IDC has only one 6CPU node, and a maximum of 2 nginx pods can be started (system resources are reserved, and 3 pods cannot be started). The remaining two replicas automatically use Alibaba Cloud Serverless ECI to run Pods after the IDC node resources are insufficient.
kubectl scale deployment nginx --replicas 4
  1. Check the running status of Pods. Two Pods are running on nodes in IDC, and two Pods are running on Alibaba Cloud Serverless ECI using virtual nodes.
> kubectl get pod -o widek get pod -o wideNAME                     READY   STATUS    RESTARTS   AGE     IP              NODE                      nginx-79cd98b4b5-97s47   1/1     Running   0          84s     10.100.75.22    iz8vb1xtnuu0ne6b58hvx0z   nginx-79cd98b4b5-gxd8z   1/1     Running   0          84s     10.100.75.23    iz8vb1xtnuu0ne6b58hvx0z   nginx-79cd98b4b5-k55rb   1/1     Running   0          58s     10.100.75.24    virtual-kubelet-cn-zhangjiakou-anginx-79cd98b4b5-m9jxm   1/1     Running   0          58s     10.100.75.25    virtual-kubelet-cn-zhangjiakou-a

Summarize

This article introduces how the K8s cluster in IDC can use Alibaba Cloud CPU and GPU computing resources in Serverless ECI mode based on the ACK One registered cluster to cope with the growth of business traffic. This method is completely serverless, does not require additional operation and maintenance of nodes on the cloud, and is billed according to the running time of the Pod, which is flexible and efficient.

In the future, we will launch a series of articles on ACK One registered clusters, including: disaster recovery backup, security management, etc. Welcome to join us by searching the DingTalk group account. (Group number: 35688562)

Reference documents:

[1] Register cluster overview

https://help.aliyun.com/document_detail/155208.html

[2] Use elastic container ECI to expand the cluster

https://help.aliyun.com/document_detail/164370.html

[3] Instance types supported by ECI

https://help.aliyun.com/document_detail/451262.html

[4] Multi-level flexible scheduling

https://help.aliyun.com/document_detail/446694.html

Click here to view more product details of ACK One

Guess you like

Origin blog.csdn.net/alisystemsoftware/article/details/131875034