Technical information: Decrypt the most popular open source Serverless framework elastic technology implementation

Author: Yuan Yi

Knative is an open source Serverless application orchestration framework based on Kubernetes. Its goal is to develop cloud-native, cross-platform Serverless application orchestration standards. Knative's main functions include request-based automatic elasticity, shrinking to 0, multi-version management, traffic-based grayscale release, event-driven, etc.

Elasticity is the core capability in Serverless. So, as the most popular open source Serverless application framework in the CNCF community, what unique elastic capabilities does Knative provide? This article will give you an in-depth understanding of Knative's flexible implementation. (Note: This article is based on Knative 1.8.0 version for analysis)

Knative provides request-based automatic elastic implementation KPA (Knative Pod Autoscaler) and also supports HPA in K8s. In addition, Knative provides a flexible elastic expansion mechanism that can expand elastic implementation based on its own business needs. Here we will also introduce the combination with MSE to achieve precise elasticity and the combination with AHPA to achieve request-based elasticity prediction.

First, we introduce the most attractive native flexibility of Knative: KPA.

Request-based automatic elastic KPA

Elasticity based on CPU or Memory sometimes does not fully reflect the actual usage of the business. However, based on the number of concurrency or the number of requests processed per second (QPS/RPS), for web services, it can more directly reflect the service performance. Knative provides Request-based automatic resiliency capabilities. To obtain the current number of service requests, Knative Serving injects a QUEUE proxy container (queue-proxy) into each Pod, which is responsible for collecting user container concurrency (concurrency) or request count (rps) indicators. After the Autoscaler obtains these indicators regularly, it will adjust the number of Pods in the Deployment according to the corresponding algorithm to achieve automatic expansion and contraction based on requests.

Image source: https://knative.dev/docs/serving/request-flow/

Elasticity algorithm based on number of requests

Autoscaler performs elastic calculations based on the average number of requests (or concurrency) of each Pod. By default, Knative uses automatic elasticity based on the number of concurrency, and the default maximum number of concurrency Pods is 100. In addition, Knative also provides a concept called target-utilization-percentage, which is called target utilization. The value range is 0~1, and the default is: 0.7.

Taking elasticity based on concurrency as an example, the number of Pods is calculated as follows:

POD数=并发请求总数/(Pod最大并发数*目标使用率)

For example, the maximum number of concurrent Pods in the service is set to 10. At this time, if 100 concurrent requests are received and the target usage is set to 0.7, then the Autoscaler will create 15 PODs (100/(0.7*10) is approximately equal to 15).

Implementation mechanism of scaling down to 0

When using KPA, when there are no traffic requests, the number of Pods will be automatically reduced to 0; when there are requests, the number of Pods will be expanded from 0. So how does this happen in Knative? The answer is through mode switching.

Knative defines two request access modes: Proxy and Serve. Proxy As the name suggests, proxy mode, that is, the request will be forwarded by proxy through the activator component. Serve mode is a direct request mode, which requests directly from the gateway to the Pod without going through the activator proxy. As shown below:

The mode switching is performed by the autoscaler component. When the request is 0, the autoscaler will switch the request mode to Proxy mode. At this time, the request will be sent to the activator component through the gateway. After receiving the request, the activator will put the request in the queue and push indicators to notify the autoscaler for expansion. When the activator detects the Pod that is ready for expansion, it will forward the request. The autoscaler will also determine the Ready Pod and switch the mode to Serve mode.

Cope with burst traffic

How to quickly bounce resources under sudden traffic

KPA involves two concepts related to elasticity: Stable (stable mode) and Panic (panic mode). Based on these two modes, we can understand how KPA can achieve refined elasticity based on requests.

First, the stable mode is based on the stable window period, which is 60 seconds by default. That is, the average number of concurrent Pods is calculated within a 60-second period.

The panic mode is based on the panic window period, which is calculated through the stability window period and panic-window-percentage parameters. The value of panic-window-percentage is 0~1, and the default is 0.1. The calculation method of panic window period is: panic window period = stable window period * panic-window-percentage. The default is 6 seconds. Calculate the average number of concurrent Pods within a 6-second period.

KPA will calculate the required number of Pods based on the average number of concurrent Pods in stable mode and panic mode.

So which value is actually used for elasticity to take effect? This will be judged based on whether the number of Pods calculated in panic mode exceeds the panic threshold PanicThreshold. The panic threshold is calculated by panic-threshold-percentage/100. The panic-threshold-percentage parameter defaults to 200, which means the panic threshold defaults to 2. When the number of Pods calculated in panic mode is greater than or equal to 2 times the current number of Ready Pods, the number of Pods in panic mode will be used for elastic validation, otherwise the number of Pods in stable mode will be used.

Obviously, panic mode is designed to deal with sudden traffic scenarios. As for elastic sensitivity, it can be adjusted through the configurable parameters mentioned above.

How to prevent Pod from being exploded under sudden traffic

The burst request capacity (target-burst-capacity) can be set in KPA to deal with the Pod being overwhelmed by unexpected traffic. That is, through the calculation of this parameter value, we can adjust whether the request switches to Proxy mode, thereby using the activator component as the request buffer. If the current number of ready pods * the maximum number of concurrencies - the burst request capacity - the number of concurrencies calculated in panic mode < 0, it means that the burst traffic exceeds the capacity threshold, then switch to the activator for request buffering. When the burst request capacity value is 0, the activator will be switched only when the Pod shrinks to 0. When greater than 0 and container-concurrency-target-percentage is set to 100, requests will always go through the activator. -1 indicates unlimited request burst capacity. Requests will always go through the activator.

Some tips to reduce cold starts

Delayed scaling

For Pods with high startup costs, the KPA can be used to reduce the frequency of Pod expansion and contraction by setting the Pod delay reduction time and the Pod reduction to 0 retention period.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/scale-down-delay: ""60s"
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1m5s"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56

Lower the target usage rate to warm up resources

Configuration of target threshold usage is provided in Knative. By reducing this value, the number of Pods that exceed the actual required usage can be expanded in advance, and the capacity can be expanded before the request reaches the target concurrent number, which can indirectly warm up the resources. For example, if containerConcurrency is set to 10 and the target utilization value is set to 70 (percentage), the Autoscaler will create a new Pod when the average number of concurrent requests across all existing Pods reaches 7. Because it takes a certain amount of time for a Pod to become ready, by lowering the target utilization value, the Pod can be expanded in advance, thereby reducing problems such as response delays caused by cold starts.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/target-utilization-percentage: "70"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56

Configure KPA

Through the above introduction, we have a better understanding of the working mechanism of Knative Pod Autoscaler. Next, we will introduce how to configure KPA. Knative provides two ways to configure KPA: global mode and Revision mode.

global mode

Global mode can modify the ConfigMap in K8s: config-autoscaler. To view config-autoscaler, use the following command:

kubectl -n knative-serving get cm config-autoscaler

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-default: "100"
 container-concurrency-target-percentage: "70"
 requests-per-second-target-default: "200"
 target-burst-capacity: "211"
 stable-window: "60s"
 panic-window-percentage: "10.0"
 panic-threshold-percentage: "200.0"
 max-scale-up-rate: "1000.0"
 max-scale-down-rate: "2.0"
 enable-scale-to-zero: "true"
 scale-to-zero-grace-period: "30s"
 scale-to-zero-pod-retention-period: "0s"
 pod-autoscaler-class: "kpa.autoscaling.knative.dev"
 activator-capacity: "100.0"
 initial-scale: "1"
 allow-zero-initial-scale: "false"
 min-scale: "0"
 max-scale: "0"
 scale-down-delay: "0s"

Parameter Description:

parameter illustrate
container-concurrency-target-default The default maximum number of concurrent Pods, the default value is 100
container-concurrency-target-percentage Concurrency target usage, 70 actually means 0.7
requests-per-second-target-default Default requests per second (rps), default value 200
target-burst-capacity Burst request capacity
stable-window Stability window, default 60s
panic-window-percentage Panic window ratio, the default value is 10, which means the default panic window period is 6 seconds (60*0.1=6)
panic-threshold-percentage Panic threshold ratio, default value 200
max-scale-up-rate The maximum expansion and contraction rate represents the maximum number of expansions at one time. The actual calculation method is: math.Ceil(MaxScaleUpRate * readyPodsCount)
max-scale-down-rate The maximum scaling rate represents the maximum number of scalings at one time. The actual calculation method is: math.Floor(readyPodsCount / MaxScaleDownRate). The default value 2 means shrinking by half each time.
enable-scale-to-zero Whether to start shrinking to 0, enabled by default
scale-to-zero-grace-period The time for gracefully shrinking to 0, that is, how long it takes to delay shrinking to 0, the default is 30s
scale-to-zero-pod-retention-period The pod is reduced to a retention period of 0. This parameter is suitable for situations where Pod startup costs are high.
pod-autoscaler-class Elastic plug-in type. Currently supported elastic plug-ins include: kpa, hpa, ahpa and mpa (in the ask scenario, it supports shrinking to 0 with mse)
activator-capacity activator request capacity
initial-scale When creating a revision, initialize the number of Pods to start, the default is 1
allow-zero-initial-scale Whether to allow 0 Pods to be initialized when creating a revision. The default is false, which means it is not allowed.
min-scale The minimum number of Pods to retain at the revision level. Default 0 means the minimum value can be 0
max-scale The maximum number of Pods that can be expanded at the revision level. Default 0 means no maximum expansion limit
scale-down-delay Indicates delayed scaling time, default 0 means immediate scaling

Revision version mode

In Knative, elasticity indicators can be configured for each Revision. Some configuration parameters are as follows:

  • Indicator type

<!---->

    • Annotations for each revision indicator: autoscaling.knative.dev/metric
    • Supported indicators: "concurrency", "rps", "cpu", "memory" and other custom indicators
    • Default indicator: "concurrency"

<!---->

  • target threshold

<!---->

    • autoscaling.knative.dev/target
    • Default value: "100"

<!---->

  • Pod scale down to 0 retention period

<!---->

    • autoscaling.knative.dev/scale-to-zero-pod-retention-period

<!---->

  • target usage

<!---->

    • autoscaling.knative.dev/target-utilization-percentage

The configuration example is as follows:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"
        autoscaling.knative.dev/target: "50"
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1m5s"
        autoscaling.knative.dev/target-utilization-percentage: "80"

Support for HPA

For K8s HPA, Knative also provides natural configuration support, and you can use automatic elasticity capabilities based on CPU or Memory in Knative.

  • Flexible configuration based on CPU

<!---->

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
        autoscaling.knative.dev/metric: "cpu"
  • Memory-based flexible configuration

<!---->

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
        autoscaling.knative.dev/metric: "memory"

Increased resilience

Knative provides a flexible plug-in mechanism (pod-autoscaler-class) that can support different elastic strategies. The elastic plug-ins currently supported by Alibaba Cloud Container Service Knative include: kpa, hpa, precise elastic expansion and contraction mpa, and ahpa with predictive capabilities.

reserved resource pool

On top of the native KPA capabilities, we provide the ability to reserve resource pools. This function can be applied in the following scenarios:

  • ECS is mixed with ECI. If we want to use ECS resources under normal circumstances and use ECI for burst traffic, we can achieve this by reserving resource pools. For example, if a single Pod handles 10 concurrent requests and the number of Pods in the reserved resource pool is 5, then under normal circumstances, ECS resources can handle no more than 50 concurrent requests. If the number of concurrency exceeds 50, Knative will expand the number of new Pods to meet the demand, and the newly expanded resources will use ECI.

  • Resource warm-up. For scenarios where ECI is completely used, resource preheating can also be achieved by reserving resource pools. When a reserved instance is used to replace the default computing instance during business troughs, the reserved instance is used to provide services when the first request comes, which will also trigger the expansion of the default specification instance. After the expansion of the default specification instance is completed, all new requests will be forwarded to the default specification. At the same time, the reserved instance will not accept new requests and will be offline after all the requests received by the reserved instance are processed. Through this seamless replacement method, a balance between cost and efficiency is achieved, that is, the cost of resident instances is reduced without significant cold start time.

Accurate elastic expansion and contraction

The throughput rate of a single Pod processing requests is limited. If multiple requests are forwarded to the same Pod, it will cause an overload exception on the server side. Therefore, it is necessary to accurately control the number of concurrent processing of single Pod requests. Especially in some AIGC scenarios, a single request will occupy a lot of GPU resources, and it is necessary to strictly limit the number of concurrent requests processed by each Pod.

Knative is combined with the MSE cloud native gateway to provide an implementation of precise elastic control based on concurrency: the mpa elastic plug-in.

mpa will obtain the concurrency number from the MSE gateway and calculate the required number of Pods for expansion and contraction, and the MSE gateway can accurately forward based on requests.

The configuration example is as follows:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: mpa.autoscaling.knative.dev
        autoscaling.knative.dev/max-scale: '20'
    spec:
      containerConcurrency: 5
      containers:
      - image: registry-vpc.cn-beijing.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Parameter Description:

parameter illustrate
autoscaling.knative.dev/class: mpa.autoscaling.knative.dev mpa indicates that the MSE indicator is used for expansion and contraction, and supports shrinking to 0.
autoscaling.knative.dev/max-scale: '20' The upper limit of the number of expansion Pods is 20
containerConcurrency: 5 Indicates that the maximum number of concurrencies a single Pod can handle is 5

Elastic forecast AHPA

Container service AHPA (Advanced Horizontal Pod Autoscaler) can automatically identify elastic cycles and predict capacity based on historical business indicators to solve the problem of elastic lag.

Currently, Knative supports the elastic capability of AHPA (Advanced Horizontal Pod Autoscaler). When requests are periodic, resources can be preheated through elastic prediction. Compared with lowering the threshold for resource preheating, AHPA can maximize resource utilization.

In addition, since AHPA supports custom indicator configuration, the combination of Knative and AHPA can achieve automatic elasticity based on message queue and response delay rt.

The configuration example of using AHPA based on rps is as follows:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  template:
    metadata:
      labels:
        app: autoscale-go
      annotations:
        autoscaling.knative.dev/class: ahpa.autoscaling.knative.dev
        autoscaling.knative.dev/target: "10"
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/maxScale: "30"
        autoscaling.alibabacloud.com/scaleStrategy: "observer"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/autoscale-go:0.1

Parameter Description:

parameter illustrate
autoscaling.knative.dev/class: ahpa.autoscaling.knative.dev Specify the elastic plug-in AHPA.
autoscaling.knative.dev/metric: "rps" Set the AHPA indicator. Currently supports concurrency, rps and response time rt.
autoscaling.knative.dev/target: "10" Set the threshold of the AHPA indicator. In this example, the rps threshold is 10, which means that the maximum number of requests per second processed by a single Pod is 10.
autoscaling.knative.dev/minScale: "1" Set the minimum number of elastic policy instances to 1.
autoscaling.knative.dev/maxScale: "30" Set the maximum number of elastic policy instances to 30.
autoscaling.alibabacloud.com/scaleStrategy: "observer" Set the auto-scaling mode. The default value is observer. observer: Indicates that it only observes, but does not perform actual scaling actions. This way you can observe whether AHPA is working as expected. Since prediction requires 7 days of historical data, the default service creation mode is observer mode. auto: Indicates that AHPA is responsible for expansion and reduction. AHPA indicators and thresholds are input to AHPA, and AHPA ultimately decides whether to take effect.

summary

This article starts from the typical elastic implementation KPA of Knative and introduces it, including how to implement automatic elasticity based on requests, shrink to 0, cope with sudden traffic, and our expansion and enhancement of Knative elasticity functions, including reserved resource pools, precise elasticity and elasticity prediction. ability.

Here we also provide an experience activity using Knative in the AIGC scene. Welcome to participate: come and unlock the exclusive AI image of your cute pet! Event time: 2023/08/24-09/24.

Come and unlock the exclusive AI image of your cute pet!

https://developer.aliyun.com/adc/series/petsai#J_2264716120

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10116674