Shuhe uses Knative to accelerate AI model service deployment

Summary

Data, training, inference, etc. of AI services require a large amount of computing resources and operation and maintenance costs. In Shuhe Technology's financial business scenario, model storage is frequently iterated, and multiple versions of the model are simultaneously deployed online for evaluation. The real effect of the model online, the resource cost is high. How to improve AI service operation and maintenance efficiency and reduce resource costs while ensuring service quality is challenging.

Knative is an open source serverless application architecture based on Kubernetes, which provides automatic elasticity based on requests, scaling to 0, and grayscale release. Deploying Serverless applications through Knative allows you to focus on application logic development and use resources on demand. Therefore, combining AI services with Knative technology can achieve higher efficiency and reduce costs.

Shuhe Technology currently deploys 500+ AI model services through Knative, saving 60% of resource costs, and the average deployment cycle is shortened from the previous 1 day to 0.5 days.

In this sharing, we will show you how to deploy AI workloads based on Knative. Specific content includes:

  • Introduction to Knative
  • Shuhe is based on Knative’s best practices
  • How to deploy Stable Diffusion in Knative

Introduction to Knative

As we all know, Kubernetes has attracted the attention of many manufacturers and developers since it was open sourced in 2014. As an open source containerized orchestration system, users can reduce operation and maintenance costs and improve operation and maintenance efficiency by using K8s, and it provides standardized APIs, some kind of The meaning is to avoid being bound by cloud vendors, thereby forming a cloud-native ecosystem with K8s as the core. According to the CNCF 2021 survey, 96% of enterprises are using or evaluating Kubernetes.

With the evolution of cloud native technology, serverless technology, which is application-centric and uses resources on demand, has gradually become mainstream. Gartner predicts that more than 50% of global enterprises will deploy Serverless by 2025.

We know that FaaS, represented by AWS Lambda, has made Serverless popular. FaaS does simplify programming. You only need to write a piece of code to run it directly, and developers do not need to care about the underlying infrastructure. However, FaaS currently has obvious shortcomings, including the highly intrusive development model, function runtime limitations, cross-cloud platform support, etc.

The container technology represented by K8s has solved these problems very well. The core concept of Serverless is to focus on business logic and reduce infrastructure concerns.

So how to provide developers with Serverless container technology based on the K8s open standard? The answer is: Knative.

Knative development trajectory

Knative is an open source serverless container orchestration framework based on Kubernetes released at the 2018 Google Cloud Next conference. The goal is to formulate cloud-native, cross-platform Serverless application orchestration standards and create an enterprise-level Serverless platform. Alibaba Cloud Container Service has already provided Knative productization capabilities as early as 2019. With the addition of Knative to CNCF in March 2022, more and more developers are now embracing Knative.

Knative overview

The core module of Knative mainly includes Serving for deploying workloads and Eventing, an event-driven framework.

The core capability of Knative Serving is its simple and efficient application hosting service, which is also the basis for its serverless capabilities. Knative can automatically expand the number of instances during peak periods based on the request volume of your application, and automatically shrink the number of instances when the request volume decreases, which can help you save costs very automatically.

Knative's Eventing provides a complete event model. After the event is accessed, it flows internally through the CloudEvent standard, and combined with the Broker/Trigger mechanism provides an ideal way for event processing.

Knative application model

The Knative application model is Knative Service:

  • Knative Service contains 2 parts of configuration, one part is used to configure the workload, called Configuration. Each time the Configuration content is updated, a new Revision will be created.
  • The other part of Route is mainly responsible for Knative's traffic management.

Let’s take a look at what we can do with traffic-based grayscale publishing:

Assume that we created the V1 version of Revison at the beginning. If there is a new version change at this time, then we only need to update the Configuration in the Service, and the V2 version will be created. Then we can set different traffic flows for V1 and V2 through Route. Ratio, here v1 is 70% and v2 is 30%, then the traffic will be distributed to these two versions in a ratio of 7:3. Once the new V2 version is verified and there are no problems, then we can continue to grayscale by adjusting the ratio until the new version V2 is 100%. During this grayscale update process, once an abnormality is discovered in the new version, the rollback operation can be performed by adjusting the ratio.

In addition, we can tag the Revison in Route to Traffic. After tagging the Revison, we can directly conduct a separate version test through the Url, which can be understood as canary verification. This version will not affect normal debugging. access to traffic.

Request-based automatic resiliency: KPA

Why do we need automatic elasticity based on requests?

Elasticity based on CPU or Memory sometimes does not fully reflect the actual usage of the business. However, based on the number of concurrency or the number of requests processed per second (QPS/RPS), for web services, it can more directly reflect the service performance. Knative provides Request-based automatic resiliency capabilities.

How to collect indicators?

To obtain the current number of service requests, Knative Serving injects a QUEUE proxy container (queue-proxy) into each Pod, which is responsible for collecting user container concurrency (concurrency) or request count (rps) indicators. After the Autoscaler obtains these indicators regularly, it will adjust the number of Pods in the Deployment according to the corresponding algorithm to achieve automatic expansion and contraction based on requests.

How to calculate the number of elastic Pods?

Autoscaler calculates the required number of Pods based on the average number of requests (or concurrency) of each Pod. By default, Knative uses automatic elasticity based on the number of concurrency, and the default maximum number of concurrency Pods is 100. In addition, Knative also provides a concept called target-utilization-percentage, which is called target utilization.

Taking elasticity based on concurrency as an example, the number of Pods is calculated as follows:

Number of Pods = total number of concurrent requests/(maximum concurrent number of Pods * target usage)

Implementation mechanism of scaling down to 0

When using KPA, when there are no traffic requests, the number of Pods will be automatically reduced to 0; when there are requests, the number of Pods will be expanded from 0. So how does this happen in Knative? The answer is through mode switching.

Knative defines two request access modes: Proxy and Serve. Proxy As the name suggests, proxy mode, that is, the request will be forwarded by proxy through the activator component. Serve mode is a direct request mode, which requests directly from the gateway to the Pod without going through the activator proxy. As shown below:

The mode switching is performed by the autoscaler component. When the request is 0, the autoscaler will switch the request mode to Proxy mode. At this time, the request will be sent to the activator component through the gateway. After receiving the request, the activator will put the request in the queue and push indicators to notify the autoscaler for expansion. When the activator detects the Pod that is ready for expansion, it will forward the request.

Cope with burst traffic

How to quickly bounce resources under sudden traffic

Here KPA involves two concepts related to elasticity: Stable (stable mode) and Panic (panic mode). Based on these two modes, we can see how KPA can quickly elasticize resources.

First, the stable mode is based on the stable window period, which is 60 seconds by default. That is, the average number of concurrent Pods is calculated within a 60-second period.

The panic mode is based on the panic window period, which is calculated through the stability window period and panic-window-percentage parameters. The calculation method of panic window period is: panic window period = stable window period * panic-window-percentage. The default is 6 seconds. Calculate the average number of concurrent Pods within a 6-second period.

KPA will calculate the required number of Pods based on the average number of concurrent Pods in stable mode and panic mode.

So which value is actually used for elasticity to take effect? This will be judged based on whether the number of Pods calculated in panic mode exceeds the panic threshold PanicThreshold. By default, when the number of Pods calculated in panic mode is greater than or equal to 2 times the current number of Ready Pods, the number of Pods in panic mode will be used for elastic validation, otherwise the number of Pods in stable mode will be used.

Obviously, panic mode is designed to deal with sudden traffic scenarios. As for elastic sensitivity, it can be adjusted through the configurable parameters mentioned above.

How to prevent Pod from being exploded under sudden traffic

The burst request capacity (target-burst-capacity) can be set in KPA to deal with the Pod being overwhelmed by unexpected traffic. That is, through the calculation of this parameter value, we can adjust whether the request switches to Proxy mode, so that when facing sudden traffic, the activator component can be used as a request buffer to prevent the Pod from being exploded.

Some tips to reduce cold starts

Delayed scaling

For Pods with high startup costs, the KPA can be used to reduce the frequency of Pod expansion and contraction by setting the Pod delay reduction time and the Pod reduction to 0 retention period.

Lower the target usage rate to warm up resources

Configuration of target threshold usage is provided in Knative. By reducing this value, the number of Pods that exceed the actual required usage can be expanded in advance, and the capacity can be expanded before the request reaches the target concurrent number, which can indirectly warm up the resources.

Flexible policy configuration

Through the above introduction, we have a better understanding of the working mechanism of Knative Pod Autoscaler. Next, we will introduce how to configure KPA. Knative provides two ways to configure KPA: global mode and Revision mode. The global mode can be configured through ConfigMap: config-autoscaler, the parameters are as follows:

Alibaba Cloud Container Service Knative

Open source products generally cannot be directly used in products. Problems faced by Knative’s productization include:

  • There are many management and control components and complex operation and maintenance
  • Diversity of computing power, how to schedule it to different resource specifications on demand
  • Cloud product-level gateway capabilities
  • How to solve the cold start problem
  • 。。

To address these problems, we provide the container service Knative product. Fully compatible with community Knative and supports AI reasoning framework KServe. Enhanced elasticity capabilities, supporting reserved resource pools, precise elasticity and elastic prediction; supporting fully managed ALB, MSE and ASM gateways; event-driven integration with cloud product EventBridge; in addition, integrating with other Alibaba Cloud products, such as ECI, arms, log service, etc. A comprehensive integration was carried out.

reserved resource pool

On top of the native KPA capabilities, we provide the ability to reserve resource pools. This function can be applied in the following scenarios:

  • ECS is mixed with ECI. If we want to use ECS resources under normal circumstances and use ECI for burst traffic, we can achieve this by reserving resource pools. Under normal circumstances, traffic is carried through ECS resources, and for sudden traffic, the newly expanded resources use ECI.
  • Resource warm-up. For scenarios where ECI is completely used, resource preheating can also be achieved by reserving resource pools. When a low-specification reserved instance is used to replace the default computing instance during business troughs, the reserved instance is used to provide services when the first request comes, which will also trigger the expansion of the default-specification instance. After the expansion of the default specification instance is completed, all new requests will be forwarded to the default specification, and then the instance will be taken offline to retain the instance. Through this seamless replacement method, a balance between cost and efficiency is achieved, that is, the cost of resident instances is reduced without significant cold start time.

Precision and flexibility

The throughput rate of a single Pod processing requests is limited. If multiple requests are forwarded to the same Pod, it will cause an overload exception on the server side. Therefore, it is necessary to accurately control the number of concurrent processing of single Pod requests. Especially in some AIGC scenarios, a single request will occupy a lot of GPU resources, and it is necessary to strictly limit the number of concurrent requests processed by each Pod.

Knative is combined with the MSE cloud native gateway to provide an implementation of precise elastic control based on concurrency: the mpa elastic plug-in.

mpa will obtain the concurrency number from the MSE gateway, and calculate the required number of Pods for expansion and contraction. After the Pod is ready, the MSE gateway will forward the request to the corresponding Pod.

Elastic forecast

Container service AHPA (Advanced Horizontal Pod Autoscaler) can automatically identify elastic cycles and predict capacity based on historical business indicators to solve the problem of elastic lag.

Currently, Knative supports the elastic capability of AHPA (Advanced Horizontal Pod Autoscaler). When requests are periodic, resources can be preheated through elastic prediction. Compared with lowering the threshold for resource preheating, AHPA can maximize resource utilization.

Through AHPA you can:

  • Prepare resources in advance: expand capacity in advance before requests arrive
  • Stable and reliable: the predicted RPS throughput is enough to cover the actual number of requests required

event driven

Knative provides event-driven capabilities through Eventing, and Eventing uses Broker/Trigger for event flow and distribution.

But using native Eventing directly also faces some problems:

  • How to cover enough event sources
  • How to ensure that events are not lost during transfer

How to build production-level event-driven capabilities? We integrate with Alibaba Cloud EventBridge.

EventBridge is a serverless event bus service provided by Alibaba Cloud. Alibaba Cloud Knative Eventing integrates EventBridge at the bottom layer, providing cloud product-level event-driven capabilities for Knative Eventing.

So how do we use Knative in our actual business scenarios? What benefits can we bring through Knative? Next, I will introduce to you the best practices for using Knative in Shuhe Technology.

Shuhe is based on Knative’s best practices

Shuhe Technology (full name "Shanghai Shuhe Information Technology Co., Ltd.") was established in August 2015. Driven by big data and technology, Shuhe Technology provides financial institutions with efficient intelligent retail financial solutions. Its business covers consumer credit, It provides marketing customer acquisition, risk prevention and control, operation management and other services in various fields such as small and micro enterprise credit and scenario installment.

Huanbei APP, a subsidiary of Shuhe Technology, is an installment service platform based on multiple consumption scenarios. It officially entered the market in February 2016. By cooperating with licensed financial institutions, we provide personal consumer credit services to the public and provide loan funding support to small and micro business owners. As of June 2023, Huanbei has accumulated 130 million activated users, providing reasonable credit services to 17 million users, helping users "borrow and repay easily".

Pain points of model release

During the model launch phase, resource waste is a problem

  • In order to ensure the stability of our services, buffers are generally reserved, and resources are usually reserved in accordance with specifications that exceed our actual usage.
  • The created resources are not full of resources for the whole time. You will see some online applications, especially some offline job-type applications. Most of the time, the entire CPU and memory usage is very low, and only certain time periods of resources are available. The usage rate will rise, and there will be an obvious tidal phenomenon.
  • Because resource usage lacks sufficient flexibility, it often results in a large amount of waste of resources.

The problem of difficult resource recovery during model offline phase

  • A model online may have multiple versions existing at the same time.
  • Different versions are used to evaluate the real online effect of the model. If a certain version does not perform well, this version will be removed from the decision-making process. At this time, the model actually no longer provides services. At this time, if the resources cannot be offline in time, it will cause a waste of resources.

Model continuous delivery technology architecture

Model platform part

The model platform generates model files and then registers the model files to BetterCDS.

BetterCDS will generate an artifact corresponding to this model file

Model release and model version management can be completed through products

Knative management module

Configure the global Knative expansion configuration of the model

Each model can be configured with its own knative expansion configuration

CI pipeline

The main process of the CI pipeline is to pull the model file and build the model image through the dockerfile.

Deployment process

The pipeline updates Knative Service, sets the routing version and mirror address, and updates Knative's elastic expansion configuration. Knative will generate a revision corresponding to the version of the product.

Knative Service update process

  • Knative service updates will generate a revision version
  • The revision version will generate a deployment object
  • The pipeline will observe the deployment status. If all pods are ready, then this version is considered to be deployed successfully.
  • After successful deployment, tag revision to create a version route

Model multi-version release

Multi-version product release of the model is implemented through Knative's Configuration:

1. Multiple versions of the product correspond to the Revision version of the Configuration one-to-one.

2. The product contains the model version image. By updating the image version of Knative Service, the Revision corresponding to the product is generated.

Model multi-version service coexistence capability:

1. Create a version route for Revision through Tag, and call different version services according to different routes.

2. Because multiple versions exist at the same time, it can support calling different model versions in the decision-making process to observe the model effect. At the same time, due to the existence of multiple routing versions, it can support multiple traffic policies.

It also supports a set of latest routes, which can switch the latest route to any version without changing the decision-making process, thus completing the switching of online traffic.

Request-based popup

Why the model adopts the elastic expansion strategy based on the number of requests:

1. Most of the models are computing-intensive services. The model CPU usage is positively related to the number of requests. Therefore, once the requests surge, the CPU will be filled up and eventually lead to a service avalanche.

2. In addition to request-based pop-up, there is also HPA pop-up based on CPU and memory indicators. However, HPA pop-up also has the following problems:

a. The elastic expansion link of the indicator is long: Prometheus collects the indicator from the service exposure indicator, and the indicator rises. Then HPA calculates the number of pods to be expanded based on the indicator, and then starts to expand the pod. The overall elastic expansion link is relatively long.

b. Inaccurate indicators: For example, Java's gc will cause periodic changes in memory.

c. The indicator cannot reflect the real service status: when the indicator rises, the response delay has become very high. That is to say, if the indicator finds that the threshold is exceeded, it will be delayed, and then the pod will be expanded and the pod will be ready. It has also been postponed.

3. Therefore, we need to advance the expansion time and use the number of requests to trigger model expansion to ensure that the model service can maintain normal service status.

The following is the working process of a revision version request from 100%-0%:

Knative pop-up link is as follows:

1. Traffic is requested from the service to the pod.

2. PodAutoscaler collects the traffic request indicators of the queue-proxy in the Pod in real time, and calculates the number of pods to be expanded based on the current request.

3. PodAutoscaler controls Deployment to expand pods, and the expanded Pods will be added to this Service, thus completing the expansion of Pods.

Compared with indicator elastic expansion, the request-based elastic expansion strategy has faster response and higher sensitivity. Able to ensure service response. In view of the pain points of resource offline in the past, since it can be scaled down to 0, we can adjust the minimum number of pods in revision to 0. If there is no request, it will automatically scale down to 0. Combined with elastic nodes, reducing the capacity to 0 will no longer occupy resources, and also realizes the serverless capability of model services.

BetterCDS one-stop model release

The model is deployed through the pipeline. The following is the BetterCDS release pipeline step process.

  • Deploy version step: Trigger Knative model deployment.
  • New route: Add Knative version route.
  • Update the latest route: Update the latest route as a parameter of the pipeline. You can choose to update the latest route when the deployment is completed.

Through the Knative pop-up configuration management module, each model can be configured with global pop-up configuration and version pop-up configuration.

Cluster elastic expansion architecture

Container applications run on our ACK and virtual node clusters, and ecs and elastic nodes are mixed. Normal business is carried by annual and monthly ecs nodes, and elastic business is carried by pay-as-you-go elastic nodes, which can achieve high elasticity at a low cost. Based on the elasticity capability of virtual nodes, it not only meets elasticity requirements, but also avoids waste of resources and reduces usage costs.

1. ACK maintains a fixed node resource pool and reserves Nodes based on resident Pods and regular business volume, with lower annual and monthly costs.

2. Virtual nodes run elastic instances to provide elastic capabilities. Because there is no need to care about the nodes, the elastic capabilities can be expanded as needed to easily cope with sudden traffic and peaks and troughs.

3. For businesses that do not have high real-time requirements for jobs and offline tasks, resources will not be occupied all the time because of the periodicity, and the cost advantage is high.

4. Free operation and maintenance: elastic nodes, Pods will be destroyed when used up

Results and benefits

Knative's elasticity maximizes resource usage. From the curves of resource and traffic requests, we can see that our resources have peaks and troughs, and there is obvious periodicity. When the request increases, the Pod usage increases, and when the request decreases, the Pod usage increases. Volume decreases. The peak number of Pods in the cluster is close to 2,000, the cost is reduced by about 60% compared with the past, and the resource savings are considerable.

Thanks to Knative's multi-version publishing capabilities, publishing efficiency has also been improved, from hours in the past to minutes now. Shuhe's Knative model practice has also received authoritative certification and was selected as an "Excellent Cloud Native Application Case" by the Cloud Native Industry Alliance of the Academy of Information and Communications Technology.

Knative and AIGC

As a well-known project in the AIGC field, Stable Diffusion can help users quickly and accurately generate the scenes and pictures they want. However, currently using Stable Diffusion directly on K8s faces the following problems:

  • The throughput rate of a single Pod processing requests is limited. If multiple requests are forwarded to the same Pod, it will cause an overload exception on the server side. Therefore, it is necessary to accurately control the number of concurrent processing of single Pod requests.
  • GPU resources are precious. It is expected to use resources on demand and release GPU resources in a timely manner when business is low.

Based on the above two issues, we can achieve precise elasticity based on concurrency through Knative. When resources are not in use, the capacity will be reduced to 0.

In addition, Knative provides an out-of-the-box observable disk, which allows you to view the number of requests, request success rate, Pod expansion and contraction trends, and request response delays, etc.

Author: Li Peng (Alibaba Cloud), Wei Wenzhe (Shuhe Technology), this article is based on KubeCon China 2023 sharing and compilation

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10143303