[Reposted] 40 days of 14 major version upgrades, large-scale container technology practice behind the Tencent conference

40 days 14 major version upgrades, large-scale container technology practice behind Tencent conference 

https://my.oschina.net/jxcdwangtao/blog/3197014

 

Due to the impact of the New Coronary Pneumonia epidemic around the Spring Festival, online office applications quickly became popular. As an enterprise-level online office product, Tencent Conference was quickly recognized by users. At the same time as the number of users is exploding, the business is also iteratively upgrading at a high speed. It has completed the upgrade of 14 major versions in 40 days and staged a classic case of small steps and fast iteration. This article will start from the perspective of the container and show you how the Tencent conference is based on the Tencent cloud container service TKE. With the back-end capacity reaching 1 million cores, the cloud native technology behind the rapid iteration of the version.

As a key product for enterprise-level products, Tencent Meeting has very high requirements on product availability and stability. Any unstable service may result in users not being able to access the meeting, meeting interruption, or poor audio and video quality, resulting in user complaints. Affect product reputation and reduce user trust.

At the same time, the number of users is exploding, which requires back-end capabilities to keep up with the pace of user growth in real time, which has very high requirements for the timeliness of capacity expansion. In addition, in the process of service operation, the product will inevitably have iterative upgrades such as function optimization and bug fixes. In order to make it transparent to upgrade users, the platform is required to efficiently and controllably support the release and update of business programs.

Solutions

To address the above business scenario requirements, Tencent Cloud Container Service ( TKE ) is used to support Tencent meetings. With the help of TKE 's dynamic routing, fixed network, elastic scalability, and controllable upgrade capabilities, it perfectly carries the deployment and operation of high-growth services during the outbreak of Tencent conferences, online education, and air classrooms. The management platform architecture is as follows:

The management platform is based on the TKE cluster deployment service, and some function expansions have been made based on this. The green mark shows the realized capabilities, the red mark is in progress, and the gray is planned to be implemented; the TKE cluster is based on Tencent Cloud CVM, CBS, CLB, VPC Based on basic capabilities, deploy cloud-native k8s clusters; through TKE-based extended operators, from routing (CLB / L5 Controller), network (Ipamd), resource management (NodeResourceOversell, DynamicQuota), elastic scaling (VPA, HPAPlus) Function optimization has been done in multiple dimensions, which has well supported the deployment and operation of Tencent's self-developed services such as Tencent Conference.

The business deployment model is as follows:

The services here are deployed in three sets of environments, corresponding to three sets of namespaces: test, pre-release, and formal. Services are deployed in multiple microservices models under the namespace. Each set of services includes a set of independent and complete execution resources, such as: workload, service / ingress, configmap, secret, pv / pvc, etc.

The full name of Tencent Cloud Container Service TKE in English is Tencent kubernetes Engine, referred to as TKE , which aims to provide users with a stable, secure, efficient, flexible, and easy-to-use Kubernetes container management platform. Based on native kubernetes, TKE provides a container-centric, highly scalable, high-performance container management service. It is fully compatible with the native kubernetes API. It expands the kubernetes plugins such as Tencent Cloud's cloud hard disk and load balancing. It provides efficient deployment, A series of complete functions such as resource scheduling, service discovery, and dynamic scaling solve the environmental consistency problems of user development, testing, and operation and maintenance, improve the convenience of large-scale container cluster management, and help users reduce costs and improve efficiency.

Specific solution analysis

The following will introduce how Tencent Cloud Container Service TKE supports the business evolution of Tencent Conference from three aspects of quality assurance, efficiency and controllable upgrade.

1. Quality assurance

The stability of business services directly determines the reputation of the product. In order to ensure the service quality of Tencent conferences and other services, Tencent Cloud has made guarantee measures in the following dimensions of dynamic routing, anomaly detection, parallel expansion, and business migration:

1.1 Dynamic routing

动态路由是基于腾讯自研的路由系统L5,通过在TKE集群内拓展service能力实现的L5-controller组件。L5-controller控制层面实现逻辑和通用controller类似,基本流程包括监听变更事件,触发回调,入队列;拉起worker,更新路由。

However, in order to ensure the consistency of routing, the two sets of main and auxiliary functional processes have been expanded. As shown in the above figure, the processes are marked with orange and green arrows respectively. The main process guarantees effective implementation and the auxiliary process guarantees consistency. Among them, the main process is triggered in real time by listening to events, and multiple workers execute concurrently to ensure the timeliness of routing updates and take effect in seconds. The auxiliary process serially scans the service in the cluster, pulls the routing configuration data corresponding to the service and reconciles the data in the L5 routing system, and checks for gaps. Any routing settings that are not through the center console will be strengthened and consistent. Detected and overwritten, effective at the minute level. L5-controller data level can ensure that the change settings take effect in seconds, and the strong consistency is implemented in minutes

1.2 Anomaly detection

Support the user-defined pod health check through the following capabilities of cloud native k8s, to ensure that the service is isolated during abnormal periods of traffic

  • livenessProbe: Survival check probe
  • readinessProbe: readiness check probe
  • postStart: specify the action to perform before the pod is pulled up normally
  • preStop: specify the action to be executed before the pod is destroyed
  • terminationGracePeriodSeconds: waiting time for pod termination

Through the use of the above several functional features, combined with dynamic routing capabilities, high stability of business services can be achieved.

1.3 Parallel expansion

In a massive Internet user scenario such as the Tencent Conference, it is very necessary to grasp the sensitivity of expansion. The explosive growth of traffic in a short period of time requires that the back-end expansion rhythm must be able to keep up. Therefore, Tencent Cloud has two major cloud-based HPA The ability to expand, one is the amount of concurrency, run the capacity expansion detection process in a high concurrency mode, the high concurrency actively detects the high load of the business, and then triggers the capacity expansion in real time according to the actual situation; the second is the calculation cycle, which supports the user to set the detection calculation cycle. The minimum can even reach the second level. When the detection process finds a high load during the calculation cycle, it can trigger expansion in real time.

1.4 Business migration

Business migration is mainly achieved through the following three modes:

helm deployment: cloud-native way, one configuration, multi-cluster distribution, the advantage is efficient and manageable, the disadvantage is that it is easy to be different from the existing network operation configuration, resulting in not necessarily completely usable, strongly dependent on deployment specifications

Namespace packaging and copying: Package all the ns of the business and all resources under the ns, copy and deploy them to another cluster, as much as possible to ensure the execution environment and configuration restoration of the business, the one-key operation has been implemented in the center console, fast and efficient

Namespace packaging and reuse: similar to the above process, there is function optimization that supports user-defined configuration adjustments to reduce user operation costs

2. Efficiency

During the new crown epidemic, Tencent Conference was well-recognized by the majority of users, users exploded, and the efficiency of service deployment was the core capability. To promote the platform's ability to quickly adapt to the pace of user growth, Tencent Cloud from process automation, CI / CD Efforts have been made in authentication management and flexible expansion:

2.1 Automation

Automation is mainly to improve process efficiency. Specific implementation processes such as: new cluster construction, cluster capacity expansion, environment initialization, component distribution, rely on Tencent Cloud API, automatically implement cluster creation, cluster addition node; environment initialization, kernel parameter adjustment, system environment settings , Tool installation, file system formatting, component distribution and other processes, all rely on script packaging and tool execution; the integration of the process and tools of the central console is implemented through platform channels, such as: batch execution Script, batch installation, registering the console after creating a new cluster, etc.

2.2 CI / CD

Tencent's internal business deployed in the TKE cluster has basically opened up the CI / CD process and supports tandem deployment across multiple network environments. Tencent conferences have implemented through the existing CI / CD model and the cloud-native service orchestration model. Efficient landing from development, deployment, testing, launch, and version iteration

2.3 Authentication management

CMDB, as a relationship management model for products, business modules, and IP, is widely used and reused in many company business scenarios. The implementation of the CMDB Controller is basically similar to the general controller logic, which includes monitoring pod change events of the apiserver, triggering corresponding workers and workers to query cluster information based on event callbacks, querying pod-related product information, and finally landing the relationship to the CMDB service system.

The real-time synchronization of CMDB provides a good auxiliary function for the implementation of many customized expansion functions, such as the following authentication application process:

As shown in the process diagram, through the init-container mode, the business CMDB is quickly synchronized to land, and it is linked with Tencent's internal weaving cloud, L5 and other systems, which can effectively open the authentication channel, which has always made the business headache. Support business service deployment.

2.4 Elastic expansion and contraction

In order to meet the needs of key business scenarios such as Tencent meetings, in addition to the cloud-native VPA capability, Tencent Cloud has expanded based on cloud-native HPA, and implemented it in Operator mode by extracting the HPA function module separately to support business custom feature settings , Specific expansion capabilities include multi-channel indicators, support collection of multi-channel indicators through Metrics server, Promethus, business monitoring, etc., better compatibility with business scenarios

; Parallel implementation, multi-work parallel detection of business indicators, real-time trigger elastic scaling to ensure timeliness; custom, support users to customize expansion and contraction threshold, calculation cycle, elastic coefficient. The functional framework of the multi-channel indicator is as follows:

3. Controllable upgrade

For users, the Tencent conference is a key scenario for solving the difficulty of direct communication between users. In order to provide a better user experience, there will be more or less iterations of function optimization and bug fixes during product operation. Under the cloud-native model, how to ensure the reliability and control of updates is particularly important. Based on cloud native capabilities, Tencent Cloud has expanded and optimized the following four aspects: fixed network, upgrade lightweight, batch upgrade, and capacity guarantee, and realized the ability of the following scenarios:

3.1 Fixed network

Fixed network means that the IP remains unchanged after the destruction and reconstruction of the pod. Scenarios such as pod anomalies and workload version updates, etc. The fixed network is a very recognized feature of Tencent ’s self-developed business. Many business background services have IP-based authentication management , White list mechanism and other functional dependencies, I believe many companies will have similar scenarios. The design of the fixed network is based on three major functional modules: IPAMD Controller, TKE-ENI-AGENT, CNI:

  • The IPAMD Controller runs in the form of an operator, and is responsible for IP allocation management, network resource status update, and node and pod association record update, etc.
  • TKE-ENI-AGENT runs on each node in daemonset mode, responsible for routing reconciliation, generating cni configuration, setting node policy routing and Pod network stack
  • CNI also runs on node and is responsible for implementing node policy routing and Pod's network stack

The specific implementation process of the fixed network is as shown above:

  1. User initiated the creation of statefulset
  2. IPAMD Controller monitors the creation request through the list-watch mechanism
  3. IPAMD Controller uses IP allocator to allocate IP to all pods under the new statefulset
  4. The IPAMD Controller generates the corresponding network configuration StaticIPConfig for each pod, and then manages the association and status updates of the pod and the assigned IP in the life cycle after the statefulset
  5. IP allocator is the core function of the IPAMD Controller. It will determine the current pod is a new application, destroy the reconstruction reservation, delete the recovery and other processes according to the pod information and statefulset status, and generate the corresponding configuration. When the statefulset is newly created, the allocator will be equipped with new The IP is given to the pod. When the pod is abnormally destroyed and rebuilt, the allocator will temporarily recover the original IP. After the reconstruction is completed, it will be reused repeatedly. When the statefulset is deleted, all the allocated IP will be recovered.
  6. When the pod is created on the node, it will obtain the network configuration StaticIPConfig that IPAMD has generated through TKE-ENI-AGENT, and then request the IP allocator to implement the effective pod network configuration through GRPC
  7. After receiving the TKE-ENI-AGENT request, the IP allocator will obtain the static IPConfig of the pod network configuration that has been allocated, then call the Tencent Cloud Interface to create the ENI elastic network card according to the configuration, and finally generate the CNI configuration CNIInfo
  8. CNI implements node policy routing and pod network stack based on the configuration CNIInfo generated by the IP allocator, and the pod network takes effect

3.2 StatefulsetPlus Operator

Cloud native deployment, StatefulSet and other workload types can not meet the needs of Tencent conference and other services such as: lightweight upgrade, batch upgrade, capacity guarantee, etc., so Tencent Cloud developed a set of Operator based on StatefulSet to support new workloads The type is StatefulsetPlus, which inherits all the core features of StatefulSet built in Kubernete. The main function logic is similar to StatefulSet, but the subdivision capabilities are expanded such as the fixed IP of the container (Pod) instance and the multi-batch grayscale update of the application. Better compatibility with the release of traditional applications and Pod's automatic drift when the Node is disconnected, supporting in-situ container upgrades. The application architecture and functional modules of StatefulsetPlus Operator are as follows:

StatefulsetPlus is deployed in CRD mode, which is partially different from Statefulset's yaml parameter, and its usage is basically the same as Statefulset

3.3 Upgrade in batches

Many key services, such as Tencent Meeting, require absolutely controllable advancement when updating and upgrading. Batch upgrade is based on the expansion of this scenario. It can also be better compatible with the release of traditional applications. Batch upgrade is in actual use. The evolution process is as follows:

To achieve sufficient controllability of batch upgrades, batches need to be specified in advance when the upgrade is initiated. Each subsequent process will be manually triggered. If an upgrade fails, you can continue after the repair or you can directly follow the Rollback process, compared to other workloads. The advantage of StatefulsetPlus' batch upgrade function is that users can configure the instance set of each batch. For example, users can only temporarily upgrade the instances in the corresponding area according to the application service area. The specified instances of each batch are updated concurrently, and the upgrade efficiency is high.

At the same time, the upgrade is more secure and worry-free. It supports waiting for user confirmation after each batch is completed, and then triggers the upgrade of the next batch of instances. After the application probe automatically detects that the upgrade is successful, it automatically triggers the upgrade of the next batch of instances. In addition, it supports manual scaling, elastic scaling based on basic indicators (Cpu, Mem, Network I / O), and elastic scaling based on application-defined monitoring indicators.

to sum up

At present, the trend of full cloud access is already clear. The Tencent conference also took advantage of the cloud computing boom to quickly complete the rapid iteration of the version and the perfect upgrade of functions based on cloud native technology. In the cloud-based transformation of all walks of life, container technology plays a vital role. Tencent has been researching container-related technologies and services a long time ago. Many of its successful businesses, such as games, WeChat, and advertising, have chosen to run on container technology. It can be said that container technology is supporting billions of user.

With the vigorous development of the cloud native technology ecosystem, as a technology follower, I also hope that Tencent Cloud Container Service TKE technology can support more businesses to move forward proudly.

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12598770.html