Alibaba Cloud ACK newly upgraded to create a modern application platform in the era of intelligent computing

Author: Yi Li

Today, the demand for containers and Kubernetes remains high in every conceivable and unexpected field, making this technology truly ubiquitous.

At the 2023 Yunqi Conference, Yi Li, head of container services for Alibaba Cloud's cloud native product line, introduced the application of container service ACK at this Asian Games, which made the audience shine. "Take the Hangzhou Asian Games as an example, as a cloud native The technical base provides highly flexible, highly available, and remote multi-center architectural support for many core applications such as the Asian Games One Stop and Asian Games Nails, ensuring that the event system is foolproof."

Alibaba Cloud Container Service ACK has grown into a cloud-native application operating system for enterprises, helping more and more customers achieve intelligent and digital innovation, including autonomous driving, intelligent scientific research, financial technology and many other emerging fields. It covers various scenarios from public cloud, edge cloud, to local data center. Let all places that require cloud capabilities have a unified container infrastructure.

In the past year, Alibaba Cloud's container product capabilities have continued to be widely recognized by the industry. In September 2023, in the Magic Quadrant for Container Management released by the authoritative consulting organization Gartner, Alibaba Cloud became the global leader and the only one in Asia due to its complete product system in public cloud, private cloud, hybrid cloud and other environments. In the Forrester Public Cloud Development and Infrastructure Platform Q4/22 evaluation in the fourth quarter of 2022, Alibaba Cloud is the best choice for cloud native developers in China.

The era of intelligent computing has arrived. Yi Li introduced that in order to help enterprises build modern application platforms, Alibaba Cloud container services provide efficient cloud native computing power, high-performance intelligent computing applications, intelligent operation and maintenance management, trusted infrastructure, and distributed cloud architecture.  The five core directions bring new upgrades in product capabilities.

A new generation of cloud-native computing power improves enterprise computing performance

Larger scale: new breakthrough in elastic computing power pool

Alibaba Cloud provides a wealth of elastic computing power, including multi-CPU architectures such as Intel/Amd/EtianArm, various heterogeneous accelerators such as GPU/RDMA, and diversified sales forms such as volume-based, Spot, and savings plans. Using ACK, customers can maximize the use of Alibaba Cloud's overall elastic computing power pool capabilities, flexibly choose according to their own needs, increase efficiency and reduce costs.

ACK cluster supports two different data plane forms: managed node pool and virtual node:

  • The managed node pool supports any ECS bare metal and virtual machine instances as K8s working nodes. One working node can run multiple Pods. It is fully compatible with K8s semantics and has both flexibility and ease of use.
  • Virtual nodes, each Pod runs in an independent elastic container instance ECI. Each ECI instance is an independent security sandbox with high elasticity, strong isolation, and operation-free maintenance. Alibaba Cloud elastic computing is based on CIPU and can uniformly produce ECS bare metal instances, virtual machine instances, and elastic container instances. This means that ECI supports elastic computing and abundant computing power types, with sufficient inventory guarantee.

This year, the ACK cluster can better schedule ECI instances through mutual awareness of elastic computing scheduling, and supports the expansion of K8s's cluster resource scheduling capabilities to the entire elastic computing power pool, ensuring unified scheduling and consistent capabilities between the ECS node pool and virtual nodes. Users Maximize the use of cloud resources without modifying existing K8s application definitions.

More and more customers are building large-scale microservice architecture applications and large-scale data computing tasks based on ACK clusters. At the same time, in order to meet the demand for cluster scale growth, the maximum number of nodes supported by a single ACK cluster has been increased from 10,000 to 15,000, and the number of ECI instances has been increased from 20,000 to 50,000 instances. Our control plane components will scale on demand based on the data plane scale to ensure stability.

Better cost performance: Exclusive optimization of Yitian architecture

More and more ACK customers choose Etian chips as their new computing power choice. There are three main reasons why customers choose Etian Architecture instances:

  • High cost performance: Compared with the G7 instance family, web applications are improved by 50%, video encoding and decoding is improved by 80%, and Spark tasks are improved by 28%.
  • High throughput: It adopts Arm V9 architecture, provides independent physical cores, and provides more deterministic performance; compared with the G7 instance family, Web application throughput is increased by 22%; Spark TPC-DS Benchmark speed is increased by 15%.
  • Exclusive optimization: The container image service ACR cooperates with the basic software team and the Dragon Lizard Community WIP Center to provide basic software and application software images that are specially optimized for Etian chips. Provide special parameter tuning for Etian architecture through KeenTune based on AI and expert knowledge base. In mainstream scenarios, the performance after optimization is improved by 30% compared with before optimization.

In order to support the smooth switching of container applications to Etian architecture, ACR provides multi-architecture image building capabilities, supporting the construction of application images containing x86 and Arm architectures from one source code. At the same time, the ACK cluster can contain Arm/x86 node pools or virtual nodes at the same time, allowing Customer K8s applications are scheduled on demand under different CPU architectures and switched gradually.

Higher elasticity: newly released node pool instant elasticity

Maximizing the use of cloud elasticity is an important requirement of customers for container products. Yili also brought a new release from ACK: "On Alibaba Cloud, the container service has millions of cores of computing power resources that can be expanded on demand every day. Scale down to help customers optimize computing costs. Today, we officially release the instant elasticity capability of the ACK node pool."

ACK Node Pool Instant Elastic Scaler has the following features:

  1. Faster elastic speed: On a 100-node pool scale, the end-to-end expansion speed is maintained at an average of 45 seconds, which is 60% higher than the community Cluster Autoscaler.
  2. Supports user-defined flexible specification matching strategies: In the community's Cluster Autoscaler, the CPU/Memory specifications of the nodes in each node pool are fixed. To meet different needs, multiple node pools need to be created, which will bring about configuration management complexity. The possibility of resource fragmentation is introduced and increases the risk of reduced elasticity and stability due to insufficient inventory. Instant Elastic Scaler supports user-defined flexible specification matching strategies. Under the matching conditions of node specifications of different models, the system will generate optimized binning results based on the resource requests and scheduling constraints of the Pending Pod set to be scheduled, as well as the inventory awareness of ECS. . In this way, only one node pool can achieve node elasticity for multiple specifications and multiple availability zones. While reducing the complexity of node pool configuration, it also reduces resource fragmentation and improves the success rate of elasticity.
  3. Instant elasticity is fully compatible with the existing node pool capabilities and usage habits, and can be used with managed node pools to achieve automated node operation and maintenance.

Simplified operation and maintenance: ContainerOS combined with fully managed node pool

For K8s clusters, node operation and maintenance is an important daily task to ensure system stability and security, but manual operations are very complicated and cumbersome.

ACK managed node pool supports automatic operation and maintenance of nodes throughout the life cycle, including automatic repair of high-risk CVE vulnerabilities, self-healing of node failures, and automatic upgrade of OS/node components. The node self-healing success rate is 98%; cluster node operation and maintenance time is reduced by 90%. .

ContainerOS is a container-optimized operating system released by the Dragon Lizard community. It is built using the concept of immutable infrastructure and has the characteristics of simplicity, security, and programmability. The elastic time of 1,000 nodes is P90 55s, which is 50% lower than the elastic time of nodes such as CentOS.

ContainerOS can be perfectly combined with the fully managed node pool to further optimize the flexibility and operability of the node pool, allowing enterprises to focus on their own business rather than K8s infrastructure maintenance.

Richer scenarios: Serverless containers increase efficiency and reduce costs for AI scenarios

Support for Serverless Containers is an important direction in the evolution of K8s. ECI-based ACK Serverless has been widely used in customer scenarios. ACK and ECI not only help the elastic scaling of online applications such as Weibo hot searches and DingTalk meetings, but also help many AI and big data customers reduce costs and increase efficiency.

  • Based on ACK and ECI, Shenzhen Technology has implemented a multi-region deployment AI scientific research platform, which is free of operation and maintenance, creates experimental environments on demand, supports large-scale AI image pulls in seconds, and increases resource utilization by 30%.
  • Based on ACK and ECI, miHoYo unifies the big data platform architecture of all regions around the world and creates more than 2 million ECI instances in a single day to perform Spark computing tasks. By efficiently utilizing ECI Spot instances, overall resource costs are reduced by 50%.

There are four important releases of ECI elastic container instances this year:

  • Inclusive cost reduction: New "economical" specifications are added, which are 40% cheaper than the current general-purpose models. They are designed for cost-sensitive web applications, computing tasks, development and testing and other workloads. In addition, the price of existing general-purpose instances will also be reduced in the near future, by up to 15%.
  • Ultimate performance: It is planned to add "performance enhanced" specifications to provide higher performance computing power and more certainty compared to existing general-purpose instances for computing-intensive business scenarios, such as scientific research, high-performance computing, and games. performance.
  • Elastic acceleration: ECI realizes pre-scheduling of underlying resources through self-learning and prediction of user load characteristics, and the expansion speed is increased to 7000 Pod/min, which is very suitable for large-scale data task processing scenarios. In addition, it is the first in the industry to support GPU driver version selection, providing more flexibility for AI applications while increasing cold start speed by 60%.
  • Flexible efficiency improvement: ECI released support for Etian Arm and AMD architecture this year, and ACK has also recently launched Windows container support to support richer enterprise application scenarios. It also releases support for fine-grained memory specifications to help users refine resource adaptation and eliminate idle resource overhead.

Cloud native intelligent computing infrastructure to build an efficient modern application platform

Fully supports Lingjun cluster to improve the efficiency of large model training

In the past year, AIGC/large language model is undoubtedly the most important development in the field of AI. As the size of large model parameters, training data, and context length grow, the amount of computation consumed in training large models increases exponentially.

ACK fully supports Alibaba Cloud Lingjun intelligent computing clusters, providing high-performance and efficient Kubernetes clusters for large-scale distributed AI applications. ACK provides comprehensive support for Lingjun's high-performance computing power, as well as batch AI task scheduling, data set acceleration, GPU observability and self-healing capabilities.

Through software and hardware co-design and cloud-native architecture optimization, ACK helps PAI Lingjun intelligent computing solution efficiently utilize powerful computing power to improve efficiency in many intelligent computing business scenarios such as AIGC, autonomous driving, finance, and scientific research.

ACK cloud-native AI suite has been enhanced to build an enterprise-specific AI engineering platform

ACK launched a cloud-native AI suite last year to help users make full use of elastic computing power on Alibaba Cloud based on Kubernetes and support scenarios such as elastic training and inference. On top of this, it not only serves AI platforms and services such as Alibaba Cloud PAI, Lingjun Intelligent Computing, and Tongyi Qianwen, but also provides containerization support for open source AI frameworks and models.

This year, for large model scenarios, the AI ​​suite has added containerization support and optimization for the open source large model frameworks DeepSpeed, Megatron-LM, and TGI. Through the scheduling optimization and data access acceleration of the cloud-native AI suite, AI training speed is increased by 20%; large model inference cold start speed is increased by 80%, and data access efficiency is increased by 30%.

ACK AI suite has been widely used by many domestic and overseas companies to help customers build their own exclusive AI platforms, significantly improving GPU resource efficiency and AI engineering efficiency.

  • Domestic AI painting tool **"Haiyi AI"**: Based on Fluid data set acceleration and AIACC model optimization scheme, the inference performance is increased by 2 times.
  • Anydoor Soul: Based on ACK, a nearly 1,000-calorie-scale AI PaaS platform is built, and the development iteration efficiency is increased by 2-5 times.

ACK cluster scheduler, optimized for AI/big data load expansion

The ACK cluster scheduler is based on the Koordinator project. It is an open source Kubernetes scheduler implementation based on Alibaba's large-scale co-location practice. It can uniformly and efficiently support diverse workloads such as microservices, big data, and AI applications. Among them, we have made the following optimizations and expansions for AI and big data loads:

  1. It provides scheduling primitives for batch tasks, such as Gang Scheduling, elastic quota, priority scheduling, etc., based on full compatibility with Kubernetes' existing scheduling capabilities. It can be seamlessly integrated with community projects such as Kubeflow and KubeDL.
  2. Supports topology-aware performance optimization. Based on the topology information of interconnection links such as PCIe, NVSwitch, and RDMA network cards, it automatically selects the GPU card combination that can provide the maximum communication bandwidth, effectively improving model training efficiency.
  3. Supports fine-grained resource sharing scheduling for GPU, effectively improving GPU resource utilization in model inference scenarios.

Recently, we are collaborating with Xiaohongshu in the community to release the ability to co-locate Hadoop Yarn tasks and Kubernetes workloads to further improve the resource efficiency of Kubernetes clusters. Related work helped Xiaohongshu ACK cluster resource efficiency increase by 10%.

We are also promoting the donation of Koordinator to the CNCF Foundation to maintain the long-term and healthy development of the project, and welcome everyone to co-build in the community.

Intelligent autonomous system reduces container operation and maintenance management costs

ACK AIOps intelligent product assistant accelerates K8s problem location and resolution

The technical complexity of Kubernetes itself is a significant factor hindering adoption among enterprise customers. Once a K8s cluster fails, troubleshooting problems with applications, clusters, OS, and cloud resources is full of challenges even for experienced engineers.

ACK's newly upgraded container AIOps suite combines large models with expert systems, allowing administrators to interact with the system using natural language through intelligent product assistants, accelerating Kubernetes problem location and resolution.

When an issue occurs, the AIOps suite collects context-sensitive definitions, status, and topology information of Kubernetes objects and cloud resources. For example, Deployment, Pod and associated nodes, etc. And related observable information, such as logs, monitoring, alarms, etc. Then, data analysis and collection will be conducted based on the large model, and possible causes and repair plans for the current problems will be given. The large model solution behind ACK is optimized for cloud native development and operation and maintenance knowledge base, improving the accuracy of problem analysis.

Users can further utilize the expert experience system in intelligent diagnosis to locate root causes. The existing AIOps suite contains more than 200 diagnostic items, covering Pod, node, network and other problem scenarios, and can conduct in-depth investigation of network jitter, kernel deadlock, resource contention and other problems.

In addition to user-driven problem diagnosis, the AIOps suite is also strengthening automated inspections and automated real-time processing of abnormal events to provide more comprehensive protection for cluster stability and security and prevent problems before they occur.

ACK FinOps suite is fully upgraded, with detailed scenario-based analysis and allocation strategies

ACK released the FinOps cost management suite last year, which enables enterprise administrators to make costs "visible, controllable, and optimizable" for K8s clusters. Over the past year, the FinOps suite has supported hundreds of customers across a variety of industries, including:

  • Qianxiang Investment uses the FinOps suite to optimize application configuration, increase cluster resource utilization by 20%, and save costs by more than 100,000 yuan/month.
  • Jikrypton Automotive achieves elastic cost reduction in hybrid cloud through FinOps suite, saving millions in IT costs a year.

This year, the FinOps suite has been fully upgraded, adding more scenario-based analysis and allocation strategies. For example, in AI scenarios, cost visualization can be performed based on dimensions such as GPU cards and video memory. In addition, the FinOps suite also releases a one-click resource waste check function, which can quickly discover unused resources such as empty cloud disks and SLBs in the cluster, further improving the overall resource utilization of the cluster.

End-to-end container security protects the construction of trustworthy AI applications

Trusted application delivery enhancement, ACK and ACR provide DevSecOps software supply chain

Software supply chain security is the biggest concern for enterprises implementing cloud native technologies. Gartner predicts that by 2025, 45% of organizations around the world will have suffered software supply chain attacks.

Alibaba Cloud ACK and ACR services provide DevSecOps best practices and realize automated risk identification, blocking and prevention capabilities from image construction, distribution to operation. Help enterprises build a safe and trustworthy software supply chain.

The practice of DevSecOps relies on in-depth collaboration between R&D, operation and maintenance, and security teams. This year, we launched a cluster container security overview to help enterprise security administrators better understand the security risks of cluster configuration, application images, and container runtime, and improve the supply chain process. More transparent and efficient.

By using our DevSecOps supply chain security capabilities: the famous car manufacturer Lotus implements thousands of security configuration inspections every month to prevent high-risk configurations from going online; China Merchants United Financial Group uses its supply chain policy governance capabilities to conduct daily CI/CD During the process, the interception of thousands of risk mirrors is realized to ensure the security of financial business.

The best of both worlds: a new service mesh that combines Sidecarless and Sidecar models

Service mesh has become the network infrastructure for cloud native applications. Alibaba Cloud Service Mesh ASM product has undergone a new upgrade, becoming the first product in the industry to release managed Istio Ambient Mesh, providing integrated support for Sidecarless mode and Sidecar mode.

The classic service grid architecture adopts the Sidecar mode, which requires the Envoy Proxy Sidecar to be injected into each Pod to implement traffic interception and forwarding. It has extremely high flexibility, but it introduces additional open source resources, increases the complexity of operation and maintenance and delays in establishing connections. In Sidecarless mode, the L4 agent's capabilities are moved to the CNI component on the node, and the optional L7 agent runs independently of the application. Applications can enjoy features such as secure encryption, traffic control and observability brought by service mesh without redeployment.

In typical customer scenarios, using the Sidecarless model service grid can reduce resource overhead by 60%, simplify operation and maintenance costs by 50%, and reduce latency by 40%.

Hosted Istio Ambient Mesh effectively reduces the complexity of service grid technology and promotes the implementation of zero-trust network technology.

Newly launched privacy-enhanced computing power to protect the construction of trusted AI applications

In order to address the growing concerns of enterprises about data privacy, Alibaba Cloud, DAMO Academy Operating System Laboratory, Intel and the Dragon Lizard Community jointly launched Confidential Computing Containers (CoCo) based on Trusted Execution Environment (TEE) on the cloud. The reference architecture combines trusted software supply chain and trusted data storage to achieve an end-to-end secure and trusted container operating environment, helping enterprises resist security attacks from external applications, cloud platforms, and even internal enterprises.

Based on the Trust Domain Extension TDX technology provided by Alibaba Cloud's eighth-generation Intel instances, ACK has newly launched support for confidential containers and confidential virtual machine node pools. Using TDX technology, business applications can be deployed into TEE without modification, which greatly lowers the technical threshold and provides privacy-enhanced computing power for data applications such as finance, medical care, and large models.

In the AI ​​era, models and data have become core business assets of enterprises. Based on confidential computing containers, Alibaba Cloud basic software, containers, and the Intel team provide a demonstration solution for trusted AI applications. In this example architecture. Applications, AI models, and fine-tuning datasets are all encrypted and stored in the cloud service, and are decrypted and executed by a confidential container in the TEE at runtime.

  • The model inference and fine-tuning process is safe and trustworthy, ensuring the confidentiality and integrity of data.
  • Highly cost-effective, based on AMX instruction set optimization, the 32-core CPU can achieve stable diffusion rendering in seconds.
  • Low loss, the performance brought by TDX can control the loss within 3%.

Easier cross-cloud collaboration makes business management more efficient

ACK One Fleet provides a unified control plane for multiple K8s clusters in different regions. We can achieve unified cluster management, resource scheduling, application delivery, and backup and recovery capabilities for public cloud clusters, edge cloud clusters, and local data center clusters.

  • Zhaopin Recruitment uses ACK One to achieve hybrid cloud load-aware elasticity, and uses ECI to achieve business expansion of tens of thousands of cores in 5 minutes.
  • Jike Auto uses ACK One to uniformly manage dozens of hybrid cloud K8s clusters, improving security and business continuity, reducing resource usage by 25%, and increasing operation and maintenance efficiency by 80%.

In large-scale data computing workflow scenarios such as simulation and scientific computing, a batch of calculations may require tens of thousands or even hundreds of thousands of computing power, which exceeds the elastic supply capacity of a single region and requires cross-regional computing supply. In scenarios such as IoT and medical care, massive data is scattered in different regions and requires nearby computing capabilities. To this end, ACK launched a fully managed Argo workflow cluster, which is event-driven, large-scale, operation-free, low-cost, and cross-regional.

  • Argo workflow clusters make full use of the elastic computing power of multiple AZs and multiple regions, and automatically utilize ECI Spot to effectively reduce resource costs. Compared with self-built Argo workflow system, 30% resource cost saving can be achieved.
  • The cluster has a built-in distributed data cache, which provides greater aggregate read bandwidth and increases data throughput by 15 times compared to direct access.
  • The cluster provides an optimized Argo engine, and the parallel computing scale is increased by 10 times.

Genetron Health used a fully managed Argo workflow cluster to complete the processing of thousands of tumor genetic samples within 12 hours, increasing the speed by 50% and reducing the cost by 30%.

Alibaba Cloud Container Service ACK, a cloud-native basic platform in the era of intelligent computing

Just as the technological level of a civilized society depends on its ability to utilize energy, the level of intelligence of an enterprise depends on its ability to utilize computing power. Cloud computing brings unlimited possibilities to the era of intelligent computing. Alibaba Cloud Container Service has the mission of building a modern application platform for enterprises and maximizing the use of Alibaba Cloud's powerful elastic computing power:

  • Improve computing performance through scenario-based and efficient utilization of diverse computing power
  • Improve resource utilization through flexibility and scheduling;
  • Reduce operation and maintenance costs through intelligent autonomy
  • Provide end-to-end secure and trusted operating environment through best practices and technological innovation
The author of a well-known open source project lost his job due to mania - "Seeking money online" No Star, No Fix 2023 The world's top ten engineering achievements are released: ChatGPT, Hongmeng Operating System, China Space Station and other selected ByteDance were "banned" by OpenAI. Google announces the most popular Chrome extension in 2023 Academician Ni Guangnan: I hope domestic SSD will replace imported HDD to unlock Xiaomi mobile phone BL? First, do a Java programmer interview question. Arm laid off more than 70 Chinese engineers and planned to reorganize its Chinese software business. OpenKylin 2.0 reveals | UKUI 4.10 double diamond design, beautiful and high-quality! Manjaro 23.1 released, codenamed “Vulcan”
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10364557