Embracing cloud native, next-generation edge computing cloud infrastructure

  //  

Editor's note: Faced with the demand for low latency and distributed architecture in new application forms of massive data, edge computing will become a new generation of edge computing cloud infrastructure. The Volcano Engine covers a large number of edge nodes across the country and reserves hundreds of Terabytes of bandwidth. Carrying multi-scenario customer services such as live video, game entertainment, smart transportation, film and television special effects, LiveVideoStackCon 2023 Shanghai Station invited Volcano Engine edge cloud edge computing architect Guo Shaowei to share "Embracing Cloud Native-Next Generation Edge Computing" Cloud Infrastructure".

Text/Guo Shaowei

Editor/LiveVideoStack

Hello everyone, my name is Guo Shaowei. I currently work as an edge computing architect at Volcano Engine. My personal areas of expertise are cloud native and IaaS-related technology research and development. In recent years, I have been engaged in edge computing-related architecture design and research and development work.

The theme of this sharing is "Embracing Cloud Native - Next Generation Edge Computing Cloud Infrastructure".

The speech mainly includes the following five aspects: 1. Business moves to the edge; 2. New challenges brought by business development to edge computing cloud infrastructure; 3. In response to challenges, edge computing cloud infrastructure is gradually improved; 4. How the Volcano Engine builds internal and external unity , Edge-native cloud infrastructure architecture; 5. Future prospects.

-01-

Business moves to the edge

f3613a8c277722251e645f0704568d11.png

After more than ten years of development, cloud computing is no stranger to everyone. In the process of using cloud computing, we have enjoyed many benefits brought by cloud computing, such as the elasticity and reliability of cloud computing. In the past decade or so, many cloud vendors have emerged one after another, including infrastructure as a service, platform as a service, and software as a service. Cloud computing has evolved into various models such as public cloud, private cloud, and hybrid cloud. The current "cloud" It has touched every aspect of enterprise applications.

According to Gartner, by 2027, more than 90% of enterprises will use "cloud" as their preferred infrastructure. With such large-scale applications, what is the next stage of development direction of cloud computing? Let us take this question and look at the driving force for the evolution of cloud computing.

3e40e16ac6dbb5416b80e9825fe6eb0d.png

There are three key drivers of cloud computing evolution:

①Application-driven : The emergence of more and more localized applications, including cloud games, AR/VR, industrial manufacturing, webcasting, smart parks, autonomous driving, assisted driving, etc. These local business-driven diversified applications are deployed closer to user's place.

②Evolution of infrastructure : Application drives the improvement of infrastructure. Infrastructure is gradually evolving from the center to the edge, forming three new edge infrastructure forms: on-site edge, near-field edge and cloud edge, which provide different latency guarantees respectively. And computing power, the delay at the field edge is 1-5ms, the near-field edge is 5-20ms, and the cloud edge is 20-40ms.

③ Combination of computing power and network : Driven by applications and infrastructure, users hope to have a consistent experience on the edge infrastructure as on the cloud, so we put the functions of the cloud on the edge infrastructure. By integrating computing power and network, we can provide enterprises with better cloud services.

b83e59d4660816a2c94aa8013ee613a4.png

Next, let’s take a look at the evolution of business architecture:

We divide resources into three categories:

① Terminal resources : Provide users with real-time service responses, which may be the mobile phones, tablets, and car systems used by users. Although the terminal resource is closest to the user, its computing power is limited, so edge resources are available.

② Edge resources : Provide users with nearby access services, wide-area business coverage and accurate network perception capabilities. Although the edge can provide terminals with stronger computing power, it is distributed across the country, and the scale of a single edge node is limited. In order to achieve stronger elasticity and computing power, central resources need to be combined.

③Central resources : Central resources can provide users with more flexible system capacity and more powerful data aggregation capabilities.

The traditional centralized deployment architecture can no longer meet the deployment model under new resources. The business architecture adopts a cloud-edge-deployment model to fully leverage the advantages of the cloud-edge. In the future, more and more businesses will move towards the cloud-edge. The new architecture direction of end-end hybrid deployment.

ff3d8f89d8dc20bd2b666e670d630bea.png

The development of edge computing can be traced back to 1998, when Akamai first proposed the concept of CDN. An Internet-based content caching network CDN caches content nearby, reduces network congestion, and improves access efficiency and hit rates. CDN has become the basic service behind apps, websites, and clients.

In 2002, Microsoft, IBM, Akamai and other companies cooperated to deploy .Net and J2EE services on CDN PoP nodes, and the concept of edge computing appeared for the first time.

In 2009, CMU proposed the concept of Cloudlet, which combines VM with edge infrastructure to create a resource-rich trusted host deployed at the edge of the network. This is the prototype of the edge IaaS service.

In 2012, in the context of the Internet of Everything, technologies such as mobile edge computing (MEC) and fog computing were proposed to solve the problem of massive data growth caused by the Internet of Everything. Subsequently, cloud computing and edge were combined, and the concept of edge computing emerged to provide lightweight, elastic, intelligent, heterogeneous, and low-latency edge computing service capabilities between data sources and cloud center paths.

In this regard, I have two views: First, edge computing is the most powerful complement to cloud computing, and the two complement each other rather than simply replacing concepts; second, cloud-edge collaboration magnifies the value of cloud computing and edge computing, and only more Only by properly coordinating cloud and edge can the maximum value of both be unleashed.

-02-

Business development brings new challenges to edge computing cloud infrastructure

While the development of edge computing brings benefits, it also brings many challenges in terms of cloud infrastructure architecture.

d6e8b4e974c2f3526d786fc2bca7e9c0.png

The advantages of edge computing are as follows:

Low latency : Edge computing nodes are distributed across the country and cover all-link operators, providing users with a low-latency experience.

High bandwidth : Edge computing processes and transmits nearby, capable of carrying greater bandwidth.

Cost savings : Edge computing can reduce the amount of data communicated between clients and central nodes, thereby helping customers save more bandwidth costs.

Data security : Data is pre-processed and pre-aggregated at edge nodes without being transmitted throughout the network, thereby reducing the risk of data being stolen when transmitted over the public network.

aa0364002359c3105df626b062c32e80.png

Edge computing mainly brings the following four challenges:

Resource limitations : The scale of edge computing nodes is usually small, and the number of machines usually ranges from a few to dozens. Some edge nodes even have only one server. Therefore, it is necessary to consider how to manage resources under small-scale nodes. Under limited resources, Increase resource sales rate as much as possible.

Distributed management : Hundreds of clusters of edge computing nodes are distributed across the country, and there are problems with weak network management and edge autonomy.

Diverse needs : Since customers' businesses are diverse, customers also have many demands on edge nodes. Customers need to provide various resource types such as cloud hosts/containers/bare metal at the edge. In addition, at the network level, customers expect us to provide VPC, PIP, EIP and other capabilities. At the storage level, customers expect us to provide cloud disk, local disk, file storage, object storage and other capabilities.

Security management : It is necessary to achieve tenant isolation within a very small node and ensure the security of public network transmission coordinated by the public network and edge nodes.

-03-

Coping with challenges: Edge computing cloud infrastructure is gradually improving

In order to cope with the above challenges, edge computing cloud infrastructure is gradually improving.

611f0df301352d4706b23a435e15369f.png

As mentioned above, edge computing faces challenges such as miniaturization, distribution, and security isolation.

In this regard, the first thing we think of is cloud native technology, which has the following characteristics:

In terms of resource management , cloud native technology supports elastic scaling and on-demand resource allocation, providing the possibility to build an elastically scalable edge node on a small edge node.

In terms of technical architecture , cloud native technology is loosely coupled, pluggable and has good scalability. Provides the possibility for heterogeneous and on-demand deployment of edge nodes.

In terms of application deployment , cloud native technology provides standard deployment, automated operation and maintenance, and observability. It provides the possibility to build simplified operation and maintenance and automatic recovery capabilities at the edge.

Cloud native is an ideological concept for cloud application design, which helps to build a system that is elastic, reliable, loosely coupled, easy to manage, and observable.

5bf633d3d05cbd2026cf42efbc5f8dd0.png

The architectural evolution of edge computing is consistent with the evolution of business architecture and has gone through three stages:

Resource-oriented stage : In the early stage, the business basically runs directly on virtual machines or physical machines. At this time, the business is directly oriented to resources, and does not solve the problem of how to orchestrate applications, how to quickly deploy, how to operate and maintain, how to observe, etc. for use on the application cloud. Ability.

Application-oriented : With the rise of container technology, kubernetes appeared in 2014, and the concept of Cloud Native appeared in 2018. At the same time, the edge has also evolved to a period where cloud native is the mainstream architecture.

However, cloud native does not solve all edge problems. Edge scenarios have their own characteristics:

At the resource level , the edge has extensive node coverage, and individual node resources are very limited, which places very high requirements on massive node management and control and single-node resource optimization.

At the network level , there is the problem of a weak network environment at the cloud edge, which puts forward requirements for edge autonomy.

As a result, the third stage of edge cloud technology architecture has ushered in, combining cloud native and edge features to form a unique edge technology solution, namely edge native.

Next, I will introduce the evolution of edge computing architecture to you in stages.

The first stage is the traditional virtualization stage. This stage combines virtualization technology with the edge to provide the ability to split large-granularity resources into small-granularity resources and isolate resources. Its main focus is resource-oriented. Customers need to solve a series of problems such as deployment, operation and maintenance, and monitoring by themselves. This management and control model has extremely high requirements on customers' basic operation and maintenance capabilities, and requires customers to have very professional operation and maintenance and management and control systems.

2bfef68313a3df62b116631fc78936cb.png

With the maturity of container technology and cloud native technology, there are more and more cloud native applications. At this time, containers are deployed in virtual machines, and containers and virtual machines are nested in each other. In this solution, virtualization is still the main technology, and containers are auxiliary. It is a "transition" solution for traditional hyper-convergence to cope with the cloud-native trend. Although some orchestration capabilities are solved at this stage, the elasticity of the container is limited by the elasticity of the virtual machine.

4b4539df62474e070b65697357fe2087.png

Based on the characteristics of edge computing, the cloud-native hyper-converged architecture eventually evolved. It realizes the management, control and deployment of virtual machines, containers and bare metal on the same resource pool, which has the following two advantages:

926a3ff952a56f70c39033491285c2f5.png

First, resource pooling. The three resource forms share a resource pool, which can flexibly allocate different resource pools and improve the overall resource sales rate.

Second, meet more business forms and provide services for cloud native applications through different containers . Use virtual machines to provide services to customers with basic operation and maintenance capabilities, use virtual machines to solve Windows ecological problems, and use bare metal to provide users with higher-performance resources in high-traffic scenarios at the edge.

Edge native combines the characteristics and advantages of edge and cloud native technologies, so it has the portability, observability, easy management, and unified orchestration capabilities of cloud native applications and services. It also has cloud-edge collaboration and edge-edge collaboration. , central control and edge autonomy capabilities. In terms of global scheduling, it has global resource scheduling and local resource optimization capabilities, and has heterogeneous capabilities at edge nodes. Combining the characteristics of cloud native and edge enables applications and services to fully utilize the capabilities of the edge.

c8ed90f286db806ed0b632fcaebab3e8.png

-04-

Unified internal and external, edge-native cloud infrastructure architecture

Next, I will introduce how Volcano Engine builds edge-native cloud infrastructure.

73285900c8064f904103a4cf6f16b3be.png

The figure shows the overall technical solution, introduced from the bottom:

Volcano Engine edge computing nodes are distributed in various provinces and cities across the country, with various operators, and have high-quality network lines. At the same time, it combines a wealth of edge hardware devices, such as customized X86 servers, ARM servers, GPU heterogeneous server resources, high-performance NVMe storage, and 100G bandwidth smart network card devices.

Based on these high-quality infrastructure, we designed the capabilities of the edge cloud native operating system, including edge autonomous management, system component management, and edge-oriented mirroring service capabilities. Autonomous management includes cluster management and application life cycle management. System components include network components, service discovery, and message queues. Image components include public images, custom images, image preheating, and image acceleration.

Cloud-edge management provides subsystems such as cloud-edge channels, cluster management, and intelligent scheduling, optimizing cloud-edge collaboration.

Data management provides data collection, monitoring and alarming, large data screens and data warehouses. The edge data is preprocessed and sent to the center for analysis and alarming.

Ultimately, we provide customers with edge computing services at the product form level, including edge virtual machines, bare metal, containers and other forms, while providing consistent edge network, edge storage and other cloud service capabilities on the cloud. In addition, we also build edge services such as FaaS and SaaS.

The scenario application level can support the needs of various business scenarios such as CDN, live video, real-time audio and video, cloud games, dynamic acceleration, and edge intelligence.

The overall concept of the architecture design is cloud-edge collaboration, edge autonomy, and hierarchical governance.

59a1d5ca5acbf74dff71bfe2c3c72e70.png

The edge-native operating system combines cloud-native and edge features and provides the following four key capabilities:

① Unified orchestration : Through the cloud native operating system, the unified orchestration of computing resources, storage resources, network resources, and its own cloud service resources can be achieved.

② Collaborative management and control : Supports collaborative management and control between the center and the edge to achieve efficient integration of the center and edge.

③On-demand deployment : Through hybrid deployment of computing power and hybrid deployment of services and pluggable components, heterogeneous computing power and heterogeneous product capabilities can be provided in different resource scenarios.

④Cloud-edge collaboration : Realizes capabilities such as cloud-edge channels and edge-edge collaboration.

55019114d7b30a7acc91f7d4d60cf91b.png

The requirements of edge nodes for resource orchestration can be summarized as miniaturization and diversification:

Miniaturization : Usually the node size is small, with only a few machines, and some nodes even have only one machine.

Computing requirements : Due to diverse business demands, edge nodes need to support multiple product forms such as virtual machines, containers, and bare metal at the same time.

Storage level : Requires block storage, file storage and object storage capabilities.

Network : You need to customize VPC network, load balancing, elastic public IP and other capabilities.

In this regard, the solution we adopt is unified resource orchestration .

The bottom layer is Kubernetes. On top of this, the abstraction is unified through CRD. For example, if you need a virtual machine, define a CRD for the Virtual Machine, and implement the controller logic through CRD to achieve resource management and control. In terms of ecology, existing network, storage, GPU and other resource types can be directly reused on Kubernetes to unify container and virtual machine storage and network resources.

2598a5246033f65e2789008822e3202d.png

The requirement for unified service orchestration is unified management of components. It includes two demands:

Lightweight : Edge clusters are usually small, so management and control services need to be lightweight.

Service running dependencies : Due to the wide variety of services, the underlying component libraries they depend on are also diverse, and some services also have specific scenario requirements for the OS.

The solution is to unify service orchestration, design all components as microservices, and package and publish the components in a unified container, so that the component runtime does not depend on the OS and component library version of a specific host.

The bottom layer of the picture on the right is the engine layer. By reusing the basic management capabilities of Kubernetes, it directly accesses the basic capabilities such as network and storage provided by Kubernetes. On top of the engine layer, we have self-developed logging, monitoring, alarming and other capabilities, and used and strengthened cloud-native capacity expansion and contraction, health detection, fault migration and automatic recovery capabilities. On top of this, external capabilities such as virtual machines, container instances, and bare metal are uniformly provided to the outside world.

f77e28081acc5d472804f4ea13e1dc97.png

The requirement for collaborative management and control is unified management, control and scheduling, including cloud-edge linkage management and control and unified resource scheduling. The solution is a self-developed cloud-edge collaborative management and control system, which includes three key points:

Global perception: Based on the Watch mechanism in the center, real-time perception of edge resources is realized, and resource and inventory changes are perceived faster.

Edge autonomy: The multi-Master mechanism is used to ensure the availability of the edge. Even if it loses contact with the center, the edge can still work independently.

Unified scheduling: Unified inventory management of virtual machines and containers is realized.

The picture on the right is the process of creating a virtual machine schedule. First, the user initiates a request to create a virtual machine instance. After the virtual machine control receives it, it initiates a request to the inventory service. The scheduling system returns the result through the globally optimal scheduling strategy, and the management and control system downloads the resources. Send to the corresponding edge node, perform lightweight scheduling through edge management and edge scheduler, and finally run the instance on a specific node.

5810ff689f0b456db13f6e20ef90193e.png

The requirement for on-demand deployment is capability diversity, which mainly includes the following points:

Heterogeneous scale : some nodes will be smaller and some nodes will be larger.

Heterogeneous resources : Server types provided by different nodes include X86, ARM, and GPU

Storage resources : The storage capabilities provided by different nodes include cloud disks, local disks, file storage, etc.

Product capabilities : Different nodes will provide X86 virtual machines or ARM virtual machines

The answer to this is component standardization and on-demand deployment.

The first is to standardize node specifications. We standardize node types and components. The former is divided into small-sized nodes, general-purpose nodes, large-sized nodes, etc., and the latter is divided into virtual machines, containers, networks, etc.

At the same time, the deployment plan has fixed arrangements for different node types and product requirements. During node construction, different deployment plans are selected based on node types and product requirements. As can be seen in the picture on the right, users are provided with standard virtual machines, containers and LB capabilities on small-sized nodes, and additional bare metal capabilities are provided on general-purpose nodes. You only need to deploy bare metal plug-ins on general-purpose nodes. To provide users with more capabilities such as GPU and file storage product capabilities on large-sized nodes, it is also necessary to deploy corresponding plug-ins based on large-sized nodes.

5d68ec1c2c95c890d76cc22ec143c91c.png

Cloud-edge collaboration solves the problem of weak cloud-edge networks, including network and security aspects. The former includes problems such as network packet loss, link instability, and network link interruption. The latter is mainly a public network link transmission security issue.

The corresponding solution is a self-developed cloud edge channel.

First of all, by establishing long links between the edge and the center, the links between the edge and the center are reused, and the data caching of each edge node is implemented in the center, ensuring that the center can sense edge changes faster, and the center components can respond to edge changes when operating the edge. Read requests are accelerated.

Secondly, in terms of security protection, the security of two-way authentication between the client and the server is ensured through mechanisms such as identity authentication and two-way certificates. In terms of transmission security, the security of transmitted data is ensured through full-link SSL encryption and decryption. In terms of SSL and ACL access control, it is ensured that only whitelisted edge nodes can register to the center, which enhances the security of cloud-edge communication.

Finally, in terms of network disaster recovery, technologies such as multiple computer rooms, multiple copies, load balancing, and automatic fault migration are adopted to ensure the high availability of cloud edge channels.

Here are a few best technical practices for edge nodes.

1b033cd38771a2e142c6c6ae55c50cd4.png

The first is the acceleration of instance creation. The problem is that edge nodes are slow to create instances, including two reasons: First, the image download is slow, because the edge node downloads the image from the center slowly, and because the image download needs to be transmitted over the public network, the image Download time is uncontrollable. Second, instance creation requires a complete copy of the base image. If the image is larger, the copy will be more time-consuming.

The solution adopted for this is warm-up and snapshot.

First, pre-warm virtual machine images and user-defined images to edge nodes in advance. Then pre-create a snapshot of the edge image. When a virtual machine needs to be created, it is created directly based on the snapshot. The bottom layer of the virtual machine shares the same snapshot layer. The snapshot uses the Copy On Write mechanism. When the virtual machine is created, the image data is not completely copied. The data that needs to be changed is copied only when the data is actually written. Through the snapshot mechanism, virtual machines can be created in seconds.

ceaebb248a72cadedebb208ea96682cb.png

We have optimized virtualization performance at the performance optimization level. As the name suggests, a virtual machine is virtualized by software. Therefore, the virtual machine has performance loss to a certain extent, which is reflected in the following three points:

•First, vCPUs are scheduled as ordinary user-mode processes on the operating system, so there may be performance competition between vCPUs.

•Second, because the virtual machine splits large-grained memory into small-grained memory, there is a memory conversion performance overhead.

•Third, VMM Exit may affect CPU performance.

In order to have a deeper understanding of the above issues, let me introduce the basic principles of virtual machines:

The running level of the CPU is divided into four running status levels, Ring0~3. Linux only uses Ring0 and Ring3, which represent the kernel mode and user mode respectively.

The virtual machine is mainly composed of VMM (Hypervisor) and Guest. In order to support virtualization, the X86 server provides two operating modes, root mode and non-root mode. The CPU virtual machine running process is actually the CPU controlled switching between root and non-root operating modes.

The operation mode switching between VMM and Guest is mainly divided into two parts. Assume that the currently running code is in the VMM layer. If you want to run the customer's code, you need to enter the Guest layer. You can manually call the VMLAUNCH or VMRESUME instruction to switch the currently running code to the guest side. This process is called VM Entry. Assume that the client needs to respond to external interrupts or page faults during operation. At this time, the CPU operation will switch to VMM. We call this process VM Exit.

In order to reduce the performance loss of virtual machines, we have done the following things:

•vCPU binding: By binding vCPU to the physical machine CPU one-to-one, frequent CPU switching is reduced, thereby reducing CPU context switching loss;

•Hugepage: By utilizing large memory pages, the memory page table is reduced, TLB misses are reduced, and the memory access performance of the virtual machine is improved;

•Exit optimization: By transparently transmitting exits such as timer/IPI, most VM exits are eliminated, reducing virtualization loss to less than 5%.

f0682ba8747d515d13567ec8524e3a78.png

Optimization at the I/O level mainly includes two points:

•Network I/O: ultra-large bandwidth, such as vCDN scenario

•Storage level: Localized caching scenarios require strong storage bandwidth and IOPS capabilities

The corresponding solution is to use hardware offloading, hardware pass-through, polled I/O and other methods.

•Hardware Offloading: Offload network traffic to a dedicated network card device, and use the dedicated network device to forward network packets. This not only improves the forwarding throughput, but also releases some CPU resources.

•Device pass-through: Directly pass the disk or network card device to the virtual machine, reducing the software forwarding path and improving the overall IO performance

•Polled I/O: Through user-mode Polling, it reduces dependence on the notification mechanism and senses data changes faster.

-05-

future outlook

In the future, edge computing will continue to show a growth trend, and the rise of edge computing will also bring more convenience. Finally, I would like to introduce to you the main directions of future edge computing efforts.

a0296d1a03eec4b74fea56bf9f228990.png

Mainly lightweight, computing network integration and open ecology.

bcc4aaac4395bb8e7ad28347d8562bf9.png

Currently we provide standard virtualization capabilities and very complete functions at the edge. But there is currently a serious problem with virtualization.

In the future, we will optimize the hypervisor to achieve a lighter overhead and further reduce virtualization losses. In addition, at the management and control level, some management and control capabilities are unified in the center through cloud-edge collaboration, and lightweight autonomous capabilities are provided at the edge, making the edge management and control plane and hypervisor lightweight.

6186d954fd4b3005fb050cd1813c81a4.png

The second is the deep integration of computing networks. Currently, we rely more on the elasticity of a single node and the scheduling of computing power resources of a single node. Applications need to provide disaster recovery capabilities for multiple computer rooms. In the future, we will do deep integration of computing power networks, uniformly schedule network resources and CPU computing resources, and achieve cross-node elastic scalability, so that some services can be freely migrated between different nodes. Make good use of resources from different nodes.

221c7633ce9a31eb5079fe10468b1d0f.png

Finally, there is a more open ecosystem. Currently, we have built an edge-native operating system based on cloud-native technology and provide public cloud services such as virtual machines, containers, and bare metal in a unified manner.

In the future, we will open more cloud native capabilities to users and attract more partners in the cloud native ecosystem. Through a more open model, cloud native technology can not only serve itself, but also allow more customers to enjoy the cloud native benefits. Come with ecological convenience.



35cefc38503ae957815af071520de1f8.png

Scan the QR code in the picture or click " Read the original text " 

Direct access to LiveVideoStackCon 2023 Shenzhen Station 10% off ticket purchase channel

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/132748759