Istio's new data plane model: Ambient Mesh technology analysis

Abstract: Ambient Mesh appears in a form that is more in line with the requirements of large-scale implementation, and overcomes the inherent defects of most sidecar models, so that users no longer need to perceive grid-related components, and truly sink the grid into infrastructure.

This article is shared from Huawei Cloud Community " Huawei Cloud Cloud Native Team: Istio Data Plane New Model Ambient Mesh Technology Analysis ", author: The future of cloud containers.

If it is said which design pattern is the most classic in the cloud-native world built on the basis of Kubernetes, the Sidecar pattern is undoubtedly the most powerful competitor among them. When it is necessary to provide an application with auxiliary functions that have nothing to do with its own logic, sidecar that injects corresponding functions into the application Pod is obviously the most Kubernetes-native way, and Istio is the representative of this mode.

The vision of the Istio project is to solve the problems of connection, security and observability between services in the microservice scenario in as transparent a way as possible. The main implementation method is to deploy a Proxy next to the application, and in the Kubernetes scenario, inject Sidecar into the application Pod to intercept application traffic to the Sidecar. Sidecar processes application traffic based on user configuration obtained from the Istio control plane, and implements service governance in a way that is almost non-invasive to application code.

Although Istio is not limited to only supporting the Kubernetes platform, the design concept of Istio has a natural affinity with the sidecar model of Kubernetes. Based on the Sidecar model, Istio can realize rapid development, deployment, and verification on the Kubernetes platform. At the same time, at the functional level, Isito strips the service governance function from the application code and sinks it to Sidecar as an infrastructure, abstracting the actual cloud-native application network layer, which greatly reduces the mental burden of application developers. Capability is exactly what the Kubernetes ecosystem has been missing. Based on Istio's perfect complement to the Kubernetes ecosystem, with the large-scale popularization of Kubernetes, Istio has also achieved rapid seizure of users' minds and markets.

Although deploying the Istio data plane in the Sidecar mode seems to be a natural choice that people cannot refuse, it needs to be emphasized that the implementation of Istio's complete functions is not strongly bound to the Sidecar mode, and we have various other options. s Choice. In addition, as the use of Istio continues to deepen and the scale of implementation continues to expand, it can be found that there are many challenges in deploying Istio data in Sidecar mode:

1. Intrusion: Istio basically achieves zero intrusion into the application code, but because the injection of Sidecar needs to change the Pod Spec and redirect the application traffic, the Pod needs to be restarted when the application accesses the grid, and the application container and the Sidecar container The conflicts caused by the uncertain startup sequence of , may also cause the application to be interrupted;

2. Life cycle binding:  Sidecar is essentially infrastructure, and its life cycle is often inconsistent with that of the application. Therefore, when upgrading Sidecar, it is also necessary to restart the application Pod, which may also cause application interruption. For Job applications, the existence of Sidecar It will cause the Pod to not be cleaned up in time;

3. Low resource utilization: Sidecar is exclusive to a single application Pod, and the application traffic has peaks and valleys. In general, the memory usage of Sidecar is strongly related to the cluster size (number of Services, number of Pods), so resources need to be reserved in extreme cases , leading to low resource utilization of the cluster as a whole. At the same time, since Sidecar needs to be injected into each Pod, as the cluster scale continues to expand, the total amount of resources occupied by Sidecar will also increase linearly.

To address the shortcomings of the Sidecar deployment model, Google and Solo.io jointly launched a new Sidecar-less deployment model --- Ambient Mesh.

Architecture introduction

The Ambient Mesh architecture is shown in the figure above. From a design point of view, it mainly has the following two characteristics:

1. Sidecar-less: In order to avoid the defects of the above-mentioned Sidecar model, Ambient Mesh no longer injects Sidecar into any Pod, and further sinks the implementation of the mesh function into Istio's own components.

2. Layering of L4/L7 processing: Ambient Mesh introduces two components, ztunnel and waypoint, to replace the original Sidecar to implement related functions. Unlike Sidecar, which can handle both L4 and L7 traffic, Ambient Mesh supports both Ztunnel is only responsible for the processing of L4 traffic, while L7 traffic is handed over to waypoints for processing as needed.

Compared with Istio in the original Sidecar mode, the control plane of Ambient Mesh is basically unchanged. The component composition of the data plane and the functions of each component are as follows:

1. istio-cni: Required component, deployed in the form of DaemonSet. In fact, istio-cni is not a new component of Ambient Mesh. It already existed in the original sidecar mode. At that time, it was mainly used to replace istio-init, the Init Container, to configure traffic interception rules, and to avoid security issues caused by istio-init. Ambient Mesh expands it and deploys it as a mandatory component. It is responsible for configuring traffic forwarding rules, hijacking the application traffic of Pods that have joined Ambient Mesh in this node, and forwarding it to the ztunnel of this node;

2. ztunnel:  Required component, deployed in the form of DaemonSet. ztunnel acts as a proxy for the traffic of Pods on the node where it is located, and is mainly responsible for the processing of L4 traffic, L4 telemetry and mTLS (two-way authentication) management between services. Originally ztunnel was implemented based on Envoy, but considering the intentional constraints on ztunnel functions and the requirements for security and resource occupancy, the community has used rust to build this component from scratch;

3. waypoint: configured on demand, deployed in the form of Deployment. waypoint is responsible for handling HTTP, fault injection and other L7 functions. Deploy at the granularity of load or Namespace. In Kubernetes, a Service Account or a Namespace corresponds to a waypoint Deployment, which is used to process layer-7 traffic sent to the corresponding load. At the same time, the number of waypoint instances can be dynamically scaled according to the traffic.

The following uses the actual processing process of the Ambient Mesh data plane to show the specific roles played by the above components:

1. Similar to the Sidecar mode, Ambient Mesh can also add services to the grid at the granularity of grid, Namespace, and Pod; the difference is that the newly added Pod does not need to be restarted, nor does it need to inject Sidecar;

2. istio-cni monitors the addition and deletion of Pods in the node and the entry and exit of the grid, and dynamically adjusts the forwarding rules. The traffic sent by the Pods in the grid will be transparently forwarded to the ztunnel of the node, directly skipping the processing of kube-proxy ;

3. ztunnel also needs to monitor the addition and deletion of Pods on the local node and the entry and exit of the grid, and obtain and manage the certificates of the Pods located on the local node and taken over by the grid from the control plane;

4. The ztunnel at the source end processes the intercepted traffic, finds the certificate corresponding to the Pod according to the source IP of the traffic, and establishes mTLS with the peer end;

5. If the target service to be accessed is not configured with a waypoint or L7-related processing policy, the source ztunnel will directly establish a connection with the destination ztunnel (as marked by the yellow line in the figure above), and the peer ztunnel will terminate mTLS and implement the L4 security policy. Forward traffic to the target Pod;

6. If the target service is configured with a waypoint (using a specially configured Gateway object) and an L7 processing strategy, the source ztunnel will establish mTLS with the corresponding waypoint. After the waypoint terminates mTLS, it will perform L7 logic processing, and then communicate with the target Pod The ztunnel of the node where it is located establishes mTLS, and finally the ztunnel of the destination end also terminates mTLS and sends the traffic to the target Pod.

Value Analysis

Although from the perspective of the underlying implementation, there is a huge difference between Ambient Mesh and the original Sidecar mode, but from the perspective of the user, the usage and implementation effects of the core Istio API (VirtualService, DestinationRules, etc.) are consistent, which can ensure Basically the same user experience. Ambient Mesh is the second data plane mode supported by the Istio community in addition to the Sidecar mode, so the grid technology itself can bring value to users. Ambient Mesh is no different from the previous Sidecar mode. Therefore, only the value of Ambient Mesh relative to the native Sidecar mode is analyzed here, and the value of the grid itself will not be repeated.

Ambient Mesh is mainly adjusted for Istio's data plane architecture to overcome the shortcomings of the existing Sidecar model, so its value must be based on its architectural characteristics. As mentioned earlier, the architectural features of Ambient Mesh mainly include "Sidecar-less" and "L4/L7 processing layering". The value analysis is based on these two points:

1. The advantages of Sidecar-less can actually be seen as the opposite of the defects of the Sidecar model:

  1. Transparency: The grid function is lowered to the infrastructure, which not only has zero intrusion into the application code, but also completely decouples the life cycle of the application, so that it is truly transparent to the application and allows the application and the grid to evolve independently;
  2. Optimize resource occupation: CPU, memory and other resources occupied by the data plane no longer increase linearly with the number of instances. As the number of instances on the data plane decreases, the number of connections to the control plane also decreases accordingly, which greatly reduces the resources and processing of the control plane. pressure.

2. As for why L4/L7 should be layered, first of all, it is necessary to distinguish the difference between the two. Compared with L4, the processing of L7 is more complicated and requires more resources such as CPU/memory, and there is also a big difference in resource occupation between different types of operations; at the same time, the more complex the operation, the greater the attack surface exposed. In addition, Envoy currently does not support strong isolation of the traffic of different tenants, and the problem of "noisy neighbor" is inevitable. Therefore, the advantages of the Ambient Mesh layered processing architecture are as follows:

  1. High resource utilization: ztunnel is only responsible for L4 processing, L4 processing is relatively simple and resource occupation is relatively fixed, so it is easier to plan resources for ztunnel, without excessive resource reservation, and more node resources can be used by users; waypoint can also Dynamic expansion and contraction according to L7 load, making full use of resource fragments in the cluster;
  2. Tenant isolation: L7 processing with complex processing and high security risks is handled by the waypoints of each tenant (Service Account), which not only avoids resource preemption among tenants, but also limits the explosion radius of security issues;
  3. Smooth landing: Allow users to gradually access the grid. When only the L4 processing capability of the grid is needed, there is no need to consider the resource occupation of L7 and the potential negative impact that may be caused (for example: due to incorrect configuration, the application enters L7 processing and the Fully comply with the L7 protocol, resulting in service interruption), and then enable relevant functions on demand at an appropriate time.

Of course, Ambient Mesh, as a new data plane architecture of Istio, still exists as an experimental feature in the community, and there are still many problems to be solved, such as:

1. Performance: Especially for L7 processing, Ambient Mesh needs to go through two ztunnels and one waypoint, and an additional hop is visible to the naked eye, so the complete L7 processing needs to go through three additional hops. Although the community claims that this has little impact on performance, further observations and comparisons are still needed after its characteristics are stabilized;

2. Container network adaptation: Although the Ambient Mesh and the application are basically completely decoupled, it also increases the coupling between the grid and the underlying infrastructure. The Sidecar mode only needs to implement traffic interception in the net ns of the Pod, but Ambient Mesh intercepts traffic on the host network, obviously, more consideration needs to be given to adapting to the underlying container network;

3. Complex configuration: Envoy's complex configuration has been widely criticized, but Ambient Mesh needs to implement a ztunnel as a proxy for all Pods on the node. The configuration complexity has increased by an order of magnitude. At the same time, complex configuration means an increase in the processing flow. It will also affect the debugging of the data plane and the overall performance;

4. Others: High availability of ztunnel? In fact, waypoint changes the original double-ended L7 processing to single-ended. How does it affect the correctness of L7 monitoring indicators?

future outlook

From the release point of view, since its release in September 2022, Ambient Mesh has been in an independent branch as an experimental feature. Therefore, the next plan for Ambient Mesh is to merge into the main branch (which has been implemented in February 2023) and release it as an Alpha feature, and finally reach Stable at the end of 2023, making it available for production.

From the perspective of API, the ideal is to share the same set of API under the two architectures. Of course, this is unrealistic, because some of the existing Istio APIs are designed on the premise of sidecar mode deployment. The most typical is the sidecar CRD, which is used to customize the configuration delivered to different sidecars, thereby reducing unnecessary resource occupation of sidecars. These Sidecar-Only APIs are obviously meaningless under Ambient Mesh. At the same time, Ambient Mesh itself introduces two unique components, ztunnel and waypoint, so Ambient Mesh also needs to create a new API to manage these unique components and implement some Ambient Mesh Only functions. In the end, Ambient Mesh will implement the existing core Istio APIs (VirtualService, DestinationRules, etc.) and create some unique APIs. It is important to unify the three types of APIs (Sidecar mode unique, Ambient Mesh unique, and both) use and interaction.

So, has Ambient Mesh fully covered the use scenarios of Sidecar mode, so that Sidecar mode has completely withdrawn from the stage of history? The answer is naturally no. Similar to the disputes between various exclusive and shared models in the industry, the Sidecar model is essentially the exclusive use of Proxy by the application Pod. A dedicated Proxy can often guarantee better resource availability, avoid the impact of other applications as much as possible, and ensure the normal operation of high-priority applications. It is foreseeable that in the final mixed deployment of the two modes, the application selects the proxy mode on demand is a more ideal way. Therefore, building a hybrid deployment mode to ensure good compatibility and a unified experience between the two modes in this mode will also be the focus of follow-up work.

Summarize

The sidecar mode is like a prototype verification for Istio, quickly demonstrating the value of grid technology in the most Kubernetes-native way, and seizing user awareness and market. However, as the implementation of Istio gradually enters the deep water area and begins to be deployed on a large scale in the production environment, the Sidecar model seems to be unable to do what it wants. At this time, Ambient Mesh appears in a form that is more in line with the requirements of large-scale implementation, overcoming the inherent defects of most sidecar models, so that users no longer need to perceive grid-related components, and truly sink the grid into infrastructure.

But it is clear that Ambient Mesh is not the end of the grid data surface architecture evolution. At present, there is no grid data surface solution that can be perfect in terms of intrusiveness, performance, and resource usage. Ambient Mesh basically achieves zero intrusion into the application, but the performance problems caused by L7's three-hop processing, and the resource occupation of resident processes such as ztunnel cannot be ignored; RPC libraries such as gRPC implement xDS through built-in, directly connected to the Istio control plane, Mixing the grid into the SDK can indeed achieve good performance and resource occupation performance, but it is inevitable to pay the inherent price of strong coupling with the application and high complexity of multi-language support; based on eBPF, the full set of grid data plane functions can be directly Sinking the TCP/IP protocol stack to the kernel seems to be an ideal final solution, but considering the complexity of kernel security and interaction with the kernel, the execution environment of eBPF is actually very limited. For example, eBPF programs must go through For verifier verification, the execution path must be completely known, and arbitrary loops cannot be executed. Therefore, for complex L7 processing such as HTTP/2 and gRPC, it will be difficult to develop and maintain based on eBPF.

Considering the extreme requirements of infrastructure on performance and resource consumption and the evolution of related technologies in the past, for example, for the basic network, most applications can share the kernel protocol stack, and some special applications use DPDK, RDMA and other dedicated technologies to accelerate. Similarly, for the grid data surface, it may be more feasible to combine multiple technologies to optimize the solutions for the corresponding scenarios. It can be foreseen that this type of solution basically takes the node-level agent like Ambient Mesh as the main body. With the development of grid and eBPF technology, as many grid data plane functions as possible will be downgraded to eBPF (Fast Path) for realization; less Some advanced functions are implemented by Proxy (Slow Path) in user mode; for applications that have high requirements on performance and isolation, dedicated Sidecars are deployed for them. In this way, it can meet the requirements of various dimensions such as intrusiveness, performance, and resource occupation in most scenarios.

To sum up, in the end, a set of data plane solutions will dominate the world, or a mixed deployment of various solutions, depending on the directors of each company, it is still necessary to continue to explore and evolve related technologies, and then use practice tests, and finally let time tell us the answer.

references

[1] Istio Ambient Mesh Explained: https://lp.solo.io/istio-ambient-mesh-explained

[2] What to expect for ambient mesh in 2023: https://www.solo.io/blog/ambient-mesh-2023

[3] Introducing Ambient Mesh: https://istio.io/latest/blog/2022/introducing-ambient-mesh

[4] Get Started with Istio Ambient Mesh: https://istio.io/latest/blog/2022/get-started-ambient

 

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/8707546