Going to 2022, how do we think about service meshes?

Author: Luo Guangming, ByteDance Infrastructure Service Framework Team Architect

background

Readers who are familiar with service mesh and Istio concepts can skip this chapter and go directly to the next chapter.
The term Service Mesh was first proposed by Buoyant, which developed Linkerd, and the term was used publicly for the first time on September 29, 2016, and was translated into "Service Mesh" and gradually spread in China. William Morgan, CEO of Buoyant, defines the concept of Service Mesh as follows:
A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.
Translated into Chinese as follows:
A service mesh is an infrastructure layer that deals exclusively with service communication. Its responsibility is to perform reliable request delivery under complex topologies of services composed of cloud-native applications. In practice, it is a set of lightweight network proxies deployed with the application service and transparent to the application service.
Istio is an open source service mesh implementation product, which has attracted much attention as soon as it was launched, and has become the object that major manufacturers and developers are vying for. The official Istio documentation defines itself like this:
It is a fully open-source service mesh that is transparently built into existing distributed applications. It is also a platform with API interfaces that can integrate with any logging, telemetry and policy system. Istio's diverse features enable you to successfully and efficiently run distributed microservices architectures, and provide a unified approach to securing, connecting, and monitoring microservices.
From the official definition, we can see that Istio provides a complete solution to manage and monitor your microservice applications in a unified way. At the same time, it also has the ability to manage traffic, enforce access policies, collect data, etc., all of which are transparent to the application and can be achieved with almost no need to modify business code. With Istio, you almost don't need other microservice frameworks, and you don't need to implement functions such as service governance yourself. As long as the network layer is delegated to Istio, it can help you complete this series of functions. Simply put, Istio is a service mesh that provides service governance capabilities.

Is Istio the standard for service meshes?

Okay, the background has been explained clearly, and now we can start the text.
Will Istio be the de facto standard in the service mesh space? I think I can give an answer today, NO.
Istio released the first version v0.1 in May 2017, and it has received large-scale attention since the release of v1.0. The release of v1.5 made a major adjustment to the control plane architecture, and then continued to iterate, releasing a major version every three months. , As of now, v1.12 has been released. Istio supports rich traffic governance strategies, has rich observability integration capabilities, actively embraces changes, moves towards architectural simplicity, enhances ease of use, and provides support for virtual machines.
Istio is an excellent open source software with a very high community activity and strong community ecology, as well as a relatively good architecture design, there is no doubt about this. However, there is a flaw, which is also a fatal flaw. Istio is not open-sourced from large-scale implementation in enterprises. Istio is an open-source software from birth, and Istio is an ideal open-source software (for example, no traffic hijacking is emphasized). invasive).
There is no silver bullet in software architecture, it is often a trade-off . At the beginning of the design, Istio considered universality and functional completeness, and supported functional design in multiple dimensions such as traffic management, security, and observability. With the development of the project, this function has been continuously enhanced. The damage is performance, and the huge consumption of CPU and memory in massive instance scenarios.

Traffic hijacking problem

We can see that Istio can be gradually implemented and promoted in many companies or business teams with relatively small instance scales, but once it becomes larger, the problem will be exposed. The performance problems caused by the early mixer components are not discussed, after all, they have been abandoned, but the traffic hijacking mechanism of iptable, to a certain extent, is a stumbling block for large-scale companies. At present, Istio uses iptables to implement transparent hijacking, which mainly has the following three problems:
  • It is necessary to use the conntrack module to implement connection tracking. In the case of a large number of connections, it will cause a large consumption and may cause the track table to be full. In order to avoid this problem, there is a practice of closing conntrack in the industry.
  • iptables is a common module and takes effect globally. It cannot explicitly prohibit related modifications, and the controllability is relatively poor.
  • In essence, iptables redirects traffic to exchange data through loopback. Outbond traffic will traverse the protocol stack twice, which will lose forwarding performance in large concurrency scenarios .
Judging from some public information such as technical speeches and blogs, companies such as Ant Group, Baidu, and ByteDance that have implemented service grids on a large scale have basically abandoned the iptables traffic hijacking scheme and adopted an agreement-based solution. Traffic hijacking (takeover) mechanism, which optimizes the latency and performance of traffic hijacking and inter-service communication in large-scale scenarios. The traffic hijacking scheme is often combined with the company's internal Naming Service and service registration discovery mechanism, and will not blindly pursue a zero-intrusive scheme.
Taking ByteDance as an example, the framework and Mesh Proxy are agreed to access the service mesh governance system.
  • Incoming traffic: Mesh Proxy monitors MESH_INRESS_PORT to complete incoming traffic hijacking.
  • Outbound traffic: The original request process of the business process calling the registry API for service discovery is changed to directly request the MESH_EGRESS_PORT of localhost, and specify the target service through the header. Support http 1.1/2.0 and grpc protocol. This requires support and docking of various frameworks. At present, the service frameworks within the byte have been supported.
In addition, the community has a solution that uses eBPF to achieve traffic hijacking. eBPF (extended Berkeley Packet Filter) is a technology that can run user-written programs in the Linux kernel without modifying kernel code or loading kernel modules. It is currently widely used in networking, security, monitoring and other fields. The earliest and most influential eBPF-based project in the Kubernetes community is Cilium, which uses eBPF instead of iptables to optimize Service performance.
Incoming traffic, comparing the hijacking of incoming traffic by iptables and eBPF, the iptables solution requires conntrack processing for each packet, while the eBPF solution is executed only once when the application calls bind, and will not be executed after that, reducing performance overhead; outgoing traffic, For TCP and connected UDP, each packet of the iptables scheme requires conntrack processing, while the overhead of the eBPF scheme is one-time, and only needs to be executed once when the socket is established, which reduces the performance overhead.
In general, using eBPF instead of iptables to implement traffic hijacking reduces request latency and resource overhead to a certain extent, but is limited by the kernel version, so it is difficult to implement large-scale implementation in a short period of time.

Configuration delivery problem

Let's look at the communication protocol xDS between the data plane envoy and the control plane istiod. xDS contains a collection of multiple protocols, such as: LDS for listeners, CDS for services and versions, EDS for instances of services and versions, and the characteristics of each service instance, and RDS for routing. The xDS can be simply understood as a collection of service discovery data and governance rules within the grid. The size of the xDS data volume is positively correlated with the grid size.
istio uses the full distribution strategy to deliver xDS, that is, all sidecars in the grid will have all service discovery data in the entire grid in memory. The reason is that it is difficult for users to sort out the dependencies between services and provide them to istio . Following this pattern, each sidecar memory grows with the grid size. According to the performance test conducted by a team in the community, it can be seen that if the grid scale exceeds 10,000 instances, the memory of a single envoy exceeds 250 megabytes, and the overhead of the entire grid must be multiplied by the grid scale, that is, 2500 gigabits, amazing consumption! ! !
Of course, the community also provides some solutions. For example, by manually configuring the CRD of Sidecar, the dependencies between services can be explicitly defined. This requires users to manually configure and sort out each call chain relationship before services can discover each other. This is in There are also certain limitations in large-scale scenarios. Some community students have open sourced some solutions, but they have brought some other problems more or less, such as single-point problems, increased system complexity, operation and maintenance complexity, peak pressure, etc.

Istio competitor or scrambler

left

Many domestic service grid players are most familiar with Istio, which is related to the fact that a group of service grid evangelists in the early days were more optimistic about Istio, and they did a lot of sermons on it, such as technical speeches, the localization of official website English documents, and English blogs. Translation and the establishment of Chinese communities (such as ServiceMesher) have played a crucial role in the widespread popularization of Istio in China. Linkerd, as the first service mesh open source project on the market, was not so popular in China and was not selected by evangelists. In fact, according to CNCF's 2020 Global Survey Report, 69% of people are evaluating Istio and 64% are evaluating Linkerd, and in an international perspective, the gap between the two is not that big.
Linkerd is an open source ultra- lightweight service mesh designed by Buoyant for Kubernetes. The data plane was completely rewritten in Rust to make it ultra-lightweight and performant, providing runtime debugging, observability, reliability, and security without requiring code changes in distributed applications.
From the perspective of user experience, there is not much difference between the two. Some functions may be supported or not, and some external components or interfaces may or may not be dockable. These are irrelevant. As long as it is the needs of most users, time will change. Addressing these issues, in other words, over time will eventually support more projects, whether it's Istio or Linkerd.
There are many people in the community who don't like to talk about who is better between Istio and Linkerd, because of the lack of technical background and landing scenarios, it cannot be generalized. Just from a technical point of view, Linkerd's performance is particularly fast. According to third-party benchmarks in 2019, it is 3-5 times faster than Istio, and its memory consumption is also nearly one-fifth of Istio's. CPU consumption is comparable. In terms of routing and service elasticity, Linkerd does not seem to pursue the completeness of Istio, and there are not so many fancy fine-grained functions. Linkerd can be described in two words: lightweight and minimalistic. From the above perspective, Linkerd seems to be more suitable for commercialization and implementation in large-scale scenarios.
Linkerd's official website statement about design principles:
  1. Keep it simple
  2. Minimize resource requirement
  3. Just work (available)
Linkerd is a graduate project of the CNCF Foundation. It has clients such as Expedia, HP, Cisco webex, etc. Although it is not as attractive as Istio's users, the number of users who have landed is still quite large, and its influence is also leveraged.
However, looking back, the iptables traffic hijacking mechanism mentioned above and the full distribution of service discovery data and governance rules also exist in Linkerd, and they have not been resolved, and no better solution has been proposed. From an architectural point of view, there is no essential difference between Linkerd and Istio. Therefore, neither Istio nor Linkerd are optimal solutions in terms of performance, CPU & memory consumption. In large-scale implementation scenarios, Linkerd is not a cure-all either.

Other projects or ideas

Both Istio and Linkerd have cultivated a certain degree of community influence, and to a certain extent, they have also formed community standards. Generally speaking, these two projects are long-term and reliable. Some start-up companies and small service business teams can use either of the two with confidence. Therefore, the author does not recommend other open source projects. Not recommending it does not mean that other projects are not good enough. I believe they also have their own landing scenarios, but the author has not taken the time to study them, so I don't know how to do it.
Here is a brief introduction to some of the ideas for some large companies to implement service grids. There are currently two ideas for big Internet manufacturers. The first is to use Istio for magic reform and to do a layer of product packaging based on Istio; Starting from the needs, complete the architecture design and productization.
First, let's look at the first idea . When a company was choosing technology, it finally chose to embrace open source, brought Istio to test and used it, found some problems, and directly forked a branch to the internal code base for transformation, deployment, and launch. As time goes by, it is found that the internal fork branch and the open source version are getting farther and farther away. It is impossible to rebase the upstream code. It is also very expensive to supplement the internal functions and requirements with the new version. In the end, it is the same as the open source Istio. into two projects. Taking Istio to transform is mainly to solve the two problems mentioned above:
  • Find an alternative to the iptables traffic hijacking mechanism. Usually, it is integrated with the internal naming service, or combined with the service framework/SDK, and adopts the agreement-based traffic hijacking scheme, that is, the business outbound traffic accesses the agreed port and forwards it to the data plane proxy for service discovery and routing forwarding, and incoming traffic. The hijacking is usually related to the service registration mechanism and varies.
  • Solve the problem of full distribution of service discovery and governance configuration. Some companies will directly use the community's Sidecar CRD solution to display the upstream and downstream links reachable by the configuration service through the product interface. One link corresponds to a Sidecar Egress, so as to achieve accurate delivery of service discovery and governance configuration and reduce data. CPU & memory consumption of face envoy. Some companies will explore some automated solutions, by adding a layer of gateways and service dependency analyzers, to achieve a non-invasive xDS on-demand loading solution.
The second idea is to completely abandon Istio's architectural design ideas, and design a new architecture from scratch based on actual business needs and availability of large-scale production. In order to solve the problem of on-demand loading of configuration without increasing the complexity of the system, it is possible to completely abandon the configuration push protocol of xDS, and adopt a set of "ship-new" protocols to realize multi-protocol support for configuration, supporting massive instances and massive governance configuration scenarios. service discovery and routing.
ByteDance adopts the second idea above. In 2018, the ByteDance infrastructure team developed a new service mesh architecture. In addition to solving the problem of traffic hijacking performance and full configuration delivery performance, this architecture is based on the actual needs of online services. It also makes some trade-offs and trade-offs. For example, the ByteDance service grid solution tends to transfer part of the calculation and complexity from the data plane to the control plane, and complete complex traffic management logic calculations at the central node such as the control plane, so that the data plane gets relatively accurate This reduces unnecessary CPU and memory consumption on the distributed data plane, as well as performance problems caused by frequent and massive computations. In addition, the ByteDance service grid solution does not pursue a completely non-intrusive solution. At a certain level, the service grid is closely integrated with the service framework. While completing the lower layer of governance capabilities, the two microservice middleware Performing their duties, they jointly completed a series of capabilities such as communication, governance, security, observability, and chaos engineering of the microservice system. After about 3 years of research and development, the number of online services of ByteDance Service Grid has reached 3w+, and the number of online instances is 300w+.

Proxy vs Proxyless

Recently, there has been a discussion on a new architecture in the community, which started with gRPC starting to implement the xDS protocol, and Istio 1.11 version supports adding gRPC services directly to Istio, so there is no need to inject Envoy proxy into Pod. This architecture Known as Proxyless Service Mesh, or Service Mesh without Sidecar Proxy, its counterpart is Proxy Service Mesh (Istio + enovy architecture).
As two well-known open source projects, the move of Istio and gRPC also shocked the people who eat melons and played the role of a wind vane. Some people ask, is this going back to the era of naked service frameworks? Is the Proxy Service Mesh model about to be abandoned?
The author believes that we should treat it rationally. From the perspective of the general environment, we should be able to fully understand the diversity of needs and be able to accommodate such diversity, but it does not mean that the emergence of a certain demand and realization must be the mainstream. The Proxyless Service Mesh mode of gRPC, to a certain extent, does solve the defects of the traditional service mesh architecture, such as no need to consider out-of-process traffic hijacking, and supports service discovery and routing on the framework side. But it also has the shortcomings of traditional microservice architecture, including coupling problems, version fragmentation problems, and multi-language consistency problems.
Proxyless Service Mesh essentially solves the problem of service governance standardization. The two modes can be connected to a control plane and a set of governance systems at the same time. It is still very valuable for some businesses in scenarios such as cloud-native integration transformation or gateway. In addition, if the internal language is unified and the framework version management capability is strong, it is also suitable for this architecture. However, it is not the mainstream after all, cannot solve the common problems of microservices, and does not conform to the trend of microservice architecture evolution.

Trends and Thoughts

From the definition of service mesh given by William Morgon in 2016 to the release of Istio version 1.12 at the end of 2021, a lot of changes have taken place in service mesh. The positioning consensus of the early service grid is to provide the infrastructure layer for inter-service communication . However, with the maturity and promotion of the sidecar model, the positioning and working principle of the service grid have also undergone some changes, and gradually generalized to more levels and fields.

multiple runtimes

At the beginning of 2020, Bilgin Ibryam ( Author of Kubernetes Patterns | Product Manager at Red Hat ) proposed a new idea of ​​microservice architecture: Multiple Runtime (multiple runtime), and summed up the four categories of requirements for distributed applications: life Period, network, state, binding. It is likely that in the future we will end up using multiple runtimes to implement distributed systems. Multiple runtimes do not correspond to multiple microservices, but each microservice will consist of multiple runtimes , most likely two runtimes: custom business logic runtime and distributed primitive runtime .
The essence of service the abstraction layer of inter-service communication, and the essence of multi-runtime is various , including but not limited to inter-service communication. From this point of view, the scope of multi-runtime coverage is a superset of service mesh: after all, service mesh only covers part of the application's requirements (ie, inter-service communication), and there are more distributed capabilities and primitives to be cover.
There is a point of view: with the advancement of cloud native, the sinking of distributed capabilities (typically represented by traditional middleware) is the general trend, and the scope of meshing will inevitably continue to expand, which means that the form of multi-runtime armor is getting more and more near.

 

More Mesh Forms - Generic Sidecar

In addition to being applied to Service Mesh, the Sidecar model can also be extended to other middleware fields, such as message Mesh, database Mesh, etc., and various Mesh forms. Some articles show that Ant Group has two types of Mesh, DB Mesh and message Mesh. The form has tried and landed.
Looking back, we can find that the emergence of service mesh stems from the technical community's desire to use it to solve problems such as multilingualism, operation and maintenance, and iteration. Can this set of service mesh technology be applied to more scenarios? Undoubtedly, all middleware sidecars belong to this field, such as DB, message, API Gateway, risk control component, login component, etc.
Since the service mesh can distribute the proxy sidecar responsible for traffic forwarding to any container on the line, and complete the safe hot upgrade, the same method can also be used for the middleware sidecar. The impact on the business side is similar, and issues such as performance, observability, and stability need to be addressed.
Based on the above requirements, ByteDance has built a set of standard , and has been implemented on a large scale in the bypass sidecar fields such as API Gateway, login components, and risk control components. Will also share with you.

Summarize

Entering 2022, the general trend of service mesh, Proxy Service Mesh is still the mainstream, and the community ecology is gradually improving, but there are also more Istio competitors. Compared with the underlying capabilities of PaaS, the field of microservices is a set of architectures that face users and are closest to the business. The scenarios are complex and the requirements are changeable, so it is difficult to appear the so-called "de facto standard". We expect that the service grid will continue to be in a state of blooming and many contenders, continue to bring greater value to users, liberate productivity, and contribute to the full realization of cloud native in the entire industry.

refer to

  1. https://www.servicemesher.com/istio-handbook/concepts/basic.html
  2. https://www.servicemesher.com/istio-handbook/concepts/istio.html
  3. https://mosn.io/docs/concept/traffic-hijack/#%E9%80%8F%E6%98%8E%E5%8A%AB%E6%8C%81
  4. https://mp.weixin.qq.com/s/LbeQAeADllUbrxaeTTbvyg
  5. https://cloudnative.to/blog/service-mesh-comparison-istio-vs-linkerd/
  6. https://cloudnative.to/blog/grpc-proxyless-service-mesh/
  7. https://skyao.io/talk/202103-dapr-from-servicemesh-to-cloudnative/
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5582173/blog/5391225