Servicemesh Crime and Punishment

0. Introduction

There are now very much introduction Servicemesh concept, architecture, methodology and standardization for the article, but for Servicemesh how in order to be truly effective and reliable ground, which we will face a difficult choice, not too many to mention. This article hopes From this perspective, combined with some experience and I stepped on the floor of the pit in a production environment, and to explore how best to allow the system to evolve to Servicemesh architecture. This article is intended to have a certain understanding of the concept of Servicemesh or want to explore in this area has been the students.

1. Service philosophy change of governance glance

Before talking about some things Servicemesh, let's take a look at the history of service governance. We all know that the development of services by single application, tiered service levels, service cut longitudinally, and the subsequent rise of micro-service idea, "two weeks" Reconstruction of law and "two pizza" team and other dazzling demolition services points methodology stuffed your eyes and ears. These are some common argument, but not the focus of this chapter explained. This chapter is more concerned about the development of a core problem resolution process service governance, to face the inevitable - "service after the split, how to generate between contact?" That is:

How to form a complex network service nodes?

Due to space limitations, we briefly review several important trends for the next issue.

1.1 Server Proxy

enter image description here

I'll call a centralized proxy stage. This stage is the most simple solution, the use of a centralized single-point services cluster to assume the functions of more service governance. For example, centralized deployment Lvs, Nginx, Haproxy, Tengine, Mycat, Atlas, Codis and so on, is the most common way. The advantage of this approach is that facilitates centralized maintenance functions, independent of language, low cost to get started, so in fact, has a large audience in the medium and small companies, especially start-up companies there. But the problem it brings are obvious, the most popular of the domain name + Http Rest call + Nginx combination package , for example, calls the link to go through DNS (DNS cache), KeepAlived, (Lvs), Nginx layers. Long call link road and multiple point systems, will lead to stability and performance issues at the rising scale of services. Similar to multiple single point of failure in US Ding Ding rental group. Tintin has been renting since Nginx, DNS and Codis-Proxy failure caused by excessive influence of big business area. US group's DNS, MGW and so encountered similar problems, thus further launched within the network to call into Http Thrift direct connection technology projects. Overall, it is bound to have its uses, it is more suitable for start-ups, or special emphasis on cross-language scenarios, Server Proxy solution can be used as a quick start for use. But the use of time, we need to have a clear understanding of its defects in order to make the right decisions in subsequent evolution.

1.2 Smart Client


Simple straightforward Server Proxy bring us convenience, but also bring us a lot of problems, but this time, in order to cope with this problem, Smart Client and the rapid rise of bursting out with its strong vitality. Let's call frame stage, the stage in order to solve the problem existed Server Proxy long single-point links, alternate manner to build direct link between two nodes. This is one step to solve the problem of long single-point links, and can be directly connected on the basis of, for greater performance optimization and lifting stability. On behalf of open source products (services framework or Rpc framework) have such Uber's Ring-pop, Ali Dubbo, ants Sofa-Rpc, Dangdang Dubbox, microblogging Motan, reviews and articles for Pigeon / Zebra, Baidu's brpc / Stargate, Grpc , Thrift and so on. Namely two big ideas:

1. not work hard, then turned soft. Hardware-based service management does not work, then go to the type of service management software.

2. Not far away, then turned close. Centralized deployment of the distal way does not work, then pulled me inside to deploy the application process.

As a result, Server Proxy thought of the many ills it was 11 resolved.

Advocating the advantages of Smart Client for a long time, and that he is not a problem right? Inevitably still exist, but the problem is also very obvious and prominent.

  1. Firstly, we know, temperament micro-services sector is geared towards self-government. Smart Client is also known as a rich client, but such an approach would be severely limit our choice on the technology stack, for a language you need to re-implement all the basic capacity to govern again, and then brought several sets of language maintenance and upgrade costs under cost increases, for the vast majority of the companies are unbearable burden. Uber had launched against Ring-pop node and go version, microblogging Motan also launched a number of languages, Grpc, Thrift also could not escape this fate. And now most companies are mixing two or more languages ​​actual development. So the problem must also be addressed.

  2. The second aspect, Smart Client SDK using heavy way of embedding application process coexist, which will make maintenance operation and maintenance of business applications and services to manage the complete mix together. This is the real work scenarios, the difficulty of operation and maintenance services to uplift the several orders of magnitude governance, service management team was forced to confront the coexistence of multiple versions of the code architecture, as well as the implementation of a version may require more than half a year nightmare.

Of course, despite these problems, but because there is no particularly sophisticated alternative, Smart Client is still occupy the mainstream position in the highly concurrent high-volume business scenarios. If the language can converge possible scenarios can be used, for example, can use the Java version of the Smart Client, and for not high Nodejs traffic using Http Rest plus domain name compromise way to do service governance.

1.3 Local Proxy

Smart Client Server Proxy and have a question can not be avoided, then there are other solution? In response to this question, Local Proxy came into being, since there is a single point of centralized deployment issues, coupled with rich clients have problems, so why not take it a compromise? This time the idea becomes:

Nearby were process-level governance . That process level using local agents the way, both to avoid the problem of centralized single point of deployment, but also issues related to language and application of coupling can be avoided.

This idea began to gradually popular, a lot of governance programs are springing up this idea emerge.

airbnb of SmartStack uses a family of four completed series service management process throughout the core, it is a simple solution.

enter image description here

Cheng OSP and also used in a similar manner for processing, and the main difference lies in the airbnb Synapse Haproxy and functional integration in order to replace a Proxy.

enter image description here

Inside the cloud field, due to the cross-language and operation and maintenance efficiency are placed in a more important perspective on this idea is to hold a dominant position, Mesos + Marathon cloud architecture, has a similar program, carried out using Haproxy routing node in the control information corresponding to the route refresh.

enter image description here

Google own son K8s too, taking into account the performance issues Proxy, we conducted a compromise treatment, using Iptables rules inject way forward (of course, this approach also has a problem can not be avoided)

enter image description here

These methods have their respective problems, but the biggest problem that is:

How to solve it brings performance decline . Whether or using iptables agent to govern, forward, communication, will be faced with a problem around the past, in high-traffic high concurrency scenarios, the loss of performance of how many? Compared to many rich client already equipped with direct applications, the performance gap between how much? Currently known some of the products at high flow QPS and RT have no small loss, some solutions can even reach 20% performance loss, which is evident in many scenes is unacceptable.

At this time, the last of a killing device Local Proxy - Servicemesh formally brought to the cusp, 2018, also known as Servicemesh first year. I think that idea is as follows:

enter image description here


1. sacrifice some performance and resources, in exchange for a whole ADVANCED degree of autonomy and governance services operations;

2. The isolated execution and control, data and control plane level ⾯ cutting;

3. virtualization, standardization of products, define specifications.

Servicemesh liberated from clutter, burgeons of Local Proxy thought, the proposed think more systematically. This paper does not do more to Servicemesh conceptual description, the Internet has been quite a similar article, limited space here does not start. So, there are a number of giant Istio cooperation (which was originally intended to help applications better on the ground cloud) of bead front, other companies have also based / reference Isito make their own solution:

  1. Based golang Ali rewrote the Envoy and constructed Sofa-Mosn / Pilot on this basis;
  2. Mixer sank limiting the ability of Istio in popular criticism;
  3. Tencent based Envoy to transform the interior of the integrated services framework TSF;
  4. Micro-bogey on Motan-Go developed Motan-Mesh, integrate their service management system;
  5. Huawei ServiceComb is similar practices, Mixer completely sink;
  6. Twitter launched Conduit, based Rust, will also be completed Mixer sink;
  7. ……

However, Servicemesh there are still several issues have not been resolved, in addition to the performance is still not open around the problem, at present there are more and more people are making mesh reflect on the question:

What control panel and data panel cutting criteria? It is not too idealistic?

This is a topic matter of opinion, not expand here. Although Servicemesh is still in its infancy, many problems still groping, but from the development of micro-services trend, the above three concepts Servicemesh coincides with its inevitable future trend.

1.4 summary

This chapter reviews the development of service management, has experienced three major phases of thought, we return to the beginning of the heart, you can see the scene of each phase of the program in fact has its own and does not apply, not the best solution, only the most appropriate solution. We can also be found along the three stages of logical thinking, in fact, for management services, as well as in a trial and error, tangled, spiral rising process.

2. considered loss on assets?

The equivalent parasitic on business machines mesh nature. Using business machine resources. In fact, the test found, the use of c ++ / go to achieve the mesh for memory consumption is relatively controllable, by default it only takes a few M, generally only rise to tens M. under high concurrency This application machine 8G / 16G memory under normal circumstances it can be basically negligible. So take up the issue of additional memory can be largely ignored. But for the large consumption of cpu resources cpu resources will generally approach the business of normal use. This means that, after the addition of the mesh, there are only cpu resources business may be able to use half of the original. This is a bigger problem.

On this issue, the main discussion in the industry believe that the normal business machine resource usage less than 10%, so this part of the additional occupancy in practice does not pose a substantial impact on the business, but can make us more better use of the idle resources and avoid waste. Business and mesh mutual benefits.

This logic, in the future for a long period of time are certainly valid. However, I think that based on this logic, will rise to two new issues:

  1. Resources are not idle indefinitely . We have already noted, as it relates to one of the cost-sharing, more and more of the business side for the use of resources has been increasing emphasis, plus future cloud the main trend of the native target machine also want to improve resource utilization. Pressing this trend, when one day the problem of resource utilization has been relatively properly resolved, that mesh CPU utilization for this issue will highlight how there will be to solve this problem? If you let alone cpu core mesh binding binding way or take a separate pod resources to do business segments and each instance of the resource, it must also bring no small cost of waste.
  2. This figure outside resource usage, as well as the level of peak business problems . We all know that business is the high and low peaks. For example, take-away point is the peak business daily meals, hotel business every holiday is the peak, movie tickets every Spring Festival traffic peak. High and low peaks it shows certainly be appropriate redundancy resources. So while looking at resource utilization is not high, but really to the peak, the system cpu resources will be part of the business soared even close to playing, if introduced into the mesh, the parties to the business of direct experience in this case is this: the peak of the business processing capacity fell half . I believe after hearing the business side said it could not accept the conclusion will be. This time how to deal with this problem? In addition to the business side expansion doubled the machine, there are solution it?

This appears to be a proposition no solution, because servicemesh architecture is like that. That these resources will not come out of thin air and more. However, I would like to ask, we can not break servicemesh architecture, or that optimize servicemesh architecture?

We previously described three important thoughts of development time service governance mentioned under review:

Server Proxy

Smart Client

Local Proxy

Servicemesh is among Local Proxy, can solve the problem with the business side of strong coupling, strong language related to a single point and so on. But other trends are good for nothing it? Obviously the answer is no. Other ways still has a strong vitality and existence value. Our solution that is using Server Proxy as a fallback plan when idle resources is not enough, take a logical Central Mesh to solve the above problems:

  1. Sidecar be idle resource exploration
  2. When they find upcoming idle resources is insufficient, told sdk switches traffic to the Central Mesh
  3. Central Mesh do all the work Sidecar.

Central Mesh loading has all the information needed in the area, the ability to assume all the Sidecar. Namely Central Mesh also as a backup area where the Sidecar, the failure of normal operation or idle resources are not sufficient in the case of Sidecar, automatically switch traffic at Sidecar.

Central Mesh is called a "logical" because Central Mesh is not necessarily set on a central cluster, but can be dispersed deployed nearby to minimize network delay and additional risk caused by a single point. For example, you can press room, by region, by even the nearest gateway deployed on the host.


3. considered the loss in performance do?

Loss in performance is a problem can not be avoided. Since more than once had to be forwarded as well as service management, so naturally the performance is worse than the performance of direct RPC way. Based on our actual performance results, as compared to direct, mesh way of performance degradation of about 20-50%, which is not used in this test iptables under way be more loss of performance. Of course, this increased latency in milliseconds, for most business requirements, in fact, is acceptable. Minimal impact on business performance.

However, we still need to consider the following potential problems:

  1. Business applications problems. For some high concurrency business scenarios, may delay itself is low (in milliseconds), and delay-sensitive, with the first call link there may be seven or eight times more than ten times the RPC call even if in this way transformation, it may lead to serious degradation of performance of such services, and may even lead to such a timeout, played in the thread pool and other issues.
  2. Basic application problems. If the future trend servicemesh is after all communications traffic mesh of not just the flow of business applications, then we also need to consider, such as similar redis this extremely sensitive to storage traffic delays were of mesh, mesh belt for us to delay additional degree of patience will be further reduced. redis itself is an ultra high concurrency, extremely low latency, highly delay-sensitive scene. A little more millisecond delay are likely to lead to decline exponentially Redis availability even lead to business failure.

So the above problems, on the one hand, we need to have degraded the performance of psychological expectations, on the other hand, we should also optimize the performance limits of even the press servicemesh do everything we can, rather than that selected the mesh to give up performance done about it. Think netty famous press to maximize the performance of "eventloop pick."

On the communication performance optimization mesh, there are several points to consider:

  1. Local optimization process communication . Since the mesh and business processes on the same machine, with the prerequisite using the local communication process accelerated communication performance. Local communication process there are numerous ways, such as mmap, unix domain socket, pipe, signal and so on. Among them, the most prominent mmap performance, traffic-shm is a lock-free asynchronous IPC framework, can easily support one million TPS, which is on the use of mmap to communicate. Through testing, the use of mmap combined with appropriate event notification mechanism on some high concurrency scenarios, its performance is compared to the way tcp will increase more than 30%.
  2. Threading model . High-performance underlying basic services have adopted the Reactor pattern to implement the thread model. Of course, with the thread pool coroutine pool, and level / Reactor, you can achieve a variety of paths. There are similar Nginx such a master multi-threaded child process + Single Reactor (high version provides thread pooling mechanism) model, evio Envoy and the use of "single Reactor coroutine pool (thread pool)", Netty multi-level multi-Reactor + level thread pool. Avoid blocking design mesh appear.
  3. Byte reuse . We are used to create a new space he needed for each request to store some information. However, when the amount of concurrency up, this will result in a larger space allocation and recovery performance overhead pressure. Therefore, based on the buddy system or Slab algorithm or other means, distributed management byte of memory usage, you will make a lot of profit. For example, Netty due to the application of heap memory and external partners using the algorithm, Nginx is using Slab mechanism, Mosn the introduction of multi-stage capacity increase sync.Pool cache mechanism on the basis of the distribution mechanism Golang be optimized.
  4. Memory alignment . Operating system memory management in accordance with the page. If you directly manipulate the memory address for data transmission (such as using mmap), then if there is no memory alignment will cause you to pull the memory space is not needed, and have a memory overhead moving mosaic, which will have a direct leading to decline in your performance. High-performance memory queue memory Disruptor also used to optimize an aligned manner.
  5. None of the lock . The first reaction of the need to handle concurrency communications security problems, many times you may have to protect the security lock by the way. This time, under consideration to replace the conventional lock operation by using the hardware level CAS operation, can also take a similar manner Redis this single-threaded processing, or Envoy Although the use of this thread pool, but the connection will be bound to operate a single thread to avoid concurrency issues.
  6. Pooled . Thread resources are very valuable, with tens of thousands of thousands of threads itself can not apply for it, so the thread pool is the default standard. Here to talk about is, for example, you use the coroutine technique, although coroutine is deified as lightweight threads, performance is very high and can easily out of tens of thousands without blinking an eye. But you need to know, it will be very seriously affect your actual processing performance, and because coroutine allocation principles golang itself, some metadata association process and will not be recycled after use, which is due to golang developers philosophy is "Once such a flow is too arrives, it shows the system is likely to face such a peak again, then be prepared in advance." So we have to coroutine, still we need to consider pooling problem. Golang version Motan-Go and Thrift's currently not considered in this regard, and Sofa-Mosn has done a corresponding pooling process.

There are many other performance optimization tools, will not list here.

4. Sidecar interaction function

We do Servicemesh a great mind, to solve business and strong coupling of the status quo, was sinking service management capabilities. However, the sink, we found the service management itself encompasses a lot of things, dynamic configuration, flow control, fuse failure exercises, load balancing, routing, communication, service registration discovery, centralized logging, distributed chain road calls, monitor the buried and so on. These things are rubbed into headlong into a thin Sidecar years. When we do this thing, it is not it also need to look at the beginning of the service management itself among so many features is not also have a similar problem, will interfere with each other organizational level and the technical level, dependence, influence and even conflict ? Yes, this is the problem extends out.

  • For example, how to ensure that this massive log collection but unimportant traffic flow will not affect the core business?
  • For example, how to ensure that the upgrade of a function, does not affect the core business of communication?
  • For example, how many teams together to maintain a Sidecar?
  • ……

These are issues that may be, of course, you can use the cofferdams, you can use hot deployment, you can find a way to split the code repository. But when everything is sinking, the face of team management had seventy-eight set maintenance, you really can solve the above problems with it?

We propose that this time, Sidecar open it. To certain rules, based on the stage of development where your mesh into account, open your Sidecar. Maybe open it, everything is solved. During this writing, such as ants we are seeing a gold dress Sofa-mosn been demolished a separate dbmesh.

Concern is, open the Sidecar not too much, otherwise it will lead to flooding and Sidecar costly upgrade costs of operation and maintenance. So you see, is not it, and of course the service is somewhat similar, both in the "demolition demolition demolition" in the process, simplify the problem, while introducing new problems? The beauty of this is that we're the cause, because you can always find their similarities in seemingly unrelated areas.


The subscription service is only responsible is not responsible for the service release?

Pilot can subscribe to the service, and bridge to the XDS system interfaces. However, the ability to service the registered Why not conduct it? I guess this is because Servicemesh is in a cloud-based Local Proxy native background evolved. The Local Proxy cloud in the native program inside, basic services will not be responsible for registration, because they will be registered to the service cloud (Mesos, Marathon, K8s and others have ready-made solutions) to achieve, or will usually combine consul / etcd / zk separate from another service agent to complete the registration. At this point Local Proxy is simple to focus only on the reverse proxy work.

And this, for our real production environment, it then becomes less friendly. Because of business development to a certain stage, generally they have their own set of service management framework, using its own subscription service registration. They are unlikely to be of service governance mesh, but the entire migration to the cloud service subscription publishing system native to the system, then a little cart before the horse. So we will certainly choose to adapt. The adaptation process, in order to use the service subscriptions to the publication, they will have the depth of transformation Sidecar, the ability to service registration to join, let a third party Sidecar docking registry. Cumbersome and complex work, but also want to control the plane broke Servicemesh shield mind differences of basic facilities.

So I believe that Servicemesh need to completely block out the presence of specific registry. Publish and subscribe should have been Pilot, to provide a unified facade for the consumer. No matter how switching center after registration, you have no need to modify the depth of invasion Sidecar.

                                                                  

6. How the control panel and panel cutting data?

This is a persistent problem. Control Panel Mixer is now basically in a "wall push everyone down" state. Comparative Istio paranoid to carry out a single treatment limiting and data telemetry etc.. The former will bring significant performance bottleneck (even if Istio subsequent increases in caching mechanism Envoy also powerless), which will bring double the traffic consumption. Many are abandoned or Servicemesh implementation simplifies the Mixer. Online corresponding article there are many, no longer go into details here.

Although Istio this design too idealistic, hoping to shield this way infrastructure differences, and then provide unlimited support for large memory capacity Sidecar, while some complex logic excluding a Sidecar as possible, to ensure the best Sidecar It may be stable and reliable. But the reality is very cruel, where there will be communication problems. The most complex distributed environments is caused due to a network.

And if the Mixer will fall to sink, then they had to face a variety of complex logic Sidecar how to protect themselves sufficiently stable and reliable, low resource consumption enough to introduce as little as possible dependent, small but beautiful enough state?

While cutting the interview and data control panel is a very difficult proposition, but Istio which also caused some Tucao, but from the point of view of the pioneers, Istio successful integration of the development for a long time Local Proxy program, and it rose to the height of the methodology, successfully led the industry for systematic reflection panel and the control panel data, this is the greatest contribution Istio made, this is a change from tactical to strategic, from surgery to track innovation.

7. Conclusion

We from various angles, analyzes the problems Servicemesh development so far may be present. Made a number of summed up based on actual production experience solutions, hope we can bring some help. Of course, all the choices are difficult, we have no standard answer, how to combine each company's actual situation where trade-off is the ability to reflect and value. Although Servicemesh there are some problems, but it must be the future trend of development. It can bring great imagination and human liberation for us.


Guess you like

Origin juejin.im/post/5d395bf7f265da1bb9702221