Weibo Service Mesh high-availability architecture combat.

Service Mesh is a popular new micro-service approach in the past two years, and a large number of Service Mesh implementations represented by Istio have also been produced.

image.png

Based on actual business needs, Weibo has built and open sourced its own Weibo Mesh, and has already implemented large-scale internal key businesses.

This article will explain Weibo Mesh in detail from the following parts, hoping to bring you some inspiration in the direction of service, and better serve your own business:

Weibo Service Challenge

Service-oriented new ideas

Weibo Mesh solution introduction

Production Practice

to sum up

Weibo Service Challenge

image.png

First of all, I will introduce the challenges faced by Weibo as a service. Weibo has a special form. Except for the high peak traffic in the normal noon/evening, sudden hot events are more lethal.

When a hot event hits, the traffic has exploded in a very short time, and the event often breaks out without any signs. This has brought great challenges to the servitization and stability of Weibo.

If one of the links is down and cannot be sensed and responded to in time, it is very likely to cause an avalanche downtime and cause the entire site to hang.

So how to solve this problem? First of all, I think of automatic expansion/reduction construction, but what do we rely on for decision-making? How to meet the observability requirements of the system, and how to evaluate the availability and redundancy of the system?

In its essence, the construction of system service governance is very important, and it will directly affect service availability. However, the diversity of Weibo technology stacks has caused difficulties in microservices and service governance.

From a technical perspective, the typical service call of Weibo is roughly as shown in the figure above. The business system will call multiple interfaces of the platform system, such as obtaining Weibo content through the platform interface.

The platform system is mainly the Java technology stack, and there are many microservice-related solutions. In addition to the early serviceization time, the platform microservice system is relatively complete. At the same time, some excellent open source frameworks have been produced, such as the Motan microservice framework. .

The language stack of the business side: diversified, basically covering all mainstream languages, among which PHP-related system traffic accounts for a relatively large amount.

Call link: Normally, the business party and the platform use the RestFul API to interact. A request call has to go through the 4th and 7th layers of scheduling. Service stability is often troubled by network jitter and DNS instability. The consumption in the middle layer cannot be ignored.

In addition, the wide variety of languages ​​in business departments indirectly leads to uneven construction of microservice systems in various departments.

Response to peak traffic requires the assistance and linkage of all business modules of all departments, and the test is the strength of the entire site. We need all modules to have high-level service governance capabilities. Therefore, we urgently need to solve the problem of cross-language microservices.

image.png

The figure above is the micro-service system support diagram inside the Weibo platform. The platform implements a micro-service governance system with the Motan framework as the core.

In addition, there are also self-developed Vintage registration center, Open DCP intelligent flexible scheduling platform and Graphite real-time monitoring platform. It can be seen that the platform's microservice architecture is fully supported by DevOps.

The general trend in the industry is cloud-native. As an important part of cloud-native, microservices are what we must break through.

New Thoughts on Weibo Service

Cross-language service governance attempts

In order to solve the problem of cross-language service governance, I will briefly introduce which solutions we have tried.

There is a big background here. Because Motan is an excellent framework that has been used internally for Weibo for a long time, has undergone major tests and has been open source, it has accumulated a lot of excellent service governance experience, so we must fully consider the existence of Motan in our service transformation.

We tried to adapt Motan to the RPC protocol of PHP such as Yar. PHP can communicate with Java on the server side, but PHP cannot perform service discovery.

So we added a Daemon program next to PHP, and also considered using Nginx for service discovery.

Of course, the problem is also obvious. Such a transformation will result in higher business intrusion, higher costs, and poorer scalability. Moreover, it does not solve the service governance problem of PHP as the server.

We have also tried GRPC. Of course, cross-language calls can be solved, but there are several problems here, one is how to perform service governance, and the other is PB serialization.

Because the content structure of the Weibo scene is very large, and the efficiency is not higher than that of Json, business changes lead to changes in PB files that make upgrade and maintenance costs unacceptable. In addition, serialized data encounters problems and debugging becomes difficult.

In addition, the diversity of the technology stack will also cause a series of problems. Even if we solve the problem of calling from PHP to Java. But the same governance function cannot be realized again in different languages.

The service governance experience accumulated by the Motan framework is something we need to inherit and carry forward. How to balance these problems and solutions?

The nature of cross-language service

image.png

I think the essence of cross-language service can be summarized in two points:

Data interaction

Service governance

Data interaction design should consider cross-language and protocol neutrality, and service governance design should be flexible, comprehensive and scalable.

image.png

In the above figure, I listed the advantages and disadvantages of the cross-language service approach:

Traditional HTTP proxy can resolve calls between different services. HTTP is a traditional gateway, which is easier to implement, but because everyone has to add a gateway for internal calls, this increases the link and results in low scalability.

RPC module or Agent. There are many RPC frameworks in the industry. Most of the Java stacks have complete functions, but the cost of cross-language maintenance is very high.

Agent is a new idea. The research and development cost, maintenance cost, and use cost of the agent are all relatively compromised. That is, we have an independent agent to solve the problems of cross-language service.

That way, it can not only release the traditional business-side service governance pressure, but also inherit the essence of the Motan framework, without the need to implement multi-language service governance logic, and allow the business and the agent to develop independently of each other.

This idea eventually evolved into today's Weibo Mesh:

image.png

From this, Weibo moved to Service Mesh. Weibo's move towards Service Mesh is not blindly catching up with the latest technological trends.

We are based on the status quo, based on solving actual business problems, and after step by step exploration, we found that the final solution coincides with the idea of ​​Service Mesh. From the side, it also verifies the rationality and forward-looking nature of Service Mesh to solve the service-oriented problem.

image.png

In the above figure, we can use the orthogonal decomposition method to understand Service Mesh. It can be seen that the original microservice is split into a business logic layer and a service interaction management layer. The service interaction management layer is abstracted as Service Mesh.

Service Mesh decouples the interaction and governance logic between services from the business, abstracting it independently into a special processing module.

Usually Mesh Agent is deployed locally in the form of Sidecar and services. Agent (hereinafter Mesh Agent is referred to as Mesh or Agent) can be understood as the infrastructure layer of the service. The business does not need to care about the interaction/governance details between services, and all are handled by the Agent.

Service Mesh brings a change in the thinking of microservice architecture, which brings many architectural advantages to business development. Agent and business can develop independently. Usually, the agent is handed over to operation and maintenance management, and business development is handed over to the business line, so the overall It can achieve continuous development and continuous integration.

The Mesh idea can not only solve the problem of cross-language service, but also solve the problem of resource service, and these transformations are basically transparent to the business side.

Weibo Mesh solution introduction

The following is a specific introduction to the implementation of Weibo Mesh. The above picture is the Weibo Mesh architecture diagram. In addition to the necessary Mesh Agent, considering the business migration and actual landing needs, a Client is added between the business code and the Mesh.

This Client is very light, and its core function is to encapsulate Mesh requests, facilitating the business to make Mesh calls and minimizing the cost of business migration.

In fact, we also implemented some other business-friendly functions in the Client, and at the same time enhanced the function of Mesh call.

For example, you can perform cross-language serialization, Mesh failover, multiple requests, and timeout control in the Client.

Of course, different business parties can customize functions. The client is written in the same language as the business, and its core purpose is to help smooth business migration.

The Mesh layer implements the core functions of Service Mesh, including discovery, interaction, routing, and governance.

Weibo Mesh is implemented by Go, and most of the major domestic manufacturers' Mesh data planes are also implemented by Go.

Go has excellent performance and ease of use, and it is a language that is more respected in the cloud era. In the future, the Mesh layer is very likely to be combined with the container and become an infrastructure layer of the container.

Weibo Mesh data surface

To analyze a Service Mesh service, generally through the data plane and the control plane. First look at the performance of Weibo Mesh in the data plane, which contains five core modules:

Cluster (cluster management), the abstract management of the node list discovered through the service under the group.

HA (High Availability Strategy), LB (Load Balancing).

Endpoint (abstraction of service node) is essentially IP and port, but from a code perspective, it is an abstraction of service node. Direct calls can be made through Endpoint, which can be understood as a unit of call.

Protocol (Motan2/Transmission Protocol+Simple/Serialization Protocol).

These modules are introduced one by one below.

①Cluster module

The caller requests to pass through the local Mesh. In the Cluster module processing, it first passes through a series of cluster-granular Filter Chain (filter chain, including cluster Metric, fuse, interception, authentication, packet switching and other functions, they are carried out in a chain structure Organize calls, support arbitrary filtering function expansion).

Then through the high-availability strategy and load balancing strategy, an available Endpoint is filtered out. In the Endpoint, a filter chain of request granularity (single-machine log records, metric, etc.) will be carried out. The request is serialized and assembled according to the transmission protocol. Finally, the request is sent to the mesh at the opposite end through Endpoint.

②High availability strategy

image.png

The high-availability strategy in Weibo Mesh supports general common strategies, such as Failfast, Failover, etc., load balancing supports weighted rounds, according to weight rounds, random and other common strategies. Of course, you can also customize your own HA/LB strategy.

The recommended high-availability strategy in Weibo Mesh is Backup Request, also called dual sending. Double-shot is inherited from the Motan framework and is a more efficient and reliable mechanism that we have explored. It can effectively solve the long tail problem and at the same time improve system throughput.

The traditional solution to the interface timeout problem may be through retrying, waiting for a specified timeout time after a request is sent, and if it does not return, then request another time. In the worst case, it will consume 2 times the timeout time.

This is not the case for the dual-transmission mechanism. After sending a request, wait for P90 (90% of the requests can be returned within T1, then P90=T1, usually the P90 of the system is much smaller than the timeout set by the program).

If the request does not return, send the request again at this moment. Within the timeout period, the fastest return of the two requests is selected.

Of course, there is an anti-avalanche mechanism here. If more than a certain number of requests (such as 15%) are being dual-issued, it is considered that there is a problem with the service as a whole, and the dual-issue is automatically stopped. Practice has proved that the long tail removal effect of the dual-engine mechanism is very obvious.

③Node abstraction

image.png

Endpoint is the calling unit from the caller Mesh to the peer Mesh. When we start Weibo Mesh, while initializing the Cluster, we will also initialize the Endpoint, bind the Filter Chain, and maintain a certain number of long links for each Endpoint for selection.

Of course, there will be some details here. If a node fails to call and the count exceeds a certain threshold, the node will be automatically removed, and regular detection will be performed to wait for availability, and then re-added to the list of available nodes.

④Motan2 transmission protocol

Weibo Mesh has adopted Motan's protocol design as a whole and has been upgraded.

Motan supports the serialization of Java. At the time, we considered the mutual communication between Java. However, considering the needs of cross-language communication and future expansion, we divided the protocol design into serialization protocol and transmission protocol. The transmission protocol is responsible for transmitting the serialized data, and the serialization protocol is the key to cross-language.

The Motan transmission protocol is a typical three-stage:

Header

Metadata

Body

The serialization type and message type (heartbeat or normal request) are marked in the header, and you can define your own PB serialization or self-developed Simple serialization.

There will be some method names, attribute names, and user parameters in the Metadata; what is stored in the Body is the serialized request/response body.

Simple serialization: Simple design is relatively simple and practical. Currently Simple serialization supports basic types, including Bool, String, Int, and Float. Of course, it also supports some combination types, such as Map, Array, etc. combined by String and Bool.

In the example above, type is a byte data type, such as Bool, String, then byte length, and then UTF-8 byte stream. Content can be nested. Below is an example of nesting.

Protocol conversion process: From the protocol level, the Weibo Mesh request flow is that the caller calls the Client through a function, and then passes through the Motan2 transmission protocol and Simple serialization, and then passes through the local Mesh and Mesh layer and then forwards it to the opposite end Mesh.

The upper layer of the peer Mesh may be any form of service, such as a non-RPC service, so here we have a Provider module that can proxy non-standard Service Mesh services such as HTTP/CGI, and it can export these services into a Motan protocol RPC service.

The real protocol of the server-side service is blocked through the Provider. The outside of the Provider is the standard Motan protocol service, and the inside is the service of the original protocol, so for the server side, the cost of migrating to Weibo Mesh is extremely low.

Weibo Mesh Control Surface

The control surface is mainly divided into two aspects:

①Strategy expansion

Cluster and Endpoint have corresponding Filter Chains, and they implement call control strategies of different latitudes or granularities.

Filter Chain includes access log records, metrics, fusing, current limiting, downgrades, etc. A compromise between call efficiency and coupling degree, they all exist in Weibo Mesh in the form of plug-ins, and the Filter strategy and call sequence can also be customized freely.

②Flow scheduling

Weibo Mesh's traffic scheduling is based on the registry. The registry not only provides service registration and discovery for Mesh, but also provides service configuration distribution. When Mesh subscribes to the registry, it also needs to subscribe to related configuration items.

For example, if we route all traffic from computer room A to computer room B, we only need to support this command in Mesh.

Production Practice

Typical scene

The above picture is the overall distribution diagram of the gateway and Mesh in the microservice architecture. In general, the gateway is set up at the edge of the service, and the edge node mainly controls the macro-scale flow scheduling control problem.

Internally, Weibo Mesh is built between microservices to effectively improve the communication quality and observability requirements between services.

Consideration of migration cost:

When really introducing Service Mesh into business scenarios, some issues need to be considered, such as whether the business deployment model is non-cloud, hybrid cloud or cloud native? Like Weibo, it is a hybrid cloud, with different scenarios and therefore different architectures.

Weibo Mesh needs to adapt to the registry.

Each language needs to be adapted to the Client. Currently, mainstream languages ​​such as PHP/C++/C/Python/Lua are supported, and Java/Go is natively supported.

To adapt to the corresponding DevOps construction. It needs platform support such as corresponding monitoring/statistics, and any architecture transformation must have sufficient DevOps support.

Next, I will introduce Weibo Mesh forward and reverse proxy practice.

Forward Mesh

The picture above is the forward proxy process in the Weibo Mesh scenario. The server-side service registers, the caller subscribes to it, and the caller request passes through the Client, then passes through the local Mesh, and finally reaches the opposite end Mesh. It should be noted that the dotted line is the failover process.

If the local Mesh Agent hangs up, the Client will select available nodes to call through the node snapshot returned by the service discovery to achieve the purpose of failover.

Reverse Mesh

The picture above is the reverse proxy process in the Weibo Mesh scenario. Generally our service type is HTTP/CGI, or other proprietary protocols.

The highlight of Reverse Mesh is that it does not require any structural transformation on the server side, just build Weibo Mesh directly.

Through the Provider, the Mesh Agent exports the services of the original protocol into the Motan2 protocol for external exposure, and only needs to register the exported service to the registry to provide the service.

At the same time, it will not affect the provision of original services. If you need to export the private protocol to Motan2 protocol service, you can extend and develop it yourself. The export of http/php-cgi service is supported by default.

Features of Reverse Mesh:

Provide HTTP/cgi provider, customizable extension.

The HTTP framework automatically converts to RPC, and the business does not need to develop a new RPC framework.

Mesh has no intrusion to server transformation.

to sum up

Differences in governance models

Traditional service calls may go through a gateway or RPC in the middle. Service governance can only exist on one end, and service governance is generally performed on the Server side.

However, the Mesh service, due to the native deployment of the Agent, encapsulates the service management, which can realize the two-way management of the Client side or the Server side, which is a major feature of the Mesh service.

Weibo Mesh advantage

The actual combat effect is as follows:

fd9d6be25a78429b3dd030351921a98a.png

As you can see in the above figure, the client-side RT curve of the Mesh service is close to the server-side RT, which shows that the point-to-point Mesh call has no intermediate layer loss, coupled with appropriate service management methods, the performance of both ends is relatively close. The HTTP service has more middle layer loss.

In the figure on the right, you can see that the p999 curve of the dual launch is relatively straight, which shows that the dual launch to the effective long tail cutting function indirectly also improves the system performance.

Weibo Mesh cluster

At present, the calls between several core businesses in Weibo have been meshed, and they have experienced major events and the test of the Spring Festival Gala, and the support flow is still quite large.

Difference with Istio

From the control level, Weibo Mesh puts functions similar to Mixer and CA in Istio into Filters in the form of plug-ins.

Istio services need to be discovered through Pilot, while Weibo Mesh is directly through the registry, Istio uses Envoy as Sidecar, and we have created a brand-new Agent based on Motan.

Istio has a feature. Each of the above modules can be understood as a microservice, which can be split and deployed independently.

However, Weibo Mesh has more plug-in coupling for efficiency and internal convenience, and the overall performance will be better than Istio.

In the above figure, you can see that Istio adapts various APIs through the cloud platform for service discovery, and Weibo Mesh is the adaptation registry.

Istio puts more emphasis on discovery at the container level (going straight to the cloud native), while Weibo Mesh can support common registries such as Consul, ZK, etc.

Weibo Mesh intercepts Mesh requests through Client, modularly coupling some functions. When it is highly customized, the service cannot be completely transparent. And Istio achieves complete business unawareness through IPtables traffic interception.

WM in progress

We know that Mesh Agent does not care what service the agent forwards, so there is a new direction, namely resource servicing, so that the resource storage layer can also be servicing.

Weibo Mesh's idea of ​​solving cross-language service is also applicable to the call problem between services and resources. Weibo service has many resource dependencies, such as MC, Redis, MySQL, MCQ, etc.

We can set up Agent at the resource layer, and we can also achieve resource as a service, that is, pan-service. At present, we already have resource service-oriented usage scenarios based on Weibo Mesh.

WM's future development direction

There are two main directions for Weibo Mesh in the future. One is to continue to promote cloud native, and the other is to continue to polish in terms of ease of use.

Weibo Mesh opens up the cloud platform and registration center to promote cloud native, and combines container orchestration to complement each other in service governance.

In addition, we will also actively work hard to integrate Mesh into the L5 layer. We will continue to explore and solve the inconvenience of business side access, such as more convenient traffic interception methods; wider language support...

Weibo Mesh has always advocated simple implementation, efficient and reliable functions. With the larger-scale promotion of Mesh, the scenes are becoming more and more extreme and the performance requirements are getting higher and higher. We will continue to polish in this regard. Welcome everyone to join Weibo Mesh!

Guess you like

Origin blog.51cto.com/14308898/2551166