Cloud Native: Practice and Prospect of Ant Service Mesh on a Large Scale

The concept of cloud native is in full swing, but there are still only a handful of companies that have truly implemented large-scale landings. Ant, as an early trial company in China, after more than 2 years of exploration, a set of practical solutions has been deposited and finally passed. The test of Double Eleven.

1. Why do we need Service Mesh?

 

 

 

Why do we need Service Mesh, where is its value to the business, we have summarized three points:

 

1. Decoupling microservice governance and business logic.

2. Unified governance of heterogeneous systems.

3. Financial-level network security.

 

Explained separately below.

 

 

1. Decoupling microservice governance and business logic

Before Service Mesh, the traditional microservice system was played by the middleware team to provide an SDK for business applications. In the SDK, various service governance capabilities, such as service discovery, load balancing, current limiting, and service Routing, etc.

At runtime, the SDK and business application code are actually mixed and run in the same process, and the coupling is very high, which brings a series of problems:

  • The upgrade cost is high. Each upgrade requires business applications to modify the SDK version number and re-release. Take Ant as an example. In the past, we spent thousands of man-days on middleware version upgrades every year.

  • Serious version fragmentation. Due to the high cost of upgrading, middleware will continue to move forward. Over time, it will lead to inconsistent online SDK versions and capabilities, making it difficult to manage uniformly.

  • The evolution of middleware is difficult. Due to serious version fragmentation, the middleware needs to be compatible with a variety of old version logic in the code during the forward evolution process, which is like moving forward with "yoke" and cannot achieve rapid iteration.

With Service Mesh, we can strip most of the capabilities in the SDK from the application, disassemble it into an independent process, and run it in Sidecar mode. By sinking service governance capabilities to infrastructure, businesses can focus more on business logic, while middleware teams can focus more on building various general capabilities, truly independent evolution, transparent upgrades, and overall efficiency.

 

 

 

 

 

2. Unified governance of heterogeneous systems

With the development of new technologies, applications and services in different languages ​​and different frameworks often appear in the same company. Take Ant as an example. The internal business is also blooming. There are various businesses such as front-end, search recommendation, artificial intelligence, and security. The technology stacks they use are also very diverse. In addition to Java, there are NodeJS, Golang, Python, C++, etc. In order to be able to manage and control these services in a unified manner, we have to re-develop a complete SDK for each language and each framework. The maintenance cost is very high, and it also brings great challenges to our personnel structure.

 

 

 

 

With Service Mesh, by sinking the main body’s service governance capabilities to the infrastructure, multi-language support is much easier. You only need to provide a very lightweight SDK, or even in many cases, you don’t need a separate SDK. It can easily implement multi-language, multi-protocol unified flow control, monitoring and other governance requirements.

 

 

 

3. Financial Grade Cyber ​​Security

At present, many companies' microservice system construction is based on the assumption of "intranet credibility". However, this assumption may not be appropriate in the context of the current large-scale cloud migration, especially when it involves some financial scenarios.

Through Service Mesh, we can more easily implement application identification and access control. With the help of data encryption, we can achieve full link credibility, so that services can run in a zero-trust network and improve the overall security level.

 

 

 

2. Large-scale implementation of Ant Service Mesh

 

 

It is precisely because Service Mesh brings the above-mentioned benefits that we have started technical exploration and small-scale landing pilots since the beginning of 2018. However, in the initial promotion with the business team, we faced soul torture:

1. Does the business need to change the code? The business team is facing very heavy business pressure daily and does not have much energy to do technical transformation.

2. The upgrade process should not affect my business. As far as the company’s business is concerned, stability must come first, and the new structure cannot have an impact on the business

3. Others whatever you want. The implication is that as long as we can ensure that the transformation cost is low enough and the stability is good enough, the business team is still willing to cooperate with us to implement Service Mesh.

 

 

 

This reminds me of the famous product value formula:

 

 

 

 

From this formula:

"New experience-old experience" is the various benefits brought by the aforementioned Service Mesh, and this part of the value needs to be maximized

"Migration cost" refers to the various costs of the business in the process of migrating to the new Service Mesh architecture. This part of the cost needs to be minimized, mainly including

  • Access cost: How does the existing system access Service Mesh? Is there a business transformation?

  • Smooth migration: Many business systems are already running in the production environment, can they be migrated smoothly to the Service Mesh architecture?

  • Stability: Service Mesh is a brand-new architecture system. How to ensure stability after business migration?

Next, let's take a look at how ants do it.

 

 

Access cost

Since Ant’s services use the SOFA framework uniformly, in order to minimize the access cost of the business, our solution is to modify the logic of the SOFA SDK to automatically identify the operating mode. When the operating environment is found to have Service Mesh enabled, it will automatically Sidecar docking, if Service Mesh is not enabled, it will continue to operate in a non-Service Mesh mode. For the business side, it only needs to upgrade the SDK once to complete the access without changing the business code.

 

 

Let's take a look at how the SDK connects with Sidecar. First, let's look at the process of service discovery:

1. Assuming that the server is running on the machine 1.2.3.4 and listening on port 20880, first the server will initiate a service registration request to its Sidecar, and inform the Sidecar of the service that needs to be registered and the IP + port (1.2.3.4:20880)

2. The Sidecar on the server side will initiate a service registration request to the registration center, telling the service to be registered and the IP + port. However, it should be noted that the registration is not the port of the business application (20880), but the one monitored by Sidecar itself Port (for example: 20881)

3. The caller initiates a service subscription request to its Sidecar to inform the service information that needs to be subscribed

4. The Sidecar on the calling side pushes the service address to the calling side. It should be noted that the pushed IP is the local machine, and the port is the port monitored by the Sidecar on the calling side (for example: 20882)

5. The Sidecar on the calling side will initiate a service subscription request to the registration center to inform the service information that needs to be subscribed

6. The registration center pushes the service address to the sidecar of the caller (1.2.3.4:20881)

 

 

 

Let's look at the service communication process:

1. The "server" address obtained by the caller is 127.0.0.1:20882, so a service call will be initiated to this address

2. After receiving the request, the sidecar on the calling side can know the specific service information to be called by parsing the request header, and then after obtaining the address returned from the service registry before, the real call can be initiated (1.2.3.4:20881)

3. After the Sidecar of the server receives the request, it will finally send the request to the server after a series of processing (127.0.0.1:20880)

 

 

 

Through the above process, the docking of SDK and Sidecar is completed. Some people may ask, why not adopt the iptables solution? The main reason is that on the one hand, the performance of iptables is severely degraded when there are many rule configurations. Another more important aspect is its poor control and observability, and it is difficult to troubleshoot problems.

 

 

Smooth migration

Ant’s production environment runs a lot of business systems with complex upstream and downstream dependencies. Some are very core applications. A slight jitter will cause failures. Therefore, for a large architectural transformation like Service Mesh, smooth migration is A required option, but also need to support grayscale and rollback.

Thanks to the registration center we keep in the architecture system, the smooth migration solution is relatively simple and straightforward:

1. Initial state

The following figure is an example. Initially, there is a service provider and a service caller.

 

 

 

2. Transparently migrate callers

In our solution, there is no requirement for migrating the caller or the service party first. It is assumed that the caller wants to migrate to the Service Mesh first, so as long as the sidecar injection is enabled on the caller, the SDK will automatically recognize the current When Service Mesh is enabled, it will subscribe and communicate with Sidecar, and then Sidecar will subscribe to the service and communicate with the real service party, while the service party is completely unaware of whether the caller has migrated. Therefore, the caller can turn on the Sidecars one by one in grayscale, and just roll back if there is a problem.

 

 

 

3. Transparent migration service provider

Assuming that the server wants to migrate to Service Mesh first, then as long as the sidecar injection is turned on on the server, the SDK will automatically recognize that Service Mesh is currently enabled, and it will register and communicate with Sidecar, and then Sidecar will transfer itself As the service provider registers to the registry, the caller still subscribes to the service from the registry, and is completely unaware of whether the service provider has migrated. Therefore, the service party can turn on the Sidecars one by one in a grayscale manner, and just roll back if there is a problem.

 

 

4. Final state

Finally, the final state is reached, and both the caller and the server smoothly migrate to the Service Mesh, as shown in the following figure.

 

 

 

 

stability

Through the introduction of the Service Mesh architecture, we have initially realized the decoupling of applications and infrastructure, greatly speeding up the iteration speed of infrastructure, but what does this mean for stability?

Under the SDK mode, after the middleware classmates release the SDK, business applications will be gradually upgraded, and will be gradually promoted in accordance with development, testing, pre-release, grayscale, production and other environments and complete functional verification. To a certain extent In fact, there are a large number of business students who are helping to test middleware products, and they are gradually upgrading in a small scale, so the risk is very small.

However, under the Service Mesh architecture, business applications and infrastructure are decoupled. This greatly accelerates the iteration speed, but it also means that we can no longer use the previous model to ensure stability. We not only need to ensure that in the R&D phase The quality of products must also be controlled during online changes.

Considering the scale of ant clusters, online changes often involve hundreds of thousands of containers. How to ensure the stability of such a large-scale upgrade? The solution we give is: unattended changes.

Before understanding the unattended changes, let's take a look at unmanned driving. The following figure defines the maturity level of unmanned driving, from L0 to L5. L0 corresponds to most of our current driving modes. The car itself does not have any automation capabilities and needs to be fully controlled by the driver, while L5 is the highest level and can realize true unmanned driving. Like Tesla as we know it, its autonomous driving is between L2 and L3, and it has the ability to autonomously drive in certain scenarios.

We also refer to this system to define the level of unattended changes, as shown in the following figure:

 

 

L0: Pure human flesh change, black screen operation, without any tool assistance

L1: There are some tools, but they are not systematically connected. People need to arrange different tools to complete a change to ensure artificial grayscale

L2: With preliminary automation capabilities, the system can arrange the entire change process by itself, and has the ability to force grayscale. Therefore, compared with L1 level, human hands are free, and only one work order is required for one change.

L3: The system has the ability to observe. During the change process, if an abnormality is found, the user will be notified and the change will be blocked. Therefore, compared to the L2 level, human eyes are also liberated. We don’t need to watch the change process all the time, but call It has to be turned on, once there is a problem, it needs to be online in time

L4: It goes a step further. The system has the ability to make decisions. When a change is found to be problematic, the system can automatically process and realize self-healing, so compared to the L3 level, the human brain is also liberated, and the change can be carried out in the middle of the night. If there is a problem, the system will automatically deal with it according to the predefined plan. If it can’t be solved, you need to call the person to go online for processing

L5: It is the ultimate state. After submitting the changes, people can leave, and the system will automatically execute and ensure that there is no problem

At present, our self-assessment has achieved the L3 level, which is mainly reflected in:

 

1. The system automatically arranges batch strategies to achieve mandatory gray scale

2. Introduced change defense, added pre- and post-check, and can block changes in time when problems occur

The change defense process is shown in the following figure:

 

  • After submitting the change order, the system will divide the changes into batches, and open the batch changes according to the dimensions of the computer room, application, and unit.

  • Before the start of each batch change, a pre-check is first performed, such as checking whether the current time is the peak period of business, whether it is a failure period, and checking the system capacity, etc.

  • If the pre-check is not passed, the change will be terminated and the students who changed will be notified. If passed, the Mosn upgrade or access process will start

  • After the change is completed, post-verification will be performed, such as business monitoring, such as whether the transaction and payment success rate has fallen, etc., and the service health, such as RT, error rate, and upstream and downstream systems. , It will also be associated with the alarm to check whether there is a fault during the change, etc.

  • If the post-check is not passed, the change will be terminated and the students who changed will be notified. If it is passed, the next batch of changes will start.

 

 

 

 

 

Overall structure

Let's take a look at the overall architecture of Ant SOFAMesh. The "dual-mode microservices" here refers to the combination of traditional SDK-based microservices and Service Mesh microservices, so that:

  • Interoperability: applications in the two systems can access each other

  • Smooth migration: applications can migrate smoothly between the two systems, and can be transparent and imperceptible to upstream and downstream dependencies

  • Flexible evolution: After interconnection and smooth migration are realized, we can carry out flexible application transformation and architecture evolution according to the actual situation

On the control surface, we introduced Pilot to implement configuration distribution (such as service routing rules), and still retained an independent registry for service discovery to achieve smooth migration and large-scale landing.

On the data side, we use the self-developed Mosn, which not only supports SOFA applications, but also Dubbo and Spring Cloud applications.

In terms of deployment mode, we not only support containers/k8s, but also virtual machine scenarios.

 

 

 

 

Landing scale and business value

At present, Service Mesh covers thousands of Ant applications and achieves full coverage of core links. There are hundreds of thousands of Pods in production. The QPS processed on Double Eleven reached tens of millions, and the average processing response time was <0.2 ms. Good technical results have been achieved.

 

 

 

In terms of business value, through the Service Mesh architecture, we have initially realized the decoupling of infrastructure and business applications, and the upgrade capability of infrastructure has been increased from 1 to 2 times per year to 1 to 2 times per month, which not only greatly accelerates the iteration speed At the same time, it saves the upgrade cost of thousands of person-days of the whole station every year. With the help of Mosn's traffic allocation, the time-sharing scheduling scenario is realized. It only takes 3m40s to complete the switch of 2w+ containers, saving 3.6w+ physical cores, and realizes Double promotion without adding machines; in terms of security and trustworthiness, identity authentication, service authentication and communication encryption are realized, so that the service can run in a zero-trust network, which improves the overall security level; in terms of service governance, it goes online quickly Capabilities such as adaptive current limiting, global current limiting, stand-alone stress testing, and business unit isolation have greatly improved the level of refined service governance and brought great value to the business.

 

 

 

 

3. Looking to the future

 

 

At present, we can already see very clearly that the entire industry is going through the process from Cloud Hosted to Cloud Ready to Cloud Native.

But the point I want to emphasize here is that we are not technology for the sake of technology. The development of technology is essentially for business development. The same goes for cloud native, which is basically to improve efficiency and reduce costs, so cloud native is not an end in itself, but a means.

 

 

 

 

Through the large-scale implementation of Service Mesh, we have also taken a solid step towards cloud native and verified the feasibility. At the same time, we have indeed seen that after the sinking of the infrastructure, both the business and the infrastructure team have brought R&D and operation. The improvement of maintenance efficiency.

At present, Mosn mainly provides RPC and MQ capabilities. However, there is still a lot of infrastructure logic embedded in the business system as SDK. There is still a long way to decouple the infrastructure and the business, so we will have more capabilities in the future Sink into Mosn (such as transaction, cache, configuration, task scheduling, etc.) to realize the evolution from Service Mesh to Mesh. For business applications, in the future, they will interact with Mosn through standardized interfaces, and there is no need to introduce various heavy SDKs, so that Mosn will evolve from a simple traffic proxy to a next-generation middleware runtime.

 

 

 

 

In this way, the coupling between business applications and infrastructure can be further reduced, making business applications more lightweight. As shown in the figure, the evolution from the earliest monolithic application to microservices has realized the decoupling between business teams, but it has not uncoupled the coupling between business teams and infrastructure teams. The future direction is like the third one As shown in the figure, we hope that business applications will move in the direction of pure business logic (Micrologic) and sink all non-business logic into Sidecar, so that the independent evolution of business and infrastructure can be truly achieved, and overall efficiency can be improved.

 

 

 

 

Another trend is serverless, which is currently limited by factors such as application size and startup speed. The main application scenario of serverless is still on Function.

However, we always believe that Serverless is not limited to the Function scenario. Its flexibility, free operation and maintenance, and on-demand use are obviously of greater value for ordinary business applications.

 

 

 

Therefore, when business applications evolve to Micrologic + Sidecar, on the one hand, the volume of business applications themselves will become smaller and the startup speed will be accelerated. On the other hand, the infrastructure can also be more optimized (such as connecting to the database in advance, and preparing in advance. Cache data, etc.), so that ordinary business applications can also be integrated into the serverless system, and truly enjoy the benefits of efficiency and cost brought by serverless.

 

Guess you like

Origin blog.csdn.net/Java0258/article/details/112991603