GIAC 2020 Global Internet Architecture Conference Speech Record: Practice of Microservice Technology Architecture Based on TarsGo

From August 14th to 15th, 2020, the GIAC 2020 Global Internet Architecture Conference officially opened in Shenzhen last Friday.

GIAC (GLOBAL INTERNET ARCHITECTURE CONFERENCE) is an annual technology architecture conference for architects, technical leaders and high-end technical practitioners launched by the high-availability architecture technology community and msup that has long focused on Internet technology and architecture. It is the largest technology in China One of the meetings.

The 6th GIAC will share typical examples of technological innovation and research and development practices from the most popular frontier technologies of the Internet architecture, technology management, system architecture, big data and artificial intelligence, mobile development and language, architecture-related fields .

On the topic of Go language, Tencent senior engineer Li Kaiyuan delivered a keynote speech entitled "Practice of TarsGo-based Microservice Technology Architecture".

The following is the speech record:

The topic I shared today is based on the practice of TarsGo's microservice technology architecture. First, I will introduce myself. My work experience is very simple. After graduation, I have been working at Tencent. I have worked in three directions in the past 5 years.

The first part was in 2015. At that time, docker was relatively popular and there were many people who knew K8S, but at that time everyone felt that it was still like a toy and not suitable for production environment. My team built a container cloud by itself, mainly running Tars and other services on the docker container, and realized much more automatic scheduling capabilities than k8s. In the past few years, we wanted to open source this solution, but found that k8s has become Fortunately, there were no business standards for microservices, so we open sourced Tars.

The second part was around 2017. At that time, Go also started to catch up with some of the functions of the Go language version of Tars that I implemented at that time.

The third part is 2019. Many people started talking about serverless and cloud native. I also switched to Tencent Cloud Cloud to develop this product. Our back-end team uses TarsGo to develop it.

Have you noticed that I have done several directions, it seems that they are all catching up with the business trend, but the same thing is all related to Tars, from the infrastructure, to the application platform, to the users of the platform. Because of these experiences, there may be different angles when talking about Tars. When building a platform or technical framework, you may pursue cool technologies, but after you become a user, you don’t care about how awesome the technology is. It depends on whether a certain component or platform is good or not, stable and unstable. , Can we solve our business problems?

Today’s sharing session will be divided into three parts:

1) The core functions of Tars, 2) The features and performance optimization practices of TarsGo, 3) The combination of Tars and K8S.

Before the introduction, let's first analyze the general technical challenges caused by the growth of business scale that basically every software development team will encounter.

If you are just doing a personal blog site with hundreds of uvs per day, this challenge is not big, just find a server to host it, but if the daily traffic grows to several million or several billion, that is the first challenge to be mentioned here When the user traffic increases, the original single service architecture cannot support. At this time, a distributed architecture needs to be considered, which requires separation of reads and writes and requires high-performance caches.

The second challenge is to recruit more software developers to meet more business needs. Large numbers of people will also bring new problems. For the boss, in addition to paying more wages, it is necessary to find ways to maximize the value and surplus value of employees. For our research and development, of course, we also hope that good production tools are better. To meet the boss’s needs. Specific to the project development, a common interface protocol is required to communicate, and a framework that supports multiple development languages, so that team members of different development languages ​​can also collaborate to complete the same project, and then students who are proficient in PHP and proficient in Golang Of students don’t spend time arguing about which is the best language in the world.

The third challenge is the growth of the number of servers. It is not easy to manage a large number of machines. If you have 1w machines, several machines may crash every day, several disks are broken, or the entire computer room. They are all hung up, so this requires a solution for automatic recovery of machine failures or machine room failures.

The fourth is the increase in the number of business services, which corresponds to the increase in the complexity of business logic. It is also a big difficulty to manage a large number of business services. It was very cool during development, and a lot of microservices were developed, but the long operation behind How to deal with the maintenance process, to expand and contract, to monitor, to troubleshoot problems, which requires a fully functional microservice governance framework to improve the efficiency of operation and maintenance.

The challenges listed above have been encountered in a large number of business products of Tencent. After years of precipitation, they have also accumulated more experience. Among them, Tars is the most representative external open source project introduced today.

So what are the capabilities of Tars? I think there are two parts. One part is related to development. Tars provides a development framework for IDL and automatic code generation. Development only needs to care about business logic. You don’t need to care about whether you use tcp or ucp protocol, and you don’t need to know the IP of the dependent service. Or port, Tars currently supports cpp/golang/php/nodejs/java 5 languages, and more languages ​​are also being supported.

The other part is related to operation and maintenance, supports visual management of services, and supports non-destructive changes. Configuration management has version management and configuration reference capabilities. Logs, monitoring and call chains are capabilities that Tars supports by default. Tars' Set model provides operations and maintenance. Convenient multi-cluster management capabilities, in addition to overload protection and disaster tolerance. Previously, the open source version of TARS did not have the ability to automatically expand and shrink. Recently, we have open sourced the integration of tars and k8s to complement this ability. This is also the solution I will introduce later.

There are many microservice frameworks similar to TARS. Here are a few more influential RPC frameworks for comparison. GPRC is the most popular RPC framework, but it does not have service governance capabilities. On the other hand, it is like Spring Cloud. Or Dobbo has relatively complete service management capabilities. Recently, it has also open sourced the golang language version of dubbo. In the past, I felt that the Java language was still the main language. The service mesh is very popular now and is called the next microservice architecture. It separates the business logic from the underlying communication and has nothing to do with language, which means it supports multiple languages, but it has not been long in industrial applications and is not particularly mature.

The TARS introduced today has many similar concepts to service mash. For example, general functions such as load balancing should not be implemented at the business layer, but by the public infrastructure layer. Compared to service mash, it was only proposed in recent years. The concept of TARS has been used in many of Tencent’s core businesses since 2008. It is a microservice framework that supports multiple languages ​​and has complete service governance capabilities.

Just mentioned applications, TARS has also communicated with many companies after open source. Due to time constraints, I will first select only a few tars capabilities for more introduction.

Let's first return to a simple scenario. For example, service A needs to call service B. Service B has multiple nodes. How does A choose which node to call? TARS has several load balancing methods. The default is polling. If you want to use the hash method to call, you only need to configure it without additional development.

In addition, when a node of service B is abnormal, TARS provides a fuse algorithm by default, which will automatically shield the abnormal node to minimize the number of request failures caused by the abnormal node, and then try the request again. If the node recovers, Then add back the abnormal node.

There is also an abnormal scenario where the request volume is too large and the service is overloaded. Tars can configure the maximum number of concurrent processing to achieve overload protection and avoid service avalanches.

In another scenario, service A wants to call B, and A needs to know the address of service B. The simplest way is to write B’s IP in service A, but the IP of service B may change due to expansion and contraction. In this way, the change of service B needs to modify A’s code, which is obviously unmaintainable. The optimization solution is to change the IP Written in the configuration file, this plan B still needs to be modified when the service changes. So basically everyone will use the third name service scheme, the service address is automatically registered, and the client does not need to change when the called service changes. However, the name service still has shortcomings in the multi-cluster scenario. Multi-cluster is a very common scenario. For example, a toc product has a large number of users. The clusters can be divided into Shanghai and Beijing regions to serve users in different regions, which is better. The isolation fault, for example, when there is a problem in Shanghai, it will not affect users in Beijing.

There are two traditional methods for clustering. One is that each cluster uses a set of name services and common services. Each new cluster will increase a certain amount of operation and maintenance costs. The other is that multiple clusters share a set of name services. The same service of the cluster needs to use different names, and it is necessary to manage the name of each cluster service, which also brings very high operation and maintenance complexity.

In response to this dilemma, TARS provides SET capabilities.

I think the SET model is one of the most essential capabilities in TARS. It provides the function of logically separating service traffic. If a service has SET enabled, it can only call the same SET service, and if the service is not enabled set , And all set services can be called. In addition, wildcards can be used to make a set call several sets of close-ups. There is no need to change code or business configuration. Operation and maintenance can flexibly configure sets to manage service calls. relationship.

Computer room disaster recovery is also a very common business requirement. From personal experience with Tencent, there are basically one or two computer room failure scenarios every year. In the past few years, Tencent’s computer room in Shanwei may have to be powered off due to floods. Of course The most common reason is that the fiber is cut. When the entire computer room is unavailable accidents, if the service of a product is only deployed in one computer room, it will cause the entire product to be unavailable, and the boss will definitely not be able to accept it. Therefore, when considering disaster recovery, the same service must be deployed in multiple computer rooms. When a computer room is unavailable, services can be provided to the outside world normally.

This also brings new problems. Services have to be called across computer rooms, which will increase latency.

What to do if you want to do disaster recovery in the computer room, but don't want to increase the delay? TARS provides the ability of automatic area awareness. The client only calls the server node in the local computer room when multiple computer rooms are available. If there is no node available in the local computer room, there will be calls across the computer room. This function is only enabled on the control plane. Modifying the business code, while having disaster tolerance capabilities, does not increase the delay, and the operation and maintenance becomes easier.

For microservices, monitoring, logging, and call chains can be said to be standard. Because of the characteristics of microservices, if you have to log in to the machine for troubleshooting after a problem occurs, the low efficiency is sometimes not realistic. Generally, the running status of the service is observed through monitoring, and there will be an alarm when the status is abnormal. Through the call chain to understand the call relationship and service architecture of the complex system, you can quickly locate the problematic module when a timeout occurs, and then you can use the log to analyze the cause of the exception. Then the monitoring, log, and call chain overlap with each other during use. For example, the function of monitoring metrics can also be achieved by aggregating logs, but the efficiency will be lower when the amount of data is large, so a mature microservice system still needs These three complete abilities.

TARS will provide basic services such as tarslog/tarsstat/tarsproperty by default, and can also integrate more mature open source solutions, such as Prometheus for monitoring, ELK for logs, and zipkin and jaeger for call chains.

Next is the second part I want to introduce, the GO language version of TARS's RPC framework, which will mainly introduce experience related to performance optimization.

TarsGo has three advantages, namely the service management capabilities of tars, the advantages of go, and the performance optimization of TARSGo. TARSGo is an rpc framework implemented based on the tars protocol, and will inherit the service governance capabilities of tars by default, such as load balancing, disaster tolerance and fault tolerance, and tars name services.

The second is the advantage of go language. Golang has a rich standard library. In many cases, there is no need to reinvent the wheel, and it is faster for new beginners. Compared with python/C or other languages, there is a better balance between development efficiency and program operating efficiency. Supporting static compilation and cross-platform compilation at the same time provides a lot of convenience for R&D. For example, it can be developed and compiled on the mac, and then uploaded to the Linux server to run. I think the biggest advantage of the go language is concurrent and asynchronous programming. The data structure of go coroutines and channels greatly simplifies the complexity of concurrent programming.

The third advantage of TarsGo is performance, which will be introduced in detail later.

Performance is very important for microservices, because a user request may go through a dozen or more services. A high-performance RPC framework can reduce the time-consuming growth of microservices. This is what TARS does. Not bad.

Previously, we made performance pressure test comparisons for commonly used PRC frameworks and different development languages. From the test results, the overall performance of TARS C++ and TARS GO is the best, and the small package scenario is more than 5 times that of grpc.

In order to achieve this goal, tarsgo has made a lot of optimizations. These optimization experiences should also have certain reference value for students who use go to develop.

The first is the optimization of the timer. We know that every rpc call has a timeout period, and a timer is needed to achieve the goal of limiting the timeout. If the amount of concurrency is large, a lot of timers are needed. Using golang's pporf tool to generate flame graphs can find the performance problem of creating a timer. So we implemented a time wheel algorithm, which can reduce the time complexity of creating a timer from log(n) to O(1). The disadvantage is

In addition, there are also optimization points such as the performance optimization of the daedline setting of the net package, frequent applications for the optimization of the winning sync.Pool for bytes.Buffer, and the use of coroutine pool optimization for the large number of goroutine creation problems. After the above optimization, the aforementioned effect is achieved.

Next, I will introduce the integration plan of TARS and K8S.

Currently, TARS provides comprehensive service management capabilities such as name service, load balancing, overload protection, and area awareness. However, the open source TARS version does not have an automatic scheduling function, and operations such as service online and capacity expansion need to fill in the IP and port.

This will bring 2 problems:

The first is service capacity management. For services that cannot be automatically expanded or contracted, if the reserved resources are small, the service may be overloaded in the event of a traffic burst, and more resources will be wasted if reserved.

The second is mixed deployment of services. If multiple services are deployed in a mixed manner, they may affect each other. If the deployment is not mixed, the microservices are too fine and the machine resources will be wasted.

The most natural idea for the above problems is to use the scheduling capabilities of K8S to realize automatic expansion and contraction of TARS services. But the problem comes again. Many of TARS's capabilities (such as name services, etc.) overlap with K8S, and it is not easy to integrate them.

Under the K8S ecosystem, Istio is the most popular microservice solution. Should we abandon TARS and change to Istio? Here is an analysis of the problems that will be encountered.

The first one is service code transformation. This is a very real problem, but it is not necessarily a good reason not to use iostio.

The second problem with using istio is that long connections to tcp can only be load balancing at the connection level, while tars is at the request level and has better capabilities.

The third is that the tars client has fuse capability to make the service more robust.

The fourth is that tars has a set/idc grouped name service, which is easier to operate and maintain when deployed in multiple clusters and multiple computer rooms.

The fifth is to support automatic port management. No need to care about the port in the business code to avoid the trouble of managing the port when the service is long.

The sixth is to better support logging, monitoring, and call chains. For example, although istio also supports call chains, it must also be used in the business code to pass through the context of the call chain. A common logic needs to be implemented in two places. Intuition tells I am prone to problems here.

The seventh is that compared with istio, it is still in rapid iteration and has a higher cost in landing. TARS has been verified by a large number of business services and is relatively mature and stable.

Therefore, it is necessary to continue to use TARS and transform TARS to support automatic scheduling in K8S.

The main difficulty in implementation is the integration of overlapping functions. The capabilities that Tars does not have are automatic resource scheduling, docker image management, and network virtualization. The capabilities that TARS has but K8S does not are high-performance RPC frameworks, logs, monitoring, and call chain services. Capabilities, service deployment, version release and management, name service and configuration management are functions that both tars and k8s have.

Therefore, it is necessary to choose between overlapping functions. Because k8s supports automatic service scheduling, which is the mainstream standard of the business, service deployment can only choose k8s, and k8s uses docker image to include basic environment dependencies, which is a better version management solution. In the name service part, TARS supports traffic control, SET/IDC grouping, client fault tolerance and other functions that k8s does not have, so it is necessary to retain the name service of tars, and the configuration management of tars supports version management, application/set/service-level configuration reference With the merger, it will be more suitable for use in the production environment than the k8s configmap solution.

After analyzing the overlapping functions of K8S and TARS, we recently implemented and open sourced such a solution. Without modification to k8s, the native k8s capabilities will be used, and several interfaces will be added to TarsRegistry for automatic service registration, heartbeat reporting, and node offline. For crash scenarios, nodes that have not reported heartbeats for a long time will be automatically deleted. In addition, a new command line tool is newly implemented to manage services in the container, and realize functions such as generating startup configuration, service registration, and heartbeat reporting. The advantage of this solution is to retain TARS's native development framework and service management capabilities, while K8S is used for service deployment and scheduling.

The life cycle of k8s pod and tars service is strongly related. After pod is started, the tarscli supervisor process is started by default to generate service startup configuration, pull up and monitor service status, and synchronize services to k8s name service. The hzcheck command is used to ensure the pod and service status. When the pod exits, the node is automatically shut down through the prestop subcommand.

How to deploy a TARS service to k8s? First, we need to deploy the basic services of tars, including database, name service, web management page and basic services related to service governance. In our open source project, yaml is provided to quickly deploy these services directly on k8s. Then the deployment of business services can also be achieved through yaml files. Here are two examples. For specific use, please refer to the documentation on the open source homepage. After the deployment is complete, you can see the list of service nodes through the tarsweb page.

On the management page, there will be a service monitoring curve by default, and you can visually see the service call volume, time consumption, abnormal rate and timeout rate. In addition, tars supports lossless changes. It can be seen from the monitoring point of view that the business monitoring curve does not fluctuate during the mirroring change.

Li Kaiyuan right

Finally, talk about TARS and cloud native:

I think the boss will like cloud native, because instead of hiring many people to develop and maintain the infrastructure, you can directly use the open source solution, and then host the service at the cloud vendor.

TARS itself has rich service management capabilities, which can improve development efficiency and reduce operation and maintenance costs. The combination of TARS and K8S can deploy services to suitable cloud vendors without changing business codes or at a small cost. While reducing the cost of infrastructure operation and maintenance, it can also avoid binding with cloud vendors. Therefore, I recommend that you use the cloud native solution of TARS+K8S. This solution is currently open source on github. Observe everyone to pay attention. You can find it by searching k8stars on github.

This is my sharing, thank you!

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/108211574