When we are talking about "service governance", what are we talking about?

High concurrency and high availability architecture series

-Take you to build large-scale distributed services

Reading this (series) article, you will gain:

  1. Comprehensive and systematic understanding of service governance in large-scale distributed systems
  2. How to deal with high-concurrency and high-traffic scenarios for first-line Internet companies, revealing the stability guarantee system (required for high concurrency and high availability)
  3. The realization of common current-limiting algorithms, the design principles and practical experience of Alibaba (Double Eleven) current-limiting and fuse protection tool sentinel (required for high concurrency and high availability)
  4. The essence, architecture design ideas, principles and practical experience of high-performance, high-availability configuration center (required for microservice architecture)
  5. The essence, architecture design ideas, principles and practical experience of a high-performance and high-availability service registry (required for microservice architecture)
  6. Common pain points, architectural vision and solutions (technical breadth and thinking ability) of Internet companies' technical architecture

When we are talking about "service governance", what are we talking about?

At the beginning of my career, I came into contact with a bunch of cross-language distributed systems based on Webservice, Hessain, etc. That was an era when SOA architecture and concepts were very popular, and I often heard seniors talking about "SOA governance" And so on, but I didn't understand what "governance" is at the time , and even wondered: Why not call it "management"? Before that, I had only been exposed to the term "sewage treatment" in primary school textbooks. Until the large-scale servicing process of Internet companies in recent years, open source service frameworks represented by Dubbo and Spring Cloud have become popular, and "service governance" has become popular again.
Insert picture description here
So what exactly is
"governance" ? Large Internet companies often have tens of thousands of applications, while medium-sized companies also have at least hundreds of applications. After the popularity of microservices, the number of services is increasing day by day, and governance is urgently needed . The root cause is still high complexity, which needs to be sorted out, standardized and optimized**. The essence of the architecture is to manage complexity and meet the demands of different stakeholders . Below I will briefly summarize all aspects of the "service governance" field (including but not limited to), so that you can establish a comprehensive and systematic understanding and understanding for further in-depth research and practice.

Service definition and management

How the service should be defined, how should it be exposed, how upstream needs to be called, how to grasp the granularity of splitting... These all require unified specifications and constraints, otherwise it is easy to mess up organization and collaboration. I once heard an architect say that "microservices are an interface and a service...", and I have also taken over similar systems. This is a typical "technical theory school" who can never grasp the essence of technology or methodology. For another example, when we are developing a certain business function, we find that we need to rely on the support of other domains. This requires knowing what services are included in a certain subdomain and an application of the company, whether we provide the business capabilities we want, and further understand the services What the contract description looks like so that we can quickly access it, which requires a good management mechanism and platform support. If it is a start-up company with only dozens of application systems or services, and only 20 technical personnel, it will not face similar complexity, just shout face to face. However, the most basic specifications and constraints are necessary for teams of any size . Here are some common sense experiences. For example: Do not use Map types for parameters in interface definitions. An overly flexible structure often leads to unclear interface contract definitions. It is difficult to understand, and the code logic of the provider will also be full of various judgments and special processing; do not use enumeration in the DTO definition of the response result, because if the service provider adds enumeration values ​​and the service consumer does not upgrade the two-party package, It is very easy to cause deserialization to fail. This has witnessed similar online accidents during the Ali period; only new interfaces are allowed, and the definition of existing interfaces is not allowed to be modified, unless you can ensure that all upstream consumers are upgraded uniformly. Normal companies are obviously impossible to achieve;
don’t look at the problems that I listed seem to be very rudimentary, but many companies do poorly in this area, otherwise there will not be so many in the technology circle **" Dig the pit, future generations fill the pit" ** story

Service registration and discovery

How service providers register services and how consumers quickly discover and select services are the core issues to be solved in this field. How to choose an open source registry? Is the capacity of the registration center sufficient? After the service provider goes offline, can the registry promptly sense and notify consumers? Choose open source or self-research? These are all issues that architects need to consider. Regarding service registration and discovery, it will involve network communication, load balancing, health checks, data storage, distributed election algorithms and protocols, etc. I will explain them in a separate chapter later.

Insert picture description here

Service call, routing, fault tolerance

For service call , you can choose TCP/UDP or application layer HTTP protocol from the communication protocol. In terms of style, there are mainly binary RPC, or HTTP+JSON (here I correct the misunderstanding of many technical personnel, the so-called "Restful" service of many companies is actually the HTTP interface provided. Just like many people " H5" is actually a small page adapted to the screen size of a mobile phone, without any new features of HTML5 at all). Taking the RPC framework as an example, there will be custom private protocols at the codec level. Going up to the application layer, there will be serialization and deserialization. The use of protobuf serialization or Hessaion2 needs to be comprehensively considered in terms of performance, compatibility, stability, cross-language, readability, testability, etc.

Service routing , this is also easy to understand, such as the group management provided in the open source Dubbo framework, which can be understood as a kind of routing. The "same room priority" capability provided by HSF is also a routing strategy. We sometimes need to implement a "blacklist, whitelist" filtering protection mechanism for services, which is a kind of conditional routing. When we do something like "grayscale publishing, traffic coloring, environmental isolation"**, etc., we need to use special tags and then transparently transmit them, and route them according to rules when the service is called. Of course, more importantly, the model design of the registry is flexible enough to support similar tagging capabilities. For external services, we implement routing through gateways, Nginx, etc. (this is essentially request forwarding)

Service security , almost all systems usually need to do identity verification, permission verification, etc. In a distributed microservice architecture, we usually put authentication and other cross-cutting operations in the gateway. If external applications want to access the service, they need to go through the gateway. The more common one is to use the Oath2 protocol, JWT components, etc. to achieve. As for the calls between internal services, because they are all in the internal network, there will be security measures such as firewalls. Based on performance and workload considerations, many companies will no longer perform independent authentication. Of course, the necessary level of authority to verify these is still necessary. In addition, a WAF (web application firewall) is usually set up to filter malicious requests before traffic access. The more common security issues include: XSS, SQL injection, DDOS, horizontal permissions, etc., which will not be expanded here

There are many service fault tolerance modes , and the common ones are probably as follows:
failfast , fast failure. For example, in the Java collection framework, when concurrently modifying or removing operations such as ArrayList and other containers, the system will throw ConcurrentModificationException, which is typical The failfast mechanism. In distributed service calls, the most commonly used failfast mechanism is timeout. Whether it is a service provider or a service caller, a timeout needs to be set;
failover , failover , and occasional network jitter problems in distributed service calls, usually we Will choose another machine to retry, this is a typical failover. In addition, in the fields of database and message middleware, architecture models similar to Master and Slave are often used. When a failure occurs, a master-slave switch is automatically performed to ensure the high availability of the cluster;
failback , failure automatic recovery, when the request is abnormal or fails It should be able to retain the context information and let it automatically retry through some mechanism. A typical implementation method is to capture the abnormal state, record the log or drop it in the "abnormal recovery table", and then perform compensation through the background timing task scan. Usually suitable for scenarios that are not sensitive to data timeliness and consistency;
failsafe , failure safety, when an exception or failure occurs in the request, simply record the log and ignore it directly, and continue to advance the main process. It is more suitable for some non-main link and weakly dependent requests, such as: reporting operation logs to the big data platform;

Rely on governance

First, under the torture of the soul, which applications are my services dependent on? Will the amount of upstream calls drag me down? Will my failure cause upstream failure? What services do I rely on myself? Will they affect me if they hang up? Which are strong dependencies and which are weak dependencies? These issues need to be clear to the heart. The essence of relying on governance is to manage complexity and avoid or reduce risks.

Service monitoring and emergency measures

Service monitoring mainly includes: log monitoring, call chain tracking, and metrics , each of which is an area worthy of in-depth study. In the cloud-native era, we collectively refer to it as **"observability"**.

Large-scale distributed systems usually consist of hundreds or thousands of applications, and the number of machines is often thousands. It is impossible to SSH into the server to execute tail and less as long ago. We first need to have a unified log format and a unified log path, and then collect, report, quickly analyze, display, and warn the logs . In this field, the most popular masterpiece of the open source community is ELK;

When we talked about "dependency governance", we have already understood the complexity of dependencies among distributed systems. The call chain is intricate and it is impossible to rely on personal experience. When a fault or performance problem occurs, we need to be able to "clearly" and "follow the vine" to achieve the purpose of quickly locating the fault or performance bottleneck**. Representative works in this field include: pinpoint, cat, skywalking, Taobao Eagle Eye, etc. and some commercial APM;

From the application perspective, we need to understand the service call volume, success rate/error rate, response time and other indicators . At the same time, we also pay attention to thread pools, slow queries, number of connections, etc. From the perspective of the business side, we need to pay attention to similar business indicator data such as the number of current orders and the total amount of orders placed . These indicator data are all related to the time dimension**. We need to save these indicator data in a time series database for aggregate statistics, ranking and then display or warning;
Insert picture description here

Current limit , including: page current limit, interface current limit, access source or IP current limit, single machine current limit, cluster current limit, gateway current limit, hot parameter current limit, custom current limit, etc.; Beijing subway entrance morning and evening peak period The control is a typical "current limit" (usually divided into "equal speed queuing" and "fast failure");

Downgrade can be divided into trigger conditions: manual downgrade and automatic downgrade. From the scene, it can be divided into: consistency degradation, integrity degradation, user experience degradation, reading and writing degradation, etc.; e-commerce shuts down product review services and recommendation services when resources are scarce during peak periods. This is a typical "abandonment" "Guarantee" means of demotion. In addition, the automatic fuse of the interface is also a typical downgrade method;

Regarding dependency governance, capacity planning, current limiting, downgrading, etc., I will explain in detail later in the chapter of the subsequent stability assurance system

Service test

How do you know if the service is OK after it is published? If there is an ops console, I can directly enter the parameters after selecting the service and easily click to verify that the service is smooth. Isn’t it cool? If during the development of joint debugging or unit testing, the other party did not realize it, and the two parties just confirmed the service contract, then we need to be able to mock the data easily. In the traditional testing system, we are usually divided into: unit testing, integration testing, component testing, end-to-end testing, etc., forming a classic "test pyramid model".

Insert picture description here

In the era of microservices, the call between different services has become the focus of our "integration testing", so "contract testing" is introduced to ensure that both service providers and consumers meet specifications

summary

Looking at service governance from the perspective of an architect, we need to pay attention to the demands of development, testing, operation and maintenance, and business parties (stakeholders). We need to weigh the trade-offs in technology selection and architecture design to meet their demands as much as possible. As shown:

Insert picture description here

Subsequent chapters

Common problems and stability guarantee system in high concurrency scenarios

The design principle and practical experience of sentinel

Principles and practical experience of high-performance configuration center

Principles and practical experience of service registry

Guess you like

Origin blog.csdn.net/dinglang_2009/article/details/103845408