ICBC's practice of constructing financial microservice architecture based on Dubbo-Service Discovery

Head picture.png

Author | Zhang Yuanzheng
Source | Alibaba Cloud Native Official Account

Introduction : Dubbo is a distributed microservice framework, and many companies have implemented distributed system architecture based on Dubbo in practice. After restarting open source, we not only saw the  release of Dubbo 3.0's latest Roadmap , but also saw Alibaba start to promote the integration of Dubbo and internal HSF in its own e-commerce , and began to use Dubbo 3.0 on Double 11. This article is a sharing of the financial microservice architecture built by ICBC based on Dubbo. It mainly describes the response strategies and results of service discovery. The follow-up will release ICBC's practice of large-scale service monitoring and governance, and how to re-develop Dubbo from an enterprise perspective. . Welcome to follow.

Background and overview

The traditional business systems of ICBC are generally based on the JEE monolithic architecture. In the face of the online and diversified development trend of financial business, the traditional architecture can no longer meet the needs of the business. Therefore, since 2014, ICBC has chosen a business system as a service-oriented attempt, verified, evaluated, and compared several distributed service frameworks at that time, and finally selected a relatively complete one that has already been used by many domestic companies. Dubbo. At the same time, ICBC has also customized Dubbo to help this business system complete the service-oriented implementation, and it has also received very good results after it went online.

In 2015, ICBC began to expand the scope of service architecture. On the one hand, it helped traditional business systems to transform their structures. On the other hand, they also gradually precipitated super-large service groups similar to those in the middle office to support the rapid service combination and reuse of business systems. With the accumulation of experience, ICBC has continued to iteratively optimize and customize Dubbo. At the same time, it has gradually built a comprehensive service ecosystem around services.

In 2019, ICBC’s microservice system was officially upgraded to one of the key capabilities of the ICBC’s open platform core banking system, helping ICBC’s IT architecture to achieve a truly distributed transformation.

The composition of ICBC's microservice system is shown in the figure below:

p1.png

  • In terms of infrastructure, both the service nodes of the business system and the working nodes of the microservice platform have been deployed on the ICBC cloud platform.

  • In terms of service registration discovery, in addition to the conventional service registration center, a metadata center is also deployed to implement service registration and discovery by node.

  • In terms of service configuration, through an external distributed configuration center, the unified management and distribution of various dynamic parameters can be realized.

  • In terms of service monitoring, it realizes the unified collection and storage of various service indicators, and connects with the enterprise monitoring platform.

  • In terms of service tracking, it is mainly used to track the overall link of the service in real time, helping the business system quickly locate the point of failure, and accurately assess the scope of the failure.

  • The service gateway is to meet the service requirements of traditional business systems. On top of Dubbo service subscription and RPC capabilities, it realizes the automatic discovery, automatic subscription and protocol conversion capabilities of new services and new versions (HTTP protocol to RPC protocol), achieving 7× 24 hours uninterrupted operation.

  • The service management platform provides a one-stop management, monitoring, and query platform for operation and maintenance personnel and development testers to improve the efficiency of daily service management.

the biggest challenge

After years of practice in ICBC, this article summarizes the biggest challenges in the following two aspects:

  • In terms of performance capacity , the current number of online services (that is, the number of service interfaces in the Dubbo concept) has exceeded 20,000, and the number of provider entries on each registry (that is, the cumulative number of all providers for each service) has exceeded 70 Million. According to the assessment, in the future, it needs to be able to support 100,000-level services and 5 million-level provider entries per registry.

  • In terms of high availability , ICBC’s goal is: Any node failure on the microservice platform cannot affect online transactions. The bank’s business system runs 7×24 hours. Even within the version’s production time window, the production time of each business system is staggered. The platform's own nodes need to be upgraded. How to avoid the impact on online transactions, especially the registration center Update of its own version.

This article will first share ICBC’s response strategies and effectiveness from the perspective of service discovery.

Service discovery difficulties and optimization

1. Getting Started

p2.png

In Dubbo, service registration, subscription and invocation is a standard paradigm. The service provider registers for the service when it initializes, and the service consumer subscribes to the service when it initializes and obtains the full list of providers. During operation, when the service provider changes, the service consumer can obtain the latest provider list. Point-to-point RPC calls between consumers and providers, without going through the registry.

Regarding the choice of registration center, ICBC chose Zookeeper in 2014. Zookeeper has large-scale applications in various scenarios in the industry, and supports clustered deployment. Data consistency between nodes is guaranteed through the CP mode.

p3.png

Inside Zookeeper, Dubbo will establish different nodes according to the service. Each service node has four bytes: providers, consumers, configurations and routers:

  • providers temporary node : record the list of service providers. The offline child nodes of the provider are automatically deleted. Through Zookeeper's watch mechanism, consumers can immediately know that the provider list has changed.

  • Consumers: Temporary nodes : record the list of consumers, mainly used to query consumers during service governance.

  • configurations Persistent nodes : mainly save service parameters that need to be adjusted during service management.

  • routers : The child nodes are persistent nodes, mainly used to configure the dynamic routing strategy of the service.

p4.png

In the online production environment, Zookeeper deploys multiple clusters in the data center, and each cluster is configured with 5 election nodes and several Observer nodes. Observer node is a new node type introduced in Zookeeper 3.3.3 version. It does not participate in elections, but only listens to the voting results. Other capabilities are the same as follower nodes. Observer nodes have the following benefits:

  • Shunt network pressure : With the increase of service nodes, if clients are connected to election nodes, it will consume a lot of CPU for election nodes to process network connections and requests. However, election nodes cannot be expanded at any level. The more election nodes, the longer the transaction voting process, which is detrimental to high concurrent write performance.

  • Reduce cross-city and cross-DC registration and subscription traffic : When 100 consumers need to subscribe to the same service across cities, Observer can handle this part of the cross-city network traffic in a unified manner to avoid pressure on inter-city network bandwidth.

  • Client isolation : Several Observer nodes can be allocated to a key application to ensure the isolation of network traffic.

2. Problem analysis

Based on the sad history of using Zookeeper online in recent years, ICBC summarized the problems that Zookeeper faced when serving as a service registration center:

p5.png

  • As the number of services and service provider nodes increase, the amount of data pushed by services will explode. For example, a service has 100 providers. When the provider starts, because of the CP feature of Zookeeper, every time a provider goes online, consumers will receive event notifications and read all the current services from Zookeeper. The list of providers, and then refresh the local cache. In this scenario, in theory, each consumer has received a total of 100 event notifications and read the list of service providers 100 times from Zookeeper, 1+2+3+...+100, totaling 5050 provider data. This problem is particularly prominent during the peak period of business system production, which can easily cause the Zookeeper cluster network to be full, resulting in extremely low service subscription efficiency, and further affecting the performance of service registration.

  • As the number of nodes written on Zookeeper increases, Zookeeper's snapshot files continue to grow. Every time a snapshot is written to disk, a disk IO charge will occur. During the peak production period, because of the large transaction volume, the frequency of writing snapshot files is also very high, which brings greater risks to the infrastructure. At the same time, the larger the snapshot file, the longer it will take to recover after a Zookeeper node failure.

  • When the Zookeeper election node is re-elected, the Observer node must synchronize the full transaction from the new Leader node. If this phase takes too long, it will easily cause the client session connected to the Observer node to time out, making the corresponding provider node Temporary nodes are all deleted, that is, from the perspective of the registry, these services are offline, and there is an abnormal error report on the consumer side that there is no provider. Immediately afterwards, these providers will reconnect to Zookeeper and re-register the service. This phenomenon of registration rollover of large-scale services in a short period of time often brings more serious service registration push performance problems.

In summary, the conclusion that can be drawn is that Zookeeper is generally competent as a registry, but in a larger-scale service volume scenario, further optimization is needed.

3. Optimization plan

The main optimization measures of ICBC include the following aspects: subscription delayed update, registration center adopts multiple mode, upgrade to registration by node, etc.

1) Subscription delayed update

p6.png

ICBC optimized the Zookeeper client component zkclient, and made a small delay for consumers to obtain the provider list after receiving event notifications.

When zkclient receives the childchange one-time event, installWatch() restores the monitoring of the node through EventThread, and at the same time uses getChildren() to read all child nodes under the node to obtain the provider list, and refresh the local service provider cache . This is the source of the "5050 data" problem mentioned earlier.

After zkclient received the childchange() event, ICBC made a wait delay before letting installWatch() do what it was supposed to do. If the service provider changes during this waiting process, no childchange event will be generated.

Some people may ask whether this violates the zookeeper CP model. In fact, it is not. The data on the zookeeper server is strongly consistent. Consumers have also received event notifications. They just delay reading the provider list and execute getChildren( ), the latest data on zookeeper is already read, so there is no problem.

The internal pressure test results show that when the service provider went online on a large scale, before optimization, each consumer received a total of 4.22 million data volume from provider nodes, and after a delay of 1 second, the data volume became 260,000. , The number of childchange events and network traffic have both become about 5% of the original. After this optimization, we can calmly deal with the online and offline of a large number of services during the peak production period.

2) Multiple mode

p7.png

ICBC adopted and optimized the SPI implementation of registry-multiple in the new version of Dubbo to optimize service subscription in the scenario of multiple registration centers.

The original processing logic of the service consumer in Dubbo is as follows: when there are multiple registration centers, the consumer selects the provider according to the invoker cache corresponding to the registration center, and if it is not found in the cache corresponding to the first registration center, go Find the cache corresponding to the second registry. If there is an availability problem in the first registry at this time, and the data pushed to the consumer is missing or even empty, it will affect the consumer's screening process, such as an exception with no provider, unbalanced call load, etc.

The multiple registry is to merge the data pushed by multiple registries and then update the cache, so even if a single registry fails, the data pushed is incomplete or empty, as long as the data of any other registry is complete, it will not Will affect the last merged data.

In addition, the multiple registration center mechanism is also used in heterogeneous registration center scenarios. The registration center can be offline at any time when problems occur. This process is completely transparent to the service call of the service node, which is more suitable for grayscale pilots or emergency handovers.

Furthermore, there are additional benefits. Reference objects on the consumer side take up JVM memory. The multiple registry mode can help consumers save half of the cost of the invoker object. Therefore, it is highly recommended that multiple registry scenarios use multiple mode .

3) Register by node

p8.png

ICBC backports the service discovery logic of Dubbo2.7 and Dubbo3.0, using the service registration-discovery model of "register by node". Here is the iron triangle combination of configuration center, metadata center, and registration center:

  • Configuration Center : It is mainly used to store dynamic parameters at the node level, as well as persistent node data such as the configurations and routers of the service originally written on Zookeeper.

  • Metadata Center : Store node metadata, that is, the mapping relationship between each service node name (applicaiton-name) and the services it provides, as well as the class definition information of each service, such as the input and output parameter information of each method .

  • Registry : At this time, the registry only needs to store the relationship between the service provider node name and the actual ip port.

The change of this model has no impact on consumers' service calls. According to the relationship between the "service node name" and "service" in the metadata center, and the relationship between the "service node name" of the registry and the actual ip port, the consumer side generates a service provider invoker cache compatible with the inventory mode.

The stress test results show that registering by node can make the amount of data on the registration center become 1.68% of the original. This amount has no pressure on the online Zookeeper, with a service volume of 100,000 and a node of 100,000 The amount can be easily supported.

Future plan

In the future, ICBC also hopes to have the opportunity to go out, participate deeply in the community, and contribute its own good features on Dubbo, Zookeeper server, and zkclient. For example, in addition to the above optimization points, ICBC has also done it on Dubbo Refined identification of RPC results, PAAS adaptation, multi-protocol on the same port, self-isolation, and other capabilities, and a registration fuse mechanism has been added to Zookeeper. At the same time, the Observer synchronization mechanism is being studied to avoid a series of problems caused by full data synchronization.

In addition, from the perspective of the development of microservices, Mesh is already one of the current hot spots. The main pain point of ICBC is the service SDK version upgrade. Istio is not up to date, and MCP is still alive or dead. How to smoothly transition the existing Dubbo service to the MESH architecture has been preliminary, but there are still many technical difficulties to overcome.

Welcome Dubbo students who have practiced to discuss the problems and experience in large-scale scenarios, and jointly make Dubbo's enterprise landing better!

Guess you like

Origin blog.51cto.com/13778063/2561347