Practical experience of implementing microservices in low fault-tolerant business scenarios

"Health examination is a low-fault-tolerant scenario. When a user goes to a hospital for a physical examination and cannot complete the scheduled project due to IT reasons, it will have a great impact on the user experience."

——Deng Zhihao, CTO of Helian Health

Established in 2014, Helian Health is a health management service company starting from the medical examination scene. For hospitals, Helian provides a set of SaaS services around pre-examination, examination, and post-examination; for enterprises, it provides group medical examination and health management. Lee Kum Kee and PwC are Helian's customers; The family provides a health management APP. At present, Helian has covered more than 200 cities and more than 2,000 hospitals across the country.

What stages of technological development has Helian Health experienced?

The first stage: macro application. From 0 to 1, the iteration speed is very fast, and there are many faults at the same time. The business needs Helian to iterate and verify quickly. How fast can it come? At that time, I also used a container management service provided by Alibaba Cloud Jushi Tower, which can be regarded as the prototype of containerization. . In summary, focus on speed, but there will be technical debt, many failures, and failure to meet business expectations.

The second stage: micro-service. When Helian connected more and more hospitals, there were more failures, and customers complained a lot. At that time, the development was "fire fighting" all day long. Subsequently, Helian began to do modular decoupling and service splitting, and introduced Dubbo and Nacos. However, at that time, the understanding of the business was not deep enough, and there were problems with service splitting, which led to a lot of service cross-calls, and almost all interfaces would appear The called super service is detrimental to stability. In summary, the splitting of microservices without a deep understanding of the business is a temporary solution, not a permanent solution.

The third stage: microservice reconstruction. Focusing on horizontal orders, order placement, and data synchronization, modules and services have been rearranged, and the deployment architecture has been replaced with K8s, and some middleware used for service governance has been replaced with cloud services such as Alibaba Cloud Microservice Engine MSE [1] , at this time, the entire system is relatively stable. To sum up, building microservices around business, combined with the advantages of the cloud, improves the efficiency of development, operation and maintenance and online stability.

What are the different technical challenges in the low-fault-tolerant medical examination business?

Low fault tolerance is the business characteristic of Helian. For example, when a user goes to a hospital for a physical examination, the scheduled item cannot be completed due to IT reasons, which will have a great impact on the user experience. Not only the physical examination, but the entire medical industry has the characteristics of low fault tolerance . In addition, for most people, the frequency of physical examination is only 1-2 times a year, which is a very low-frequency scene, so the traffic is also very low. The problem caused by low traffic is that grayscale release is almost ineffective, and even full release may not find bugs, and some bugs will not be discovered until one year after the code is released.

Therefore, Helian must first solve the problem of complex logic, and must do modularization and decoupling.

But if you only do business decoupling, then modularization is enough. For example, if you are using the Java language, you can divide the Java modules into JAR packages and use Maven to manage different dependencies. However, in the early days, many technical architectures supported different services through a single package, with many business modules and no business isolation. When there is no microservice split, there may be problems with the business code of the enterprise, resulting in the collapse of the hospital's business with low fault tolerance, which is unacceptable for the business.

Therefore, Helian directly realizes service-oriented, separates services, has public basic services that can be called, and does not affect each other between different businesses. Servitization not only realizes business decoupling, but also realizes service layering and guarantees the core services of contract fulfillment. For example, for businesses with a very low fault tolerance rate, you can build a guarantee service specifically for problem scenarios. At the same time, independent quality inspections can be performed on services, but independent quality inspections cannot be performed if they are packaged together.

There are mainly two modes of service splitting, one is splitting according to business, and the other is splitting according to capability, and different businesses can call each other. In the end, the structure of Helian is shown in the figure above. The split is mainly based on capabilities, and the split is supplemented by business. For example, the front end is the web service, the blue block is the iterative business service of the business core, and the bottom layer splits the three services of order, payment, and message according to the ability. The next level is far away from the business, such as hospital data synchronization service and manual contract fulfillment service, which are self-built independent services.

The services with the most frequent business iterations are separated from the relatively stable services. The two sides are connected through HTTP. In the business cluster, Dubbo is used as RPC, Nacos is used as the registration and configuration center, and RocketMQ is used as asynchronous message.

Practical experience in the process of microservice evolution

For microservices, Helian uses the technology stack of Dubbo + Nacos.

Dubbo is an RPC framework based on Java Interface. For Java programmers, it can become a microservice only by adding simple annotations, so it is implemented in the team. At the same time, the call hardly invades the code, and the service can be injected by changing @Autowire to @DubboReference. The integration of Nacos in Dubbo is very complete, it can be used with just a few lines of configuration, and the control panel is simple and easy to use. Like Dubbo, it is a Chinese community, and the threshold for programmers is lower.

In the early days, Helian built the community version of Nacos by itself and encountered a large performance bottleneck. At that time, the Dubbo2 service model was based on interfaces. An interface and a function would bring a service, and the traffic was very large. Alibaba Cloud's microservice engine MSE helped Helian withstand the pressure of Dubbo. It has good compatibility, and Helian followed the community to upgrade to Dubbo 3, which solved the problem of the Dubbo 2 service model. In addition, from the perspective of memory, MSE has excellent tuning capabilities, which improves business performance by 4 times and reduces resource costs.

Helian serves a large number of hospitals, and the needs of each hospital are uncertain and different, and there will be a large number of feature switches. The operation of such switches is very dangerous and is generally configured by developers, and MSE solves the pain point very well. The feature switch of MSE can be dynamically configured without restarting the application. At the same time, it can be combined with the KMS Alibaba Cloud key management service to encrypt and store the data, but the user has no perception.

The HTTP gateway mainly solves the problem of protocol conversion. Helian's App front-end business logic is heavy, and there is no need to do any result encapsulation, as long as the service capabilities are exposed. Therefore, based on the open source Apache ShenYu, a modification has been made to convert the HTTP protocol to Dubbo, support POST/GET at the same time, and put the authentication and authorization logic in the gateway.

In terms of DevOps, K8s + image release rollback uses ACK[2], and continuous integration uses cloud effect CI, which brings extremely high release efficiency to Helian. It will release 20-30 times at most in a week, and the single release time is from The original 2-3 hours was reduced to within 8 minutes. In addition, Helian has implemented service isolation based on Dubbo. For example, two versions of the same service can be deployed, the code and usage are the same, and the instances are different. Both services have independent memory, and when one service fails, the other same service will not be affected. However, this capability is still weak at present, and the enhancement of the control surface capability is the future development direction.

Future Planning of Microservices

In the future, Helian hopes to realize the control surface of Service Mesh.

As shown in the figure above, for example, when a service request arrives, if it is req*, you want it to be routed to the special version ServiceA*. The message sent after the request passes through the MQ cannot be received by the Service message, but should be received by the Service* to realize the routing capability of the whole link. At present, the Istio hosting provided by Alibaba Cloud's ASM has the above capabilities, and also provides basic Dubbo governance capabilities [3] , and will explore how to integrate and evolve in ASM in the future.

The purpose of implementing Service Mesh is to reduce the cost of the test environment. Currently, there are 7-8 sets of test environments in Helian's large cluster for each business team to use, and each team uses one set without interfering with each other, but the cost is too high. If the routing of the whole link can be realized, each development team only needs to publish the test environment of the service, and use the marking traffic to realize the release.

Referring to the current practice in the industry, the grayscale routing of the whole link can identify and mark the traffic at the gateway level, and each test environment has a separate label; each hop service call transfers the traffic label, and calls each hop , according to the traffic label and the label of the peer machine, different matching routes are made. In the end, Helian can do it for each environment, only need to deploy the modified services of the current environment, reuse the services of the baseline environment to the greatest extent, and reduce the overall cost.

In addition, Helian will implement a full HTTP gateway. From the perspective of future trends, the front-end is getting heavier and heavier. There is no need for the back-end to be a web layer, and the back-end services can be directly exposed to the front-end. Therefore, Helian considers replacing all web layers with BFF gateways, and looks forward to closely following the pace of the community and developing together with the cloud-native community.

Reference link:

[1] https://www.aliyun.com/product/aliware/mse

[2] https://www.aliyun.com/product/kubernetes

[3] https://help.aliyun.com/document_detail/214749.html

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/128290123