Best Practices of EDAS Microservices Application in the Same City Disaster Recovery

Author: God fish, Ali cloud solutions architect

Preface

 

Going to the cloud is currently the IT infrastructure construction solution of choice for most companies, but there are still some uncertain factors on the cloud (computer room hardware failure, network failure, network/power outage, human error), leading to major cloud vendors Some failures occur in different data centers every year, so it is necessary to build business applications with disaster tolerance capabilities. The disaster recovery solution on the public cloud covers disaster recovery scenarios such as dual-active in the same city, cross-region disaster recovery, and multiple actives in different places. For most mid-to-long-tail customers on the public cloud, what is needed is a kind of less intrusive or even application-intrusive A disaster recovery solution that is transparent but can guarantee high availability. Intra-city active-active is undoubtedly the preferred disaster-tolerant solution. Most business applications can avoid most unavailable failures of data centers as long as they achieve intra-city active-active.

 

This practice is to help everyone efficiently and cost-effectively realize their own business applications with the ability of dual-active disaster tolerance in the same city. Through this article, you can efficiently implement intra-city dual-active disaster recovery based on EDAS. While achieving these disaster recovery scenarios, other Ali products are required. The corresponding solutions will also be introduced. You can refer to the following architecture diagram:

 

image.png

 

In view of the current mainstream architecture that requires disaster recovery, all have been split into microservice architecture, and the microservice architecture itself is also an architecture with stronger disaster recovery and high availability capabilities. The microservice architecture is generally composed of gateway (unified access layer), RPC framework (Dubbo, Spring Cloud), messaging (MQ), distributed database, cache and other core software. Through EDAS, ingress traffic can be efficiently cut and RPC routing can be realized. For disaster tolerance, multi-zone deployment and other capabilities, refer to the following figure:

 

image.png

 

Introduction to the main products of the program

 

EDAS

Enterprise Distributed Application Service (EDAS) is a one-stop PaaS platform for application lifecycle management and monitoring. It supports deployment in Kubernetes/ECS, and supports Java/Go/Python/PHP/.NetCore without intrusion. Language application release and operation and service management, Java supports all versions of Spring Cloud and Apache Dubbo in the past five years, and multi-language applications open Service Mesh with one click.

Cloud DNS

Cloud DNS (Domain Name System, DNS for short) A safe, fast, stable, and reliable authoritative DNS resolution management service. Alibaba Cloud DNS converts easy-to-manage and identifiable domain names into digital IP addresses used by computers for interconnection communication for enterprises and developers, thereby routing user visits to corresponding websites or application servers.

Load balancing SLB

Load balancing SLB (Server Load Balancer) is a service that distributes traffic on demand. It expands the service throughput capacity of the application system by distributing traffic to different back-end services, and can eliminate single points of failure in the system and improve Availability of the application system.

ApsaraDB for RDS

Alibaba Cloud Relational Database Service (RDS) is a stable, reliable, and elastically scalable online database service. Based on Alibaba Cloud Distributed File System and SSD disk high-performance storage, RDS supports MySQL, SQL Server, PostgreSQL and other engines, and provides a complete set of solutions for disaster tolerance, backup, recovery, monitoring, migration, etc., completely solving database operation and maintenance Troubles.

 

Disaster recovery solutions in the same city

 

Multi-zone deployment of applications

Using EDAS to deploy applications, you can quickly deploy application nodes to different availability zones. The following is an introduction from the ECS and K8S two ways of hosting resources.

ECS cluster deployment

Import the ECS of different availability zones into EDAS, put them in the same cluster, and select to create an application in the application list:

 

image.png

 

Click Next to select the ECS nodes in different availability zones in the cluster to complete the creation of the application, and the ability to deploy nodes in different availability zones can be completed.  

 

image.png

 

K8s cluster deployment

Import the created K8s cluster (node ​​multi-availability zone) into EDAS, when creating an application, in the application advanced settings, select multi-zone deployment, complete the application creation, that is, the ability to deploy nodes in different availability zones can be completed.

 

image.png

 

Highly available traffic access layer

The application deployed by EDAS can directly mount multiple SLBs to meet the needs of disaster recovery, and does not rely on SLB's own disaster recovery mechanism (SLB switching logic only occurs when the main availability zone is not available as a whole, such as the overall power failure of the computer room, and the export optical cable of the computer room. The load balancer will switch to the standby availability zone after interruption, etc.), so that the user can control, deploy load balancing instances and EDAS application sections in multiple availability zones or multiple regions in a region, and then use cloud resolution DNS to access To schedule:

 

image.png

 

Click the application list deployed in the multi-zone in the first step just now to enter the specific application overview menu page, and configure multiple SLBs for the entry application (gateway) through the access method.

 

image.png

 

By using global traffic management to construct a flexible DNS resolution scheme, add the SLB created above to the global traffic address pool, and configure the DNS disaster tolerance traffic switching scheme based on the health check results to automatically resolve when an availability zone is unavailable To another available availability zone SLB, to achieve intelligent access layer traffic disaster tolerance processing.

High availability at the RPC level

EDAS supports a variety of microservice RPC frameworks, such as Dubbo and Spring Cloud. When users use the above RPC framework, when the deployed multi-zone application becomes unavailable, the outlier removal ability of EDAS microservice governance can be used , Automatically remove the nodes in the unavailable availability zone offline, and automatically add the nodes back to the application cluster after the availability zone network and other failures are restored, so as to achieve intelligent fault handling.

 

image.png

 

In the above figure, application B, application C, and application D called by application A are all controlled by the policy. If the error rate of the corresponding instance of the application called by A reaches the lower limit, the abnormal instance will be removed and no longer called by A (re-added after the detection is restored Called by A).

 

First enter the microservice governance and select the corresponding RPC framework, such as the Spring Cloud selected here, select the outlier removal menu, and configure according to the following steps:

 

image.png

 

image.png

 

Among them, the lower limit of QPS is configured according to the observable ability of EDAS application, and the lower limit is configured according to the usual QPS. The error rate is configured in the range of 10% to 50% in the disaster recovery scenario. Remove instances <50% to ensure the availability of the cluster and not cause upstream and downstream avalanches. Both the recovery time and cumulative detection times can be set to default values ​​to ensure that the node is automatically restored after the availability zone failure is restored.

 

In addition to the outlier removal function provided above, EDAS also provides the ability to enable priority calls in the same computer room for deployed Provider applications. In the event of an availability zone failure, if the same computer room priority call is enabled, no cross-computer room calls will occur, then RPC The traffic at the level does not need to use disaster tolerance capabilities such as outlier removal for re-processing or isolation of node traffic, thereby ensuring that the moment of failure occurs, and the business is completely unaware.

High availability of microservice infrastructure

When EDAS deploys applications, it already provides corresponding micro-service infrastructure by default: such as registry and configuration center. These microservice components that are not exposed to customers have achieved intra-city disaster recovery. In the event of unavailability in the availability zone, service availability can continue to be guaranteed, which greatly reduces the complexity of the customer's operation and maintenance of disaster recovery components.

Highly available at the database level

After completing the application deployment structure and RPC-level traffic in the same city for active-active processing, for data reliability, RDS MySQL provides high-availability version instances, adopting a dual-system hot-standby architecture with one active and one standby, which is suitable for more than 80% of user scenarios. When the primary node fails, the primary and standby nodes complete the switchover in seconds, and the entire switching process is transparent to the application; when the standby node fails, RDS will automatically create a new standby node to ensure high availability. Select the high-availability version when creating the instance, and select the multi-zone deployment for the deployment plan:

 

image.png

 

Note: If the existing high-availability version instance is a single-availability zone, you can refer to Migrating Availability Zone to transfer the single-availability zone to a high-availability zone.

 

If there is a stronger business scenario for data reliability, RDS provides remote disaster recovery instances to help users improve data reliability. The solution relies on data transmission service products (DTS) to achieve real-time synchronization between the primary instance and remote disaster recovery instances. At the same time, it is necessary to purchase a new disaster recovery instance. The use of this solution requires a certain cost. For the specific operation steps, please refer to the remote disaster recovery instance :

 

image.png

 

Both the master instance and the disaster recovery instance build a high-availability architecture of master and backup. When sudden natural disasters occur in the area where the master instance is located, and neither the master instance (Master) nor the backup instance (Slave) can be connected, the remote disaster recovery instance can be switched As the main instance, you can use the application configuration management product to push the database connection address to the application side, and restart the relevant application through EDAS to quickly restore the business access of the application.

Highly available at the cache level

The best practice discussed in this solution focuses on the cloud database Redis, which has the most extensive application scenarios. When the cloud database Redis product is created, it has provided the same city disaster tolerance architecture across two computer rooms by default. When creating the cloud database Redis version instance, choose to support the same city The availability zone of disaster recovery, as shown below:

 

image.png

 

When creating a multi-zone instance, the standby computer room will create a Replica instance with the same specifications as the main computer room, and the instance data in the main and standby computer rooms are synchronized through a dedicated replication channel. When there is a power or network problem in the main computer room, the Replica instance will be upgraded to the Master instance, and the underlying system will automatically route the request to the standby computer room to achieve failover.

Concluding remarks

After the above solutions, we can use Alibaba Cloud EDAS and other related products to quickly and low-cost build a city-wide active-active disaster recovery business application, ensuring that online services are quickly switched over when the availability zone is unavailable, thereby ensuring business sustainability , This solution can meet their disaster recovery needs for more than 90% of public cloud users.

 

In addition to intra-city active-active disaster recovery, Alibaba Cloud also provides a multi-active disaster recovery architecture solution evolved from the Alibaba e-commerce environment. Based on flexible rule scheduling, cross-domain and cross-cloud management and control, data protection and other capabilities, it guarantees failure scenarios The business quickly recovers to meet customers who have more stringent requirements for high availability and stability. Customers with this need can refer to the official document Multi-Activity Disaster Recovery Introduction .

Guess you like

Origin blog.csdn.net/weixin_39860915/article/details/113697602