From traditional IT disaster recovery to "full-stack cloud disaster recovery"｜What is a cloud that is more suitable for government and enterprises

At 3 o'clock in the morning, in front of a self-service bill payment machine in a certain hospital, a family member of a doctor and patient was frowning. The medical insurance card in his hand had been swiped countless times, but every time it was prompted that the payment failed, and the operation of the loved one was imminent...

8:00 in the morning is the peak time for office workers to open the news app to read news during their commute, and at this moment in the newsroom, the background editors are struggling, and the release interface of the day's hot news on the system shows "release failure" over and over again...

These pictures are simply "disaster movies" in the minds of enterprise IT managers, and the cause of these problems may be that a cabinet in the enterprise data center is powered off, a typhoon causes a failure in the computer room, or an IT administrator accidentally deletes the database…

Natural disasters and man-made disasters may be inevitable, but the above scenarios can be avoided and prevented through IT architecture design. In the era of cloud computing, in the face of black swan events, how can IT personnel use disaster recovery solutions to ensure business continuity? What are the differences between cloud platform disaster recovery and traditional IT disaster recovery? What factors affect the disaster recovery design of the government-enterprise cloud platform? What kind of solution does Alibaba Cloud have? This article will give answers one by one.
insert image description here

The double-edged sword in the era of digital intelligence, the popularization of cloud computing makes the issue of disaster recovery more urgent

With the continuous deepening of the digital and intelligent transformation of the whole industry, cloud-native applications have become a recognized digital transformation paradigm in all walks of life, and the base for carrying cloud-native applications—full-stack cloud computing platform has become a solid foundation for the digital and intelligent transformation of government and enterprises .

The mode of "intensive construction, unified large resource pool, and unified service supply" of cloud computing itself allows a large number of applications to be naturally gathered on the cloud platform. On the one hand, it releases the advantages of flexible supply and agile deployment of platform resources. It means that once the platform fails, the scope of impact will be greater. In order to ensure continuity at the business level, the high availability of the cloud platform has become a top priority for government and enterprise IT leaders.
insert image description here

Although the cloud platform has preliminary high-availability capabilities at the beginning of its design, such as multiple copies of components, data across server cabinets, and scattered network racks, etc., this can only achieve "high availability in a single computer room". For industries such as finance, taxation, medical insurance, and energy, they have higher requirements for the business continuity of the system. For example, the financial industry has clear cross-computer room disaster recovery policy requirements, and if the core business system fails for 30 minutes, it needs to be reported to the superior supervision unit; national and provincial medical insurance information systems must adopt the same city disaster recovery mode to meet business continuity requirements. Therefore, cross-computer room disaster recovery based on full-stack cloud products has become a strong demand of some government and enterprise customers.

Why can't new bottles hold old wine? Difficulties faced by traditional IT disaster recovery technology in the cloud era

After years of precipitation in traditional IT disaster recovery, there are currently two common technical routes:

Storage-level disaster recovery

This technology is mainly based on traditional array storage. The same storage model is placed in two computer rooms, and the data is synchronized in the two centers through "synchronous replication" or "asynchronous replication" between the arrays.
Typical storage-level disaster recovery diagram

In this mode, in order to avoid data double-writing, the computing nodes and applications in the backup center are shut down on a daily basis, that is, they are in "cold standby". This means that when a data center fails, it needs to switch to the IT facilities in the backup center first, and then start the computing nodes and applications in the backup center one by one, which will inevitably lead to a longer RTO. In addition, in this mode, there is also the possibility that the application cannot be started normally, further prolonging the RTO.

With the development and application of cloud native, business applications are generally distributed to hundreds or even thousands of nodes. Restarting nodes and applications of this scale will inevitably lengthen the RTO significantly, and cannot meet the most basic recovery requirements time requirement. In addition, traditional arrays do not meet the basic technical architecture requirements of cloud computing in terms of scalability and cost.

Product-level disaster recovery

The feature of this technology is that the product itself can realize "the transfer of working nodes across computer rooms and the replication of data across computer rooms", without relying on the underlying storage. At the external service level, active-standby, active-active and other modes are generally adopted. At the data level, the product implements cross-computer room data replication through its own mechanism, such as Mysql binLog replication.

Typical database disaster recovery replication architecture

Since the backup computer room product is also a normal working node, it only serves as a backup for daily roles and does not accept traffic. After the main computer room completes the switchover, the nodes in the backup computer room are immediately available. Therefore, there will be no abnormal situation where the instance is unavailable after switching to the backup center, and the RTO of the business is generally lower than that of the storage-level disaster recovery architecture.

From the perspective of the entire business dimension, this model is more controllable and has a better RTO than storage-level disaster recovery. However, this technology is only responsible for a certain layer of application technology stack, such as DB, and lacks business disaster recovery capabilities from a global business perspective.

Under cloud-native conditions, applications will be built based on full-stack cloud products such as IaaS, middleware, databases, and big data, and data will be scattered across a large number of different products. Redesign.

Examination questions for those at the helm of the cloud, full-stack cloud disaster recovery consideration formula

Based on the above analysis, the traditional IT technology architecture is difficult to meet the cloud-native business model. At this time, a full-stack cloud disaster recovery solution is required. As an IT manager, full-stack cloud disaster recovery is a new and complex proposition. What issues need to be considered? A formula is introduced here to help IT leaders make evaluation judgments:

Full-stack cloud disaster recovery complexity = (number of products X product dependencies X switching scenarios X disaster recovery indicators) / disaster recovery management experience

Many products

A business system needs to use dozens or even dozens of cloud products, and all cloud products and supporting products involved in the business need to have disaster recovery switching capabilities. At the same time, the types of data storage have greatly increased compared with traditional IT, such as block storage, object storage, OLTP data storage, OLAP data storage, offline big data storage, log storage, etc. In order to achieve cross-computer room disaster recovery, when choosing a cloud platform, IT managers need to ensure that these products must have the capabilities of "cross-computer room data synchronization" and "cross-computer room high availability".
Statistics of major cloud products used by an Alibaba Cloud customer

product dependent

In order to achieve high availability of cloud products and reduce product duplication costs, the cloud platform generally splits product components and dependent components during design, such as DNS, NTP, metadata database, distributed coordination services, etc. as base components. Provide unified services for upper cloud products. Therefore, the disaster recovery switching needs to consider the base and product dependencies, so as to avoid error reporting or unusability due to lack of dependencies after product switching.

Many disaster recovery scenarios

There are many types of cross-computer room failure scenarios, and each product needs to consider data replication strategies and switching plans in various scenarios such as "power failure in the computer room, split brain, network interruption, and failover" to realize business at the fastest speed. Recover and secure data.

High disaster recovery requirements

Business failures in the cloud era have a greater impact, and disaster recovery requires higher RTO and RPO requirements than traditional IT architectures. For example, the specific requirements for RTO and RPO in the "Cloud Computing Technology Financial Application Specification Disaster Recovery" issued by the People's Bank of China are as follows:
insert image description here

Disaster recovery management experience

In view of the above-mentioned "three mores and one high" problem, the disaster recovery management of the full-stack cloud has also become a difficult problem. It is best for the disaster recovery management to have the following capabilities:

Simple switchover: A disaster recovery switchover may involve the disaster recovery collaboration of dozens of products at the same time, and it is no longer possible to perform product switchovers one by one through traditional manual methods. Therefore, the cloud platform must have efficient drill and switchover capabilities to reduce RTO.
Full-scenario coverage: Disaster recovery design needs to take into account multiple disaster recovery scenarios such as intra-city, remote, two-site and three-center, and can continue to iterate in each scenario as the government-enterprise disaster recovery architecture evolves.
Tenant isolation: In a multi-tenant scenario (the cloud platform needs to provide external operations and services), it is necessary to support self-service disaster recovery for each tenant. At the same time, different systems of a single customer can be switched as needed, and the disaster recovery switch has no impact on the business of other customers. Influence.
Controllable disaster recovery: The cloud platform needs to have a complete disaster recovery monitoring system, so that users can keep abreast of the latest disaster recovery trends, and combine it with the internal disaster recovery plan process to ensure that the system is always in a "controllable and predictable" state. Avoid data security risks caused by "unexpected switching".

Stronger and more confident, Alibaba Cloud is the pioneer of full-stack proprietary cloud disaster recovery

From the characteristics and requirements of the above-mentioned full-stack cloud disaster recovery, full-stack cloud disaster recovery tests the ability of cloud manufacturers to control and control full-stack products, and requires code-level architecture modification and function iteration capabilities for all products, as well as Perfect product tool support system. Only in this way can we provide mature, stable and iterative disaster recovery service capabilities. This is exactly the advantage of Alibaba Cloud's full-stack self-development.

Alibaba Cloud launched Feitian Enterprise Edition in 2015, which adopts the same technical architecture as the public cloud to provide full-stack product service capabilities for government and enterprise customers. After helping customers complete the process of "building the cloud" and "going to the cloud", based on customers' general high business continuity requirements, Alibaba Cloud is the first in the industry to carry out research and development of cross-computer room disaster recovery based on proprietary cloud. After extensive user demand research, Alibaba Cloud "adopted the idea of application-level disaster recovery, based on the perspective of full-stack products, and started from application end-to-end recovery", officially launched the Feitian Enterprise Edition disaster recovery solution in 2017, creating a breakthrough in the industry A new paradigm for full-stack dedicated cloud disaster recovery.

After years of technical iterations, the capabilities of the Feitian enterprise disaster recovery solution have been continuously strengthened:

In 2017, it supported disaster recovery of dual AZs in the same city, and supported disaster recovery of 20+ cloud products; in
2018, it completed the delivery of disaster recovery projects in the same city for multiple customers such as finance and government affairs, and had production-level disaster recovery capabilities; in
2019, it supported cross-cloud in different places Disaster recovery, multi-active disaster recovery in different places, and delivery to multiple government clients; In
2020, support 3AZ disaster recovery in the same city, the industry took the lead in realizing database RPO=0 based on cloud-native conditions, and multiple bank customers entered 3AZ disaster recovery stage; support many-to-one remote disaster recovery, and support the construction model of a provincial medical insurance "provincial-level intra-city disaster recovery, and inter-provincial and multi-to-one remote disaster recovery"; in
2021, support full-stack product-level disaster recovery in two places and three centers , to meet the policy requirements of financial and other industries to have both intra-city and remote disaster recovery; in
2022, support the disaster recovery capability based on domestically produced chips, and the disaster recovery capability in various scenarios has been greatly improved, meeting the needs of government and financial customers in one cloud. Disaster recovery requirements based on core requirements.

Based on the requirements of full-stack cloud disaster recovery, Alibaba Cloud Feitian Enterprise Edition disaster recovery solution builds the capability of "polygon warrior":

Most supported products

Feitian Enterprise Edition has completed the disaster recovery architecture design of 60+ full-stack products such as IaaS, middleware, database, big data, and base in different scenarios, which can meet the end-to-end disaster recovery requirements of customers in different industries.

Support the most complete scene

In view of different disaster recovery model requirements of customers, Feitian Enterprise Edition supports double AZs in the same city, three AZs in the same city, cross-cloud disaster recovery in different places, cross-region disaster recovery in different places, multi-active disaster recovery in different places, many-to-one disaster recovery in different places, and three centers in two places Various atomic disaster recovery scenarios such as disaster recovery can be arranged and combined based on different business characteristics to form more complex combined disaster recovery scenarios, such as "same city disaster recovery + multi-active in different places", "same city disaster recovery +Multi-to-one disaster recovery in different places and other modes, with the ability of "full-scenario disaster recovery".
insert image description here

Disaster recovery management is simple

Aiming at the disaster recovery management problems of the full-stack cloud, Alibaba Cloud pioneered the business continuity management platform ASR (Apsara Stack Resilience) in the industry. ASR provides disaster recovery status monitoring, fault injection and rehearsal, disaster recovery switching and switchback, tenant isolation and other capabilities in a visualized manner through multi-scenario adaptation, and integrates complex "product switching logic, inter-product dependencies, and computer room-level switching" The internal logic is arranged and encapsulated, so that the operation and maintenance personnel do not need to care about the complex internal processing logic, and can complete the disaster recovery drill and switch of the full-stack product with "one-click". In addition, ASR greatly reduces the difficulty of full-stack cloud disaster recovery drills, and users can perform regular drills as needed, so as to ensure "dare to switch at the moment of failure".
insert image description here

Application friendly, lower RTO

Tenants access cloud products through domain names or VIPs. The disaster recovery switching of cloud products will ensure that the access addresses of cloud product disaster recovery instances remain unchanged. Therefore, the disaster recovery capabilities of products can be transparent to applications during disaster recovery switching, which can greatly reduce The time the application resumed.

RPO=0, meeting the high-level disaster recovery requirements

Industries that require high data reliability, such as finance, often require RPO=0. Alibaba Cloud is the first to launch the intra-city 3AZ disaster recovery model based on the distributed technology system of cloud computing. By deploying data copies in multiple computer rooms, RPO=0 for single computer room failures under any conditions can be met, and the "GB20988-2007-T Information Security Technology Information System Disaster Recovery Specification" and "JR/T 0168-2020 Cloud Computing Technology Financial Application Specification - Disaster Recovery" highest level requirements.

Make progress while maintaining stability, and make full-stack cloud disaster recovery a stable chassis for digital intelligence innovation

With its maturity in product support range, function satisfaction, scene coverage, ease of use, security isolation, etc., Alibaba Cloud Feitian Enterprise Edition has been used in various industries such as finance, government affairs, energy, electric power, transportation, manufacturing, and medical care. Hundreds of customers provide full-stack cloud platform disaster recovery products and services.

The evolution of IT architecture is unstoppable. As governments and enterprises continue to migrate and build innovative applications and core applications on cloud platforms, it is becoming more and more urgent to turn from traditional IT disaster recovery to full-stack cloud disaster recovery. Alibaba Cloud provides a solid cloud base support for the digital and intelligent transformation of various industries with the Feitian Enterprise Disaster Recovery Solution, turning "stability" from a one-time choice into a continuous commitment.
insert image description here

From traditional IT disaster recovery to "full-stack cloud disaster recovery"｜What is a cloud that is more suitable for government and enterprises