Thinking about migration and disaster tolerance under the trend of cloud native

Source| Alibaba Cloud Native Official Account

Author | Sun Qi

Introduction: Will the next cloud-native subversion field be in the traditional disaster recovery field? Under the trend of cloud native, how to build application system migration and disaster recovery solutions?

trend

1. Cloud Native Development Trend

Cloud Native is a very hot topic in recent years. The "Cloud Native Development White Paper (2020)" released by the Academy of Information and Communications Technology in July 2020 clearly pointed out that the turning point of cloud computing has come, and cloud native has become the driving business. An important engine of growth. It is not difficult to find that cloud native brings a reshuffle of the IT industry. From the application development process to the technical capabilities of IT practitioners, it is a disruptive revolution. On this basis, the Open Application Model definition based on the cloud-native platform appeared, which was further abstracted on the basis of the cloud-native platform, focusing more on the application rather than the infrastructure. At the same time, more and more public clouds are beginning to support serverless services, which further illustrates the future development trend: applications are the core, and the role of lightweight infrastructure layer in the system construction process. But no matter what changes, the overall development direction of IT must evolve in a direction that is more conducive to rapid business iteration and meeting business needs.

In September 2020, Snowflake created the largest IPO this year with a IPO of US$120 per share and the largest software IPO in history. Snowflake restructured the data warehouse using a cloud-native approach and successfully subverted the industry competition landscape. This is the market’s best recognition of the cloud-native development trend, so will the next cloud-native subversion area be in the traditional disaster recovery area?

2. Why is a new migration and disaster tolerance needed on the cloud?

1) Limitations of traditional solutions

Under this general trend, traditional migration and disaster tolerance still remain at the level of data handling, while ignoring cloud-oriented features and user business rethinking and construction. The vision of cloud computing is to allow cloud resources to be used on demand like water and electricity, so cloud-based migration and disaster tolerance should also conform to this historical trend. Snowflake also succeeded in breaking the old competitive landscape through this business model innovation.

Why can't traditional disaster recovery methods meet the needs of cloud native? Simply put, the core of the two concerns is different. Traditional disaster recovery often takes storage as the core and has supreme control over storage. And in the physical age, there are no effective scheduling methods for the infrastructure layers such as computing, storage, and networking, and highly automated orchestration cannot be achieved. The core of applications built on cloud native becomes the cloud native service itself. When the user's business system is fully deployed on the cloud, the user no longer has absolute control over the underlying storage, so the traditional disaster recovery methods are gone.

1.png

I think that in building a cloud-native disaster recovery solution, it is necessary to think about the construction method with the business as the core, and use the orchestration capabilities of cloud-native services to achieve the continuity of the business system.

2) Data security

AWS CTO Werner Vogels once said: Everything fails, all the time. Through AWS' shared responsibility model, it is not difficult to find that cloud providers are responsible for the underlying infrastructure, and users are still responsible for their own data security and business continuity.

2.png

I think that under the cloud native trend, the most direct appeal of users comes from data security, that is, backup. Migration, recovery, and high reliability are all based on the business form shown by backup, and backup capabilities may be provided by cloud native capabilities. It may also be provided by third-party capabilities, but the final realization of the business form is produced by the orchestration.

Users going to the cloud does not mean sitting back and forth. On the contrary, users must learn the correct way to open the cloud to ensure business continuity to the greatest extent. Although the cloud is highly reliable in the underlying design, it still cannot avoid the influence caused by external forces, such as: the available area of ​​the cloud platform cannot be used due to the cutting of the optical cable, the power failure, and human error. Decided the stability of China's cloud computing" ridicule. I think that from the moment a user decides to migrate their business to the cloud, backup, migration, recovery, and high reliability are a continuous process. How to make reasonable use of the characteristics of cloud native services to achieve business continuity, while optimizing costs and reducing overall ownership Cost (TCO).

3) Prevent vendor lock-in

In a sense, the direction of cloud native is a new round of vendor lock-in, just like the IOE architecture that prevailed back then, except that it is now replaced by cloud vendors as the base to carry applications. In the IOE era, it is difficult for users to find a perfect substitute, but in the cloud era, this difference is not so obvious. Therefore, most customers usually choose hybrid cloud as a cloud construction strategy. In order to allow applications to move smoothly between different clouds, migration using disaster tolerance technology must exist as a normalized requirement. Gartnar also considers migration and DR as a separate capability in the definition of multi-cloud management platform. It fully illustrates the normalization trend of migration and disaster recovery in a multi-cloud environment.

3.png

The relationship between cloud migration and cloud disaster recovery

1. The emergence of cloud migration requirements

In the traditional environment, the need for migration is not very prominent. Unless it is a computer room relocation or hardware upgrade, migration will be thought of. However, migration here is more like moving iron, and the need for tooling and automation of migration is not obvious. When VMware appeared, the demand for migration from the physical environment to virtualization was magnified, but because it was a single virtualization platform, the virtualization vendors' own tools were basically able to meet the needs. On the virtualization platform, everyone suddenly found that the original physical environment that could only be operated manually became lighter. Simply put, our traditional server has changed from a pile of iron to a file, and this file can be moved and copied back and forth. . Later, in the cloud era, various cloud platforms flourished, and the domestic cloud computing market was even more contending, and going to the cloud became a rigid demand. Over time, due to the impact of many factors such as cost and vendor lock-in, mutual migration between different clouds will become a normalized demand.

2. The underlying technology is consistent

The cloud migration and disaster recovery mentioned here are not migration services provided by Hei Ren, but a highly automated means. The goal is to ensure business continuity during the migration process, reduce downtime or even non-stop effect. Here, disaster-tolerant storage-level synchronization technology is used to realize "hot migration" in a heterogeneous environment. Existing solutions include both migration software from the traditional physical machine relocation era and tools based on cloud native development. But no matter what the form, it has solved the basic demands of users to go to the cloud to varying degrees. The biggest difference lies in the human-to-efficiency ratio, which is directly related to your interests.

From another perspective, it is not difficult to find that the so-called migration is essentially an intermediate process of disaster recovery before the official switchover. At the same time, after the business system is migrated to the cloud platform, disaster recovery is a continuous action, which includes not only traditional backup and disaster recovery, but also the concept of high reliability on the cloud. In this way, the user's business system can get rid of the burden of traditional infrastructure, achieve "zero operation and maintenance", and truly enjoy the dividends brought by the cloud. Therefore, I think that in the cloud native state, cloud migration, cloud disaster recovery, and cloud backup are essentially a business form, and the underlying technical means can be completely consistent.

3. Development direction

Under the above-mentioned pain points and trends, a brand-new platform will inevitably appear to help customers solve data security and business continuity issues. Today, we will analyze from this perspective how to build application systems under the trend of cloud native Migration and disaster recovery plan.

Cloud migration development trend

1. Cloud migration method

Migration is a heavy consulting business. Every cloud provider and MSP on the Internet have their own methodologies. In fact, they don't seem to be very different. Many people are sharing related topics before, so I won't repeat them in this article. Here we focus on discussing which tool to use and which method is the most efficient in the actual landing process. The so-called cloud migration tool is to migrate the source end to the target end to ensure that the source end runs correctly on the target end. Common methods include: physical machine to virtualization, virtualization to virtualization, physical machine to cloud platform, virtualization to cloud platform, etc.

4.png

This is the classic 6R migration theory (now it has been upgraded to 7R, and VMware has come out to spoil the situation). In this picture, the only real migration related to Rehosting, Replatforming, Repurchasing and Refactoring, but in this 4R, Refactoring is obviously one The long-term iterative process requires the participation of users and software developers. Repurchasing is basically not very different from manual redeployment. Therefore, only Rehosting and Replatofrming are left to be completed by the user or MSP in a short period of time.

Compared with the classic migration theory above, I prefer the following picture, which better reflects the whole process of a traditional application to cloud native growth. Similar to the above conclusions, when we truly embrace the cloud, we basically follow the above three paths:

  • Lift & Shift is another name for the Rehost method. This method has the widest road surface, meaning that this road is the shortest path to the cloud, and the application does not need any modification to go directly to the cloud.

  • Both Evolve and Go Native are narrow paths, meaning that compared to the Rehost method, these two paths take longer and are more difficult.

  • On the far right side of the figure, the three forms have the possibility of mutual conversion, and eventually evolve into a complete cloud native. The implication is that the migration is not accomplished overnight and needs to be completed gradually.

5.png

2. Rehost method

Commonly used re-hosting methods are cold migration and hot migration. Cold migration often involves cumbersome steps, requires a lot of manpower input, is error-prone and has low efficiency, has a greater impact on business continuity, and is not suitable for production system migration. The hot migration solutions are basically commercial solutions, which are divided into block level and file level, and then subdivided into traditional solutions and cloud native solutions.

1) Cold migration

Let’s take a look at the manual cold migration solution. Take VMware to OpenStack as an example. The easiest way is to convert the VMware virtual machine file (VMDK) through the qemu-img tool, convert it to QCOW2 or RAW format, and upload it to OpenStack Glance service, then restart on the cloud platform. Of course, virtio driver injection is required, otherwise the host cannot start normally on the cloud platform. The most time-consuming process in this process should be the process of uploading virtual machine files to the OpenStack Glance service. In our earliest practice, it took a full 24 hours for a host to migrate from the beginning to the start. At the same time, the data during your migration period is incrementally generated. Unless you shut down the source and wait for the migration to complete, you have to repeat the above steps. So this method is really not suitable for migrating production systems with business continuity.

So what if it is a cold migration solution for physical machines? After our best practice, here is the old backup tool CloneZilla, which is called Regenerative Dragon in Chinese. It is a very old-fashioned backup software, often used for backup and recovery of the whole machine, and it is very similar to our common Norton Ghost principle. CloneZilla copies from the underlying block level, can back up the entire disk, and supports multiple target terminals. For example, we save the disk to a mobile hard disk. The actual format is RAW. You only need to repeat the above solution to complete the migration. However, in the process of using CloneZilla, you need to use Live CD to boot, and you will also face the problem of long-term business system interruption. This is why the cold migration mentioned above is not suitable for production environment migration.

6.png

7.png

2) Traditional thermal migration solution

The traditional hot migration scheme is basically divided into block level and file level. The similarities between the two are realized by differential synchronization technology, that is, full and incremental cross synchronization.

File-level hot migration solutions are often more limited and cannot be regarded as a true ReHost method, because the original operating system needs to be prepared at the source end, and the entire machine cannot be relocated. From the complexity of the operation and the stability of the migration It's not too high. Rsync, which we commonly use on Linux, can actually be used as a solution for file-level live migration.

To truly achieve a hot migration solution, block-level synchronization must be used to reduce dependence on the underlying operating system and achieve the relocation effect of the entire machine. The traditional block-level hot migration solution basically comes from a variant of the traditional disaster recovery solution, which is implemented using the memory operating system WIN PE or other Live CD. The basic principle and process are shown in the following figure. From the process, it is not difficult to find that although this method solves the migration goal to a certain extent, it still has the following shortcomings as a normalized migration requirement for hybrid cloud in the future:

  • Because the traditional thermal migration solution is based on the physical environment, we found that there are a lot of human interventions in the whole process, and the user's skills are relatively high.

  • Unable to meet the needs of multi-tenancy and self-service in the cloud-native era

  • Installing an agent is an eternal grievance in the hearts of users

  • The one-to-one synchronization method is not economical in terms of cost

  • The best migration verification method is to completely restore the business system cluster in the cloud, but the manual verification method increases the labor cost of the migration again.

8.png

3) Cloud native hot migration solution

It is precisely because of the shortcomings of the traditional migration solution that a cloud-native hot migration solution has emerged. The representative vendor in this regard is the Israeli cloud-native disaster recovery and migration vendor CloudEndure, which was acquired by AWS to beat Google Cloud for $250 million in 2019. .

The cloud native hot migration solution refers to the use of block-level differential synchronization technology combined with cloud native API interfaces and resources to achieve a highly automated migration effect, while providing multi-tenant and API interfaces to meet the self-service needs of hybrid cloud tenants. Let's first analyze from a principle perspective, why the cloud-native approach can satisfy a highly automated, self-service user experience compared to traditional solutions. By comparing the two solutions, it is not difficult to find several advantages of the cloud native approach:

  • Utilize cloud native API interface and resources, easy to operate, completely replace a large number of cumbersome manual operations of traditional solutions, reduce technical requirements for users, and greatly reduce the steepness of learning

  • Due to simple operation and improved migration efficiency, it effectively improves the human-to-efficiency ratio of migration implementation

  • One-to-many synchronization mode greatly reduces the use of computing resources, which are only used during verification and final switching

  • Able to meet the requirements of multi-tenancy and self-service

  • The source end can also support agentless mode, dispel user doubts, and is suitable for large-scale batch migration

  • Highly automated verification means, which can be verified repeatedly before completing the migration and switching

9.png

This is the architecture diagram of CloudEndure. Of course, you can also use CloudEndure to achieve cross-region disaster recovery.

10.png

However, it is a pity that due to the acquisition by AWS, CloudEndure can currently only support migration to AWS and cannot meet the needs of various domestic cloud migration. So here I recommend a purely localized migration platform-Wanbo Zhiyun's HyperMotion, which is very similar in principle to CloudEndure. It also supports the agentless migration of VMware and OpenStack. More importantly, it covers the mainstream public ownership Cloud, proprietary cloud and private cloud migration.

11.png

3. Replatforming method

As cloud native provides more and more services, the complexity of application architecture is reduced, allowing enterprises to focus more on their own business development. However, the reduced workload on the R&D side means that this part of the cost has been passed on to the deployment and operation and maintenance links, so DevOps has become an indispensable mitigation in cloud native applications, and it also allows enterprises to more agilely respond to complex changes in the business.

As mentioned above, users can preferentially use some cloud-native services through a small amount of transformation. This migration method is called platform reconstruction (replatforming). Currently, the migration of platform reconstruction mode is mainly based on services related to user data. . Common ones include: database service RDS, object storage service, message queue service, container service, etc. The introduction of these cloud-native services has reduced user operation and maintenance costs. However, because the cloud native service itself is very tightly encapsulated, and the underlying infrastructure layer is completely invisible to users, the above-mentioned Rehost method cannot be used for migration, and other auxiliary means must be used to complete the migration.

Taking relational databases as an example, almost every cloud provides migration tools, such as AWS DMS, Alibaba Cloud’s DTS, and Tencent Cloud’s data transmission service DTS. These cloud-native tools can support MySQL, MariaDB, PostgreSQL, Redis, Migration of various relational databases such as MongoDB and NoSQL databases. Take MySQL as an example. These services cleverly use binlog replication to realize online database migration.

Take object storage as an example. Almost every cloud provides its own migration tools. For example, Alibaba Cloud's ossimport and Tencent Cloud COS Migration tools can implement incremental migration from local to cloud object storage. However, in the actual migration, the cost should also be considered. The public cloud object is relatively cheap to store data, but when reading the data, it is charged based on the network traffic and the number of requests, which requires us to design the migration plan When, fully consider the cost factor. If the amount of data is too large, you can also consider using offline devices, such as AWS's Snowball, Alibaba Cloud's Lightning Cube, etc. This part will not be introduced, and I will introduce it to you separately in the future.

12.png

If you choose the platform reconstruction method to go to the cloud, in addition to the necessary application transformation, you also need to choose a migration tool that suits you to ensure smooth data migration to the cloud. Combined with the above Rehost mode migration, the overall cloud effect of the business system can be achieved. Since there are many services involved, here is a migration tool form for your reference.

13.png

Disaster recovery development trend under cloud native

So far, there has not been a set of platforms that can fully meet the requirements of unified disaster recovery in the cloud-native state. Let's analyze through the following scenarios, how can we build a unified disaster recovery platform to meet the needs of cloud-native.

1. Traditional Architecture

Let's take a simple Wordpress + MySQL environment as an example. The traditional deployment environment is generally structured like this:

14.png

If you design a disaster tolerance solution for this application architecture, you can use the following methods:

1) Load balancing node disaster recovery

Load balancing is divided into hardware and software levels. The high reliability and disaster tolerance of hardware load balancing are often achieved through its own solutions. If it is software load balancing, it often needs to be installed on the basic operating system. Disaster recovery in the same city can be achieved by using software with high reliability. Disaster recovery in remote locations is often done by establishing peer nodes in advance, or simply using disaster recovery software. Block or file level disaster recovery is achieved. It is a very important part of Failover.

2) Disaster recovery of Web Server

The operating environment of Wordpress is nothing more than Apache + PHP. Because the file system used to store user uploads is separated, the node is almost stateless. High reliability can be achieved by expanding the node, and remote disaster recovery is relatively simple. Traditional Both block level and file level can meet the needs of disaster recovery.

3) Disaster recovery of shared file system

The Gluster file system is used in the figure. Since the consistency of the distributed system is usually maintained internally, it is difficult to ensure the consistency of the nodes by using the block level alone, so the file level disaster recovery is more accurate.

4) Disaster tolerance of the database

Relying solely on the storage level cannot fundamentally achieve the loss of data in the database 0, so it is generally implemented at the database level. Of course, if in order to reduce costs, the disaster recovery of the database can be achieved by simply using the cycle dump database. Of course, if the reliability requirements are higher High, you can also use CDP to achieve.

From the above case analysis, it is not difficult to see that disaster recovery under traditional infrastructure often takes storage as the core, whether it is storage mirroring of disk arrays, or I/O data block, byte-level capture technology, combined with network, database and The application-level technology of the cluster completes the construction of a highly reliable and disaster-tolerant system. The main participants in the entire disaster recovery process are: host, storage, network, and application software, which are relatively single. Therefore, in traditional disaster recovery solutions, how to correctly solve storage disaster recovery has become the key to solving the problem.

2. Hybrid cloud disaster recovery

This should be the most common hybrid cloud solution at present, and it is also a method recommended by major disaster recovery vendors. Here we are equivalent to treating the cloud platform as a set of virtualization platforms, and hardly use any features of the cloud platform. In the recovery process, a large number of human access is required to restore the business system to a usable state. Such an architecture does not conform to the best practices on the cloud, but it is indeed a true portrayal of many business systems after they are backed up or migrated to the cloud.

15.png

Such an architecture can indeed solve the problem of disaster tolerance, but it is very costly. Now let's change it. We used object storage and database for an optimization. We store the original storage service in the object storage, and use the data transfer service for real-time database replication. The cloud host still uses the traditional block level for synchronization. Once a failure occurs, you need to automate the orchestration ability, restore the backup again, and restore it according to our preset plan in the shortest time to complete disaster recovery.

16.png

3. Cloud disaster recovery architecture in the same city

The above-mentioned backup method is essentially a migration using platform reconstruction. Now that the migration has been used for backup, the architecture can be modified as follows to form a disaster tolerance architecture in the same city. According to the best practices of the cloud platform, we have made the following adjustments to the architecture:

17.png

This architecture not only achieves application-level high reliability, but also supports a certain level of high concurrency. Users can achieve dual-active effects in the same city with minimal transformation costs. Let's analyze how many cloud-native services are used on the cloud:

  • Domain name resolution service
  • VPC service
  • Load balancing service
  • Auto-scaling service
  • Cloud hosting service
  • Object storage service
  • Relational database RDS service

Except for cloud hosts, other services naturally support cross-availability features of high availability. For cloud hosts, we can make mirroring methods, and the automatic scaling service is responsible for the status of the instance. Since the cloud availability zone is the concept of intra-city disaster recovery, here we have implemented intra-city business system disaster recovery.

The adjusted architecture meets the requirements of business continuity to a certain extent, but still lacks guarantee for data security. In recent years, ransomware has been rampant, and a large number of companies have suffered huge losses for this. Therefore, data backup must be implemented after going to the cloud. Cloud native services themselves provide backup solutions, such as regular snapshots of cloud hosts, but often services are scattered and it is not easy to manage them in a unified manner. At the same time, during restoration, only each service can be restored. If the scale of the business system is large, it will also increase a lot of restoration costs. Although cloud native services solve their own backup problems, reorganizing backups into applications requires the use of automated orchestration capabilities.

4. Same cloud and remote disaster recovery architecture

Most of the cloud-native services are in the availability zone and provide high reliability capabilities, but for cross-regional backup capabilities are usually provided. For example: the cloud host can be turned into a mirror, and the mirror can be copied to other regions; relational databases and object storage also have cross-domain backup capabilities. Using the backup capabilities of these components and the orchestration capabilities of the cloud's own resources, we can restore the system to a usable state in the disaster-tolerant available domain. How to trigger the switch?

Here we customize alarms on cloud-native monitoring based on the characteristics of the business system, and use the triggering capabilities of the alarm platform to trigger function calculations to complete the cross-domain switching of the business system and form the effect of remote disaster recovery.

18.png

5. Cross-cloud disaster recovery

However, cross-cloud disaster recovery is not like the same cloud disaster recovery. At least the services are consistent between different availability zones. At this time, the methods used on the same cloud basically fail, and the capabilities of the target cloud platform or neutrality are required. Third-party solutions. In addition to data backup, another point is the matching of service configurations. In order to fully meet the needs of cross-cloud disaster recovery and recovery. Another point that needs to be considered is cost. Taking object storage as an example, it is typical of “easy to go to the cloud and difficult to go to the cloud”. Therefore, how to use the characteristics of cloud native resources to rationally design disaster recovery solutions is a great test of cost.

19.png

to sum up

Cloud native disaster tolerance is still in its early stages, and there is currently no complete platform that can support the disaster tolerance requirements of the above various scenarios, which is a topic worthy of continuous exploration. Cloud-native disaster recovery takes backup as the core, and takes migration, recovery and high reliability as business scenarios to realize the free flow between multiple clouds and ultimately meet the business needs of users.

Therefore, as a cloud-native disaster recovery platform, three capabilities must be addressed:

  • With data as the core, let data flow between multiple clouds. Data is the core value of users, so no matter how the underlying infrastructure changes, data backup must be the user's wake-up demand. How to solve data backup for different cloud native services is a necessary foundation for data flow.

  • Utilize cloud-native orchestration capabilities to achieve a high degree of automation and build business scenarios based on data. Use automated orchestration capabilities to implement more data-based applications and help users complete more business innovations.

  • Flexible use of the characteristics of cloud native resources to reduce the total cost of ownership. To solve the problem of huge investment in traditional disaster recovery, users can really pay on demand like water and electricity.

Guess you like

Origin blog.51cto.com/13778063/2553560