Disaster Recovery Basic Learning

disaster recovery

Disaster recovery: Disaster recovery is the abbreviation of disaster recovery and backup. It uses scientific technical means and methods to establish a systematic data emergency method in advance to deal with disasters. Its content includes: data backup, system backup, business continuity planning, personnel structure, communication guarantee, crisis public relations, disaster recovery planning, disaster recovery plan, business recovery plan, etc.

​ Disaster recovery refers to the establishment of two or more sets of IT systems with the same functions in two places far apart (in the same city or in different places), and they can perform health status monitoring and function switching between each other. When the work stops, the entire application system can be switched to another place, so that the system functions can continue to work normally, focusing on data synchronization and continuous availability of the system. Refers to the establishment of two or more sets of IT systems with the same functions in two places far apart (in the same city or in different places), and they can perform health status monitoring and function switching between each other. , the entire application system can be switched to another place, so that the system functions can continue to work normally, focusing on data synchronization and continuous availability of the system.

​ Backup refers to the user making one or more copies of important data (or original important data information) generated by the application system to enhance data security. Focus on data backup and preservation.

1. Disaster recovery implementation

image.png

Backup: In order to cope with possible unexpected situations such as file and data loss or damage, copy the data in the computer storage device to a large-capacity storage device such as a disk.

Verification: Check whether the backup data is consistent with the metadata, whether it is intact, and whether it can be used (consistency and availability).

Drill: Simulate disasters to detect whether the entire organization has the ability to respond to disasters when a sudden disaster occurs.

Disaster recovery (emergency): When an actual disaster occurs, whether the entire organization has the ability to respond to disasters, allowing the entire application system to be switched to another location, so that the system functions can continue to work normally.

Recovery (Switchback): Refers to whether the normal re-operation of the main production system can be restored after a disaster occurs.

2. Key technical indicators of disaster recovery

1.RTO

RTO (RecoveryTime Object, Recovery Time Objective) determines how long the business has been interrupted. After the disaster, from the moment when the IT system goes down and the business stops, to the time when the IT system is restored to support the operation of various departments and the business resumes operation, the time period between these two points is called RTO.

Common techniques for improving RTO include: tape recovery, manual migration, and remote switching of application systems.

Disaster recovery technology duration
tape recovery day level
manual migration hour class
Application system remote switching second level

2.RPO

RPO (Recovery Point Object, recovery point objective) determines how much data is lost. After a disaster occurs, the disaster recovery system restores data, and the time point corresponding to the recovered data is called RPO.

RPO is a metric that reflects the integrity of restored data. In the synchronous data replication mode, the RPO is equal to the data transmission delay time. In the asynchronous data replication mode, the RPO is basically the queuing time of the asynchronous transmission data.

Common techniques for improving RPO include tape backup, periodic data replication, asynchronous data replication, and synchronous data replication.

Disaster recovery technology duration
tape backup day level
Periodic data replication hour class
asynchronous data replication minute level
synchronous data replication second level

3. The relationship between RTO and RPO

RTO and RPO indicators are not isolated, but reflect disaster recovery capabilities from different perspectives. The RPO index comes from before the fault occurs, and the RTO index comes from after the fault occurs. The smaller the value of the two, the shorter the time interval between normal business and business transition period can be effectively shortened.

image.png

When a disaster occurs, the ideal state is that the system recovers immediately, and there is no data loss at all. At present, RTO can be equal to 0, and RPO can approach 0 infinitely. However, when designing a disaster recovery system, RPO and RTO cannot be pursued too much, because the smaller the RPO and RTO, the greater the investment. The higher the overall input cost, the lower the return on investment will be. From an economic point of view, the best disaster recovery solution is not necessarily the best disaster recovery solution, because the overall investment TCO and investment return ROI of the disaster recovery system are very important design indicators for many users.

3. Disaster recovery level

Disaster recovery is an important technology application in an enterprise, which plays a great role in enterprise data security. Generally speaking, disaster recovery levels can be divided into three levels: data level, application level, and business level.

1. Data-level disaster recovery

Data-level disaster recovery refers to the remote backup of data by establishing a remote disaster recovery center to ensure that the original data will not be lost or destroyed after a disaster occurs. For example, in the early days, backups were transferred to tapes and transferred to remote locations, or asynchronous/synchronous data transmission between the disaster recovery center and the production center was realized based on the network. But at the level of data-level disaster recovery, applications will be interrupted in the event of a disaster.

Copy the data in the data center from the application host or storage device to other media to prevent data loss and destruction.

  • may make some or all of the data;
  • It can be only in the center or across centers;
  • Can save multiple data at different historical time points;
  • Usually needs to be scheduled and supported by a backup management service;
  • Cross-center backup is the basis of disaster recovery.

image.png

2. Application-level disaster recovery

Application-level disaster recovery is based on data-level disaster recovery. A set of the same application system is also built at the backup site. Through synchronous or asynchronous replication technology, it can ensure that key applications can resume operation within the allowable time range, as much as possible. Reduce the losses caused by disasters, let users basically not feel the occurrence of disasters, and make the services provided by the system complete, reliable and safe. The support system includes data backup system, backup application system, and backup network.

The data transmission between the application-level disaster recovery production center and the remote disaster recovery center adopts a heterogeneous WAN transmission method; at the same time, the application-level disaster recovery system needs to be realized through more software, so that various applications can be carried out when a disaster occurs. Fast switchover ensures business continuity.

Set up two or more sets of IT systems with the same functions in distant places. When one system stops working unexpectedly, the entire application system can be switched to another to ensure that the application system can continue to work normally.

  • Each center can perform health status monitoring and function switching among each other;
  • Is an integral part of the system's high availability technology;
  • Provide node-level system recovery function;
  • More emphasis is placed on the impact of the application external environment on the information system, especially the impact of catastrophic time on the entire IT node.

image.png

3. Business-level disaster recovery

Business-level disaster recovery is the highest level of disaster recovery. In addition to the necessary IT-related technologies, it also requires all infrastructure. Most of its content is non-IT systems (such as telephones, office locations, etc.). When a catastrophe occurs, the original office will be destroyed. In addition to data and application recovery, a backup workplace is needed to carry out business normally. . For example, the office space of business users and the backup of business staff.

The same business is provided by multiple centers at the same time:

  • Multiple data centers carry business pressure, which can be shared in proportion;
  • After a center stops serving, business traffic can be automatically switched to another center to provide continuous services to the outside world;
  • Automatic switching is transparent to the access terminal, and the access terminal is completely unaware;
  • Effectively improve resource utilization.

image.png

4. Data type

From the perspective of data usage, the data that needs to be backed up can be divided into system data, basic data, application data and temporary data; at the same time, it can be divided into database data, non-database data, isolated data and lost data according to the data storage and management methods. data.

  • System data : mainly refers to the operating system, various software packages installed in the application system, and the execution programs of the application system. System data will basically not change after the system is installed, and only change when the operating system, application system version is upgraded or application program is adjusted.

  • Basic data : mainly refers to the system directory, user directory, system configuration file, network configuration file, application configuration file, access control, etc. used to ensure the normal operation of the business system. The basic data changes with the change of the operating environment of the business system, and is generally saved as a system file.

  • Application data : mainly refers to all business data of the business system, which has high requirements on data security, accuracy, and integrity and changes frequently.

  • Temporary data : mainly refers to system operation records generated by operating systems, application systems, and databases, database logic logs, and various temporary files for printing and transmission generated during the execution of applications, which change with system operations and business occurrences. Temporary data has little impact on the integrity of business data, and needs to be cleaned up regularly after it increases.

5. Business type

There are different business scenarios in an enterprise. We can divide business systems into key business systems, important business systems, general business systems, etc.

  • Business-critical systems : business data is relatively centralized and core, and there are many server nodes connected to it, which is essential to ensure the normal operation of the entire enterprise; once the business is interrupted, the services provided by the enterprise and normal business operations will be severely affected immediately , and directly bring economic losses to the enterprise or affect the reputation of the enterprise, and even serious cases may have potential legal liabilities. Such as online Ctrip, Taobao, Jingdong and so on.

  • Important business systems : business interruption will have a serious impact on the normal and effective operation of the entire enterprise. Once the business is interrupted, part of the services provided by the enterprise and part of the business will be affected and interrupted, but it has nothing to do with the overall situation. Such as: internal corporate website, mail transmission system, business operation system, etc.

  • General business system : business interruption will not immediately have a serious impact on the normal operation of the entire enterprise, and once the short-term can be tolerated, it can be restored within a few days or weeks. For example: personnel file system, attendance system, project budget and final account system, etc.

6. Disaster recovery technology

Data center disaster recovery technologies can be roughly divided into five types: cold backup , warm backup , hot backup , active-active , and multi-active .

1. Cold standby

That is, cold backup, also known as offline backup, refers to a complete backup of the database when the database is closed and the database cannot be updated.

In cold backup, only the main data center undertakes business. The backup data center will not back up the main data center in real time. When the main data center goes down, the business will also be interrupted. This technology has no ability to prevent and take over failures in advance, and recovery takes time It is too long to meet the high requirements of data center disaster recovery development.

2. Warm up

Warm backup is a method between cold backup and hot backup. It mainly realizes a complete backup of the entire system by setting up hard disk remote mirroring, database replication, and setting up a disaster backup center.

3. Hot standby

That is, dual-machine hot backup refers to the hot backup based on two servers in a high-availability system. Although hot backup can only back up the primary data center in real time, when the primary data center fails and the business is unavailable, the standby data center can automatically take over the primary data center business, and the business can be restored in the shortest possible time.

4. Hyperactive

Active-active means that another data center is in operation at the same time and undertakes business at the same time, improving the overall service capability and system resource utilization of the data center. The two data centers are mutually backed up. When a data center fails, the business is automatically switched to the other A data center with zero data loss and zero business interruption.

The active-active data center solution implements active-active at the storage layer, application layer, and network layer, eliminating single points of failure and ensuring business continuity.

5. Live more

That is, more live in different places, generally refers to the establishment of independent data centers in different cities, "live" is relative to cold backup, cold backup is to back up the full amount of data, usually does not support business needs, only when the host room fails Only then will it switch to the backup computer room, and more active means that these computer rooms also need to use traffic in daily business to provide business support.

7. Disaster Recovery Architecture

1. Use the cloud to build a remote disaster recovery center : the local physical computer room is the main data center, and only the data is backed up to the cloud.

2. Intra-city disaster recovery based on public cloud : Migrate all systems to the cloud and deploy them in two different availability zones in the same region to achieve intra-city disaster recovery of the system.

3. Remote disaster recovery based on public cloud : Migrate all systems to the cloud and deploy them in two different regions to achieve cross-regional disaster recovery.

4. Combining public cloud disaster recovery in the same city and remote disaster recovery : such as three centers in two places, five centers in three places, etc.

8. Cloud Disaster Recovery

Cloud disaster recovery is a service model developed based on the cloud platform. Cloud disaster recovery refers to the service model of cloud computing that provides enterprises with business disaster recovery, data backup, data copy utilization and other data application scenarios, that is, disaster recovery as a service (DRaaS, DR as a Service).

1. Advantages of cloud disaster recovery

Cloud disaster recovery combines many advantages of cloud platform such as computing, storage and bandwidth, and has many advantages compared with traditional disaster recovery:

  • infrastructure reduction

Instead of purchasing traditional disaster recovery servers, rely on computing and storage platforms provided by cloud platform suppliers, or directly adopt cloud disaster recovery DRaaS application services. The cloud disaster recovery technology solution can effectively reduce maintenance requirements and cost consumption. While saving more physical space, customers can also save more IT resources, freeing relevant maintenance personnel to participate in other work.

  • Reduce IT costs

According to specific needs, more economical and flexible cloud storage is used for backup, eliminating the hardware purchase and maintenance costs caused by self-built data centers, eliminating the troubles caused by maintaining various hardware, and realizing the resource allocation. Delicate management reduces most disaster recovery expenses.

  • pay as you go

Cloud disaster recovery can adopt cloud infrastructure or DRaaS model, allowing users to freely select important systems and data for disaster recovery. Therefore, whether it is business takeover or rehearsal, customers only need to pay for the resources actually used, which greatly reduces the waste of resources and improves efficiency.

  • high flexibility

Cloud disaster recovery makes it easier to assess business needs. Users can more accurately estimate which system or even which subsystem needs to be maintained, and can also select key data in a more fine-grained manner to optimize their own backup plan instead of completely backing up the entire system. , to more precisely set the RPO, which is the maximum amount of data loss that can be tolerated. The high-availability and high-fault-tolerant architecture established in the cloud can improve RTO and RPO. Based on the public cloud platform or open source private cloud technology, it is also possible to easily, quickly and flexibly build disaster recovery nodes and migrate or copy data to the cloud to improve disaster recovery. speed.

  • quick recovery

Because even with traditional customized remote backup, it still takes time to restore data and restart business, and it depends on the distance of the remote backup location and the performance of the remote server. And cloud disaster recovery can make full use of the capabilities of the cloud, break through physical limitations, and start services on the cloud.

The unique high performance, high reliability, high scalability, easy maintenance, low liability risk and high cost-effective service characteristics of cloud disaster recovery help users build a highly available, flexible and pay-as-you-go professional cloud disaster recovery platform at low cost .

For many users with limited IT resources, cloud-based disaster recovery is a good choice, because cloud services are a pay-as-you-go model, and if enterprises build their own disaster recovery facilities, most of the time they are in a Idle and standby, so the cloud is perfect for those SMBs. After using cloud services to set up a disaster recovery site, the enterprise's dependence on data center space, IT infrastructure and IT resources will be greatly reduced, which in turn will lead to a significant reduction in operating costs. With the help of the cloud, small businesses can also implement disaster recovery systems, which was previously only possible for large enterprises.

2. Cloud disaster recovery level

Referring to the classification of traditional disaster recovery levels, since the cloud disaster recovery infrastructure adopts the cloud platform, there is little difference between the application level and the business level in the level division of cloud disaster recovery. Disaster levels are divided into: data-level disaster recovery and business-level disaster recovery.

Data-level cloud disaster recovery: Data-level cloud disaster recovery refers to the remote backup of data through the cloud platform. After a disaster occurs, it is necessary to ensure that the original data will not be lost or destroyed.

Business-level cloud disaster recovery: Business-level cloud disaster recovery refers to the remote backup and recovery of data through the cloud platform to ensure that key applications resume operation within the allowed time range, minimize losses caused by disasters, and ensure a certain RPO and RTO.

With the gradual cloudification of IT infrastructure, disaster recovery is also facing cloud transformation, and more cloud disaster recovery products and solutions are emerging.

9. Three Centers in Two Places

image.png

The two-site three-center architecture is a distributed system architecture pattern, which is used to ensure high availability and fault tolerance of the system. It divides the entire system into three data centers: two in the same city and one in a remote location. Among them, the two data centers in the same city assume the role of master and backup respectively, and the data center in a different place acts as a backup .

In the two-site three-center architecture, data synchronization is performed between two data centers in the same city through a high-speed network, realizing active-standby switchover and fault recovery. When the main data center fails, the backup data center will automatically take over the service to ensure the continuity and availability of the system. At the same time, the off-site data center is used as a backup to provide services when both the primary and backup data centers fail.

Guess you like

Origin blog.csdn.net/weixin_46706771/article/details/131894473