ITIL 4 explanation: IT service continuity management (disaster recovery)

Introduction to IT Service Continuity

The goal of implementing IT service continuity management is to ensure that the availability of IT services (availability introduction, see " DevOps Operation and Maintenance Series: Availability Management ") and performance are maintained at a sufficient level when a disaster occurs . In fact, continuity management is the current disaster recovery management (IT side).

Definition: Disaster (ITIL 4)

Sudden accidents that cause major losses or major losses to the organization. To classify an event as a disaster, the event must meet certain business impact criteria predefined by the organization.

Many companies are now doing continuity management, especially the financial industry, including banking, securities, insurance, consumer money, and so on. Companies must first respond to regulatory requirements. Both international and domestic standards organizations and monitoring agencies have issued a series of management norms for continuity, see below (just a part):

International standard ISO 22301:2012 Business Continuity Management
National standard GB/T30146-2013 Public safety business continuity management system requirements
National standard GB/T20988-2007 Information Security Technology Information System Disaster Recovery Specification


Industry Standard Guidelines for Disaster Recovery Management of Insurance Information System (Bao Jian Fa [2008] No. 20)
Industry Standard Standards for emergency management of important information systems in the banking industry
Industry Standard JR/T 0044—2008 Banking Industry Information System Disaster Recovery Management Specification

Second, ensuring service continuity is becoming more and more important and difficult. Especially in the context of digital transformation, many companies' businesses rely heavily on IT systems, making IT service continuity more and more important. Severe service interruptions can have catastrophic effects on the enterprise.

In ITIL 2, the service continuity management process is one of the processes of service delivery. In ITIL 3, continuity management is a process in service design. Compared with ITIL 2, the process activities have no major changes, but the risk management method is explained in detail. So what are the new highlights of IT service continuity management in ITIL 4?


Terminology of IT service continuity

Definition: Service Continuity

After a disaster or disruptive event occurs, the service provider's ability to continue service operation at an acceptable predefined level.

In this definition, we need to define the scope of continuity management as disasters, and continuity management is the planning and response measures for catastrophic events. The management of non-catastrophic events is generally not included in the practice of IT service continuity management, such as

●Small faults. Depending on the business impact, failures should be considered minor or major failures. It is important to consider factors such as affected maintenance actions, failure scale, and failure time.

●Strategic, political, market or industry events

Definition: Service Continuity Plan

The service continuity plan guides service providers to respond, recover, and return to normal levels after service interruptions.

The service continuity plan usually includes:

●Response plan: How the service provider initially responds to destructive events to prevent damage, such as in the event of a fire or network attack.

●Recovery plan: how the service provider recovers the service to achieve RTO and RPO.

●Resume normal operation plan: how the service provider will resume normal operation after recovery.

Indicators: RTO and RPO

Definition: RTO Recovery Time Objective

The longest time that can elapse after a service interruption before the lack of business functions seriously affects the organization. This represents the longest agreed time within which the product or activity must be restored or resources must be restored.

Definition: RPO Recovery Point Objective

In order for the activity to run effectively when it is resumed, the information used by the activity must be restored to this point.

RTO stipulates the time during which the business can be interrupted. RPO specifies the time period during which data loss is acceptable. Generally, RTO and RPO are used as metrics for continuity management and are written into SLA.


Service continuity management process

Service continuity management activities are divided into the following five processes:

●Governance of service continuity management

●Business Impact Analysis

●Develop and maintain service continuity plan

●Test service continuity plan

●Response and recovery.

1. Governance of service continuity management

Service continuity governance mainly includes three activities, defining scope, strategy selection and development of awareness and exercise plan. Generally, continuous enterprises do not have a huge main business, and their IT systems are even more complicated and interactive. For the sake of economic efficiency, it is impossible for an enterprise to guarantee that all applications and infrastructure components are backed up. Therefore, first determine the key business and components according to BIA (Business Requirements Analysis). Then according to different levels, choose different disaster recovery methods and drill plans.

2. Business Impact Analysis BIA

Business impact analysis includes the following activities:

●VBF recognition

●Analysis of the consequences of interruption

●VBF interdependence recognition

●Determine service continuity requirements

ITIL 4 does not provide specific implementation methods for these activities. I will write an article specifically on how to develop BIA. The difficulty of BIA lies in the technical implementation level. System architects must be involved, and risk assessment also requires technical personnel.

3. Develop and maintain a service continuity plan

The steps involved in this process are:

●Service continuity strategy formulation

●Formulation of service continuity plan

●Preliminary test of service continuity plan

The service continuity strategy can include the continuity level, the corresponding RTO and RPO goals, the availability goal, and the level of the exercise. Such as:

Level requirements for disaster tolerance of cloud computing platforms in the financial sector

Sphere of influence

the level of danger

Minor impact

General impact

Serious impact

Internal auxiliary management

Level 1       

level 2

Level 3

Internal operation management

level 2

Level 3

level 4

Financial rights of citizens, legal persons and other organizations

Level 3

level 4

Level 5

National financial stability and financial order

level 4

Level 5

Level 6


key indicator:

Disaster tolerance level

RTO

RPO

Availability

Level 3

<=24 hours

<=24 hours


Level 4

<=4 hours

<=1 hour


Level 5

<=30 minutes

Approximately equal to 0


Level 6

<=2 minutes

0


The drill level is stipulated in the " Insurance Industry Information System Disaster Recovery Management Guidelines (Bao Jian Fa [2008] No. 20) " as: tabletop drill, simulation drill, actual combat drill, partial drill and full drill.

4. Test Continuity Plan

This process includes two activities, execution exercises and continuous review.

5. Response and Recovery

The response includes the invocation of the corresponding supplier's service continuity plan.


The correlation and difference between IT service continuity and availability

Let me talk about the difference first.

From the perspective of objectives, service continuity management does not include the handling of minor or short-term failures that have no serious impact. It focuses on the risks associated with major damage, regardless of the likelihood or likelihood of its occurrence. Usually, these are emergency situations: fire, flood, power outage, data center failure, etc. The practice of availability management does not ignore the negative impact of failures on service providers and users, but it also considers the slight interruption of individual components in the process.

In terms of measurement indicators, service continuity is RTO and RPO, and availability indicators are MTTR, MIBF, and Availability%.

Let's talk about contact.

Service continuity and availability will use VBFs and risk assessment in the implementation method, and BIA analysis of service failure is required. Therefore, the documents and output content formed during the implementation process can be used in these two practices. It can be seen how important the system topology, VBF, and risk assessment are for IT service operation and maintenance management. These are basic information. In addition to these two practices, they can also be used for configuration management. Unfortunately, many companies are missing at this basic level. Many people have mentioned a bunch of high-level methodology and technology, but the foundation is not in place, resulting in the operation and maintenance management is a mess of loose sand, without evidence.


to sum up

If we do disaster preparedness, it is far from enough to just look at the IT service continuity management explained by ITIL. We also need to combine industry standards and management norms to interpret regulatory requirements. The main reason is that the practice of IT service continuity management explained in ITIL is mainly explained from the IT level, but from the perspective of enterprise operation and maintenance, "business continuity management" should be implemented. Unfortunately, ITIL 4 has some explanations for this level, but the explanations are not comprehensive enough. Regarding the regulatory interpretation of business continuity management, I will write another one later.

Continuity management is very complicated in terms of methodology and technical implementation, especially many companies are currently applying new technologies in cloud architecture and microservices. How to do disaster preparedness and its technical solutions are currently discussed. At present, many companies specializing in service continuity management have appeared in the market. Companies can also choose to outsource continuity management, but the effect is still unknown.


Guess you like

Origin blog.51cto.com/yazi0127/2551977
Recommended