Digital Transformation|Model of Digital Transformation of Banking Data Center 03

Lead:

The digital transformation of the banking data center is a systematic project that involves transformation at the management level—including digital transformation strategy, transformation of infrastructure and technology architecture, technological innovation and transformation of knowledge systems, and transformation at the execution level—including personnel management (P ), process management (P), technology management (T), resource management (R), etc. As a grand systematic project, the digital transformation of data centers must be planned, followed by steps, and carried out in stages based on a model or standardized management system, in order to ensure successful implementation. This article analyzes and interprets the digital transformation model and shares it with readers.

In the content of the model chapter 01 and 02 of the digital transformation of the banking data center, we explained to you the ITIL 4 management system and its reference value for the digital transformation of the data center, as well as the innovative ideas and thinking of the ITIL 4 system models, guidelines, etc. In today's article, we will discuss with you the current hot reference framework SRE and how it can help the digital transformation of data centers.


Great SRE Origins

SRE is not only a reference framework for O&M management and transition from O&M to O&M, but also an important reference book for the digital transformation of data centers. SRE is inspired by Margaret Hamilton, who first coined the term "software engineer" and introduced software engineering methodology to NASA. Navigation and control software is also based on the reliability concept of zero tolerance in aerospace, and SRE has increasingly attracted the attention and favor of data center managers.

Innovation of SRE operation and maintenance concept

1. Small changes, big improvements

The goals of developers and operation and maintenance personnel are different. The goal of developers is to respond to the needs of business departments and to launch business functions as quickly as possible. As the discoverers and direct responsible persons of production environment problems, operation and maintenance personnel aim to ensure the reliability of the production environment. So this has formed what we often say that the natural thinking of developers at work is to "gas the pedal", while the natural thinking of operation and maintenance personnel is to "step on the brakes". So how to make the goals of developers and operation and maintenance personnel tend to be consistent, so that the two are willing to work together to achieve mutually agreed goals? SREs have the answer.

The formula for calculating availability under traditional O&M is:

Availability = (Total Uptime - Unplanned Downtime) / Total Uptime

As can be seen from the figure above, planned downtime is not included in the calculation of availability. For developers, as long as they apply for application launch requests in advance, they can "friendly" release business system maintenance after process approval Notification, so that when customers log in to the system, a "system is under maintenance..." window appears. Since the planned downtime has been set in advance, developers generally take the online downtime as a matter of course, but rarely think about the impact of the planned downtime on availability.

So, how is availability calculated under the SRE framework?

SRE introduces the concept of "error budget" and applies it to the launch of new functions, routine system maintenance, and troubleshooting. It no longer distinguishes between planned and unplanned downtime. As long as the business is stopped, the error budget will be occupied. Therefore, the SRE implementation The formula for calculating availability is:

Availability = (total run time - error budget) / total run time

The introduction of the error budget seems to be a small change, but it is precisely this change that gradually aligns the goals of developers and operation and maintenance personnel, because once developers apply for new functions to go online, it will affect usability, so usability becomes Developers and operators need to jointly guarantee the goal. The introduction of the error budget requires developers and operation and maintenance personnel to try their best to achieve business online without downtime, such as high availability of applications, gray release and other methods, through the joint efforts of developers and operation and maintenance personnel to ensure and effective Lower error budget.

At the same time, the introduction of the error budget further embodies the concept of operation. Reducing downtime means increasing operating income, and ensuring operational reliability has become a common goal of development and operation and maintenance.

2. The necessary skills for operation and maintenance are development

SRE advocates that daily operation and maintenance work should occupy less than 50% of the total man-hours, and spend the rest of the time on the R&D work on the operation and maintenance side. Through the R&D work on the operation and maintenance side, the operation and maintenance efficiency can be improved and the system availability can be guaranteed.

Under the SRE framework, through a series of measures, the development team and the operation and maintenance team, especially the operation and maintenance management team, agree that the operation and maintenance personnel will devote 50% of their energy to the development of operation and maintenance support tools.

■ The first measure is that the operation and maintenance management regularly measures the time allocation of members to ensure that the time spent by operation and maintenance personnel on daily operation and maintenance work is controlled within 50%, and the working hours devoted to research and development are guaranteed from the perspective of work allocation.

■ The second measure is to establish a rotation mechanism for developers and operation and maintenance personnel, so that developers can actively recognize the impact of downtime releases and system instability on system availability, and at the same time allow developers to make suggestions for operation and maintenance automation People can work together to improve the operation and maintenance model, because developers understand software architecture better and are better able to use software thinking to solve problems.

■ The third measure is mandatory regular analysis of alarms and accidents. The SRE principle is "doing nothing to people". Formulate strategies for alarm optimization and accident handling, and formulate R&D plans for operation and maintenance support tools.

R&D on the O&M side is the only way to change the O&M model. Building R&D capabilities on the O&M side is fundamental to improving O&M efficiency. It is a very innovative concept to ensure that 50% of O&M personnel’s energy is devoted to R&D on the O&M side. It is worth promoting in the rapid advancement of .

3. The goal of operations management is to eliminate chores

There are endless chores and temporary tasks every day, which are the cognition of most operation and maintenance personnel. For the time being, we think that these are the trivial matters mentioned by SRE. Eliminating trivial matters is the pursuit and dream of every operation and maintenance manager, but the reality In China, things often backfire, and it is always difficult to find an effective method.

SRE's definition of trivia (having some of the following characteristics is considered trivia):

■ Manual: tasks that require manual execution, including manual execution of scripts

■ Repetitive: work done repeatedly

■ Can be automated: a task that can be done by a computer, or that can be eliminated altogether in some way

■ Tactical: Pop-up, reactive ad-hoc work, non-strategy-driven and proactive work

■ No lasting value: no improvement in service status and efficiency after task completion

■ Linear growth with service: tasks that grow linearly with traffic or number of users

The public goal of SRE is to ensure that each engineer can devote more than 50% of his time to the development of the operation and maintenance side. Obviously, if the trivial matters are not eliminated or more and more, this goal will definitely not be met. If trivial things are not controlled, they will become more and more, and eventually they will occupy 100% of the time of operation and maintenance personnel. Of course, we can solve the problem by increasing the number of operation and maintenance personnel, but usually not only cannot add more people, but also the leadership wants to reduce the number of operation and maintenance teams, so the idea of ​​eliminating trivial matters proposed by SRE is very correct and worthy of innovation.

SRE practice results for reference

SRE not only has a strong innovation in operation and maintenance, but also has a large number of practical results, which can be used as a reference for the digital transformation of data centers.

1. Capability transformation under the framework of SRE

SRE is committed to changing the so-called "trivial" operation and maintenance work through automation and intelligent tools, so the knowledge structure of the operation and maintenance organization is somewhat different. First, the requirements for talents are very high, and they need to understand basic software and hardware systems Knowledge, but also needs to understand development, so recruitment is difficult, and the cost of individual personnel is high; second, junior engineers are mainly engaged in assembly line operations, and are responsible for performing tasks that cannot be replaced by machines, such as intervening in the operation of support tools; third, the original system operation and maintenance And application operation and maintenance second-level engineers are transformed into software designers and developers of operation and maintenance efficiency optimization.

If traditional operation and maintenance customers plan to transform into SRE operation and maintenance, they need to establish an independent SRE operation and maintenance team first, and at the same time gradually transform the existing operation and maintenance personnel. In the early stage, high parallel costs need to be invested, and managers need enough courage to innovate.

The operation and maintenance in the SRE mode requires relatively fewer personnel, relatively high unit costs, improved system reliability, and improved operation and maintenance efficiency. Therefore, the choice of SRE mode operation and maintenance should not be driven solely by cost savings. It must be based on more and more guarantees. The greater the reliability of the business, the drive to ensure the transformation from operation and maintenance to operation. With the deepening of the digitalization of the banking industry, it is inevitable to shift from O&M to operation, and it is also the general trend to integrate the SRE concept. The era of software-defined O&M has arrived.

2. Solve the capacity planning problem

Resource cloudification solves the problem of purchasing hardware for traditional construction applications at the same time, and the low efficiency of hardware resource usage. However, how to carry out capacity planning for cloud resource pools has also become a new problem. SRE puts forward new thinking on capacity planning, and its concepts and methods are worth learning from.

SRE capacity planning concerns:

■ There must be an accurate natural growth (as user usage increases, resource usage also increases) demand forecasting model, and the length of demand forecasting should take into account the length of time resources are in place to ensure the natural growth of resources during resource forecasting and acquisition.

■ In addition to natural growth, non-natural growth factors need to be considered (traffic caused by new product launches brought about by environmental changes, commercial promotion activities, competition in the same industry, etc.), and accurate non-natural growth demand sources and statistics must be included in planning.

■ There must be a periodic stress test to determine the compliance of the system performance through the stress test to evaluate the capacity expansion requirements to meet the performance requirements, especially after hardware changes and upgrades (downloading or Xinchuang transformation), periodic stress testing Particularly important.

3. Make operation and maintenance simple

As described in "Gail's Law": any complex system gradually evolves from a simple system, and the same is true for operating and maintaining a cloud platform or an application system. As time goes by, it must become more and more complex, including the structure and relationship, or few individual operation and maintenance personnel can fully describe it clearly. Therefore, to ensure the quality of operation and maintenance and improve operation efficiency, the operation and maintenance work must be returned to simplicity.

How to make operation and maintenance simple?

How to make the operation and maintenance work return to simplicity, SRE has brought us a lot of inspiration and reference:

■ The operation and maintenance work must first do a good job of monitoring, paying enough attention to the monitoring of SRE, and fully sharing experience with the outside world. Usually, O&M personnel receive monitoring alarms, and then track and troubleshoot problems based on the alarms. SRE believes that when an event that requires manual intervention by O&M personnel occurs, the monitoring system can send alarms to O&M personnel, and O&M personnel receive Instead of further troubleshooting after receiving an alarm, the concept of SRE requires the monitoring system to complete root cause analysis and find the exact cause of the event, which puts very high requirements on the monitoring system. It is the goal of R & D on the operation and maintenance side.

■ Operation and maintenance personnel need to decouple the complex architecture and complex operation and maintenance work to make the work simple. It relies heavily on cloud vendors for solutions. In this case, the work that needs to be done is to return the cloud management platform to a fully open and loosely coupled architecture, simplify the handling of O&M issues, and make O&M simple.

■ From the perspective of the management level, SRE managers need to reserve enough time for the R&D work on the O&M side, simplify the O&M work through R&D on the O&M side, make O&M return to simplicity, and achieve the availability goal. (to be continued)
--------------------------------------
©Copyright belongs to the author: from the blogger of 51CTO For the original works of CLP Jinxin talents, please contact the author to obtain reprint authorization, otherwise legal responsibility will be
pursued

Guess you like

Origin blog.csdn.net/zhongdianjinxin/article/details/131447003