This is the understanding of Alibaba technical experts on SRE and stability assurance

Head picture.png

Author | Wu Peng
Source | Alibaba Cloud Native Official Account

Preface

In technical work, for the two roles of product/basic technology R&D and SRE, there is usually an understanding based on whether to focus on coding. Regarding the conversion of product R&D to SRE, there are often opinions about whether to "leave the coding work" or whether to "deviate from the advancement of the product/basic technology."

Based on past experience in technology research and development and stability assurance, I will share my personal understanding of SRE and explore the collaborative relationship between the two roles of “product/basic technology research and development” and “stability assurance” to better serve the business service.

SRE overview

The earliest discussion of SRE comes from Google’s book "Site Reliability Engineering: How Google Runs Production Systems". Key members of Google SRE share how they focus on the overall life cycle of software, and why this can help Google successfully build, deploy, monitor, and operate the world's largest existing software system.

Douban link of the book:https://book.douban.com/subject/26875239/

The earliest discussion of SRE comes from Google’s book "Site Reliability Engineering: How Google Runs Production Systems". Key members of Google SRE share how they focus on the overall life cycle of software, and why this can help Google successfully build, deploy, monitor, and operate the world's largest existing software system.

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.

There is a sentence describing the work of SRE:

SRE is “what happens when a software engineer is tasked with what used to be called operations.”

That is, the goal of SRE is to build a scalable and highly available software system, and solve infrastructure and operation-related problems through software engineering.

In the Google SRE book, there is an accurate description of the daily work status of the SRE: up to 50% of the time and energy to deal with operation-related matters, and more than 50% of the energy through software engineering to ensure the stability and scalability of the infrastructure.

Based on the above description, my understanding of SRE is:

  • Responsibilities: Ensure the stability and scalability of the infrastructure.
  • Core: Solve the problem.
  • Method: Accumulate problem experience through operational affairs, and improve problem solving efficiency through coding and other methods.

Software life cycle

In the book Google SRE, there is a vivid description of software engineering from a life cycle perspective:

Software engineering is sometimes similar to raising a child: Although the process of giving birth is painful and difficult, the process of raising a child as an adult is where the most energy is really needed.

40%~90% of the cost of a software system is actually spent in the continuous maintenance process after the development and construction are completed.

In the project life cycle, the proportion of time and energy spent on designing and building software systems is usually less than maintenance and management after the system goes online. In order to better maintain the reliable operation of the system, two types of roles need to be considered:

  • Focus on designing and building software systems.
  • Focus on the life cycle management of the entire software system, including from its design to deployment, after continuous improvement, and finally successfully offline.

The first type of role corresponds to product/basic technology research and development, and the second type of role corresponds to SRE. The common goal of the two is to achieve project goals and collaborate to serve the business.

Stability guarantee value

In view of the impact of stability, students who directly participate in handling customer issues will feel more physically:

  • Through the degree of influence and urgency of direct feedback from customers when the problem occurs, we can feel the anxiety that stability brings to customers.
  • Through the customer's feedback after the problem is solved, we can feel the customer's gratitude or anger for the stability guarantee.
  • Through the changes in revenue status and customer scale afterwards, I feel the impact of stability on business revenue.
  • Through the extension of product planning, I feel the impact of stability on product iteration.

The value of stability guarantee is thus highlighted:

  • Guarantee the customer's product experience and satisfy the customer's demand for reliability.
  • Accelerate business iterations and meet business requirements for stability. Business attention is focused on launching functions that meet customer needs more quickly.

How does SRE ensure stability

Stability issues usually have these characteristics:

  • Human-induced, relying on expert experience
  • A series of factors lead to
  • Inevitable
  • 100% protection is not necessary

The online stability problem, which is caused by improper human operation, is a high proportion, concentrated in the two links of release and online operation and maintenance, both of which are high-frequency operations. For complex systems, these two links rely heavily on expert experience.

The stability problems that occur usually have systemic characteristics, that is, they are not caused by a single functional component defect, but are caused by a series of factors. For example, the lack of monitoring alarms causes the inability to detect in time, and the lack of logs cannot help quickly locate the problem. A good troubleshooting process leads to dependence on personal ability, and the lack of good coordination and communication leads to an increase in the processing time of problems and an increase in customer influence.

The problem is inevitable. The sudden increase in traffic, server/network/storage damage, uncovered input, etc. will all induce problems.

The business has external SLAs, and promises a certain degree of stability to customers. If it is not reached, it will be paid in accordance with the agreement. At the same time, problems are inevitable. Continue to improve stability while meeting the internal SLO standards, which will bring higher implementation costs. The increase in business revenue will also be smaller.

SRE needs to have an in-depth understanding of problem characteristics, systematically design and implement solutions, and grasp the main problems within a period of time to solve them. An overall solution for reference is as follows:

1.png

During the landing process, the following three gripping systems can be solved first:

  • Controllability
  • Observable
  • Best Practices for Stability Assurance

Controllability includes the following three main dimensions:

  • Release management

    • Focus on solving the artificial stability problem caused by the release.
    • Including review of important changes before release and management of change actions during release.
  • Operation management

    • Focus on solving the artificial stability problem caused by the black screen operation.
    • Including unified cluster operation entrance, cluster operation authority management, cluster operation audit, etc.
  • Design review
    • Focus on solving the best practices of application stability guarantee in the software system design phase.
    • Including cluster plan review and important function design review, etc.

Observable aspects include the following important dimensions:

  • monitor

    • Focus on the perception of the software system's operating state.
    • Including the construction and maintenance of monitoring collection/visualization system.
  • Log

    • Focus on the ability to troubleshoot software system problems.
    • Including the construction and maintenance of log collection/storage/query/analysis systems.
  • Inspection

    • Focus on the active detection ability to solve the normal function of the software system.
    • Including the construction of inspection services, the development and maintenance of general inspection logic, etc.
  • Alert
    • Focus on solving abnormal and timely access requirements.
    • Including alarm system construction, alarm configuration management, alarm path management, alarm analysis, etc.

The best practice for stability assurance is to abstract awareness, processes, specifications, and tools from historical issues and industry practices, integrate them at the beginning of the system design, and use them throughout the life cycle of the system, such as curing the best through templates practice:

  • Project quality acceptance criteria
  • Project safety production standards
  • Checklist before project release
  • Project TechReview template
  • Project Kick-off template
  • Project Management Specification
  • etc.

one example:

2.png

In order to facilitate understanding, the check items can be further classified to facilitate communication and project stability assessment:

3.png

When best practices can be standardized through documentation, tools or services can then be provided to apply them at low cost, making the best practices for stability assurance an infrastructure. SRE needs to continuously iterate on stability-related methodology and practice, top-down design, bottom-up feedback, and reasonable and reliable guarantee of stability.

Win-win, join hands in service business

  • Product/Basic Technology Research and Development: Focus on designing and building software systems.
  • SRE: Focus on the life cycle management of the entire software system, including from its design to deployment, after continuous improvement, and finally go offline smoothly.

These two types of roles are a relationship of mutual cooperation and mutual service, and have a common goal: to meet business needs and better serve the business.

SRE usually supports multiple projects horizontally, and has a more comprehensive understanding and thinking about the types of online problems and solving practices. Based on this, it will form best practice theories, tools or services, and provide theoretical and tool support for research and development. It is also possible to product stability assurance solutions on this basis to serve more customers and create greater value. Product/basic technology research and development has a deeper understanding of business requirements, functions/technical details, on the one hand, it directly brings business value, on the other hand, it can bring practical needs for stability assurance through practice, and further ensure stability together with SRE .

The two types of roles need to work side by side towards a common goal, develop together with the business, and achieve a win-win situation .

summary

Due to the nature of the work, SRE will serve a large number of businesses in the horizontal aspect, and accumulate in-depth understanding of the stability assurance problem domain and deep understanding of the importance of stability assurance through practice. In the vertical aspect, the stability assurance will be maximized through technical means. The best practices are precipitated and applied; at the same time, the vision is to look forward together with R&D and business, and integrate technology and management to create value.

The above is a personal understanding of SRE and stability assurance, focusing on solving problems and creating greater value.

References

Guess you like

Origin blog.51cto.com/13778063/2608526