The Way of SRE: Creating Software Systems to Keep Systems Running

Introduction: Ben Treynor Sloss, senior vice president of Google's operations team and inventor of the SRE name, provides his definition of SRE here. 
This article is selected from "SRE: Google Operation and Maintenance Decryption".

  As we all know, computer software systems usually cannot run autonomously without human beings. So, how should we operate and maintain an increasingly complex large-scale distributed computing system? Hiring a system administrator (sysadmin) to operate and maintain complex computer systems has long been a common practice in the industry. And Google's solution is - SRE. 
  SRE teams employ software engineers to create software systems to maintain system operation as an alternative to manual operations in the traditional model. 
  How exactly did SRE originate at Google? In fact, my answer is very simple: SRE is the result of having software engineers design a new type of operations team. When I joined Google in 2003, my assignment was to lead a "production environment maintenance group" of seven software engineers. At the time, my entire career had been devoted to software engineering, so it was natural that I built the team around the way I was most comfortable working and managing. 
  Over time, the 7-person team has grown into an SRE team of more than 1,000 people in the company, but the guiding philosophy and working methods of the SRE team have basically maintained my original idea. 
  The main module in the SRE methodology is the composition of the SRE team. There are basically two types of engineers on every SRE team. 
  The first category, 50% to 60% of the team are standard software engineers, specifically, those who can normally go through the Google software engineer recruitment process. The second category, the other 40%~50% are some engineers who basically meet the Google software engineer standards (with 85%~99% of the required skills), but also have a certain degree of other technical capabilities. At present, the internal details of the UNIX system and the 1~3 layer network knowledge are the two types of additional technical capabilities that Google values ​​most. 
  In addition, all SRE team members must be very willing and very convinced that software engineering methods can be used to solve complex operational problems. Google has been closely monitoring the performance of these two categories of candidates on SRE teams after they have been hired, but so far has not found a significant difference in their work and grades. In fact, because of the complementary technical backgrounds of the two types of engineers, SRE teams are often able to find new and efficient ways to solve problems. 
Recruiting and managing SRE teams according to this standard, we quickly found that SRE team members have the following characteristics: 
  (a) A natural repulsion of repetitive, manual operations. 
  (b) have sufficient technical capabilities to rapidly develop software systems to replace manual operations. 
  At the same time, the SRE team and the product R&D department are very similar in academic and work backgrounds. So, in essence, SRE is using software engineering thinking and methodologies to accomplish tasks that were previously done manually by teams of system administrators. These SREs tend to replace manual operations by designing and building automated tools. 
  The key to the success of the SRE model is the focus on engineering. Without a continuous, engineered solution, the pressure on operations will increase and the team will need more people to get the job done. Traditional Ops teams grow in size roughly linearly with the product load they serve. If a product is very successful and user traffic grows, more team members are needed to repeat the same thing. 
  To avoid this, the team responsible for operating the service must have enough time to program, or they will be overwhelmed with operations work. As a result, Google has a 50% cap on all traditional operations work done by the entire SRE team. Traditional operation and maintenance work includes: work order processing, manual operation, etc. Setting such an upper limit ensures that the SRE team has enough time to improve the maintained service, making it more stable and easier to maintain. This upper limit value is not the target value. Over time, SRE teams should tend to eliminate basic operations work and focus on R&D tasks. Because the whole system should be able to run autonomously and fix problems automatically. Our ultimate goal is to drive the entire system toward unmanned operation, rather than just automating certain manual processes. Of course, in actual operation, the continuous expansion of service scale and the launch of new functions have kept SRE busy enough! 
  Google's rule of thumb is that SRE teams must spend 50% of their energy on real development work. So how do we make sure every team does this? First, we had to constantly measure the time allocation for each team. Relying on this data, SRE management makes adjustments to teams that don't invest enough time in development efforts. Usually, the management will ask the team to hand over some common operation and maintenance work to the product development department for operation, or transfer manpower from the product development department to participate in the team on duty. In addition, you can stop all new operation and maintenance work by the SRE team. Only when the management actively maintains the work balance of each SRE team can we ensure that they have enough time and energy to carry out truly creative and independent research and development work. At the same time, this also ensures that the SRE team has sufficient operation and maintenance experience. This allows them to design systems that actually solve problems. 
  We have found that the Google SRE model has many advantages when operating large-scale complex systems. Since SREs often directly participate in the development and modification of code in the process of adjusting Google's systems, the SRE culture basically represents a culture of rapidity, innovation, and embracing change within the company. Practice has proved that the number of members required by the SRE team to operate, maintain, and improve a complex system increases nonlinearly with the scale of system deployment. The operation and maintenance of the same system requires a larger number of people to maintain with the traditional system administrator model. Finally, the SRE model not only eliminates the conflicting focus of the R&D team and the operation and maintenance team in the traditional model, but instead promotes the overall improvement of the level of the entire product department. Because members of the SRE team and the R&D team can flow freely, people across the product group have the opportunity to learn and participate in large-scale operations and deployment activities, and gain valuable knowledge that is usually difficult to obtain. How many opportunities do ordinary developers have to run their programs on a distributed system with 1 million CPUs at the same time? 
  While the SRE model brings some advantages, there are also some problems. One of the persistent challenges facing Google is how to recruit the right SREs. First of all, SREs have to compete with product development departments to recruit traditional software development engineers. 
  Second, because SREs require multiple skills at the same time, there are fewer people in the market with the relevant background and experience. Since the SRE model is also relatively new, there is not much information in the industry on how to build and maintain an SRE team. Finally, after the SRE team is established, strong management support is needed to implement it, since the SRE model needs to take some practices that are contrary to conventional practices in order to improve reliability. For example: a decision to stop releasing a new feature due to a depleted error budget within a quarter may require support from management to get the product development department to take notice. 
  This article is selected from "SRE: Google Operation and Maintenance Decryption", click this link to view this book on the official website of the blog post. 
                      image description

  If you want to get more exciting articles in time, you can search for "Blog Viewpoint" in WeChat or scan the QR code below and follow.
                 image description

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326105175&siteId=291194637