Looking at the cultural construction of technical teams from the DevOps status report

This article originated from an internal sharing, and I took this opportunity to review the DevOps status reports over the years. In fact, most of the time our understanding of DevOps lies in processes, tools, and practices that are visible and tangible, but As mentioned in the thoughts at the end of the article, we have always believed that technology can change the world, but many times, you have to change people first before you can change the world, and changing people is the most difficult thing. So looking at this familiar DevOps from a cultural perspective, I also have different thoughts and perceptions.

1. Let’s look at some jokes first

On January 31, 2017, Gitlab, one of the world's largest code hosting collaboration platforms, experienced an 18-hour downtime incident. The reason was that an IT engineer cleared the data in the production database.

Logically speaking, although this kind of thing is difficult to accept, it is actually not uncommon. What's even more serious is that when Gitlab tried to restore the data, they discovered that their so-called carefully designed multiple backup mechanisms could not save the deleted data. The most exaggerated thing is that it was not until this moment that they discovered that the scheduled database backup had been failing due to the mismatch of tool versions after the upgrade. They thought that the email would alert about this problem, but a coincidence occurred again, and there was no alert for the automatic task. Take effect.

Now that the matter has come to this, either you hide the facts and give a painless explanation to the outside world, or you completely disclose every detail. How would you choose to deal with it?

GitLab chose the latter. They took the system offline as soon as possible and recorded all the details and analysis process of the incident in a public Google document. Not only that, they also live broadcast the entire recovery process on YouTube, the world's largest video website. In order to prevent some users from not watching YouTube, they also simultaneously update the status of the problem on Twitter. An accident suddenly turned into a hot topic. More than 5,000 users watched the live broadcast online at the same time, and it even reached the second place on the popular list. It can be said that for the sake of transparency, they have gone to the extreme.

Not only that, a few days later, the company's CEO personally gave a 4,000-word retrospective record of the problem, including the background of the problem, timeline, core cause analysis, description of each backup mechanism, and nearly 20 follow-up improvement items to gain the trust and recognition of users. It can be said that from this point on, they are really transparent, open and honest.

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

As for the outcome for the unlucky engineer, his punishment was to be forced to watch the Nyan Cat animation for dozens of minutes .

Etsy is an American handicraft e-commerce platform. Since the company went public in 2015, its market value was close to US$8 billion. Of course, in addition to their rapidly growing market value, the most praiseworthy thing is their DevOps capabilities. So why these little-known companies can achieve this level can be seen through a small thing.

What you may not know is that for an online e-commerce company, the most frequently viewed single page every day is actually the unavailable page of the website, which is also the customary 502 page. When Etsy's website was down, all you saw was a little girl knitting a sweater with three sleeves.

The sweater with three sleeves represents Etsy’s attitude towards mistakes. Because various awards are awarded at the company's year-end summary meeting every year, and the prize for one of the awards is a sweater with three sleeves. The winners are the individuals who introduce the biggest problem of the year in their company because, in their mind, making mistakes is not a big deal. **The mistake itself is not a personal problem, but a problem of the company's system and system. It is precisely because of such mistakes that the company is given room for improvement and growth. In a sense, this is also a contribution. **I always thought this was just a joke, but after searching Google in the past two days, I found out that they are still awarding this "Three Sleeves Sweater" award in 2023. Not only that, there are people on the list with their real names. It looks like You look so happy, so after all, cultural things are just a joke at first, but when you keep doing it, people will believe that you are serious, and they will gradually accept or embrace this kind of culture. However, how many of us Time waiting for changes to occur is another issue.

2. Return to DevOps and R&D effectiveness

Behind these "weird" jokes, there is actually a common value hidden, which originates from DevOps culture. If you want to understand the technical team, you need to first understand their working mode, their thinking mode, and even the values ​​advocated in their culture.

What is DevOps?

Derived from a software delivery model proposed in 2007, DevOps is a combination of Development and Operations. It is a collective term for a set of processes, methods and systems used to promote development (application/software engineering), technical operations and quality assurance. (QA) Communication, collaboration and integration between departments. It emerged as the software industry increasingly recognized that in order to deliver software products and services on time, development and operations must work closely together.

--Wikipedia

DevOps represents the innovation of software delivery models. The core is to cope with and meet the rapid innovation under the high uncertainty of the VUCA era. Through small steps and rapid iteration, rapid user value delivery and low-cost trial and error are achieved.

What is DevOps status reporting?

In short, the DevOps Status Report is the most authoritative survey report in the industry. It is released once a year and can be said to be the weather vane for software practitioners.

In 2014, Puppet jointly released a DevOps status report with IT Revolution and Thoughtworks. In 2016, the three main authors established DORA (DevOps Research and Assessment) and proposed the core result indicators of DevOps for the first time, which has attracted widespread attention in the industry.

In 2018, Google Cloud acquired DORA, and the three souls also joined Google. Since then, the DevOps status report has been officially separated, and two reports (i.e., Puppet version and DORA version) are issued every year. However, as the core soul person joined Google, the Puppet version gradually declined, and the DORA version is the main one in China.

The 2019 report was the last report output by the original team as the main author. After that, the report was delayed once due to the epidemic in 2020. After it resumed in 2021, the content was basically output by Google researchers.

Two sets of four core result indicators are proposed from the dimensions of efficiency and quality to measure the company's software delivery level, and four levels of elite, high, middle and low are defined based on a large number of industry sample surveys, which can be used as industry benchmarking reference lines. ( According to this standard, JD.com is roughly at a level between medium and high performance )

3. Looking at organizational culture construction from the DevOps status report

DevOps represents an organizational culture that includes many aspects. In order to objectively describe this culture, the Westrum model was quoted as early as the first report (2014) to illustrate the impact of organizational culture on organizational effectiveness.

1. Collaboration (Transparency, Communication, Trust)

Collaboration is the foundation of DevOps culture, which aims to break down silos and silos and achieve rapid delivery through responsibility sharing and team autonomy . The most typical representative is the "You build it, You run it" culture advocated by Amazon, which implements self-operation and maintenance through R&D way to break the dependence on the operation and maintenance team, and at the same time focus on building efficient operation and maintenance service capabilities (you can rely on my platform, but you cannot rely on my people). Coincidentally, Google's SRE working model also advocates quality hand-off and hand-back mechanisms. The core of which is that we should avoid making the transfer of responsibilities the purpose of collaboration , or in more popular terms, the purpose of collaboration. It’s not about passing the blame or throwing bombs, but rather rationalizing resource allocation and division of labor from the perspective of global process optimization.

transformational leadership

In the 2017 report, the concept of transformational leadership was proposed. The so-called transformational leadership means that leaders inspire and encourage followers to achieve better performance and promote widespread organizational changes by mobilizing employees' values ​​and sense of mission. In fact, for technical managers, who are involved in technical practices and management practices, in addition to providing good support to front-line employees with a server mentality, they should also encourage thinking about a sense of purpose, encourage innovation, and establish an environment that is tolerant of failure.

High performance team model

The content of the book "High-Performance Team Model" cited many times in the report describes the organizational structure advocated by DevOps, which reduces unnecessary team communication and collaboration through clearly defined organizational models and interaction models. Amazon's culture also advocates the Two Pizza model, which means to control the team size to a level that can satisfy two pizzas, thereby reducing the cost of network communication.

2. Blameless and psychologically safe (Blameless)

The culture of no-blame is the most widely circulated content in DevOps culture. In the cases of Gitlab and Etsy at the beginning, we can see that no-blame is intuitively important for building employee psychological safety.

In a 2-year survey, Google found that high-performing organizations rely heavily on an organizational culture of mutual trust and psychological safety. In particular, psychological safety culture can be expected to improve software performance.

Original text of Google research report: https://rework.withgoogle.com/blog/five-keys-to-a-successful-google

In addition, the 2019 report provides a model for improving employee productivity. Productivity is the ability to complete complex, time-consuming tasks with the least interference and interruption. It can also be understood as a smooth work flow or rhythm .

Among them , useful and easy-to-use tools are a crucial influencing factor . The survey found that in elite-level organizations, the proportion of self-developed tools and commercial tools is the highest, because this can maximize the use of mature tools while customizing them to suit themselves. On the other hand, the proportion of tool outsourcing is the lowest among all levels of organizations, which means that the proportion of completely purchasing external tools is the lowest, because this does not meet the requirements of a high-performance organization.

In the article "Software Development Speed" by Michael Dubakov, wasteful activities (such as inefficient meetings), communication and collaboration (cross-team communication), rework (duplication of work caused by requirement changes or quality defects), system complexity and the number of parallel tasks are considered The core points to improve the speed of software delivery.

Let me advertise here. Currently, the Xingyun platform has provided the planning load function. In the future, it will continue to optimize team scheduling and other functions to improve the rationality and accuracy of the plan.

http://jagile.jd.com/myzone/department

In the past year or two, domestic companies have gradually released accident review reports to the outside world, which can be regarded as an effective means of building a cultural atmosphere of psychological safety. Openness and transparency will not affect the trust of users, but can gain their understanding and recognition.

3. Fast feedback, Continuous learning & improvement (Fast feedback, Continuous learning & improvement)

The importance of feedback cannot be overemphasized. This is true for personal growth and organizational development. In fact, feedback can be integrated into all aspects of work, such as system-level monitoring and improvement, individual and manager growth, team development, etc. etc. In addition, continuous learning and improvement is also the final link in DevOps and many methodologies, and internal learning from failures is encouraged.

There are many dedicated people who find problems at work and even improve them locally. However, there is often a lack of a mechanism to allow such local improvements to be extended to the overall situation . On the one hand, it can reduce repeated pitfalls and repeated construction, and on the other hand, it can also maximize the effect of improvement. , thus standing on the shoulders of giants and working.

The awareness, methods and values ​​​​in Tencent's massive operation and maintenance approach come from the innovation of the front line, and are gradually upgraded to the technical culture of the group's consensus. These precipitated principles are simple, easy to understand and widely circulated, from the front line to the front line. This includes open source culture and code review culture.

4. Automation

Gartner has included Platform Engineering in its top 10 annual technology strategies for two consecutive years, alongside AI and cloud. This shows that platform tools are becoming increasingly important.

Everyone loves automation because it can eliminate manual labor. In other words, good and easy-to-use tools are the basis for efficiency. This is similar for efficiency improvement in any field. DevOps advocates automation, which is actually the most basic quality that software practitioners should possess. If a developer does not have the desire to actively automate tedious and repetitive work, he is actually unqualified. In retail and even within the group, investment in the construction of performance tools is continuous and firm. This is also the significance of the existence of the Xingyun platform. Our team plays a very important role in this process. It is both the builder of the platform and the platform designers and advocates.

Everything has two sides, and I have thought about the opposite side of automation before. For performance platforms, it is largely about pursuing the best solution through balance and compromise.

We are all believers in automation and equate automation with efficiency. In the past, the way we did things was to conduct front-line field surveys, summarize offline work processes, move these processes online, and then improve the execution efficiency of this process through automation. This is certainly true, but the premise is that the team has developed a practical methodological process during the offline work process, and this methodological process is time-tested and reliable. But have you ever thought about it, there are actually several problems here: 1. Each team will adapt its own process based on its own habits and resources, which means that in the absence of top-level design and strong promotion, each team The processes and work habits may be different. It is difficult for tools to unify this habit. Even if they are compatible, it will cost a lot of money. 2. The team's current process is not necessarily optimal, but is solved under existing constraints. These constraints include the team's organizational structure, division of responsibilities, existing human resources and capability levels, and various past experiences. The continuation and inheritance of habits are reasonable but not necessarily correct. If the tool is just a translation process, the value produced is actually very limited, because value often comes from innovation. 3. From a global perspective, local efficiency improvement and global efficiency improvement are potentially in conflict, which means that when we design the process for the overall situation, it is likely to harm the interests/indicators of certain teams. This will bring many unstable factors to the implementation of process promotion. 4. Even if the above problems are solved, automation means that people’s attention will be reduced. Although automation is only to improve efficiency and not to shift workload, as people’s attention decreases and the efficiency brought by automation will increase, it will This intensifies the solidification of the process, and no one has the motivation to optimize this process. Therefore, simply pursuing the continuous improvement of automation rate is not an ultimate indicator, but just an attempt in the process. However, it is easy for us to equate the value of the process with the value of the result. When we get the value of the result, the process also It was naturally closed. However, the optimization of processes and efficiency is a never-ending process, which means that we will always be on the road. This test of human nature is the biggest challenge for efficiency improvement workers.

4. Final thoughts

•Organizational culture cannot be changed directly. Only by changing the behavior first and making the behavior become a habit can the culture be slowly changed.

•The premise of encouraging behavior is to establish a mechanism. The so-called mechanism means that people are willing to take the initiative and perform beneficial behaviors.

•We thought technology could change the world, but we must first change people before we can change the world.

Author: JD Retail Shi Xuefeng

Source: JD Cloud Developer Community Please indicate the source when reprinting

The author of a well-known open source project lost his job due to mania - "Seeking money online" No Star, No Fix 2023 The world's top ten engineering achievements are released: ChatGPT, Hongmeng Operating System, China Space Station and other selected ByteDance were "banned" by OpenAI. Google announces the most popular Chrome extension in 2023 Academician Ni Guangnan: I hope domestic SSD will replace imported HDD to unlock Xiaomi mobile phone BL? First, do a Java programmer interview question. Arm laid off more than 70 Chinese engineers and planned to reorganize its Chinese software business. OpenKylin 2.0 reveals | UKUI 4.10 double diamond design, beautiful and high-quality! Manjaro 23.1 released, codenamed “Vulcan”
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10326299