"DevOps Practice Guide" - Reading Notes (7)

Part 5 The third step: technical practice of continuous learning and experimentation

In Part 3, we discuss the technical practices required to establish fast workflows within the value stream. In Part 4, our goal is to create as much feedback as possible from as many areas of the working system as possible, and to do so more immediately, faster, and cheaper.

In Part 5, we present some practices that can create more learning opportunities as quickly, frequently, and affordably as possible. This includes learning from accidents and failures, which are inevitable when we work in complex systems, and organizing and designing working systems so that we can continually try and learn to make them safer. Desired outcomes include greater resilience and an increasingly rich collective knowledge of how systems actually work, allowing us to better achieve our goals. In the next few chapters, we will develop systems for improving safety, continuous improvement, and learning by doing:

  • Build a just culture where people feel safe;
  • Enhance the reliability of the production environment through fault injection;
  • Transform locally discovered empirical knowledge into global improvements;
  • Set aside dedicated time slots for organizational improvement and learning activities.

19. Integrate learning into daily work

When working in complex systems, it is impossible to predict all possible outcomes. Even with static preventive tools such as checklists and standardized operating manuals, accidents still happen, sometimes catastrophically. These tools simply document our current understanding of the system.

To work safely in complex systems, organizations must be able to better self-diagnose and improve themselves. They must become adept at identifying and solving problems and amplifying the effects by disseminating solutions throughout the organization. This approach creates a dynamic learning system that allows us to understand errors and translate that understanding into actions to prevent the recurrence of those errors.

This is what Dr. Steven Spear calls a “recoverable organization” that is “skilled in identifying problems, solving them, and amplifying the impact of the experience by delivering solutions throughout the organization.” These tissues have the ability to heal themselves. "For such an organization, responding to a crisis is not a special job, but something that is done all the time. This responsiveness is the source of reliability."

The Netflix team achieves operational recovery goals by running "troubleshooting monkeys" to continuously inject failures into pre-production and production environments. As you can imagine, when they ran Troublemaker Monkey for the first time in a production environment, the service failed beyond imagination. By continually detecting and resolving these issues during normal business hours, Netflix engineers quickly iterated on a resilient service while creating organizational learning (during regular business hours!) that allowed them to develop A system that outperforms all competitors.

Troublemaker Monkey is just one example of incorporating learning into daily work. This story also shows how learning organizations think about breakdowns, accidents, and mistakes—as opportunities for learning, not punishment. This chapter explores how to create learning systems, how to build a just culture , and how to accelerate learning through regular drills and artificially simulated failures.

19.1 Build a culture of justice and learning

One of the prerequisites of a learning culture is that when an incident occurs (and there is no doubt about it), the response to it must be "fair". Dr. Sidney Dekker has compiled some key elements of safety culture and uses the term just culture . He writes: “If responses to incidents and incidents are perceived as unfair, it can impede safety investigations, induce fear rather than mindfulness among those doing safety-critical work, and make organizations more bureaucratic rather than more cautious. , and induce professional secrecy, evasion and self-protection.”

Dr. Sidney Dekker calls this idea of ​​eliminating errors by eliminating the perpetrators the bad apple theory . He asserts that this is invalid because "human error is not the cause of the problem; rather, it is the result of a design problem with the tools we provide."

Bad Apple Theory (Bad Apple Law): If you leave a bad apple in a basket of good apples, you will end up with a basket of rotten apples. This is the Bad Apple Law.

If accidents are not caused by "bad apples" but are caused by inevitable design problems in the complex systems we build, then there should be no "naming, blaming and shaming" of those who caused the failures. Our aim should always be to maximize opportunities for organizational learning, continually emphasizing the importance we place on broadly uncovering and communicating issues in our day-to-day work. This improves the quality and safety of the systems we work in and strengthens relationships between everyone within them.

When engineers make mistakes, if they can feel safe giving detailed information, not only will they be willing to take responsibility for the matter, but they will also be enthusiastic about helping others avoid the same mistake. This creates organizational learning. On the contrary, if you punish the engineer, everyone will no longer be motivated to provide the necessary details, and it will be impossible to understand the mechanism, principle and operation of the fault, then the fault will definitely happen again.

Two effective practices help create an unbiased learning culture: non-blaming postmortems; and introducing controlled human failures into production environments to create opportunities to address inevitable problems in complex systems. practise.

19.2 Hold non-blaming post-mortem meetings

To help build a culture of justice, when incidents and critical incidents occur (e.g., deployment failures, production incidents that impact customers), there should be a non-blaming post-mortem after the problem is resolved . “Not just the accident itself, but also the mechanism and circumstances in which the accident occurred, as well as the decision-making process of the people involved in the accident.”

This practice is also known as post-mortem analysis or post-mortem reflection. It's worth noting that there is a similar routine review that is part of many iterative and agile development practices.

To do this, we need to schedule postmortem meetings as soon as possible after the accident occurs, before memories fade, causation becomes blurred, and circumstances change. (Of course, we wait until the issue is resolved so as not to disrupt anyone who is still actively working on the issue.) In a blame-free postmortem session, we do the following:

  • Construct a timeline to collect all details about the failure from multiple angles, ensuring that the person who made the mistake is not punished;
  • Enable all engineers to improve safety by asking them to detail how they contributed to the failure;
  • Allow and encourage those who make mistakes to become experts in educating others not to make the same mistakes in the future;
  • Create a space for free decision-making, allowing people to decide whether to take action and leaving judgment on decisions after the fact;
  • Develop countermeasures to prevent similar incidents and be sure to document these countermeasures, target dates, and responsible persons for tracking purposes.

In order to gain adequate understanding, the following stakeholders need to be present at the meeting:

  • People involved in decision-making on relevant issues;
  • The person who identifies the problem;
  • people who respond to questions;
  • the person who diagnoses the problem;
  • people affected by the problem;
  • Anyone interested in attending the conference.

Attention must be paid to detailed documentation and to reinforcing a culture in which information can be shared without fear of punishment or retaliation. Because of this, it can be helpful to have a trained person who is not involved in the incident to organize and lead the meeting, especially during the first few postmortem meetings.

In the course of meetings and resolutions, the use of words such as " 原来应该" or " " should be expressly prohibited, as they are counterfactual statements stemming from the human tendency to create possible alternatives to events that have already occurred. Counterfactual statements such as "I could have..." or "If I knew this, I should have..." are imaginative ways of defining the problem rather than being based on facts . We need to limit ourselves to such contexts原本可以

The meeting must allow enough time to brainstorm and decide on responses. Once a countermeasure is identified, the work must be prioritized, a responsible person assigned, and a timetable for implementation. This is further evidence that we value improving our daily routines more than the routines themselves.

Examples of these responses include adding automated tests that detect anomalies in the deployment pipeline, adding deeper production telemetry metrics, identifying the types of changes that require additional peer review, and conducting regular drill days to address such failures. exercise.

19.3 Make the results of post-mortem meetings as widely available as possible

After a non-blaming post-mortem meeting, the minutes and all related documentation (e.g., timelines, IRC chat logs, external communications) should be made widely available. Ideally, publicly available information should be in a centralized location, easily accessible to everyone across the organization to learn from past incidents. The post-mortem analysis meeting is very important, and we can even regard the completion of the post-mortem analysis meeting as a sign of the end of the entire production accident handling process. Doing so helps translate learning and improvements within the project into learning and improvements across the organization.

19.4 Reduce accident tolerance and look for weaker fault signals

As organizations learn how to view and solve problems effectively, the threshold for identifying failures needs to be lowered to allow for deeper learning. To do this, we want to amplify those weak fault signals.

This is demonstrated by the way NASA handled malfunctioning signals during the space shuttle era. In 2003, the Space Shuttle Columbia exploded upon re-entering the Earth's atmosphere on the 16th day of its mission. We now know that a piece of insulating foam punctured the external fuel tank as it took off. However, some mid-level NASA engineers had reported the incident before Columbia returned, but their opinions were not taken seriously. They observed the impact of the foam block on a monitor during a post-launch review session and immediately reported it to NASA managers. But they were told that the bubble problem was nothing new. Foam drift has damaged spacecraft on previous launches but has never resulted in a major accident. NASA characterized the incident as a maintenance issue and took no action. Until the accident happened in the end, everything was too late.

They observe, “Companies get into trouble when they apply the wrong kind of thinking in their organizations (which determines how they deal with unclear threats, which in this book’s terms are weak signals of failure)… By 20 In the 1970s, NASA created a rigid culture of standardization and promoted the shuttle to Congress as a reusable, inexpensive spacecraft." NASA adheres strictly to process compliance rather than experimental models, and each piece of information is evaluated to confirm that no deviations have occurred. The consequences of a lack of continuous learning and experimentation are dire. The author concludes that culture and mindset are crucial and that "caution" alone is not enough. "Vigilance alone cannot prevent unclear threats (weak signals of failure) from turning into costly accidents (and sometimes tragedies)," they wrote.

Our work in the technology value stream is like space travel, it should be treated as a fundamental experimental endeavor and managed as such. All we do is a source of potentially important assumptions and data, not a repetitive routine or a validation of past practice. Technical work cannot be viewed as completely standardized and strive to achieve process compliance. Instead, one must continually look for increasingly weaker signals of failure to better understand and manage the system in operation.

19.5 Redefine failure and encourage risk assessment

Whether intentionally or unintentionally, organizational leaders take actions to reinforce organizational culture and values. Auditing, accounting and ethics experts have long argued that "voices from the top" can mean fraud and other unethical practices. To strengthen a culture of learning and assessing risk, leaders need to continually emphasize that everyone should be comfortable and accountable for failure and be able to learn from it.

"I was talking to a colleague about a massive outage that Netflix just had - which, frankly, was caused by a simple bug. In fact, the engineer who caused this incident had brought Netflix down within the past 18 months. Machine twice. However, he is someone we will never fire. In the past 18 months, this engineer has greatly improved the state of operations and automation, and progress is not measured in 'kilometres' but in 'light years' Measured, the results are outstanding. The results of his work enable us to deploy securely every day, and he personally performs a large number of production deployments."

He concluded: "DevOps must allow for this kind of innovation and accept the risks that come with it. Yes, there will be more failures in production. But this is a good thing and should not be punished."

19.6 Inject faults into production to recover and learn

As introduced at the beginning of this chapter, injecting errors into the production environment (such as using "troubleshooting monkeys") is one way to improve recoverability. This section describes the process of rehearsing and injecting faults into a system to ensure that the system is designed and built correctly so that faults occur in a specific and controlled manner. We ensure that the system fails gracefully by performing tests regularly (or even continuously).

Recoverability requires that we first define failure modes and then test to ensure that these failure modes operate as designed. One approach is to inject faults into the production environment and practice large-scale failures. This way you can be confident that the system will be able to self-recover in the event of an incident, ideally without even affecting customers.

A proactive focus on recoverability means companies can handle events that would trigger a crisis in most organizations in a routine, mundane way.

Specific architectural patterns they implemented include fail-fast (setting proactive timeouts so that a failed component does not bring down the entire system), fallback (designing each feature so that it can degrade or fall back to a lower quality performance) and feature removal (removing non-critical functionality from any specific page that is running slow to prevent the user experience from being affected). In addition to maintaining business continuity during the AWS outage, there is an amazing example of resilience created by the Netflix team. Six hours after the AWS outage occurred, Netflix announced a Level 1 (Sev 1) incident, assuming that AWS services would eventually return to normal ("AWS will recover...that's usually the case, right?"). They started all business continuity procedures only 6 hours after the AWS service was interrupted.

19.7 Create a failure drill day

This section describes a special disaster recovery drill called Game Day. The concept of drill days comes from the discipline of elastic engineering. Resilience engineering is defined as “exercises designed to improve recoverability by injecting large-scale failures into critical systems.”

The goal of the drill day is to help teams simulate and rehearse incidents so they become operationally capable. First, plan for a catastrophic event, such as simulating the destruction of the entire data center at some point in the future. Then, give the team preparation time to eliminate all single points of failure and create the necessary monitoring procedures, failover procedures, etc.

By doing this, we begin to expose potential flaws in the system. It is the injection of faults into the system that allows these problems to surface. "You may find that certain monitoring or management systems that are critical to the recovery process end up shutting down as part of a fault orchestration step," explains Robbins. "[Or] you may find some unknown single point of failure." Then, Rehearsals are performed in increasingly profound and complex ways in order to make people feel that this is part of their daily routine.

Things learned during these disasters include:

  • When the network connection is interrupted, it is not possible to use the engineer's workstation for failover;
  • Engineers don’t know how to access a conference call bridge or a bridge that can only accommodate 50 people, or need a new conference call provider so they can get rid of the engineer who keeps everyone on the conference call waiting;
  • When the backup generators in the data center ran out of diesel, no one knew the process for emergency purchasing through the supplier, resulting in someone using a personal credit card to purchase $50,000 worth of diesel.

By introducing failures in controlled situations, we can practice and establish the required operating manual. Another output from the practice day is that people actually know who to call and who to talk to. This will allow them to build relationships with people in other departments to work together in the event of an incident, turning conscious actions into subconscious actions and then further into routine.

19.8 Summary

To create just cultures that enable organizational learning, we must reframe what we call failure. When we handle them correctly, the errors inherent in complex systems can create a dynamic learning environment where all stakeholders feel safe enough to come up with ideas and insights, and teams can more easily move from failure to failure. Resume the project as scheduled.

Blame-free post-mortem meetings and injecting failures into production reinforce a culture where everyone should be comfortable with failure, take responsibility, and learn from it. In fact, when the number of accidents decreases significantly, tolerance also decreases, allowing us to continue learning. As Peter Senge said: “An organization’s only sustainable competitive advantage is its ability to learn faster than its competitors.”

20. Transform local experience into global improvement

This chapter will establish a mechanism to share and apply locally acquired new experience and optimization methods in the global scope of the entire organization, thereby greatly improving the organization's global knowledge and improvement effects. This will improve the state of practice across the organization so that everyone at work benefits from the experience accumulated by the organization

20.1 Automated accumulation of organizational knowledge using chat rooms and chatbots

Many organizations have established chat rooms to facilitate rapid internal communication. However, chat rooms can also be used to trigger automated operations

GitHub invented a chat application called Hubot to interact with the operation and maintenance team in a chat room. One can instruct it to perform certain operational operations by sending commands (for example, "@hubot deploy owl to production"). The results of the operation are also sent back to the chat room by it. Similarly, whether it is a code commit to the source code repository or a command that triggers a production deployment, a message is sent to the chat room.

Automating actions in a chat room has many advantages (compared to running automation scripts from the command line):

  • Everyone can see everything that happens;
  • New engineers can also see the team’s day-to-day work and how it performs;
  • People are also more likely to ask for help when they see others helping each other;
  • Organizational learning is established and knowledge is rapidly accumulated

Furthermore, in addition to the widely proven benefits above, chat rooms inherently record and expose all communications. In contrast, email communications are private by default, and the information within them cannot be easily disseminated within an organization.

Actions performed through Hubot include checking the health of a service, performing Puppet pushes or deploying code to production, and muting monitoring alerts for a service when it enters maintenance mode. Repeated operations can be completed with Hubot. For example, calling up the smoke test log when the deployment fails, pulling the production server out of the service cluster, rolling back the production front-end service to the previous version, and even apologizing to the engineer on duty.

GitHub creates a collaborative environment that transforms knowledge learned locally into experiences across the entire organization. We then explore how to create and accelerate the spread of organizational learning.

20.2 Automated and standardized processes in software for easy reuse

We often document the standards and processes for architecture, testing, deployment and infrastructure management, store them as Word documents, and upload them to a server. The problem is, however, that engineers who are building new applications or environments often don't know these documents exist, or simply don't have the time to implement them according to the standards in them. As a result, they create their own tools and processes that end up disappointing: applications and environments are brittle, insecure, unmaintainable, and expensive to run, maintain, and update.

Rather than writing professional knowledge into a Word document, various standards and processes that fully encompass organizational learning and knowledge can be transformed into an executable form, making it easier to reuse. A good way to achieve this kind of knowledge reuse is to keep it in a centralized source code repository, making it a searchable and usable tool for everyone.

They created a mechanism called ArchOps that “empowers engineers to be builders, not bricklayers. By transforming design standards into blueprints that can be automated and executed easily and easily used by anyone, we achieve by-product consistency".

By converting manual processes into automatable code, the process is widely adopted and provides value to all users. Arbuckle concluded: “An organization’s actual compliance is directly proportional to the extent to which it expresses its policies in code.”

Turning automated processes into the simplest way to achieve goals can lead to widespread adoption of the practice—even considering converting it into a shared service supported by the organization.

20.3 Create a single source code repository shared across the organization

Establishing a shared source code repository across the enterprise is a powerful mechanism for integrating all local findings within the organization. When anything in the source code repository (e.g., a shared library) is updated, it is automatically and quickly propagated to all other services that call it, and integrated through each team's deployment pipeline

We store not only the source code in the shared source code repository, but also artifacts containing other learning experiences and knowledge:

  • Configuration standards for libraries, infrastructure and environments (Chef recipe files, Puppet class files, etc.);
  • deployment tools;
  • testing standards and tools, including safety aspects;
  • Deployment pipeline tools;
  • monitoring and analysis tools;
  • Tutorials and Standards

"The most powerful mechanism to prevent Google from breaking down is a single code base. Every time someone commits an update to the code base, a new build is triggered, and all the dependencies it uses are up to date. Everything is generated from source code built, rather than dynamically linked at runtime - only a single version of the library is always available, the one currently being used, and it is statically linked during the build process."

If you cannot build a single source tree (source code library), you must find another way to maintain the library to ensure that all dependencies on the library are known and available versions. For example, you might set up an organization-wide repository, such as Nexus, Artifactory, or a Debian or RPM repository, and then update both those repositories and production systems for known security vulnerabilities.

20.4 Use automated test documentation and communication practices to disseminate knowledge

When a shared library is used throughout an organization, we should be able to quickly spread expertise and improvements. Making sure these libraries have plenty of automated tests means they automatically record and show how other engineers are using them.

If you adopt the practice of test-driven development (TDD), where you write automated tests before you write the code, the benefit is that the tests are almost entirely automated. This principle turns the test suite into a living, up-to-date system specification. Any engineer who wants to know how to use the system can look at the test suite to find examples of working usage of the system API.

Ideally, each library has an owner or support team who has relevant knowledge and expertise. Additionally, we should (preferably) only allow one version to be used in production, ensuring that everything produced uses the best collective knowledge of the organization.

In this model, the owner of the library also needs to help each team that relies on it safely migrate from one version to the next. This in turn requires comprehensive automated testing and continuous integration of all systems that rely on it to quickly uncover regression errors.

To spread knowledge faster, discussion groups or chat rooms can also be established for each library or service. Anyone with questions can get feedback from other users here, often faster than the developers.

Rather than having expertise scattered across the organization, using this type of communication tool facilitates the exchange of knowledge and experience, ensuring employees help each other with problems and new models.

20.5 Design operations by identifying non-functional requirements

Implementing these non-functional requirements will make our production services easier to deploy and keep running, allowing problems to be quickly detected and fixed, and ensuring graceful degradation if service components fail. Here are examples of non-functional requirements that should be in place:

  • Adequate telemetry for a variety of applications and environments;
  • The ability to accurately track dependencies;
  • A service that is resilient and can degrade gracefully;
  • There is forward and backward compatibility between versions;
  • The ability to archive data to manage production data sets;
  • The ability to easily search and understand various service log information;
  • The ability to track user requests through multiple services;
  • Use feature switches or other methods for easy, centralized runtime configuration.

By identifying these non-functional requirements, collective knowledge and experience can be more easily applied to new and existing services. These responsibilities lie with the team building the service.

20.6 Incorporate reusable operational user stories into development

If some operations and maintenance work cannot be fully automated or self-serviced, our goal is to try to make this recurring work repeatable and deterministic. We do this by standardizing the work required, automating it as much as possible, and documenting the work done so that the product team can best contribute to this activity.

Instead of setting up a server manually, then checking it one by one according to the checklist and putting it online in the production environment, it is better to automate these tasks as much as possible. If certain steps cannot be automated (for example, racking servers manually and then having a network group connect cables), the handover should be as clearly defined as possible to reduce lead time and errors. This also makes it easier to plan and schedule the steps better later.

Just like we create user stories in the development phase, put them into the backlog and then code the implementation, we can also create well-defined "operations user stories" that describe those that are used in all projects (e.g. deployment, capacity planning, security sexual reinforcement, etc.) that can be reused at work. By creating these clearly defined operations user stories, we surface repeatable IT operations work and related development work together, creating better work plans and more repeatable results.

用户故事Refers to the complete narrative or description of specific functions, paths, and services designed by the development team to meet the specific needs of a user to satisfy his or her behavior and tasks in the application. On the whole, user stories are object-oriented hierarchical analysis of software information, mainly including core user stories, derived stories and product attributes. The core user story is a narrative state used to describe the system that the user is using and its Main functionality and usability; derived stories must be part of the core user story, and implementing derived stories can improve the quality of the base user story; product attributes can define categories of complexity attributes. This is based on both user suggestions and managers' suggestions. Based on these two parts, combine them and then express them in the form of stories to complete the required software requirements in a more specific way.

20.7 Ensure technology selection helps achieve organizational goals

When we use service-oriented architecture and one of the goals is to maximize developer productivity, small service teams can build and run services using the language or framework that best suits their specific needs. In some cases, this is the best way for us to achieve organizational goals.

If there is not already a list of technologies that are jointly developed by development and operations and supported by the organization, a systematic study of the production environment infrastructure and services, as well as the underlying technologies currently supported, should be carried out to identify technologies that cause unnecessary failures and unplanned work. . We need to identify technologies that have the following characteristics:

  • Blocks or slows down workflow;
  • Resulting in a disproportionately large amount of unplanned work;
  • Generating a disproportionately large number of support requests;
  • Completely inconsistent with our desired architectural outcomes such as throughput, stability, security, reliability and business continuity.

By excluding these problematic infrastructures and platforms from the technologies supported by the operations team, the operations team can focus on the infrastructure that best helps achieve the organization's overall goals.

20.8 Summary

The techniques described in this chapter can incorporate bits and pieces of new learning into an organization's collective knowledge and multiply its impact. We achieve this goal by proactively and widely disseminating new knowledge, using methods such as chat rooms, architecture as code, shared source code libraries, and technical standardization. Doing so not only improves the development and operations teams, but also the entire organization, so everyone in the organization contributes to the collective experience.

21. Set aside time for organizational learning and improvement

There is a practice called an improvement blitz (kaizen blitz), an important part of the Toyota Production System, which refers to solving a specific problem in a dedicated and concentrated period of time, usually as long as a few days. Dr. Steven Spear explains: “Kaizen blitzes often take the form of a small group getting together to focus on a problematic process… Kaizen blitzes typically last several days and aim to optimize the process by focusing on improving the process outside of the process. people give advice to people who are usually in the process.”

The chapter then describes methods for setting aside time for organizational learning and improvement and for further institutionalizing the practice of investing time in improving day-to-day work.

21.1 Institutionalized practices for repaying technical debt

This section describes common practices that can help reinforce the practice of reserving development and operations time for improvement work, such as non-functional requirements and automation. One of the easiest ways to do this is to schedule and conduct improvement blitzes over a few days or weeks, letting everyone on the team (or the entire organization) organize themselves to address the issues of concern—no functional work allowed. You can focus on a problem point in the code, environment, architecture, tools, etc. These teams are often made up of development, operations, and information security engineers and span the entire value stream. Teams that don't normally work together can combine their skills and efforts to improve selected areas and then demonstrate the results to others in the company.

Our goal during these blitzes is not to simply experiment and innovate for the sake of testing new technology, but to improve day-to-day work, such as identifying workarounds in our day-to-day work. Although experimentation will also lead to certain improvements, the focus of improvement blitzes is to solve the specific problems encountered in daily work.

21.2 Let everyone learn from each other through teaching

Whether through traditional didactic methods (e.g. classes, training) or more experimental or open-ended methods (e.g. conferences, workshops, mentoring), a dynamic learning culture not only creates the conditions for learning for everyone, but also Can create teaching opportunities. We can devote dedicated organizational time to facilitating this teaching and learning.

It's clear from this book that there are certain skills that are increasingly needed by all engineers, not just developers. For example, it is increasingly important for all operations and test engineers to be familiar with development technologies, practices, and skills such as version control, automated testing, deployment pipelines, configuration management, and automation. Familiarity with development technologies helps operations engineers stay relevant as more and more technology value streams adopt DevOps principles and patterns

Development and operations teams can further teach skills in their daily work by performing code reviews together, thereby learning by doing, and solving small problems together. For example, developers can demonstrate to operations staff how the application authenticates users, logs into the application, and conducts automated testing of individual components to ensure that critical components are working properly (e.g., the application's core functionality, database transactions, message queues). This new automated test is then integrated into the deployment pipeline and run regularly, sending test results to monitoring and alerting systems for early detection when critical components fail.

21.3 Sharing experiences in DevOps meetings

In many cost-conscious organizations, engineers are often reluctant to attend meetings and learn from their peers. To build a learning organization, we should encourage engineers (from development and operations) to attend meetings and, if necessary, create and organize internal or external meetings themselves

21.4 Internal consultants and coaches for communication practices

Forming an in-house coaching and consulting organization is a common way to spread expertise within an organization. It can take many forms. At Capital One, all domain experts have designated time slots where anyone can consult and answer questions.

Earlier in this book, we told how Google's Testing Grouplet began to build a world-class automated testing culture within Google in 2005. Their story continues to evolve—to comprehensively improve automated testing practices within Google, they used kaizen blitzes, internal coaching, and even an internal certification program.

21.5 Summary

This chapter describes how to establish a set of routines that help reinforce lifelong learning and a culture that values ​​daily improvement in daily work. This can be accomplished by setting aside time to repay technical debt; creating forums where people can learn and mentor each other inside and outside the organization; and allowing experts to help internal teams through coaching, consulting, or simply setting up face-to-face time.

By having everyone learn from each other in their daily work, we learn more than the competition and win in the market. At the same time, we are helping each other unleash our full human potential.

21.6 Summary of Part 5

In Part Five, we explore practices for creating a culture of learning and experimentation in organizations. When we work in complex systems, learning from incidents, creating shared code bases, and shared knowledge are essential, helping work cultures to be more just and systems to be safer and more resilient.

In Part 6, we explore how to scale flow, feedback, learning, and experimentation to simultaneously help us achieve our information security goals.

Guess you like

Origin blog.csdn.net/u010230019/article/details/132844844