Read "DevOps Practice Guide" Note 1

Part 1 Introduction to DevOps

Chapter 1 Agile, Continuous Delivery, and the Three-Step Method 4

1.1 Manufacturing Value Stream 4

1.2 Technology Value Stream 4

We typically define technology value streams as "the processes required to transform business ideas into technology-enabled services that deliver value to customers."

1.2.1 Focus on deployment lead time 5

The value stream begins when an "engineer" (including development, QA, IT operations, and information security personnel) submits a change to the version control system and ends with the change successfully running in production, providing value to customers, and generating effective Feedback and monitoring information.

We do not advocate that after a large amount of work is completed serially in design and development, it is then transferred to the testing and operation and maintenance phase (such as using a large-scale, waterfall model-based development process, and working on a feature branch with a long life cycle ). Quite the contrary, our goal is to adopt a pattern in which test and operations are synchronized with design and development, resulting in a faster value stream and higher quality.

Lead time and processing time (sometimes called contact time or task time). are two commonly used indicators to measure the performance of value streams.
insert image description here

The lead time starts counting after the work order is created and ends when the work is completed: the processing time starts counting when the work is actually processed, and it does not include the time that the work is queued in the queue (see Figure 1-1) .

Because lead time is the time customers experience, we focus on reducing lead time rather than processing time

In a DevOps ideal, developers quickly: get continuous feedback on their work, develop, integrate and validate code quickly and independently, and deploy code to production

This goal can be achieved by continuously submitting small batches of code changes to the version control system, and performing automated and exploratory tests on the code before deploying it to the production environment.

In order to achieve the above goals more easily, it is also necessary to optimize the architecture design through modularization, high cohesion, and low coupling.

1.2.2 Pay attention to rework indicators - %C/A 7

In addition to lead time and processing time, a third key metric in the technical value stream is time to completion and precise percentage of total time spent (%CIA). This metric reflects the output quality of each step in the value stream.

1.3 Three-step work method: the basic principles of DevOps 7

The first step is to realize the rapid flow of work from development to operation and maintenance from left to right
. The second step is to apply continuous and rapid work feedback mechanism in each stage from right to left. The
third step is to establish a creative and high-level Credible corporate culture supports dynamic, rigorous and scientific experiments.

1.4 Summary 8

Chapter 2 Step 1: The Flow Principle 9

2.1 Making work visible 9

In order to be able to identify where work is flowing, queued or stalled, it is necessary to visualize the work as much as possible. Visual work boards are a better way of working, such as using paper or electronic cards to display each work on a Kanban or Sprint planning board. Work is usually initiated from the left (pull from the backlog), then pulled from one work center to the next, and finally to the far right of the job board, which is also usually marked as Done or "Online". In this way, not only the content of the work can be visualized, but also the work can be managed efficiently and its flow from left to right can be accelerated.

insert image description here

Ideally, the kanban should cover the entire value stream; work is only considered complete when it reaches the far right of the kanban (see Figure 2-1). The development of a certain function cannot be regarded as "completed". Only when the application runs successfully in the production environment and begins to provide value to customers can it be considered "completed".

By putting all the work of each work center into the queue and displaying it visually, it is easier for stakeholders to determine the priority of each work based on the overall goal.

2.2 Limit WIP to 10

Assigning an engineer to multiple projects at the same time, he has to switch back and forth between multiple tasks, cognitive rules and goals, and pay the cost of re-entering the role. Multitasking results in longer processing times.

When using kanban to manage work, you can limit the appearance of multi-tasking, such as setting a limit on the number of work-in-progress for each column of the kanban or each work center, and marking the upper limit of the number of cards on each column.

For example, set the maximum WIP quantity for a test job to 3. When there are already 3 cards in the test queue, no new cards can be added until a card is completed, or one of the 3 is returned to the previous queue.

Controlling the length of the queue (i.e. WIP) is a very powerful management tool as this is one of the important factors affecting lead time - for most work items it is not predictable until they are completed Exactly how long will it take.

2.3 Reducing the batch size 11

High volumes lead to skyrocketing WIP, which can lead to long lead times and poor product quality—if a single body panel is found to be faulty, the entire lot has to be scrapped.

In order to shorten the lead time and improve the quality of deliverables, the low-volume mode should be continuously pursued. In theory, the smallest batch size is one-piece flow, that is, the processing of only one unit of product is performed per operation.

The big difference between small and large batches is illustrated by the classic case of "Mock Mailing Brochures". This example assumes 10 brochures are being mailed out. Before mailing, each brochure must go through 4 steps: folding, inserting the envelope, sealing the envelope, and stamping.

In the case of a high-volume strategy (ie "mass production"), we will perform the above 4 steps sequentially on each brochure. In other words, first fold all 10 sheets, then insert each sheet into an envelope, then seal all the envelopes, and finally stamp all of them.

Another approach is a low-batch strategy (ie, "one-piece flow"), where all the steps required are performed sequentially on each brochure before processing begins on the next. In other words, fold a sheet of paper, insert it into the envelope, seal the envelope, and stamp it; then, remove the sheet of paper and repeat the process.
insert image description here

The difference between adopting a high-batch and small-batch strategy is huge (see Figure 2-2). Suppose the above 4 steps must be taken for all 10 envelopes, and each step takes 10 seconds. If processed using the high-batch policy of 5 per batch, it would take 310 seconds to complete the first stamped envelope.

Worse, suppose we find out in the envelope sealing operation that the first step of the fold is done wrong, in which case the earliest we can find the error is after 200 seconds, then we would have to send the batch The 10 booklets were then refolded and put back in envelopes.

For technology value streams, high-volume side effects are the same as for manufacturing. This high-volume release creates sudden, high volumes of work in progress, causing massive confusion in all downstream work centers1, with poorer flow and lower quality as a result. That is, the greater the changes to the production environment, the more difficult it is to locate and fix the problem, and the longer it takes to fix it.

2.4 Reduce the number of handovers 13

In the process of value flow of code, different departments need to cooperate to complete related tasks, including functional testing, integration testing, environment construction, configuration server, storage, management, network, load balancing equipment and information security reinforcement, etc.

When a job is handed off between teams, it requires a lot of communication - requesting, delegating, notifying, coordinating, and often prioritizing, scheduling, deconflicting, testing, and validating.

Each link in the above process has its underlying queue, and when relying on resources shared by different value streams, there will be a wait for work. The lead times for these requests are often high, resulting in continuous delays in work that should have been completed on schedule. Even in the best of circumstances, some information or knowledge is inevitably lost in the handoff.

To reduce these types of problems, either strive to reduce the number of handoffs, automate most operations, or restructure the organization so that teams can independently deliver value to customers without relying on others.

2.5 Continuously identify and improve constraint points 14

In the process of DevOps transformation, if you want the lead time to be shortened from months or quarters to a few minutes, you generally need to optimize the following constraints in turn.
Environment construction: The solution is to establish a fully self-service environment on demand, which can be created in an automated manner.
Code Deployment: The solution is to automate the deployment process as much as possible so that any developer can automate deployment on demand. Test preparation and execution: The solution is to automate testing so that the speed of testing can keep up with the speed of code development
while deployment is performed safely and in parallel . Tightly coupled architecture: The solution is to create a loosely coupled architecture so that developers can safely and autonomously make changes and increase productivity.

Our goal is to enable small development teams to develop, test and deploy independently, quickly and reliably, and continue to create value for customers

2.6 Eliminate dilemma and waste in the value stream15

2.7 Summary 16

Improving the fluidity of the technology value stream is critical to implementing DevOps. To do this, we need to visualize work, limit work in process, reduce batch sizes, reduce the number of handoffs, continuously identify and improve constraints, and eliminate day-to-day dilemmas.

Chapter 3 Step 2: Feedback Principles 17

3.1 Working safely in complex systems 17

The principles described in the First Step Work method enable work to flow rapidly from left to right in the value stream. The second step working method description

3.2 Timely detection of problems 18

The goal should be to have fast feedback and feed-forward loops in all work execution at every stage of the technology value stream, including product management, development, QA, infosec, and operations. This includes creating automated build, integration, and test processes to detect early on those code changes that could lead to defects.

It is also necessary to establish a comprehensive monitoring system to monitor the running status of service components in the production environment, so as to quickly detect service exceptions.

3.3 Brainstorm and work together to overcome problems and acquire new knowledge 19

3.4 Guarantee quality at the source 21

Evaluate proposed changes against peer review to ensure they will work as designed. Automate as much as possible of the quality checks normally performed by QA and information security personnel. Execute automated testing on demand without developers requesting or initiating testing efforts from the testing team. In this way, developers can quickly test their code and even deploy code changes to the production environment.

3.5 Optimizing for downstream work centers 22

3.6 Summary 22

Chapter 4 Step Three: Principles of Continuous Learning and Experimentation 23

4.1 Building a Learning Organization and Safety Culture 23

The first step is to establish a workflow from left to right, the second step is to establish rapid and continuous feedback from right to left, and the third step is to establish a culture of continuous learning and experimentation.

A blame-free retrospective can be conducted after each incident, providing an objective explanation of why and how it happened and agreeing on the best actions to optimize the system. Ideally, this not only prevents recurrence of the problem, but also facilitates faster fault location and recovery. Eliminating blame can effectively realize a learning organization.

4.2 Institutionalize day-to-day improvement 25

If the team does not have the ability or willingness to improve the existing process, then it will continue to suffer and suffer from the problems at hand, and the pain index will increase day by day.

The pattern of solving problems with ad-hoc solutions often also leads to the accumulation of problems and technical debt. More important than daily work is the continuous improvement of daily work.

Improve day-to-day work by explicitly setting aside time to pay off technical debt, fix bugs, refactor, and optimize code and environments. You can set aside periods of time between each development cycle, or schedule improvement blitz sessions where engineers self-organize as teams to solve problems they are interested in.

4.3 Transform local discovery into global optimization 26

Once results are achieved locally, it should be shared with other people in the organization so that more people can benefit from it.

Failure reports should be written for any process or operational deviations in order to accumulate experience. Regardless of the strength of the failure signal, or the magnitude of the risk, processes and system designs are continuously updated based on these experiences.

4.4 Injecting resilience patterns into everyday work 27

Drills can also be used to rehearse large-scale failures, such as shutting down a data center. Or inject large-scale failures into the production environment (such as Netflix's famous "trouble monkey", which randomly kills processes and servers in the production environment), to verify whether the reliability of the system meets expectations.

4.5 Leadership reinforces the learning culture 27

4.6 Summary 29

4.7 Summary of the first part 29

Discusses the three-step working method of DevOps organizational transformation: the principle of flow, the principle of feedback, and the principle of continuous learning and experimentation.

where to start part two

Chapter 5 Choosing the Right Value Stream as an Entry Point 32

5.1 Greenfield Projects and Brownfield Projects 34

In technology, a greenfield project refers to a completely new software project. Starting from scratch, so there are not many concerns about the existing code base, process and team.

DevOps greenfield projects usually refer to some pilot projects to prove the feasibility of public cloud or private cloud solutions, or to try to adopt automated deployment tools or related tools.

DevOps brownfield projects are those products or services that have served customers for years or even decades. This kind of project usually bears a lot of technical debt, such as no automated testing, running on an unmaintained platform, etc.

5.2 Balancing Recording and Interactive Systems 35

5.3 Start with the most innovative teams 36

Our goal is to find teams that believe in DevOps principles and practices, and have the will and ability to innovate and improve existing processes. Ideally, these groups will be champions of DevOps transformation.

We don't spend much time changing conservative groups, especially in the early stages. Instead, focus on teams that create success and are willing to take risks, and slowly expand from there.

Even with top-level buy-in, avoid a “big bang” approach (i.e., a blooming start), and instead focus on a few pilot areas, ensure their success, and then expand incrementally.

5.4 Expanding the scope of DevOps 37

No matter how the entry point is chosen, the results must be demonstrated as early as possible and actively publicized. Break down big improvement goals into small, incremental steps. Doing so not only increases the rate of improvement, it also allows for early detection of wrongly selected value streams.

Build on the successes already achieved and gradually expand the scope of application of the DevOps program. Follow a low-stakes sequence to methodically build credibility, influence, and buy-in.

How to scale up your impact based on the support you have received.
(1) Identify innovators and early adopters: Ideally these colleagues are respected and have a lot of influence over others in the organization. Their support contributes to the credibility of the innovation.
(2) Win the silent majority: In the next phase, seek to expand DevOps practices to more teams and value streams, with the goal of building a stronger crowd base.
(3) Identify "nail households": Only after gaining the support of the majority of people and establishing a solid enough mass base can we consider how to deal with this group.

5.5 Summary 38

Chapter 6 Understanding, Visualizing, and Using Value Streams 39

6.1 Identify the teams needed to create customer value 40

After selecting a DevOps pilot application or service, all members of the value stream must be identified, and generally, include the following members.
 Product owner: As the spokesperson of the business side, define the functions that the system needs to achieve.
 Development team: Responsible for developing system functions.
 QA Team: Provide feedback to the development team and ensure that the system functions as required.
 Operation and maintenance team: Responsible for maintaining the production environment and ensuring that the system can run well.
 Information security team: Responsible for system and data security.
 Release Manager: Responsible for managing and coordinating the production environment deployment and release process.
 Technical Lead or Value Stream Manager: (as defined by the Lean methodology) is responsible for "enabling, from start to finish, that the output of the value stream
meets or exceeds customer (and organizational) expectations."

6.2 Mapping value streams for team work 40

After identifying the relevant members of the value stream, the next step is to gain a better understanding of how things work and document them with a value stream map.

The goal of value stream mapping is not to record all the steps and details, but to identify the links that prevent the rapid flow of the value stream

Based on all the information provided by the value stream participation team, the following two types of work should be analyzed and optimized:
 Work that needs to wait for weeks or even months, such as preparing a production environment, change approval process, or security review process
; Work that handles major rework.
The initial version of the value stream map should contain only important process modules.
insert image description here

At a minimum, each module should include the lead time and processing time of the work item (LT and VA, respectively, in Figure 6-1), and the %C/A as measured by downstream consumers. The metrics in the value stream map can be used to guide improvement efforts.

Once you have identified the metrics you want to improve, draw an idealized value stream map and use this as the improvement goal for the next stage.

6.3 Form a dedicated transformation team 42

A dedicated transformation team should be formed and separated from the department responsible for day-to-day operations, and most importantly, this team should be responsible for achieving well-defined, measurable, system-level goals (e.g., moving "from code commit to deployment 50% reduction in lead time for the process of "to production environment"). In order to do this, the following measures should be taken:
 Members of the transformation team are dedicated to DevOps transformation work (instead of letting them continue to do their previous work, but spend 20% of their time doing DevOps transformation);  Choose to be familiar with
multiple Generalists in the field as team members;
 Choose people who have long-term good relationships with other departments as team members;
 If possible, find a separate office area for the team, so that members can communicate with each other as much as possible Appropriate distance.

If possible, free the transformation team from many existing rules and regulations.

6.3.1 Having a common goal 43

6.3.2 Improvement plan to keep a small span 44

In any DevOps transformation project, you need to keep the improvement plan small, and you should strive to obtain measurable improvement results or usable data within weeks (or at worst, months).

6.3.3 Reserve 20% of the development time for non-functional requirements and reduce technical debt 44

To actively manage technical debt, ensure that at least 20% of development and operations time is devoted to refactoring, automation efforts, architectural optimization, and non-functional requirements (sometimes called "quality attributes"), such as maintainability , manageability, scalability, reliability, testability, deployability, and security.

Through this 20% investment, developers and operators can provide long-term solutions to the problems encountered in daily work, and ensure that technical debt will not hinder rapid and safe development and operation.

6.3.4 Improving the visibility of work 47

6.4 Using Tools to Reinforce Desired Behavior 47

Not only do developers and operators share common goals, but they also have the same task list. Ideally, task lists are stored in a common job system,

Shared work queues can be created instead of using different tools (e.g. JIRA for developers and ServiceNow for ops)

When production incidents are visible in the developer's work system, people can clearly know when the incident will affect other work. This is especially obvious when using Kanban board diagrams.

Another benefit of the shared queue is that the task list is unified. Everyone can think about the highest priority from a global perspective. When technical debt is found, it can be added to the task list if it cannot be resolved immediately. For outstanding issues, 20% of the time reserved for non-functional requirements can be used to fix them.

6.5 Summary 48

Chapter 7 Designing Organizational Structures Using Conway's Law 49

7.1 Organizational archetypes 51

7.2 Dangers of excessive functional orientation (“cost optimization”) 51

Traditional IT operation and maintenance organizations often adopt a functional structure, organizing teams according to areas of expertise. Database administrators are grouped in one group, network administrators in another, server administrators in a third, and so on. This approach obviously prolongs lead times, especially in complex activities like large-scale deployments, where having to issue a bunch of work orders to multiple teams and coordinate work handovers leads to lengthy delays at each step. time to wait.

In addition to causing long waits and extended lead times, this situation can lead to poor handoffs, extensive rework, reduced delivery quality, bottlenecks and delays.

7.3 Building market-oriented teams (“optimized for speed”) 52

In extreme cases, market-oriented teams are responsible not only for feature development, but also for testing, security, deployment, and production operations throughout the application lifecycle. These cross-functional teams can operate independently

To achieve market orientation, embed engineers and their professional skills (such as operations and maintenance, QA and information security) into each service team, or provide teams with self-service platforms, whose functions include configuring production-like environments, performing automated tests or conducting deploy.

This enables each service team to independently deliver value to customers without having to submit tickets to other departments such as IT operations, QA or information security.

7.4 Making Functional Orientation Effective 53

As long as the service team can get reliable help from the operation and maintenance team quickly (preferably on demand), then the centralized functional operation and maintenance organization can also operate efficiently, and vice versa. Many well-known DevOps companies, including Google, Etsy, and GitHub, have retained functional operations teams.
insert image description here

As long as all people in the value stream are aware of the goals of the customer and the organization, no matter where they are in the organization, they can achieve the desired DevOps outcomes through functional orientation.

What these organizations have in common is a culture of high trust, where all departments are able to collaborate effectively, all work is prioritized transparently, and the system reserves enough capacity to complete high-priority work quickly. This is achieved in part by an automated self-service platform, which guarantees product quality.

7.5 Integrate testing, operation and maintenance and information security into daily work 54

7.6 Make team members generalists 54

insert image description here

One countermeasure is to make every team member a generalist (see Table 7-1). Give engineers the opportunity to learn the skills necessary to build and run the systems they are responsible for, and rotate them between roles on a regular basis.

Through cross-training and upskilling, generalists can do more than professionals, while improving overall workflow by reducing tasks in queues and wait times.

Multi-skill cross-training is great for employee career development and makes work more interesting for all employees

7.7 Invest in services and products, not projects 56

7.8 Setting Team Boundaries According to Conway's Law 56

When teams are on different floors, different office buildings, or even different time zones, it becomes more difficult to establish and maintain common goals and mutual trust, which is detrimental to effective collaboration. It is difficult for teams to collaborate effectively when the primary communication mechanism is a ticket or change request.

Unreasonable team organization may have adverse consequences. These inappropriate ways include dividing teams by function (such as housing developers and testers in different office locations, or outsourcing testers completely), and splitting teams by architectural layers (such as application layer, database layer, etc.).

Improper organization requires a lot of communication and coordination across teams, but can still lead to a lot of rework, disagreements over requirements definitions, inefficient work handoffs, idle people waiting for upstream to finish, etc.

Ideally, the software architecture should ensure that small teams can operate independently and be fully decoupled from each other, thereby avoiding excessive unnecessary communication and coordination.

7.9 Creating Loosely Coupled Architectures to Improve Productivity and Security 57

An architecture that enables small teams to develop, test, and deploy independently, safely, and rapidly can increase and sustain developer productivity and improve deployment quality. Service-oriented architecture SOA has this feature. It is an architectural approach that supports independent testing and deployment of services, and is typically characterized by loosely coupled services with bounded contexts.

A loosely coupled architecture means that one service can be updated independently in a production environment without updating other services.

A bounded context is something a developer should be able to understand and update a service's code without knowing the internal logic of its peer service. Services interact strictly through APIs, so they do not have to share data structures, database schemas, or other internal representations of objects. Bounded contexts ensure that services are divided into independent parts with well-defined interfaces, which also makes testing easier.

7.10 Summary 60

Chapter 8 Integrating operations and maintenance into daily development work 61

How centralized operations teams can achieve market-driven results. The following are 3 general strategies:
 Build self-service capabilities to help developers improve productivity;
 Integrate operation and maintenance engineers into the service team;
 If the number of operation and maintenance engineers is tight, you can use the operation and maintenance liaison model.

8.1 Create shared services to improve development productivity 62

One approach for operations to achieve market-driven results is to create a centralized set of platforms and toolsets that all development teams can use to increase productivity, allowing development teams to Spend more energy and time building features than acquiring the infrastructure necessary to deliver and support production features.

Ideally, all platforms and services provided by O&M should be fully automated and available on demand, without requiring developers to submit work orders and then wait for the O&M team to manually process, which can ensure that O&M does not become a customer the bottleneck

8.2 Integrating operation and maintenance engineers into the service team 63

Another way to achieve market-driven outcomes is to reduce reliance on centralized operations by including operations engineers to make product teams self-sufficient.

By integrating ops engineers into the development team, their work priorities are driven almost entirely by the goals of their product teams.

However, interviews and hiring decisions may still be made by centralized operations teams to ensure consistency and quality of employees.

For new large-scale development projects, operations and maintenance engineers can be integrated in the start-up phase. They will participate in relevant discussions of the development team, such as planning meetings, daily stand-up meetings, and demonstrations of new features. As the development team's demand for operation and maintenance knowledge and capabilities gradually decreases, operation and maintenance engineers can transfer to other projects or work middle,

8.3 Assign operation and maintenance liaisons to each service team 64

Due to various reasons (such as cost or insufficient resources), it may not be possible to assign an operation and maintenance engineer to each product team, but an operation and maintenance contact person can be assigned to each product team.

A centralized operations team still manages all environments and is responsible for ensuring their consistency. It is the responsibility of the dispatched operation and maintenance engineer to understand the following:
 What is the function of the new product and why is this product developed;
 How does it work, how is the operability, scalability and monitoring capabilities 
How to monitor and collect metrics, and how to confirm that the application is functioning properly;  Are the architecture and patterns different from what has been
done in the past, and why?
How infrastructure capacity is affected;
 Feature release schedules.

The operation and maintenance liaison should also participate in the team's stand-up meeting, incorporate the team's needs into the overall operation and maintenance plan, and perform related tasks when necessary. When competition for resources or conflicting priorities arise, the team relies on the operations liaison to move the issue forward.

Compared with the mode of integrating operation and maintenance engineers, the method of assigning operation and maintenance liaisons can support more product teams. Our goal is to make sure Ops doesn't become a bottleneck for product teams. If it is found that the product teams are not meeting their goals due to the ops liaison being overwhelmed, it may be necessary to reduce the number of teams supported by each liaison, or temporarily integrate ops engineers into some product teams.

8.4 Invite operation and maintenance engineers to participate in development team meetings 65

8.4.1 Invite operation and maintenance engineers to participate in the daily stand-up meeting 65

The daily stand-up is the format favored by Scrum. This is a quick-fix meeting, and everyone has to make three things clear to everyone: What did you do yesterday? What are you going to do today? What problem did you encounter?

With the ops engineers participating in the meeting, the ops department can fully understand the activities of the development team, so as to plan and prepare better. For example, when the product team is planning to launch a major feature in two weeks, the ops team can ensure that the people and resources needed to deploy and release are in place ahead of time, or enhance areas that require more communication and preparation (such as creating more Monitoring points or automated scripts, such as database background tuning, such as building more environments for integration testing and performance testing

8.4.2 Invite operation and maintenance engineers to participate in the retrospective meeting 66

Review meeting. At the end of each development cycle, team members come together to discuss: what was successful? What areas still need improvement? How will the successes and improvements achieved be carried over to the next iteration or project?

When a deployment or release happens to occur during the time period under review, the Ops engineer should report the results to everyone and provide feedback to the product team. Doing so can improve the way future work is planned and executed, increasing the quality of work.

We must remind everyone that improving the day-to-day work is actually more important than the day-to-day work itself, and all teams must set aside time for this (for example, allocate 20% of each cycle to improving work, schedule one day a week or one day a month) week, etc.). If this is not done, the productivity of the team will certainly be destroyed under the enormous pressure of repaying the technical debt.

8.4.3 Using Kanban diagrams to show operation and maintenance work 66

Operations work is part of the value stream, so it should be represented on the Kanban board along with other work related to product delivery.

8.5 Summary 67

8.6 Summary of Part II 67

Guess you like

Origin blog.csdn.net/lihuayong/article/details/120245567