"DevOps Practice Guide" - Reading Notes (3)

Part 3 Step One: Flow Technology Practice

The goal of Part III is to create the necessary technical practices and architecture to enable a steady and rapid flow of development-to-ops work without disrupting production environments or disrupting customer service. This means reducing the risk of deploying and releasing changes into production. This is achieved through a set of technical practices called continuous delivery .

Continuous delivery includes laying the foundation for an automated deployment pipeline, ensuring that the team can use automated testing to continuously verify whether the code is in a deployable state, ensuring that developers submit code to the trunk every day, and building an environment and code that is conducive to low-risk releases. The next few chapters will focus on the following:

  • Lay the foundation for the deployment pipeline;
  • Achieve fast and reliable automated testing;
  • Implement and practice continuous integration and continuous testing;
  • Achieve low-risk releases through automation, architectural decoupling, etc.

These practices can effectively shorten the lead time of creating a production-like environment. At the same time, continuous testing can quickly provide feedback to each team member, allowing small teams to develop, test and deploy code to production safely and independently, making deployment and release to production a part of daily work.

9. Lay the foundation for the deployment pipeline

In order for work to flow from development to operations quickly and reliably, a production-like environment should be used at each stage of the value stream. In addition, these environments must be set up in an automated manner. Ideally, this should be built on demand using scripts and configuration information stored in a version control system, without relying on the operations team for manual operations. The goal of the deployment pipeline is to repeatedly build an entire production environment based on the information in the version control system.

9.1 Build development environment, test environment and production environment on demand

We no longer need the operation and maintenance team to manually build and configure the environment. Instead, we can use automated methods to complete the following operations:

  • Copy the virtualization environment (such as VMware virtual machine image, execute Vagrant script, and start Amazon EC2 virtual machine image file);
  • Build an automated environment construction process for "bare metal physical machines" (for example, use PXE to install through baseline images);
  • Use "infrastructure as code" configuration management tools (such as Puppet, Chef, Ansible, SaltStack, CFEngine, etc.);
  • Use operating system automated configuration tools (such as Solaris Jumpstart, Red Hat Kickstart, and Debian preseed);
  • Build an environment using a set of virtual images or containers (such as Vagrant and Docker);
  • Create new environments in public clouds (such as Amazon Web Services, Google App Engine, and Microsoft Azure), private clouds, or other PaaS (Platform as a Service, such as OpenShift, Cloud Foundry, etc.)

With a fully controllable environment, developers can quickly reproduce, locate, and fix defects while safely isolating them from production services and other shared resources. At the same time, developers can also try to change the environment and optimize the infrastructure code that creates the environment (such as configuration management scripts), further sharing information between development and operations.

9.2 Apply a unified code warehouse

Through the work in the previous stage, we have been able to create development environments, test environments, and production environments on demand. Next, you must ensure that all parts of the software system are working properly

A version control system records changes made to a file or collection of files in the system. These files can be source code, resource files, or other documentation for a software development project. A set of changes constitutes a commit, also called a revision. Each revision and its metadata (such as who made the change when) are stored in the system in some way, allowing us to commit, compare, merge, and restore objects from previous revisions from the warehouse. Version control systems can also reduce risk by rolling back objects in production to previous versions.

To ensure that a production environment can be restored reproducibly and accurately (and ideally quickly) even in the event of a catastrophic incident, the following resources must also be included in the version control system:

  • All code and dependencies of the application (e.g. libraries, static content, etc.);
  • Any scripts used to create the database schema, application reference data, etc.;
  • All tools and artifacts used to build the environment described in the previous section (such as VMware or AMI virtual machine templates, Puppet or Chef configuration modules, etc.);
  • Any files used to build containers (such as Docker or Rocket definition files and compose files, etc.);
  • All scripts that support automated and manual testing;
  • Any script that supports code packaging, deployment, database migration and environment provisioning;
  • All project artifacts (e.g. requirements documents, deployment processes, release notes, etc.);
  • All cloud platform configuration files (such as AWS CloudFormation templates, Microsoft Azure Stack DSC files, and OpenStack HEAT template files);
  • Create any additional scripts or configuration information required to support various infrastructure services such as enterprise service buses, database management systems, DNS zone files, firewall configuration rules, and other network devices.

It's not enough to be able to reproduce the state of your pre-production environment, you must also be able to reproduce the entire pre-production environment and build process. Therefore, everything the build process depends on needs to be included in the version control system. This includes the tools used (such as compilers and test tools) and the environments they depend on.

In fact, in almost all cases, the environment has several orders of magnitude more configurable parameters than the code. Therefore, the environment most needs to use version control.

9.3 Make infrastructure reconstruction easier

When we can quickly rebuild applications and environments on demand, if something goes wrong, we can build quickly instead of having to spend time fixing it.

By repeatedly creating environments, we are able to easily add more servers to the resource pool, thereby easily increasing capacity (i.e. horizontal scaling). It also avoids the pain of having to restore services after a catastrophic failure of non-reproducible infrastructure. These catastrophic failures are often caused by years of undocumented manual changes.

In order to ensure the consistency of the environment, all changes to the production environment (configuration changes, patches, upgrades, etc.) need to be copied to all pre-production environments and newly built environments.

You can rely on automated configuration management systems to ensure consistency (such as Puppet, Chef, Ansible, Salt, Bosh, etc.), or you can create new virtual machines or containers through automated build mechanisms, deploy them to the production environment, and then destroy or remove them. Old resources.

The latter model is called immutable infrastructure, where no manual operations are allowed anymore in the production environment. The only way to make changes to a production environment is to check the changes into version control and then rebuild the code and environment from scratch. Doing so eliminates the possibility of differences creeping into the production environment.

To eliminate uncontrolled configuration differences, you can disable remote logins to production servers, or periodically delete and replace instances in the production environment to ensure manual changes are removed. This forces everyone to make changes the right way through the version control system. These measures systematically reduce the likelihood that the infrastructure will deviate from a known good state (e.g., configuration drift, fragile artifacts, bric-a-brac, snowflake servers, etc.).

In addition, it is important to ensure that the pre-production environment is up to date, especially for developers to use the latest environment.

9.4 “Complete” only when running in a production-like environment

Now we can set up environments on demand and everything is under version control. The next goal is to ensure that development teams use these environments in their daily work. You need to confirm that the application can run normally in a production-like environment long before the end of the project, or before deploying to the production environment for the first time.

Our goal is to ensure that development and QA regularly integrate code with production-like environments throughout the project, with increasing frequency. We do this by expanding the definition of "done." "Done" means that not only functionally correct code has been implemented, but that working and shippable code has been integrated and tested in a production-like environment at the end of each iteration cycle.

By allowing development and operations teams to jointly understand how code interacts with the environment and deploy code early and frequently, deployment risks in production environments are significantly reduced. This also prevents architectural problems from being discovered at the last minute of the project and completely eliminates this type of security hazard.

9.5 Summary

Building a fast workflow from development to operations requires ensuring that anyone can get a production-like environment on demand. By having developers use a production-like environment from the earliest stages of a software project, the risk of problems with the production environment can be significantly reduced. This is also one of the many practices that proves that operation and maintenance can improve development efficiency. By expanding the definition of the word "done", developers are required to run code in a production-like environment.

Additionally, by bringing all production artifacts into a version control system, we have a “single source of truth,” which allows us to rebuild the entire production environment in a fast, repeatable, and documented manner, and use and Develop consistent practices for your work. By making infrastructure easier to rebuild than repair, we can solve problems more easily and quickly and increase team productivity more easily.

10. Achieve fast and reliable automated testing

On a day-to-day basis, developers and QA staff use production-like environments to run applications. For each feature, the code has been integrated and run in a production-like environment, and all changes have been committed to the version control system. However, if you wait until all development work is complete and then have a separate QA department find and fix bugs through a dedicated testing phase, the results are often not ideal. And if testing can only be done a few times a year, developers will only learn about the mistakes they made months after the change was introduced. At that point, it's difficult to pinpoint the cause of the problem, and developers are forced to rush to fix it, greatly reducing their ability to learn from their mistakes.

Automated testing solves an important and disturbing problem. Gary Gruver said: "Without automated testing, the more code we write, the more time and money we spend testing it. In most cases, this business model is unviable for any technology organization. expanded."

10.1 Continuously build, test and integrate code and environment

Our goal is to enable developers to create automated test suites in their daily work and ensure product quality early in development. Doing so facilitates a rapid feedback loop, helping developers identify problems early and resolve them quickly when constraints (such as time and resources) are minimal.

The purpose of creating an automated test suite is to increase the frequency of integration and evolve testing from a phased activity to a continuous activity. By building a deployment pipeline (see Figure 10-1), when new changes enter the version control system, a series of automated tests will be triggered.
Insert image description here
The deployment pipeline ensures that all code checked into the version control system is automatically built and tested in a production-like environment. This allows developers to get immediate feedback on build, test, or integration errors when they submit code changes, allowing them to fix them immediately. Proper continuous integration practices always ensure that code is in a deployable and shippable state.

In order to achieve this, automated build and test processes must be created in a dedicated environment. It's crucial to do this, and here's why.

  • Build and test processes can run at any time, regardless of an engineer's personal work habits.
  • Separate build and test processes ensure that engineers understand all the dependencies required to build, package, run, and test the code (i.e., eliminating the “application works on the developer’s laptop, but not in production” problem).
  • Package the application's executable file and configuration and install it repeatedly in the environment (such as RPM, yum and npm on Linux or OneGet on Windows), or use development framework-specific packaging formats, such as Java's EAR and WAR files , or Ruby gem file).
  • Package applications into deployable containers (such as Docker, Rkt, LXD, and AMI) instead of packaging program code.
  • Configure production-like environments in a consistent and repeatable manner (e.g. remove compiler from environment, turn off debug flags, etc.)

The purpose of the deployment pipeline is to provide all members of the value stream (especially developers) with feedback as quickly as possible to help them promptly identify changes that may cause the code to deviate from the deployable state, including code, environmental factors, automated testing and even the deployment pipeline. Any changes to the infrastructure (such as Jenkins setup).

After deploying the pipeline infrastructure, continuous integration practices must also be implemented, which requires the cooperation of the following three aspects:

  • Comprehensive and reliable automated test suite to verify deployable status;
  • A culture that can “stop the entire production line” when verification tests fail;
  • Developers work on trunk and commit changes in small batches rather than on long-lived feature branches

10.2 Build a fast and reliable automated test suite

Whenever new changes are checked into the version control system, quick automated tests need to be run in the build and test environment. This way, all integration issues can be discovered and resolved immediately, just like Google's GWS team. This keeps the amount of code integration small and ensures that the code is always deployable.

Generally, automated tests are divided into the following categories from fast to slow

  • Unit testing : Usually testing each method, class or function independently. Its purpose is to ensure that the code runs as the developer designed it. For many reasons (such as the need for fast and stateless testing), stub out is often used to isolate the database and other external dependencies (for example, modify the function to return a static predefined value instead of calling the database ).
  • Acceptance testing : Usually testing the application as a whole to ensure that each functional module works normally according to the design (for example, it meets the business acceptance criteria of the user story, and the API can be called correctly), and no regression errors are introduced (that is, the previous normal functions are not destroyed). Jez Humble and David Farley believe that the difference between unit testing and acceptance testing is: "The purpose of unit testing is to prove that a certain part of the application meets the programmer's expectations...The purpose of acceptance testing is to prove that the application can meet the customer's wishes, not just is in line with the programmer's expectations." After the built version passes the unit tests, the deployment pipeline performs acceptance testing on it. Any build that passes acceptance testing can generally be used for manual testing (e.g. exploratory testing, user interface testing, etc.) and integration testing.
  • Integration testing : Ensure that the application can interact correctly with other applications and services in the production environment without calling stubbing interfaces. Jez Humble and David Farley write: "Most system integration testing involves deploying new versions of the application and getting them to work together properly. In this case, smoke testing usually refers to a set of tests performed on the entire application. Mature acceptance tests." Only builds that pass unit tests and acceptance tests can perform integration tests. Because integration tests are often brittle, the number of integration tests should be minimized and as many defects as possible should be found during unit and acceptance testing. A critical architectural requirement is the ability to call virtual or simulated remote services when performing acceptance tests.

When faced with the pressure of project deadlines, developers may no longer write unit tests on a daily basis, regardless of the definition of "done." In order to detect and eliminate this situation, test coverage needs to be measured (depending on the number of classes, lines of code, permutations, etc.) and the measurement results need to be visualized, even when the test coverage is below a certain level (for example, when the class When the unit test rate is less than 80%), the verification result of the test suite will show failure.

10.2.1 Find errors early in automated testing

One of the design goals of automated test suites is to find errors as early as possible in tests. Therefore, execute faster automated tests (such as unit tests) before executing time-consuming automated tests (such as acceptance and integration tests). Both tests are performed before manual testing.

Therefore, whenever an acceptance test or integration test finds a bug, corresponding unit tests should be written to identify the bug faster, earlier, and cheaper. Martin Fowler describes the concept of an "ideal testing pyramid" that uses unit tests to catch most errors, as shown in Figure 10-2. In contrast, many testing projects do just the opposite, with people spending most of their time and effort on manual and integration testing.
Insert image description here
If writing and maintaining unit or acceptance tests is difficult and expensive, the architecture may be too coupled, meaning that there are no longer (or never were) clear boundaries between modules. In this case, a more loosely coupled system needs to be built so that modules can be tested independently without relying on the integration environment. Even for the most complex applications, acceptance testing can be completed in just a few minutes.

10.2.2 Execute tests as quickly as possible in parallel

We want to execute tests quickly, so we need to design parallel tests, which may use multiple servers. We also want to run different types of tests in parallel. For example, after a certain build passes the acceptance test, security tests and performance tests can be executed in parallel, as shown in Figure 10-3. Manual exploratory testing may or may not be done until the build passes all automated tests (exploratory testing can speed up feedback, but it is also possible to do manual testing against builds that will eventually fail) Any build that passes all automated
Insert image description here
tests Builds can be used for exploratory testing as well as other forms of manual or resource-intensive testing (such as performance testing). All of these tests should be performed as frequently and comprehensively as possible, either continuously or periodically.

10.2.3 Write automated tests first

To ensure reliable automated testing, one of the most effective ways is to write automated tests in daily work through technologies such as Test-Driven Development (TDD) and Acceptance Test-Driven Development ( ATDD ) .

Kent Beck introduced TDD as part of Extreme Programming in the late 1990s. This technique has the following 3 steps

  1. Make sure the test fails, "write test cases for the functionality you want to add", and check in the test cases;
  2. Ensure that the test passes, "write the code to implement the function until the test passes", and check in the code;
  3. "Refactor the old and new code, optimize the structure", ensure that all tests pass, and check in the code again.

The automated test suite is checked into the version control system along with the program code to provide a usable and up-to-date set of system specifications. If developers want to understand how to use the system, they can look at the test suite to find examples that demonstrate how to call the system API.

10.2.4 Automate manual testing as much as possible

The purpose of automated testing is to find as many code errors as possible and reduce reliance on manual testing.
"While testing can be automated, the quality creation process cannot. Asking humans to perform tests that should be automated is a waste of human potential."

By performing automated testing, all testers (including developers, of course) are able to do high-value activities that cannot be automated, such as exploratory testing or optimizing the testing process itself. However, simply automating all manual tests may have undesirable consequences - no one wants the automated tests to be unreliable or to produce false positives (that is, because the code is correct, the test should pass, but due to poor performance, timeouts, or unreliable A controlled startup state, or an unexpected state caused by the use of database stubbing or a shared test environment, causing the test to fail).

In other words, start with a small number of reliable automated tests and add to them over time. This increases the system's assurance level and quickly detects any changes that take the code away from a deployable state.

10.2.5 Integrating performance tests into test suites

We often find poor application performance during integration testing or after the application is deployed to production. Performance issues are often difficult to detect, and performance may gradually deteriorate over time before the problem is discovered until it is too late (such as a database query without an index). Moreover, many problems are difficult to solve, especially when they stem from architectural decisions made previously or from previously undiscovered limitations of the network, database, storage, or other systems.

The goal of writing and executing automated performance tests is to verify the performance of the entire application stack (code, database, storage, network, virtualization, etc.) as part of the deployment pipeline, so that problems can be discovered as early as possible and solved at the lowest cost and Solve the problem as quickly as possible.

If you can understand how your applications and environments perform under production-like loads, you can make better capacity planning and detect conditions such as:

  • Non-linear increase in database query time (for example, forgetting to create an index for the database, causing page load time to increase from 100 milliseconds to 30 seconds);
  • Code changes cause the number of database calls, storage space usage, or network traffic to increase several times.

10.2.6 Integrating non-functional requirements tests into test suites

In addition to testing the code and verifying that it meets expectations and operates properly under production-like loads, other quality attributes of the system need to be verified. These quality attributes are often called non-functional requirements and include availability, scalability, capacity, and security.

Many non-functional requirements are achieved by correctly configuring the environment, so corresponding automated tests must be written to verify the correctness of the environment setup and configuration. For example, the consistency and correctness of the following items should be ensured, on which many non-functional requirements depend (such as security, performance and availability)

  • Applications, databases and software libraries used;
  • Programming language interpreters and compilers, etc.;
  • Operating system (e.g. enable audit logging, etc.);
  • all dependencies

When using an infrastructure-as-code configuration management tool (such as Puppet, Chef, Ansible, SaltStack or Bosh), you can use the same framework you use to test your code to test whether the environment is configured correctly and running properly (such as writing environment tests into Cucumber or Gherkin test).

In addition, just like code analysis for applications in the deployment pipeline (such as static code analysis and test coverage analysis), tools (such as Chef's Foodcritic or Puppet's puppet-lint) are also used to analyze the code of the build environment. All security hardening checks should also be done as part of automated testing to ensure that all relevant configurations are correct (e.g. server specifications)

At any time, automated testing can verify that the code is in a deployable state. Be sure to set up an andon rope mechanism so that if the deployment pipeline fails, any necessary steps can be taken immediately to restore the build to a green state.

10.3 Pull the Andon cord when the deployment pipeline fails

When a build is in green status in the deployment pipeline, we can confidently deploy code changes to production.

To keep your deployment pipeline green, be sure to create a virtual Andon rope, similar to the physical device in the Toyota Production System. Once a developer submits a code change that causes a build or automated test failure, no new changes are allowed to be submitted until the issue is resolved. If someone needs help solving a problem, they can get whatever resources they need, as in the case of Google described at the beginning of this chapter.

When a deployment pipeline fails, at least inform the entire team. Everyone can either work together to fix the problem, roll back the code, or even configure the version control system to reject subsequent code commits until the first stages of the deployment pipeline (i.e., builds and unit tests) return to a green state. If the problem stems from false positives generated by an automated test, then the test should be rewritten or deleted. All members of the team should have access to rollback operations (I'm skeptical) in order to return the deployment pipeline to a green state.

When a later stage of the deployment pipeline fails (such as acceptance testing or performance testing), instead of stopping all new work, you should have a subset of developers and testers on standby who are responsible for fixing the problem as soon as it occurs. These developers and testers should also execute new tests early in the deployment pipeline to catch regression errors introduced by these issues. For example, if a defect is discovered during acceptance testing, a unit test should be written to capture the problem. Likewise, if a defect is discovered during exploratory testing, corresponding unit tests or acceptance tests should be written.

In some ways, this step is more challenging than building and testing the server—those are purely technical activities, whereas this step requires changing human behavior and providing incentives.

Why Pull the Andon Cord
Failure to pull the andon cord and resolve deployment pipeline issues immediately will make it more difficult to return applications and environments to a deployable state. Think about the following situation

  • Someone submitted code that caused the build or automated tests to fail, but no one fixed it.
  • Someone else submitted a code change on top of the failed build. This of course fails automated testing, but no one sees the results of these tests which would help find new defects, let alone fixes.
  • None of the existing tests execute reliably, so writing new test cases is unlikely. (Why bother with this? It can’t even pass the current test.)

10.4 Summary

In this chapter, we create a comprehensive set of automated tests to ensure that the build is always in a green, deployable state. We organized the test suite and testing activities in the deployment pipeline, and established specifications that required everyone to do their best to restore the system to a green state no matter whose code changes caused the automated tests to fail.

This approach lays the foundation for continuous integration, allowing many small teams to independently and securely develop, test, and deploy code to deliver value to customers.

Guess you like

Origin blog.csdn.net/u010230019/article/details/132729753