"DevOps Practice Guide" - Reading Notes (5)

Part 4 The second step: Technical practice of feedback

In the third part, we describe how to establish a fast-flowing workflow from development to operation and maintenance, as well as the architecture and technical practices it requires. In Part 4, we describe technical practices for implementing the second-step approach, creating rapid and continuous feedback from operations to development.

This practice speeds up and strengthens the feedback loop, allowing us to identify problems as they occur and communicate this information back to everyone in the value stream. This allows problems to be quickly identified and fixed early in the software development lifecycle; ideally, before they lead to a catastrophic incident.

We'll also explain how to comprehensively monitor user behavior, production failures, service outages, compliance issues, and security vulnerabilities throughout the build, test, and deployment process. The following work needs to be explored and implemented:

  • Establish a telemetry system that can locate and resolve faults;
  • Use monitoring to better predict failures and perceive the achievement of business goals;
  • Incorporate user research and feedback into the work of the R&D team;
  • Provide feedback to development and operations so they can deploy applications safely;
  • Use peer review and pair programming feedback to improve work quality

14. Establish a telemetry system that can identify and resolve problems

One problem that operations teams cannot avoid is failure - a small change can lead to many unintended consequences, including service interruptions or even global outages that affect all customers. The fact is: the operation and maintenance team is faced with a complex system, and no one can understand the entire system and explain how each part of the system fits together.

In order to be disciplined in solving problems, we need to design systems with continuous telemetry. Telemetry is broadly defined as " 一个自动化的通信过程,先在远程采集点上收集度量数据,然后传输给与之对应的接收端用于监控". The goal is to establish telemetry across the application and its environment, including production, pre-production, and deployment pipelines.

The goal of this chapter is to ensure that the service is running properly in the production environment by establishing a comprehensive monitoring system. When problems arise, we can quickly locate them and make the right decisions to solve them, preferably long before customers are affected. Furthermore, monitoring allows us to assemble our best understanding of reality and detect when it is incorrect.
Insert image description here
Emergency response indicators :

  • MTTR: ​​Mean-Time-To-Resolution (Mean-Time-To-Resolution), which represents the time from when a problem is first reported to when it is finally solved by technicians. In the field of cybersecurity, MTTR is used to represent the time span between when a cybersecurity incident is first discovered and finally resolved by security operations personnel.
  • MTTD: Mean Time to Detect (Mean Time to Detect), which represents the time it takes for an attacker to use tactics and techniques to gain a foothold on the target network for the first time until they are finally detected by the network or terminal security device.
  • MTTA: Mean time to acknowledge. MTTA refers to the average time from when the system generates an alarm to when personnel begin to pay attention and handle it.
  • MTTI: Mean time to investigate. MTTI refers to the average time from the recognition of a security incident to the initiation of investigation into its cause and resolution.
  • MTTC: Mean Time to contain. MTTC refers to the time it takes for security teams to find a threat actor and prevent them from further entering your systems and network
  • MTTF: Mean Time to Failure. MTTF refers to the duration of failure

14.1 Build a centralized monitoring architecture

Operational monitoring and log management are definitely not new; for many years operations engineers have used and customized monitoring frameworks (such as HP OpenView, IBM Tivoli, and BMC Patrol/BladeLogic) to ensure the normal operation of production systems. Monitoring data is typically collected using an agent running on the server side or through agentless monitoring such as SNMP traps or poll-based monitoring. The front-end is typically presented using a graphical user interface, while the back-end is often supplemented using tools such as Crystal Reports.

A modern monitoring architecture has the following components

  • Collect data at the business logic, application, and environment layers : Within each layer, establish monitoring for events, logs, and metrics. Logs can be stored on all servers using a specific file (such as /var/log/httpd-error.log), but it is better to send all logs to a common log service, which is easier to centralize, rotate, and purge. Most operating systems provide this function, such as Linux's syslog, Windows' event log, etc. Additionally, collecting metrics at all levels of the application stack provides a better understanding of system activity. At the operating system level, you can use collectd, Ganglia and other tools to collect status indicators, such as CPU, memory, disk or network usage. Use other tools to collect performance metrics, including AppDynamics, New Relic, and Pingdom.
  • Event router responsible for storing and forwarding events and indicators : This feature supports monitoring visualization, trend analysis, alarms, anomaly detection, etc. By collecting, storing and aggregating all monitoring information, deeper analysis and health checks can be achieved. This is also where configuration information related to services (and the applications and environments they support) is stored, enabling threshold-based alerts and health checks.

Insert image description here
"Monitoring is so important that the monitoring system needs to be more available and scalable than the system being monitored."

From now on, the term telemetry is used interchangeably with metrics to cover event logs and metrics generated by services at all levels of the application stack, as well as event logging and metrics from all production environments, staging environments, and deployment pipelines. .

14.2 Setting up application log telemetry for production environments

With a centralized telemetry infrastructure, we must ensure adequate telemetry is established for all applications being built and operated. To achieve this, development and operations engineers must build production telemetry as part of their daily work, not just for new services, but also for existing services.

"Every time NASA launches a rocket, it uses millions of automated sensors to report the status of every component of this valuable asset. However, we don't typically do the same with software.
We Telemetry for facilities is one of the things with the highest return on investment.”

Logs must have different levels, some of which may also trigger alarms, for example:

  • Debug level : This level of information is related to all events that occurred in the application and is most commonly used when debugging. Typically, debug logging is disabled in production but temporarily enabled during troubleshooting.
  • Information level : Information at this level includes user-triggered or system-specific actions (such as "Start credit card transaction").
  • Warning level : This level of information tells us about possible error conditions (for example, a call to the database takes longer than a certain amount of time). Alerts and troubleshooting procedures may be triggered, and other log messages may help to better understand the cause of the event.
  • Error level : This level of information focuses on error conditions (for example, API call failed, internal error).
  • Fatal level : This level of information tells us when an outage condition occurs (for example, the network daemon cannot bind a network socket)

To help ensure that we have information relevant to service reliability and secure operation, we should ensure that all potentially significant application events generate logging entries

  • Results of authentication/authorization (including exit);
  • Access to systems and data;
  • System and application changes (especially privilege changes);
  • Changes to data, such as adding, modifying or deleting data;
  • Invalid input (possible malicious injection, threat, etc.);
  • Resources (memory, disk, CPU, bandwidth or any other resource with hard/soft limits);
  • health and availability;
  • startup and shutdown;
  • glitches and errors;
  • circuit breaker tripped;
  • Delay;
  • Backup success/failure.

To make it easier to interpret and give meaning to all log entries, (ideally) we should rank and classify log records, for example by non-functional attributes (e.g. performance, security) and functional attributes (e.g. search, ranking)

14.3 Using telemetry to guide problem resolution

As mentioned at the beginning of this chapter, high performers are highly trained in problem solving, in contrast to the common practice of relying on hearsay, where the average exoneration time (i.e., how long it takes to convince others) is not our fault outage) indicators are very bad.

Telemetry gives us a scientific method that allows us to formulate hypotheses about the causes and solutions to specific problems. In the process of solving the problem, we can answer the following questions

  • What evidence is there in the monitoring system that the problem is actually occurring?
  • What are the possible related events and changes in the application and environment that caused the problem?
  • What hypotheses can be made to substantiate the proposed link between cause and effect?
  • How to verify which assumptions are correct and can successfully solve the problem?

14.4 Integrate building production telemetry into your daily work

To enable everyone to identify and solve problems in their daily work, we need to make it easy for them to create, display, and analyze metrics in their daily work. Therefore, the necessary infrastructure and libraries must be established to make it easy for any developer or operations staff to create telemetry for any function. Ideally, creating a new metric, displaying it through a dashboard, and making it visible to everyone in the value stream should be as easy as writing a line of code.

Figure 14-3 shows a user login event created with one line of code (in this example, that line of PHP code is: StatsD::increment("login.successes")). The figure shows the number of successful and failed logins per minute, and the vertical line on the figure represents a production deployment.
Insert image description here
By making the creation of production environment metrics a part of your daily work, you can not only catch problems in a timely manner, but also expose problems during the design and operation process, allowing you to track more and more metrics.

14.5 Establishing self-service telemetry and information emitters

We want production monitoring metrics to have a high degree of visibility, which means placing them at the center of the development and operations staff's work areas so that anyone interested can see the current status of the service. This should include at least everyone in the value stream, such as development, operations, product management, and information security.

This is often called an information radiator , and is defined by the Agile Alliance: “This general term refers to a handwritten, drawn, printed, or electronic display of information that the team places in a highly visible location for all teams to Members, as well as people passing by, can quickly view the latest information: number of automated tests, velocity, incident reports, continuous integration status, etc. The idea originated from the Toyota Production System.

By placing information radiators in highly visible places, we spread a sense of responsibility among our team members while also actively demonstrating the following values:

  • The team hides nothing from observers (customers, stakeholders, etc.);
  • The team has nothing to hide from themselves, they acknowledge problems and face them head-on.

Now, we can radiate the production telemetry information generated by our infrastructure throughout the organization, and we can also disseminate this information to internal customers and even external customers. For example, create a publicly browsable service status page so customers can understand the status of the services they rely on.

14.6 Discovering and filling telemetry blind spots

We need to build adequate and complete telemetry in all environments, at every level of the application stack, and in the deployment pipelines that support them. We need the following levels of metrics

  • Business level: Examples include number of trading orders, turnover generated, number of user registrations, churn rate, results of A/B tests, etc.
  • Application level: Examples include transaction processing time, user response time, application failures, etc.
  • Infrastructure level (e.g. database, operating system, network, storage): Examples include web server throughput, CPU load, disk usage, etc.
  • Client software level (e.g. JavaScript on client browser, mobile application): Examples include application errors and crashes, user-side transaction processing times, etc.
  • Deployment pipeline levels: Examples include build pipeline status (for example: red or green status of various automated test suites), change deployment lead time, deployment frequency, test environment go-live status, and environment status.

By fully covering the above areas with telemetry, we can see the health of everything the service depends on, and speak to the problem with data and facts instead of listening to rumors, pointing fingers, and blaming each other.

By finding and fixing issues early, we can fix them while they are small and easy to fix, reducing the impact on our customers. In addition, after every production failure, we should identify whether there are telemetry blind spots so that filling them will allow faster failure detection and restoration of service. Even better, these blind spots can be identified during the peer review process during feature development.

14.6.1 Application and Business Metrics

At the application level, our goal is to ensure that the telemetry produced not only reflects the health of the application (e.g., memory usage, transaction counts, etc.), but also measures how well the organization's business goals are being achieved (e.g., user growth, Number of user logins, user session length, proportion of active users, frequency of use of certain features, etc.).

Business metrics will be part of your customer acquisition funnel , which are the theoretical steps a potential customer takes to make a purchase. For example, in an e-commerce website, measurable events include: how long a user is online, the number of clicks on a product link, the number of items in the shopping cart, and completed orders.

"Typically, feature teams will define goals on an acquisition funnel, including the number of times each customer uses the feature in a day. Sometimes, users are informally referred to as 'tire kickers' during different stages of monitoring. 'Active users', 'member users' and 'hardcore users', etc."

14.6.2 Infrastructure Metrics

Similar to what we did for application metrics, for both production and non-production infrastructure, our goal is to ensure comprehensive telemetry is in place so that if an issue arises in any environment, we can quickly pinpoint whether the issue is caused by the underlying infrastructure. caused by architecture. In addition, you must be able to pinpoint which component of the infrastructure is causing the problem (e.g. database, operating system, storage, network, etc.)

We want to expose as much infrastructure monitoring information as possible to all technical stakeholders, ideally organized by service or application logic. In other words, when something goes wrong in our environment, we need to know exactly how our applications and services may be or are being affected.

No matter how simple or complex a service is, displaying business metrics together with application and infrastructure metrics can help identify failures. For example, you might see new customer registrations drop to 20% of the daily average, and then immediately discover that all database queries are taking 5 times longer than normal, allowing us to focus our efforts on solving the problem.

“Instead of measuring operations based on downtime, it’s better to measure development and operations teams based on the actual business consequences of downtime: What are the business benefits we should be getting but are not getting.”

14.6.3 Displaying overlaid indicator combinations

To improve visibility of changes, all production environment deployment activity can be overlaid on the monitoring graph. For example, for a service that handles a large number of inbound transactions, a change to the production environment may cause a significant downturn , during which the application's performance plummets due to all cached queries being invalidated.

To better understand and maintain quality of service, we need to understand when performance can be restored and, if necessary, implement performance improvement measures. Similarly, we also want to superimpose other meaningful operation and maintenance activities. For example, when the service is being maintained or backed up, it may be necessary to display alarms or suspend alarms in certain places.

14.7 Summary

It is important to detect problems when they occur so that the cause can be identified promptly and remedied quickly. Whether in an application, database, or production environment, by having all components of a service send telemetry information for analysis, problems can be discovered and resolved long before a failure results in disaster, or even long before customers notice the problem. . Not only will this result in happier customers, but by reducing the number of firefights and crises, work stress will be reduced, burnout will be reduced, and work will be more enjoyable and productive.

15. Analyze telemetry data to better predict failures and achieve goals

In this chapter, we will create tools to identify discrepancies and weak failure signals hidden in production telemetry to avoid catastrophic failures. This chapter introduces a number of statistical techniques and illustrates their use through case studies

This chapter explores many statistical and visualization techniques, including anomaly detection, that we can use to analyze telemetry data to better predict problems. This way, we can resolve issues faster, at lower cost, and before the customer or anyone in the organization is impacted. In addition, we will create more data usage scenarios to help us make better decisions and achieve organizational goals.

15.1 Use mean and standard deviation to identify potential problems

One of the simplest statistical techniques for analyzing production environment metrics is to calculate the mean (or mean) and standard deviation. In this way, we can create a filter that detects when a metric differs significantly from normal values, and even configure alerts to trigger remedial actions (for example, in the early hours of the morning when the database query speed is significantly slower than average Notify the production environment attendant at 2 o'clock for troubleshooting).

A common use of standard deviation is to periodically check a measure of a data set and alert if it differs significantly from the mean. For example, an alert is issued when the number of unauthorized logins per day is three standard deviations greater than the mean. As long as this data set has a Gaussian distribution, we expect that only 0.3% of the data points will trigger an alert.

Even a simple statistical analysis can be valuable because you no longer have to set static thresholds—which is not feasible if you are tracking thousands or hundreds of thousands of production metrics.

Throughout the remainder of this book, we will use the terms telemetry, metrics, and data sets interchangeably . In other words, a metric (e.g., "page load time") maps to a data set (e.g., 2 ms, 8 ms, 11 ms, etc.), which statisticians use to describe a matrix of data points, where each column represents a pair A variable on which statistical operations are performed.

15.2 Abnormal status processing and alarms

At this stage, we will replicate the above practical results. One of the simplest methods is to analyze the most serious incidents encountered in the recent past (for example, within 30 days) and build a telemetry list that can be used to detect and diagnose problems earlier and faster, and make it easier and faster to Confirm that effective remediation measures have been implemented.

For example, if the NGINX web server stops responding to requests, we look at key indicators and they may have given us early warning that we are starting to deviate from standard operations. For example:

  • Application level – web page load times are increasing, etc.;
  • Operating system level - insufficient server idle memory, insufficient disk space, etc.;
  • Database level - database transaction processing time exceeds normal values, etc.;
  • Network level – the number of servers running behind the load balancer drops, etc.

All of the above indicators are potential signs of an incident in a production environment. For each such indicator, we will set an alert. When they deviate enough from the mean, an alarm is issued to notify relevant personnel to take corrective measures.

By repeating the above process for all weaker failure signals, we can detect problems earlier in the software's life cycle, thereby reducing the number of customer-impacting incidents. In other words, we must not only proactively prevent problems, but also detect and fix them more quickly.

15.3 Problems with non-Gaussian distributed telemetry data

It is very practical to use mean and standard deviation to detect anomalies. However, in operation and maintenance we will use many telemetry data sets, and using these techniques on them may not necessarily produce the expected results.

In other words, when the data in the data set does not have the above-mentioned Gaussian distribution (bell curve), it is inappropriate to use properties related to standard deviation. For example, we are monitoring the number of file downloads per minute on the website, and what we need to detect is the period when the download volume is abnormally large. For example, when the download rate is three standard deviations greater than the mean, we may want to proactively add more bandwidth.

Dr. Nicole Forsgren explains: “Many data sets in operations have what we call a ‘chi-square’ distribution. Using standard deviation on such data not only results in over- or under-alarming, but also produces ridiculous results. When concurrency When the download volume is three standard deviations below the mean, the result will be negative, which obviously does not make sense."

Insert image description here
Netflix takes advantage of the fact that its consumers' browsing patterns are surprisingly consistent and predictable, although not Gaussian-distributed. Figure 15-4 reflects the number of customer requests per second during the entire work week, showing that customer browsing patterns are regular and consistent from Monday to Friday.
Insert image description here

15.4 Apply anomaly detection technology

In situations where monitoring data does not have a Gaussian distribution, we can still use various methods to find differences worth noting. These techniques are broadly classified as anomaly detection, which is generally defined as "searching for data entries or events that do not fit expected patterns." Some of these features can be found within monitoring tools, while others may require the help of a statistics expert.

We employ a statistical technique called smoothing , which works particularly well with time series data, meaning that each data point has a timestamp (e.g., download events, completed transaction events, etc.). Smoothing techniques typically involve using a moving average (or rolling average), which transforms data by averaging each point against all other data in a sliding window. Doing so helps dampen short-term fluctuations and highlight long-term trends and cycles.

The effect of this smoothing technique is illustrated in Figure 15-6. The black line represents the raw data, while the gray line represents the 30-day moving average (ie, the 30-day average trajectory).
Insert image description here
We can expect that a large amount of telemetry data related to users will have cyclical/seasonal similarities - web traffic, retail transactions, movie viewing, and many other user behaviors, with striking daily, weekly and yearly patterns. Regular and predictable. This allows us to detect deviations from historical patterns, such as a Tuesday afternoon order trade rate falling to 50% of the weekly average

Figure 15-7 shows the number of transactions per minute for an e-commerce website. Notice in this chart that weekly trading volume declines over the weekend. Visually, a special situation occurred in the fourth week, because the trading volume on Monday did not return to normal levels. This suggests that we should investigate the incident.
Insert image description here

15.5 Summary

This chapter explores several statistical techniques for analyzing production telemetry data so that problems can be identified and resolved earlier and when they are minor, before they have catastrophic consequences. These technologies allow us to identify those weak signals of failure and take timely action, resulting in a safer working system while improving the ability to achieve goals.

16. Apply feedback for secure deployment

We must integrate monitoring of production telemetry into deployment efforts, while also establishing a cultural norm that everyone has the same responsibility for the health of the entire value stream.

This chapter will establish a feedback mechanism that allows us to continuously improve the health of the value stream at every stage of the service life cycle, from product design to development and deployment, to operations and final offline. This way, you can guarantee that your service is always in a "ready" state, even in the initial stages of a project, while also learning from every release and production issue and using the experience for future work, thus improving security. sex and everyone’s productivity.

16.1 Making deployments more secure with telemetry

At this stage, we ensure that the metrics of the production environment are proactively monitored when anyone performs a production deployment. This allows deployers (whether developers or operations personnel) to monitor the new version once it is running in production. Quickly confirm functionality is working as expected. After all, no code deployment or production change should be considered complete until this new version is running in production as expected.

As mentioned in Part III, our goal is to catch bugs in the deployment pipeline before the software enters production. However, there are still undetected errors, which rely on telemetry from the production environment to quickly restore service. We can use feature switches to turn off the offending functionality (this is usually the easiest and least risky option since it doesn't involve production deployment), or fix the bug forward (i.e. modify the code to fix the defect and then move the code through the deployment pipeline Deploy changes to the production environment), or roll back (for example, use feature switches to switch back to the previous old version, or take the faulty server offline through blue-green deployment, canary release, etc.)

While forward fixing is often dangerous, it's actually very safe when you have automated testing, rapid deployment pipelines, and comprehensive telemetry to quickly confirm that everything in production is working properly.

  • 前向恢复: Refers to the reliability design of the system, setting up a series of detection points on processes that may fail, requiring the detection points to provide as much fault information as possible, and 进程前方setting one or several recovery points in advance, so that the system can Resume the work from the predetermined state and finally get an acceptable output sequence. Forward recovery is suitable for systems with relatively demanding spatiotemporal resources but strong real-time requirements. Because it does not perform complete operations, the services provided are often functionally degraded. When multiple recovery points exist, based on the information obtained from fault detection, the recovery point with the smallest service loss can be selected to start recovery. Therefore, forward recovery requires estimating the possible impact of the failure, choosing from which point to continue running, and estimating the possible deviations in the final results and the degradation of the service due to the loss of part of the service.
    Insert image description here
  • 后向恢复: Refers to setting up a series of detection points on processes that may fail when designing the system for reliability, requiring the detection points to provide as much fault information as possible. 进程后方And set a series of one or several recovery points at the detection point . When the detection point detects a fault, the system will reset to a recovery point that the process passed before the fault. The recovery point can maintain the state (data) of the previous process, so that after troubleshooting, the operation from the recovery point to the detection point can be restarted from this state to obtain correct results. The advantage of this recovery strategy is that system recovery does not cause functional degradation; the disadvantage is that it increases execution time and space.
    Insert image description here

16.2 Development and operation and maintenance share duty duties

Just because each product feature is marked as "Done" does not mean that the business goal has been achieved. On the contrary, it is truly "complete" when all functions run normally in the production environment as designed, without causing major failures or causing unplanned operation and maintenance or development work.

ITIL defines "guarantee" as the ability of a service to run reliably in a production environment for a predetermined period of time (e.g., two weeks) without intervention. Ideally, this definition of "guarantee" should be incorporated into the definition of "done."

This approach works for a variety of teams, including market-facing teams, teams responsible for developing and running features, and functionally oriented teams.

Regardless of the team's organizational structure, the basic principle remains the same: when developers get feedback on how their application is running in production, including how bugs are being fixed, they are closer to their customers, which makes Everyone in the value stream benefits.

16.3 Let developers track the downstream impact of their work

One of the most powerful techniques in interaction and UX design is situational interviewing. The product development team observes customers using the application in their natural environment (usually at their desk). These are situational interviews. Through situational interviews, it is common to uncover the difficulties that customers encounter while using the product. For example, performing a simple task in daily work also requires many clicks and requires copying and pasting text between multiple screens. , or need to record information on paper. In fact, these are compensatory behaviors caused by the lack of usability of the application.

Our goal is to use this technique to observe the impact of our work on internal customers. Developers should track their work so they can see how downstream work centers interact with the products they develop in a production environment. This way we get feedback on the non-functional aspects of the code (including all elements not related to customer-facing functionality) and can find ways to improve the deployability, manageability, maintainability, etc. of the application .

Through user experience observation, you can ensure quality at the source and develop empathy for other team members in the value stream. Ideally, user experience observations help us create the non-functional requirements of the application and put them into a shared backlog, and ultimately we can proactively implement them into all services built. This is a DevOps culture an important part of construction work.

16.4 Let developers manage production services themselves

Even if developers normally write and run code in a production-like environment, operations teams may still encounter a product release that results in a catastrophic incident because it is the first time we see how the code performs under real production conditions. The reason for this result is that operations learning often occurs too late in the software life cycle.

In order to solve this problem, a practice of Google is worthy of reference. They first let the development team manage the services they develop in a production environment before they can be managed by a centralized operations team. By letting developers take responsibility for deployment and support in production, the products they create are more likely to transition smoothly to operations teams.

In order to prevent problematic self-managed services from entering the production environment and bringing organizational risks, we can define the release requirements of the service. Only by meeting these requirements can the service interact with real customers and be exposed to production traffic. Additionally, to help product teams, operations engineers should serve as consultants to help them prepare services for deployment into production.

By establishing service release guidelines, it helps to ensure that the collective wisdom of the entire organization, especially the accumulated experience of the operation and maintenance team, is used to help each product development team. Service release guidelines and requirements may include the following.

  • Defect count and severity: Does the application work as designed?
  • Type/frequency of alerts: Does the application generate too many alerts to be supported in a production environment?
  • Monitoring coverage: Is the scope of monitoring coverage large enough to provide enough information to restore failed services?
  • System architecture: Are services loosely coupled enough to support the high frequency of changes and deployments in a production environment?
  • Deployment process: Is the process of deploying code in production predictable, deterministic, and sufficiently automated?
  • Cleanliness of the production environment: Are there any signs that production habits are good enough for anyone else to provide production support?

At this time, we may also want to know whether the service is subject to any regulatory objectives now or in the future.

  • Does the service generate substantial revenue? (For example, if its revenue is more than 5% of the total revenue of a U.S. public company, then it is a "material account" and must comply with Section 404 of the Sarbanes-Oxley Act of 2002.)
  • Is the user traffic or downtime/damage cost to the service high? (i.e. Do operational issues result in availability or reputational risks?)
  • Does the Service store payment card holder information (such as credit card numbers) or personally identifiable information (such as Social Security numbers or patient care records)? Are there other security issues that could create regulatory, contractual obligation, privacy or reputational risks?
  • Are there any other regulatory or contractual requirements for the services, such as U.S. export regulations, PCI-DSS, HIPAA, etc.?

The above information will ensure that we effectively manage technical risks related to the service, as well as potential security and compliance risks. It also provides important input into the control and design of the production environment
Insert image description here

  1. By incorporating operability requirements from the earliest stages of the development process and letting developers manage their own applications and services first, the process of releasing new services into production becomes smoother, easier, and more predictable. However, for existing services in the production environment, we need another mechanism to ensure that operation and maintenance will not be stuck in unsupported services. This is especially important for functionally oriented operations organizations.
  2. At this stage, we can establish a service callback mechanism . In other words, when a service in the production environment becomes very fragile, the operations department can hand over the responsibility of supporting the service back to the development department
  3. When a service becomes managed by developers, the role of operations changes from providing production support to being a consultant to the development department, helping the development team make the service production-ready again.
  4. This mechanism acts like a pressure relief valve for operations personnel, ensuring that operations teams don't end up in a situation where they have to manage fragile services while technical debt accumulates and local problems become global problems. This mechanism also helps ensure that the operations department has sufficient capabilities to carry out improvement work and preventive projects.

Case:
Google established two sets of security checks for the two key stages of releasing new services, namely the Handover Readiness Review (LRR) and the Online Readiness Review (HRR).

Google must perform LRR and sign-off before any new service is exposed to customers and receives production traffic , and HRR is performed when the service is transitioned to operational management (usually a few months after LRR). The LRR and HRR review checklists are actually similar, but HRR is more stringent and has higher acceptance standards , while LRR is performed and reported by the product team themselves.

LRR and HRR review checklists are one way to build organizational memory.
Insert image description here

16.5 Summary

This chapter discusses feedback mechanisms that allow us to improve services at every stage of our daily work, whether that's deploying changes to production, asking engineers to fix code when problems arise, letting developers track downstream work, and establishing non-functional requirements. To help the development team write better production-ready code, or to hand the problematic service back to the development team to manage themselves.

By creating these feedback loops, you can make deployments to production more secure, improve the production readiness of the code you develop, and build better working between development and operations teams by reinforcing shared goals, responsibility, and empathy. relation.

Guess you like

Origin blog.csdn.net/u010230019/article/details/132800537