Author: JD Retail Liu Huiqing

I. Introduction

The high availability of software is a common topic. "High Availability" (High Availability) usually describes a system that has been specially designed to reduce downtime while maintaining high availability of its services. The calculation formula is: availability rate = ( total time - unavailable time ) / total time.

This article focuses on the perspective of implementation practice as an entry point, and leads everyone to show the implementation steps and implementation details of high availability from the aspects of collaboration efficiency, technology implementation, and operation specifications. In order to facilitate understanding, let’s unify the language and vocabulary first, and look at the various stages in the software delivery process, as shown in the following figure:

Why do you say that high availability of software faces many challenges?

◦From the point of view of the demand delivery link, to complete the target delivery requires the close cooperation of multiple stakeholders such as product, R&D, testing, operation and maintenance, and operations. For some project requirements, there are sometimes hundreds of collaborators. Each person has different responsibilities, but they cooperate with each other and rely on each other. If there is a mistake in any link, the availability rate may be affected;

◦From the perspective of time, if you want to achieve an annual availability rate of 99.99%, it means that the allowable time for failure in a year is: 365*24*60*(100%-99.99%)=52 minutes, if To achieve an availability rate of 5 nines, the allowable failure time is only 5 minutes, which is almost the time it takes to restart the application after we find the problem;

◦From the perspective of iteration efficiency, if there is no iteration and no online, the probability of problems will be much smaller. There is a negative correlation between software iteration efficiency and availability. Balancing the relationship between the two will also face considerable challenges.

To sum up, the specific problems we face are as follows:

◦How to solve the problem of many collaborators and long links related to demand delivery?

◦How to deal with the problem of low downtime tolerance?

◦How to keep the availability rate from being greatly impacted under the current situation of frequent demand iterations?

2. Collaboration Efficiency Guarantee

Cognitive misunderstanding

From the entire demand delivery link, we can find that as the link increases step by step, the more branches of the information transmission link will be, and the deeper the transmission level will be. This causes two problems:

1. The efficiency of information transmission is reduced;

2. Information accuracy becomes worse.

The final result of these two problems is the reduction of collaboration efficiency.

A student with no practical experience will often think that increasing the number of people will improve the efficiency of demand delivery. In fact, this idea is not entirely correct. For the specific relationship, refer to the following figure:

It's like building a building. If one person builds it step by step, it will take 100 days to complete. If 100 people are invited to help, can the house be built in 1 day? the answer is negative.

There are collaboration costs, such as: team understanding (designers, bricklayers, masons, plumbers), job matching, risk control;

There are process dependencies, for example: construction depends on design, and soft decoration always comes after hard decoration;

There are cost budgets, such as: talent gradient and scale of the entire organization (contractors, agents, contractors);

All of the above are not simply solved by laying manpower.

Process specification

The underlying logic of improving collaboration efficiency is to reduce the delivery link level and shorten the information transmission link, thereby ensuring the accuracy and transmission efficiency of information. (The content of the organizational construction level will not be expanded here)

This requires the ability to do today's work and complete today's work. At the organizational level, this is called process specification, and at the personal level, it is called work methods and sense of responsibility.

Try to avoid delaying the current matter to the next link, otherwise it will affect the scheduling and delivery efficiency of subsequent links, and even rework may occur in extreme cases. In short, think clearly and don't bury the hole. Product requirements are for R&D, R&D design is for testing, and test cases are for each delivery node such as products. The deliverables must be reliable.

Three technologies landing guarantee

In the demand response cycle, the high-quality implementation of architecture design, coding implementation, safe launch, deployment and operation and other production stages is the premise and basis for the implementation of high-availability software.

architecture design

Architecture design often affects the early implementation cost (ROI) of the system and the difficulty of subsequent operation and maintenance. It belongs to the top-level design of the software, which includes both the macro design scheme and the paradigm constraints in the implementation details.

• Process guarantee

Invite architects to participate: Invite architects to participate in core transaction nodes and major demand changes, which is the most direct and effective way to close the pit;

Emphasis on design documents: A clear description of the scheme and the approval of relevant stakeholders are the prerequisites for walking on the right path.

• Design Guarantee

Disaster recovery design: It is necessary to reserve a way out, think clearly in advance, and do a good job in disaster recovery design. Rollback, fuse, retry, and downgrade possible.

Robust design: stateless design, anti-heavy design, idempotent design, data consistency design

Encoding implementation

If the architectural design is the skeleton, then the coding implementation is the nerves, blood vessels and muscles. The former determines how stable and how long you can walk, while the latter determines how fast and how far you can go. Implemented at the coding level, it is the degree of aging and corruption of the code.

• Process specification

Code review mechanism: Code review is not just as simple as finding problems in the system. It is a long-term behavior and a form and carrier for the implementation and inheritance of organizational culture. During the review process, the boundaries of business responsibilities, design and coding consensus, and excellent standard-oriented research and development consensus were clarified. It is equivalent to giving specific guidance through concrete cases, which are the cornerstones of ensuring the combat effectiveness of the team.

Many problems in the R&D process can be discovered and resolved through the code review mechanism, such as:

◦ How to deal with the design and realization of temporary requirements?

◦What do you think of the writing of "Hello World!" in N?

◦ How to understand design patterns and the boundaries of over-engineering?

◦How to evaluate the deliverables of the current stage?

◦ Is it necessary to introduce unit tests?

• Coding Standards

◦Is there any error handling? For external services called, are return values checked or exceptions handled?

◦ Does the design follow a known design pattern or a pattern commonly used in the project?

◦Can the new code written by the developer be implemented with the functions in the existing SDK/Framework? Is there a similar function in this project that I can call without reimplementing it all?

◦Whether useless, duplicated functions, or jar package dependencies of different versions are introduced into the project? (json library, various utils)

◦Is there any useless code that can be removed?

◦ How readable is the code? Are there enough comments?

◦Is there any error in parameter passing, and is there any assertion (Assert) or judgment to ensure that the conditions we think are invariant are really met?

◦ How are boundary conditions handled? How is the default branch of the switch statement handled? Is it possible for the loop to have an infinite loop?

◦Where are resources applied for and released? Are there possible resource leaks (including timeouts, memory, files, object references, large objects, number of threads, etc.)? Is there room for optimization?

◦ How effective is the code? What is the worst case scenario?

◦In the code, especially in the loop, is there any obvious part that can be optimized (can string operations be optimized with StringBuilder)?

◦ Will calls to the system and network time out? How to deal with it?

◦Is the code easy to test (the number of method lines, cyclomatic complexity, and whether the definition of input and output parameters is reasonable)?

◦Does the change affect the old version, historical data, and upstream compatibility?

◦Does the interface design consider issues such as idempotence, concurrency, overreach, and degradation?

◦Are there caching, database performance issues, and data consistency issues from multiple data sources?

◦Does the online plan consider the grayscale plan and the inconsistent data status?

safe online

70% of online faults are triggered by some kind of change, and a considerable proportion of them are caused by irregular online. So going online safely is very important.

• Process specification

◦It is strictly forbidden to go online frequently: for example, no more than 2 times a week;

◦It is strictly forbidden to go online during the peak period: reduce the scope of the problem;

◦It is strictly forbidden to go online without permission: if there is a change, it must pass the test verification and product return confirmation;

• Process specification

◦Pick up traffic: select the first batch of machines jsf off-line/np to pick up traffic (choose as cold standby);

◦Look at the log: observe the log to confirm that there is no traffic on the removed machine;

◦Service preheating: confirm that the machine is started successfully, and the core business interface needs to be preheated;

◦ Hanging traffic: mount the online machine traffic;

◦Look at the indicators: observe whether the mdc indicators of the online machine are abnormal (cpu, memory, load), and whether the logs are abnormal

deployment operation

A very important means to achieve high availability is capacity redundancy. The direction and ideas are given below, as well as the specific implementation details and strategies, which can be extended according to specific situations.

• network

◦At the operator level, China Unicom, China Telecom, China Mobile, etc.;

◦Link nodes, VIP, CDN, router/switch, reverse proxy, client, browser, etc.;

• storage

◦Whether it is a database master-slave architecture or a copy architecture of ES, it is a means to achieve high availability of storage, and important data must make good use of relevant features;

◦When designing the data structure, it is also necessary to do a good job of shunting strategy, capacity planning, data splitting or heterogeneity. For example: Avoid caching hot keys, bottlenecks in database table throughput, limitations on the number of database connections, and other issues that affect high availability.

• Service

◦Horizontal expansion: It is very important for services to ensure that capacity can be expanded by adding resources;

◦Service grouping: According to business parties or usage scenarios, services are isolated at different granularities to prevent extreme situations from affecting each other;

◦Extreme strategy: It is mainly a defense strategy in extreme abnormal situations. The purpose is to maintain the reliability of the service as much as possible after an accident occurs. For example: current limit, fuse, retry, fast failure, etc.;

◦Gray -scale strategy: When new functions are launched, problems are often most likely to occur. Having mature traffic gray-scale capabilities is the key to controlling the scope of problems;

Four operation standard guarantee

Operating Specifications

1. Can monitor : system running status

2. Can call the police : the abnormal situation can notify the relevant personnel of the system

3. Can be located : After a problem occurs, the cause of the problem can be quickly located

4. Repairable : In the event of an abnormal situation, the problem can be repaired at the first time;

emergency plan

High availability means poor tolerance for downtime, means no time for troubleshooting and repair, and no time to open code for vulnerability troubleshooting. This requires us to have a complete set of emergency plans, which can solve most of the foreseeable failure problems.

• Process specification

◦ resume production first;

◦Second troubleshooting;

For the detailed accident emergency handling manual, please refer to the following figure:

• Process specification

◦Develop corresponding plans in three dimensions: network, service, and storage, and fill in the emergency plan list (file name: checklist) in your own code base to keep the content inherited and updated;

◦Predictability , that is, problem triggering scenarios should be written clearly. Example: According to the current progress (10,000/day), with the increase of database data, it is estimated that after 10 months, slow queries will appear in the database table (xxx table name);

◦Achievability , the ability to eliminate solutions to problems. Example: start the historical data archiving task (xxxWorker), and transfer the historical data to the archiving database;

Standard compliance

No matter how good the process and norms are, there must be a corresponding mechanism to implement them. Otherwise, it will be a flower in the mirror, a moon in the water, which looks beautiful but is actually useless. Executable and measurable are the prerequisites for getting better according to the goals. So here is a tool called "High Availability and Compliance Periodic Self-inspection Table" to assist in the implementation of the specification.

Five Summary

This article discusses the question "Why is there a great challenge in high availability?", emphasizes the importance of collaboration efficiency in the process of demand delivery, and points out why it is necessary to follow the working principle of "Today's work, today's work". From the aspects of architecture design, coding implementation, safe launch, deployment and operation, etc., it introduces in detail the guidelines and implementation details related to technology implementation guarantee. Finally, from the perspective of post-launch operation, it gives practical operation guarantee tools such as emergency plan, regular self-check table and so on. Hope it can help readers.

Architect's Diary - Software High Availability Practice Those Things

Author: JD Retail Liu Huiqing

I. Introduction

2. Collaboration Efficiency Guarantee

Cognitive misunderstanding

Process specification

Three technologies landing guarantee

architecture design

Encoding implementation

safe online

deployment operation

Four operation standard guarantee

Operating Specifications

emergency plan

Standard compliance

Five Summary

Guess you like