I. Introduction

This article is some review and summary of the construction process of Jingdong Daojia's automated testing system. Some chapters on system design and practice have been deleted, and the content related to organization and culture has been retained. It is compiled into a document for readers.

Let's discuss the problems and challenges often faced in the work from the perspective of QA (Quality Assurance).

Regarding software quality, I wonder if you have the following confusion:

In Western medicine, the idea of "treating the head if you have a headache, and treating the foot if your foot hurts" often fails to work in the R&D team. The overall dialectical treatment of Chinese medicine is often a good way to solve problems. The root of it is the difference in thinking dimension and observation perspective. For example, what changed the way of human travel did not evolve according to the cultivation of better and stronger horses, but the invention of bicycles and automobiles; there is also an example that is often joked by the public, it is not because of a certain model that seized the instant noodle market Instant noodles may be the rise of takeout. All this tells us that when we look at problems from a higher-dimensional perspective, problems are often easier to locate and solve. Let's first take a look at the delivery process of R&D system requirements, the personnel cooperation and various interaction stages involved, as shown in the following figure:

For example, take the phenomenon that the high rate of software missed tests leads to frequent online accidents. From the perspective of the entire demand delivery process, the rhythm of the entire product iteration is as follows:

So what is the reason for the high miss rate?

It may be a problem of product design, it may be a problem of R&D implementation, it may be a problem of use case verification, it may be a process problem of information asymmetry, it may be a problem with the rhythm of teamwork, or it may be a problem of technical reserves. In summary, it can be summarized as: organization Support, technical practice, and cultural consensus are several dimensions. Below we focus on the two directions of organizational support and cultural consensus.

Two Conway's Law

2.1 Law Interpretation

Amazon CEO Bezos has his own solution to the problem of how to improve meeting efficiency. He calls it the "two-pizza rule," which states that the number of people in a meeting can't be so large that two pizzas aren't enough for them to eat.

When it comes to organizational structure, the so-called first law in software architecture design cannot be avoided, Conway's law: "The architecture of the design system is subject to the communication structure of the organization that produces these designs";

They all reflect the reality that the cost of horizontal communication is very high. At the technical level, the interfaces between the various modules of the system also reflect the information flow and cooperation methods between them, and the same is true.

For an intuitive understanding, we borrow a widely circulated diagram to understand the organizational structure:

In fact, according to Conway's law in the original text, someone disassembled it into the following four laws:

The first law of organizational communication is expressed through system design.

Organizational communication and system design are closely linked, especially for complex systems. Only by solving the communication between people can we have a better system design.

The second law It is impossible to do one thing perfectly with more time, but there is always time to finish one thing.

The "agile development" model well interprets this law, achieving continuous iteration, continuous delivery, rapid verification and feedback, and continuous improvement. In a word: Done is more important than perfect!

Before the system is actually put into production use, no matter how good the architecture is, it is just an assumption. The later the product is used by users, the higher the cost and risk of failure, and small steps, through MVP quick experiments, obtain customer feedback, and iterative evolution Products can effectively reduce the cost and risk of failure. Avoid over-design problems.

The third law has potential heterogeneity and homomorphism between linear systems and linear organizational structures.

What kind of system design do you want, what kind of team you want to build, flatten if you can. It is best to divide the team according to the business, so that the team can be naturally autonomous and cohesive. A clear business boundary will reduce the cost of communication with the outside world. Each small team is responsible for the entire life cycle of its own module . This is the scene visualization of the first law;

There is a mapping relationship between the organization and the system architecture (1 ~ 1 mapping). If the two are not aligned, various problems will arise. On the one hand, if your organizational structure and cultural structure (democratic cooperative, centralized, Law of the Jungle, Talent Density) does not support, and you cannot successfully build an efficient system architecture, such as centralized and strict functional (business, development, testing, deployment, operation and maintenance) enterprises, it is difficult to implement microservices and DevOps, promote The Docker/PaaS platform will also be more difficult. Such organizational functions tend to be partially optimized and cannot form effective cooperation and closed loops.

The reverse is also true, if your system design or architecture does not support it, then you will not be able to successfully build an effective organization;

The Fourth Law Larger systems are always more prone to disintegration than smaller ones.

The more complex the system, the more manpower is needed, and the more manpower, the communication cost increases exponentially. Divide and conquer is the solution most companies choose. Different levels, different small teams, let the team complete self-governance, and then communicate externally.

In fact, the core essence of Conway's definition is how to improve the efficiency of organizational collaboration. To understand it more concretely, we can look for wisdom from nature, the most typical is the organizational structure of a tree, root -> trunk -> branch -> leaves, the root provides water and nutrients to the leaves through the bark , the photosynthesis of leaves conducts carbohydrates from top to bottom through the tree core, and the division of labor and cooperation of this living organism is in line with Conway's law. In addition, the efficiency performance of the tree structure in terms of data structure is also the most balanced. Therefore, using the tree structure well is of great benefit to organizational communication and collaboration as well as system architecture. Let's re-examine our organizational structure, this may be one of the culprits behind the phenomenon of complicated problems.

2.2 Practical cases

The Jingdong Daojia coupon system has undergone a major revision and reconstruction. The business background at that time was the O2O business model. After a period of exploration, we finally set the supermarket fresh food as the focus of the business, and then promotions for the supermarket fresh food business came one after another. : Platform newcomers, merchant newcomers, first-order newcomers and other dimensions, and then combine channels, cities, merchants, stores, brands, commodities and other dimensions in combination restrictions. The gameplay is flexible and changeable, and a large number of promotions are squeezed into the demand pool. need.

At that time, there were 3 people in the team supporting the business iteration of the current coupon system, and the operation mode can be described as a chimney. The architecture system is out of control. Its operating logic is: in order to quickly support business needs and reduce the impact on the original service logic, it can only start from scratch and redesign, and the more this is the case, from the underlying data structure to the intermediate basic services, as well as the business aggregation of the upper layer, The necessary cohesion and convergence cannot be achieved in terms of the number and structure of components, which makes it difficult to support some businesses. So we carried out system refactoring. Of course, our testers were supported by one person from beginning to end, supporting the testing tasks of the entire refactoring work. The organizational structure at that time was as follows:

With the rapid development of the business, a large number of demands for new promotions and promotions have come one after another. The most important thing is that the promotion method is changeable, and each business team has its own requirements. For example, the business team requires order rebates/ Sharing coupons/bargaining coupons and other scenarios, and for user growth teams, what the team may want is targeted push coupons/local push scan code coupons/newcomer red envelopes and other games. The business side is multi-line, but our R&D team and system are single-line, which leads to a huge cost of project communication and coordination. Due to the problem of R&D resources, many needs have been in the demand pool for a long time and cannot be responded in time. Even if we try to double the number of R&D personnel, the result still cannot fundamentally solve this problem, and the efficiency of demand delivery has become a business pain point.

Next, we made a lot of adjustments in the organizational structure. First of all, at the business structure level, we closed the business and R&D of the user growth department and the user retention department respectively, that is, both business departments set up their own independent R&D teams internally. Carry out domain closed-loop; secondly, at the level of technical architecture, we have systematically split the existing coupon service, and centralized the functions in the core life cycle of coupons, so as to provide basic services for various departments instead of In the core life cycle, such as the threshold for issuing coupons and the way of reaching out to other promotional business methods, it is enough for each business team to digest them internally. In this way, the many-to-one organizational collaboration model of business and technology is eliminated, and obvious results have been achieved in team communication and collaboration efficiency.

From business to technology research and development, the split mapping of the organization and system has been achieved, but our test team did not make timely personnel adjustments in the first place. The result is that the test quality has declined and the test efficiency has decreased significantly. Because with the reshaping of the organization and system architecture, business and R&D have achieved an internal closed loop of communication and collaboration, but this has not been done for the test students. Next, we will match the test team with the organization of the R&D team again to complete The structural adjustment of the entire demand delivery chain.

Three organizational cultures

3.1 Team cognition

It will take some time for members to overcome individual differences, cooperate tacitly, trust each other, and form a truly cohesive team. It may take 6 months or even 1 year. Once the cohesion is really formed, the members of the team will make plans together, face problems together, and get everything done together. Once a team has cohesion, it is ridiculous to disband such a team because the project is over. The best thing to do is not break up the team and keep them working together, just keep assigning them new projects.

Some newly established software outsourcing companies try to build teams around projects. This is an unwise approach. With this approach, the team can never form a cohesive force. Everyone is only on the project for a short period of time, working on it only part of the time, so they never learn how to play well together.

Professional development organizations assign projects to cohesive teams rather than building teams around projects. A cohesive team that can undertake multiple projects at the same time, distribute work according to the individual wishes, skills and abilities of the members, will successfully complete the project.

Teams are harder to build than projects. Therefore, it is good practice to form a stable team and let the team move and work together as a whole from project to project. Moreover, the team can also undertake multiple projects at the same time. When forming a team, give the team enough time to form a quasi-cohesion, work together all the time, and become a powerful engine for continuous delivery of projects.

How to build the centripetal force of the team? It must be combined with the business scenario to set an appropriate value orientation. From the perspective of an ordinary R&D team, the following aspects may need to be invested in:

1. Identify team bottlenecks, optimize barrel shortcomings, and improve resource utilization;

2. Shorten delivery cycle and increase throughput;

3. Accurate cycle estimation and precise rhythm control;

If the output of our team fails to meet expectations, we must identify the cause of the problem. Is it because the goals are not aligned, the process is not standardized, or the technical reserve is weak, and the infrastructure is weak. The most important thing is that we have good insights power and execution. Finding the problem is half the problem solved.

1. Misalignment of goals: make information transparent and clear metrics;

2. The process is not standardized: manage the process, such as adopting an agile development model;

3. Less technical reserves: deconstruction -> observation -> benchmarking -> learning -> reconstruction

4. Weak infrastructure: make good use of tools (CI/CD)/self-developed

The hardest part is decision making and execution. The actual environment and team culture must also be considered during implementation, which is the yardstick and basis for the rapid implementation of the system from top to bottom. For example, how should we carry out demand project approval and how to implement the project?

This requires finding out the basis for our decision-making and the method of implementation:

1. Do the right thing (value-driven-decision basis): focus on ROI/priority;

2. Do things correctly (rule-driven-execution method): focus on rules/methods/quality and efficiency system construction;

3.2 Problem cognition

If I want to improve the quality of software delivery, I need to grasp the essence of the problem. How to locate the essence of the problem? The core is to ask a few more why. Referring to the Six Degrees of Separation (Six Degrees of Separation) theory, "There will be no more than six people between you and any stranger, that is to say, you can know any stranger through at most six people."

What is the status of high-quality software delivery?

The answer might be: fewer questions, more efficient delivery.

After further dismantling, there are few problems, and what is the delivery efficiency measured by?

The number of bugs in thousands of lines, the number of story points launched per unit period.

How to count the number of thousands of bugs and the number of story points launched per unit period?

You can use bug tracking management software, such as Jira.

The third-party software can't support my demands well, what should I do if I want more?

You can develop your own CI/CD tools...

So the question is when will it end?

Break the problem down to at least a measurable granularity. In terms of software delivery quality, the problem may be broken down into the following dimensions:

Combined with the PDCA tool, we can abstract the entire measurement system into the following process:

The tasks that can be performed after dismantling are as follows:

1. Improve code quality: indicator measurement (thousand-line bug rate, cyclomatic complexity) / tool assistance (scanning) / service splitting / process assurance (technical review);

2. Strengthen process control: online testing process/online process (quality access control, gray scale, etc.)/indicator measurement (test pass rate); improve test coverage: indicator measurement (interface coverage rate/code coverage rate/ Automated coverage/defect analysis), etc.;

In the end, according to the software delivery process, the core nodes such as production, research, testing and transportation will be prepared in the same way, and finally a systematic solution will be formed, as shown in the figure below:

3.3 Knowledge Empowerment

In a team with combat effectiveness, tacit understanding and consensus are the foundation. In order to form a certain consensus, it is necessary to have a good set of problem feedback and resolution mechanisms, and to train people by taking advantage of them. You can use the common problems of how to conduct unit testing and how to estimate and schedule two tasks to examine the consensus of the team.

3.3.1 How to unit test?

Software testability is closely related to the R&D process, but the reality is that software developers are reluctant to write unit test cases for many reasons, such as: there are too many methods and branches, and the writing of unit test cases is even more than business code , the time is not enough; many methods depend on the context and need to be simulated (mock), only in this way can they run; the implementation of unit testing seems to be very difficult.

The verification of the correctness of program codes in many companies is done by software testers. The transfer of work will inevitably generate more communication and collaboration costs. If the software is not iterated frequently, the problem is not too big. If the software is iterated according to cycles, Full testing is required in each cycle, so the test team must be very large, and this problem is more prominent. The testing phase is likely to become a bottleneck in demand delivery, which will damage the release of software products and affect business growth. .

In order to improve testing efficiency, many testing processes can be automated, such as regression testing, performance testing, etc. Of course, there is another way to improve testing efficiency, which is to return unit testing to the scope of responsibility of developers. Of course, unit testing also belongs to the category of automated testing.

Let’s take a scenario that we often encounter in our daily web development. For example, I want to obtain the corresponding data information according to the given filter conditions:

The test of this method in the example has the dependency of the container on HttpServletRequest, which needs to be simulated before it can be tested. Otherwise, the code can only be run after the server is started. This is not a unit test, but an integration test. It is not enough to simulate the container, but also needs to simulate the underlying database to run, so the simulation will become more and more complicated. The test code is obviously not complicated, but the cost of unit testing is indeed very high, how to solve it?

We must know that what we really want to test is the code logic. The environment that the code execution depends on is not the purpose of unit testing, it will only make testing difficult. So we need to write code that can be unit tested, that is, the parameters of the code to be tested can be freely simulated by developers without depending on the environment. For example, let's split and transform the above example:

The sub-methods we split do not need to depend on the container environment, so that our unit tests can proceed smoothly. Some people may still ask questions, after the split, does the code in the original method not need to be tested? The answer is actually yes, because the code in the original method is executed sequentially and has no logic. If there is no logic in the code, there is no need to test it. The same is true for the underlying database operations. When we perform unit testing, we don’t need to do real database operations, and the database operations are not very meaningful for verifying the business logic itself.

There is no problem with unit tests being written by software developers. What is problematic is actually the software developers' understanding of unit tests and the developed code itself. The test-driven development model is a practical way to standardize our writing of unit test cases.

3.3.2 How to estimate the schedule?

R&D personnel should know how to provide business personnel with credible estimates for planning. Once a commitment is made, it is necessary to provide a definite number and fulfill it on time. But in most cases, precise numbers are difficult to do. It is professional practice to provide probabilistic estimates describing expected completion times and possible variables. For the estimated results, the R&D staff will consult with other people in the team to achieve a consensus.

In 1957, in order to support the U.S. Navy's submarine polar voyage program, PERT (Program Evaluation and Review Technique) was born. Part of the PERT is the calculation method for the estimate. This technique includes a very simple and effective way to turn estimates into probability distributions for executives to understand.

You can estimate a task based on 3 numbers. This is the triad analysis.

O : Optimistic estimate. This is a very optimistic figure. If everything goes exceptionally well, you can finish within this time. In fact, in order for an optimistic estimate to be meaningful, this number should correspond to a probability of less than 1%. Example: If it is 1 day.

N : Nominal estimate. This is the most probable number. If you draw a histogram, the nominal estimate is the one with the highest value. Example: If 3 days.

P : pessimistic forecast. This is the worst number. It should take into account all kinds of contingencies, such as hurricanes, nuclear wars, black holes, other disasters, etc. In order to ensure that the pessimistic forecast is meaningful, the probability of occurrence corresponding to this number should also be less than 1%. Example: If 12 days.

With the above three estimates, we can describe the probability distribution like this: μ=(O+4N+P)/6

For the example, we can estimate the value (1+4*3+12)/6=4.2 days

Usually this number is a bit watery, because the right part of the distribution graph is longer than the left part. So we measure uncertainty by the standard deviation of the task's probability distribution: σ=(PO)/6

σ is the standard deviation of the probability distribution for this task, if this number is large, it means very uncertain. For the above example, it is equal to (12-1)/6, which is about 1.8 days.

Now we estimate that the result is 4.2 days/1.8 (standard deviation), and the information conveyed to people is that it may actually be completed in 5 days, but it may also take 6 days or even 9 days.

The actual situation may be more complicated. Sometimes we have multiple tasks in parallel. At this time, the estimation model will also be adjusted. The statistical distribution of the total task is: μseq= ∑ μtask , and the total standard deviation is the square root of the sum of the squares of the standard deviations.

It is very necessary to master some basic knowledge of time estimation models for project planning and control of the implementation rhythm. After all, the schedules given based on experience or even brainstorming are often unsatisfactory and convincing.

Four summary

This article introduces the relationship between software and organizational structure - Conway's Law, and interprets it with actual cases. From the perspective of organizational culture, it describes the impact of team size on software delivery, and gives the view that projects should be undertaken around the team, rather than formed according to the project. Then, with the help of the Six Degrees of Separation (Six Degrees of Separation) theory, methods and ideas for analyzing and solving problems are given. Finally, it is introduced that the team should grow together, and give scientific guidance and training to common problems in work (such as: how to estimate the schedule), so as to create a "special force" that can fight and win.

Note: Some pictures in this article are from the Internet

Author: JD Retail Liu Huiqing

Source: Reprinted by JD Cloud developer community, please indicate the source

Architect Diary - Organizational Culture in Software Engineering | JD Cloud Technical Team