Talking about the Quality Management of Massive Platforms

Talking about the Quality Management of Massive Platforms

讲师介绍:在互联网电商公司,做质量保障和技术保障10年+,之前在1号店做质量总监和高级技术总监,负责企业信息化平台研发、自动化运维开发、质量保证、工程效率、CI/CD、敏捷开发转型、中间件研发等工作。长期深度参与千人研发团队规模的业务成长、架构演进、敏捷开发转型、工程效能建设、过程改进与度量、软件测试等从0到1、从1到N的多年变革过程和创新实践。担任过多年公司周年庆、双11等大促活动的技术保障总负责人和总指挥。

Talking about the Quality Management of Massive Platforms

Today’s topic cannot be separated from DevOps and agility. From the perspective of quality, let’s take a look at how quality is done in the DevOps ecosystem. Because of time constraints, I will share my thoughts and actual cases from a comprehensive perspective.

Talking about the Quality Management of Massive Platforms

1. Understand the quality

First look at what quality is. Look back at the quality accidents in history.

  • The first accident was a Japanese satellite, which lost 1.8 billion due to a line of wrong code.

  • The second is my previous example. We have a development. We need to switch the configuration at two or three in the morning and adjust the alarm clock, but the alarm clock does not ring. The company lost an order of 20,000 yuan.

  • The third one is also a certain team of ours in history. That time it went online without changing any code. As a result, the order dropped by a few points and the company lost an order of 40,000 yuan.

Except for the first example, the other examples seem to have nothing to do with the code, and these are quality incidents. What exactly is quality? Is it a matter of the quality department?

Talking about the Quality Management of Massive Platforms

In addition to function, what features are there in quality features? (Live interaction)

These aspects are often used in software development, namely function, performance, safety, ease of use, and reliability. In the R&D department, the focus of each role is different. R&D engineers pay more attention to functions and performance, the security department pays more attention to safety, and the operation and maintenance department pays more attention to reliability and performance, each with its own focus.

The quality engineer is a role that can fully pay attention to all aspects of product quality, and it is also a role that promotes other roles to pay attention to quality in an all-round way.

In the past three years, I have done more to comprehensively control technical risks in terms of business, development, architecture, operation and maintenance, and security. Technical risk control is responsible for the company's business, not only if there is no problem with the function, it is OK. I have a more comprehensive summary in a 2015 sharing of e-commerce promotion technical guarantees. Here I will mainly talk about quality-related control work today.

So what aspects are involved in quality management? We will not discuss it from academic theory for the time being. Based on my experience in Internet quality management over the past ten years, I concluded that the quality management of Internet applications mainly involves release management, change management, risk management, defect management, These aspects of configuration management. From the quality assurance panorama to be discussed later, you can also see my point of view.

Talking about the Quality Management of Massive Platforms

2. Thinking, Challenges, Trends


Let's first take a look at what challenges the quality of the Internet system faces.

First, from a system perspective, it is a test of massive business access.

For the core business, the importance needless to say, once an error occurs, even a few minutes, it may be a direct capital loss of several million at every turn. At the same time, businesses that don't seem to be on the main path, such as some modules that can reflect their own characteristics and differences in user experience, increase user stickiness, or those that can drive the conversion and growth of the main business.

These business modules have errors. Looking at the company’s entire business market, there is only a 1% error rate. In fact, it affects tens of thousands to hundreds of thousands of DAUs. If you add the amplification of the peak factor, the number of users will be very huge. .

I call these long-tail business guarantee points. Therefore, whether it is the core business or the long-tail business, it affects the performance of a certain part of the company's business and products.

For the R&D team, especially the quality team, the probability of 1% cannot be ignored. This is where the traditional "Twenty-eight Principles" cannot guide us. This is the pressure and challenge brought by mass services to quality assurance. This is also the inevitable trend of the software development industry in the future.

Talking about the Quality Management of Massive Platforms

Second, from a business perspective. Last year, JD.com proposed the "***retail" strategy. The new business model is also a new challenge for the R&D team.

  • First of all, its business needs come from a variety of users, including internal companies, external merchants, and online customers...

  • Second, under the huge volume, the business will continue to grow rapidly;

  • Third, in terms of technical complexity, the terminal that carries services is not only the main APP, but also WeChat, QQ, mini programs, and extended other APP products. The rise of the intelligent Internet of Things has allowed software products to develop from the mobile phone terminal to other terminals, such as refrigerators, speakers, TVs, and so on.

  • At the same time, in terms of engineering efficiency, higher requirements are put forward for agile implementation: business response and delivery must be fast, while quality and experience must be guaranteed.

  • In addition, the scale of research and development of thousands of people means that there are a large number of departments working together, so the requirements for collaborative efficiency are also high.

    As you can see, there are actually many conflicts here. Sometimes in order to catch up with the schedule and go online, you have to sacrifice a little quality, and sometimes there are efficiency problems in cross-departmental communication and the schedule cycle is very tight. So this is a very difficult thing.

Talking about the Quality Management of Massive Platforms

Third, the challenge of agile transformation.

This is the challenge I personally experienced when I was doing agile transformation in 2013. As mentioned above, the company’s business is growing rapidly, the R&D team is close to a thousand people, the complexity of system technology is rising, and the efficiency of cross-team collaboration is facing challenges... Therefore, to improve delivery capabilities, you can’t expand the scale indefinitely. Solve the problem of efficiency improvement.

In the process of agile test transformation, technical means and the division of roles are changing, so the number of people is also changing.

The figure below shows the changes in the personnel of the agile test team. The numbers on the picture may be sad for the test team. Of course, the boss likes this number most because the company's labor costs are declining, or the business is growing, but labor costs are effectively controlled.

So from the overall situation, we need to realize that this is a trend, and it is also the correct goal of agile reform and construction efficiency.

Talking about the Quality Management of Massive Platforms

3. Build a mass platform quality system

Talking about the Quality Management of Massive Platforms

3.1 How to build a guarantee system

Faced with such challenges, how to build a quality assurance system?

  1. First of all, from the management, in the design of the organizational structure, there must be changes and adjustments. This is the form required by the quality management organization structure, including the business test group, the acceptance support group, the framework platform group, the configuration management group, the process audit group, the process planning group, and the technical risk group. These are all involved in the teams I have brought before. Depending on the situation and stage of the R&D department, the figures listed in the figure can be either virtual characters or a physical organization.

  2. If the R&D corresponding business involves multiple platforms, at this time, a test team may only target a certain vertical platform, which requires support from a horizontal team other than the vertical. This is the purpose of the design acceptance support group.

  3. Process audit is to audit the R&D process. Process planning is to plan the management platform and process improvement. The risk group is missing in many teams. The requirements for this team are relatively high and require comprehensive capabilities.

  4. The other groups are easier to understand literally, so I won't say more.

Talking about the Quality Management of Massive Platforms

3.2 JD Quality Assurance System

For the front desk of Jingdong Mall, the product line involved is very rich, and I mentioned it when I talked about challenges. The tested objects that carry the business extend from APP, dongdong, applet, etc. This schematic diagram describes this complicated situation. What is included in the quality assurance at this time?

Here I try to summarize and summarize a set of systems suitable for the reference of the Internet testing industry.

The following picture is a panoramic picture of the quality assurance system that I thought about and sorted out. Let me talk about it a little bit.

There are two aspects. The left is based on technology, the right is based on management, and the two sides are complementary. In the upper left corner is the test strategy. Each company will formulate its own testing strategy based on its own experience, lessons and business characteristics. The strategy model I summarized here hopes to be a reference for my peers.

Starting from the white box test, to the special test of App function, performance, API and microservices, user experience and customization. This is the core and main task of the test team. These definitions can solve the question of whether the strategy is complete. Then whether it can be implemented and whether it can be effectively implemented is the problem that the quality platform and the process management on the right side need to help solve.

The quality platform includes a quality management platform, a test execution platform and a test monitoring platform. There is a little difference between monitoring and operation and maintenance. One part is for code and the other is for service quality monitoring. Process management and improvement include process standard specifications, incident problem management, process measurement and improvement, exercise management, compliance audit, and agile approach implementation. Quantitative management includes capability evaluation model, agile measurement model, platform product measurement model, and quality measurement platform.

Platform products are that we have a large number of platforms, how do we measure the quality of these products? This is also to be considered, this is a set of products that supports the entire R&D process.

Talking about the Quality Management of Massive Platforms

3.3 JD’s "Four Modernizations" Construction

My summary of engineering efficiency construction is: "Four Modernizations" construction, namely standardization, automation, building blocks, and intelligence.

Talking about the Quality Management of Massive Platforms

3.3.1 Standardization construction

Standardization construction includes systems, applications, configurations, people, roles, as well as organization, team, and performance. The establishment of most standards and norms is led by a process definition organization such as a QA team. The most difficult thing is to organize the team and performance. This can only be solved effectively by working from the R&D management decision-making level, and requires innovation in management.

Let me give you the example of Yihaodian. In daily work, it often involves applying for an application to go online, expand its capacity, or apply for code library access, etc. These links need to be reviewed and approved. When we are working on platforms such as operation and maintenance release platforms, code base management platforms, etc., who should approve these audit links, which involves the issue of organizational structure.

In the same department, dozens of people or even hundreds of people, some are in Beijing, some are in Shanghai, and the department head is the same person. The organizational structure of personnel is relatively flat, and there are only three-level departments at most, and there are no four. Level department.

There is no need to involve him in these approval and confirmation matters, only the confirmation of the leader of a team in Shanghai or Beijing. Or, sometimes it is a cross-team project that involves multiple business lines, and the project cycle is not short, and several departments transfer personnel to complete it together.

On this project, there is a person in charge. In terms of administrative relations, he may be just the superior of a certain part of the project, and other people in the project do not belong to the team under his jurisdiction, and the relevant review and approval of this project needs the leader to be responsible.

So there is no way to solve the problem of the approval chain through the current personnel data on the administrative organization. What should I do? In fact, if you think about it again, the situation mentioned above is not only a matter of approval, but also involves all aspects of communication, collaboration, decision-making, and daily team management during the entire R&D process.

Therefore, the most thorough solution is to find a solution in the organizational structure. We built a new entity organization called Domain. This organization relies on the existing personnel organizational structure to derive a new entity team structure. The size and division of the Domain is determined by the manager according to certain standards. Domain's data is not maintained by the personnel department, but is maintained and managed by a team in the R&D department.

Talking about the Quality Management of Massive Platforms

This example is an example of how we promoted system decoupling in the agile implementation process in 2014. We often hear that a certain system is messy and there are many problems, specifically where the problem is. Only a small number of frontline engineers in this team who are familiar with the situation know where the pit is. But there are many applications involved, and it is unrealistic to require them to spend a lot of time sorting out.

It is almost impossible for such a team to implement agile practices such as 2-week delivery and Story dismantling. I let the configuration management complete the standardization of code engineering and compilation and construction. After having this picture, I will look at the team. Everyone can see very clearly who all the applications rely on. The boss also sees it, the new junior programmer. I have also seen that the time for decoupling and refactoring of the architecture has arrived, and the difficulty of promoting all parties is also reduced.

In the end, this team became a typical team with successful agile transformation, and its R&D delivery efficiency was among the best.

Talking about the Quality Management of Massive Platforms
The table in the figure below has been shared by colleagues as a practice of DevOps at relevant conferences in the past two years. This table is the original creation of our quality team in 2013. We have many business lines, with various business characteristics, some for finance and some for front office. It is not feasible to use a unified code quality standard. Therefore, I asked the engineering efficiency team to support the use of different features to make different agreements during the initial design, and the QA team should also formulate specifications accordingly. For many of the implementation details, the time relationship will not start, and offline communication is welcome.
Talking about the Quality Management of Massive Platforms

3.3.2 Automation construction

Automated construction is two links. The test automation strategy is layered design and coordinated execution. This is now a general consensus in the industry, so I won’t explain it. Let's look at a few examples.

Talking about the Quality Management of Massive Platforms
Talking about the Quality Management of Massive Platforms

For example, JD.com’s code scan is an anteater, with an average of more than 240 inspection services per day, and an average of 40 issues per day found, and the inspection rules can be flexibly configured.

Talking about the Quality Management of Massive Platforms

The quality control in the picture below shows that e-commerce websites engage in many activities to face different consumers. These activities take 20 people a week to conduct inspections. The operation and maintenance monitoring platform focuses on the system platform level, and this monitoring focuses on quality issues at a specific business level.

Talking about the Quality Management of Massive Platforms

We have many platforms. This activity channel corresponds to APP, but not in WeChat. Unfortunately, this event was released on the WeChat channel, and we will also conduct platform adaptation checks.

3.3.3 Building block construction

In my understanding, building block construction is the evolution law of platform architecture. Building blocks is the evolution of commercial capabilities for an enterprise, and the evolution of technical capabilities for a R&D team.
Talking about the Quality Management of Massive Platforms

Judging from the construction experience of engineering efficiency platforms, automated operation and maintenance platforms, middleware platforms, and quality platforms that I have led, it must be a series of small tools, framework prototypes, scattered scripts, through modularization, service and visualization. The transformation process will gradually form a new ecosystem that is flexible, pluggable, and combinable to provide external services. This is the process of our evolution.

This process has also reached the ultimate goal of building blocks-empowerment. In the face of new business and new applications, we need to provide solutions. We only need to pick out what we need, do some simple transformations and configurations, and it's ready for the outside world.

Talking about the Quality Management of Massive Platforms

For example, in this example, the existing system has its own original modules working, how to collaborate? This is the construction of such a building block.

Talking about the Quality Management of Massive Platforms

The YHD engineering efficiency cloud service is made by me in the No. 1 shop. Through a series of transformations of functional service, data standardization, and process automation of the original closed system in each link, a complete set of organic R&D management services is finally formed.

The content is relatively rich, the time relationship, in short, two.

First person. When engineers enter the company, everything, including the permissions of the code base, the applications that can be seen, the releases they make, and so on, are automatically controlled.

Second, the product. When a R&D engineer produces a line of code, where is the code stored, where it is packaged, how to go out, what has been changed, what tests have been passed, the results of the launch, monitoring feedback, R&D testing costs and benefits, process quality/efficiency, etc.

Talking about the Quality Management of Massive Platforms

3.3.4 Intelligent construction

We are still on the way for intelligent construction.

There are not many cases shared now. This is the feedback of our users, but for the collection, different questions need to be sent to different departments. The feedback questions submitted by users online are analyzed, and semantic matching and cluster analysis are used. For example, certain types of issues are fed back to the operation department, and some issues are fed back to R&D, and so on.

Talking about the Quality Management of Massive Platforms

4. Summary and Outlook

Talking about the Quality Management of Massive Platforms

From front to back, a complete perspective. Because DevOps breaks the wall and forms an overall management, but the contradictions inside need to be balanced, and the ultimate goal is to pursue the ultimate in quality management.

Talking about the Quality Management of Massive Platforms

This is the experience of a quality manager before me, I think it resonates well, and I will share it with everyone in the end. Many companies have teams that build platforms, and so does the quality department.

Today, when DevOps is popular, everyone seems to be keen on it, while ignoring the engineers who are really struggling for quality assurance on the front line of the business. Don't do work for the sake of tools, we do tools to solve problems. No matter how great the skill is used, the tall tools can't fall to the ground, and they will be done in vain. Emphasizing quality, our original intention is to realize the value that products bring to users.

Talking about the Quality Management of Massive Platforms

Guess you like

Origin blog.51cto.com/15127503/2657793