Exploration of Best Practices for R&D Efficiency Improvement

GIAC (GLOBAL INTERNET ARCHITECTURE CONFERENCE) is an annual technology architecture conference for architects, technical leaders and high-end technical practitioners launched by the high-availability architecture technology community and msup that has long focused on Internet technology and architecture. It is the largest technology in China One of the meetings.

The 6th GIAC will share typical examples of technological innovation and research and development practices from the most popular frontier technologies of the Internet architecture, technology management, system architecture, big data and artificial intelligence, mobile development and language, architecture-related fields .

On the topic of team collaboration, Ru Bingsheng, a senior expert on Tencent's R&D efficiency, delivered a keynote speech entitled "Exploration of Best Practices for Improving R&D Efficiency".

Ru Bingsheng, TEG-Infrastructure Department-R&D Efficiency Center Expert Engineer, well-known industry-renowned practical software quality and R&D engineering efficiency expert, Tencent Cloud’s most valuable expert TVP, China Federation of Commerce Internet Application Technology Committee think tank expert, best seller Test Engineer Full-stack Technology Advancement and Practice" and "Efficient Automated Test Platform: Design and Development Practical Combat" author, InfoQ Geek Time "Software Testing 52 Lectures-Practical Methods from Workers to Experts", and has been involved in DevOps with mature capabilities He also participated in the design of international certification courses for DevOps corporate coaches.

The following is the speech record:

I am honored to be invited by the GIAC Global Internet Architecture Conference as the producer of the "Test Frontier Technology" of this conference, and also invited to give a keynote speech on the improvement of research and development efficiency. Here I have made a simple review of the essence of this lecture, hoping to inspire and help students who are concerned about the improvement of R&D efficiency. The following is a review of the main content. Welcome everyone.

The modern software industry is no longer an era of "big fish eating small fish", but has transformed into an era of "fast fish eating slow fish". For many large traditional software companies, originally "big" is their advantage, but now they are caught in the embarrassment of "the big boat is hard to turn around". For a large number of small and beautiful Internet software projects, when the creativity and subdivision areas are confirmed, the major competitors are competing for R&D capabilities. Specifically, the ability to transform requirements into software or services. Among them, the level of R&D efficiency is related to demand. The conversion rate played a crucial role. At the same time, how to effectively reduce the cost of R&D and operation and maintenance is also an important topic that needs to be paid attention to in R&D efficiency, especially large-scale Internet projects. When there is only a small amount of optimization in a certain link, due to its scale effect (such as cluster size, user traffic, etc.) The magnification effect, the final cost savings will also be considerable.

Similar to the concept of agile, it is difficult to define exactly what R&D efficiency is. In fact, many complex concepts are not defined, but gradually evolved. It is the phenomenon that first finds the appropriate expression. To understand such complex concepts, the best way is to clarify the context of development, go back to history, back to the time of birth, and stroll through its development process to truly understand its essence. However, due to time constraints, we don't have enough time to take you through history in this speech, so I plan to use a few cases to let you intuitively feel the "beauty of improving R&D efficiency".

Let's look at the first example first. Many times when we are doing product prototype design, we need to use product prototype tools to design the product GUI interface, and use this as a basis for communication to carry out follow-up work. But you will find that even if there are prototype tools like Axure and Modao, the cost of "drawing interface" is still high. Here is a one-click automatic generation solution that can convert graphic GUI design drafts, or even hand-drawn GUI design drafts, into target platform code.

In the above example, first hand-painted GUI interface design, and then can be directly converted into the code of the target platform through Sketch2Code. If the target platform you specify is Web, then directly generate html. If the target platform you specify is iOS, then The XCode project will be generated, which can be installed and executed directly on the iPhone after being compiled and packaged. The introduction of this method will greatly improve the efficiency of prototype construction.

Let's look at the second example. During API interface testing, it is very common that the input parameter critical value is not properly handled. For example, an input parameter is of type String, but the value of the String variable is not considered in the code implementation. Such problems are usually discovered in the later API integration test or joint debugging stage. At this time, it is found that the cost of repairing is usually higher, and the cost of regression testing must be considered after the repair. So we can introduce a mechanism to actively scan for the types of API input parameters, and then generate error-prone values ​​based on the parameter types, and use these values ​​to automatically call the API. If a 500 error occurs or an exception is thrown, the problem is found. (For example, for String type input parameters, you can generate NULL, super long strings, strings containing non-English characters, SQL injection strings, etc.) Further we can integrate this solution with the CI pipeline and execute it in CI Proactively perform such tests during the process, in order to expose the problem earlier.

Both of the above examples are dominated by technology. The next example has nothing to do with technology, but is dominated by process. In the picture above, the chef makes sandwiches. In the picture on the left, because the ingredients are placed in no specific order, the chef has to walk back and forth to complete the task. In the picture on the right, the ingredients are placed in the order of use. The chef can easily complete the sandwich while standing in place. Production greatly saves unnecessary walking time, thereby greatly improving efficiency. This shows that the improvement of efficiency can be driven by technology or process.

After reading the above example, I think you already have a very perceptual understanding of the improvement of R&D efficiency. Next, let's look at the nature of R&D efficiency. If I want to summarize the R&D efficiency in one sentence, I will use "Smooth, high-quality, continuous delivery of effective value in a closed loop." Explain some of the key concepts:

  • Smooth: the flow of value must be smooth and free

  • High quality: If the quality is not good, the faster it flows, the faster it will die

  • Sustained: You can't go on and off, run at small steps is the right way, don't hold back big moves

  • Effective value: This is from the demand level, does your deliverable really solve the user's essential problem? (I want to say one more thing about the essential issue. It is not an essential issue for girls to lose weight. Girls love beauty. You can experience it yourself.)

  • Closed loop: Emphasize the importance of fast feedback

 Under the guidance of this concept, I think five continuous (continuous development, continuous integration, continuous testing, continuous delivery and continuous operation and maintenance) are necessary practices for this concept to be implemented. At the same time, we also need to effectively measure R&D efficiency from the four dimensions of flow speed, long-term quality, customer value and data drive.

The above describes the R&D efficiency from a conceptual level. Does it feel a bit dogmatic? In fact, I also think. So below I am going to use popular examples to explain my understanding of R&D efficiency.

The picture on the left is the so-called "square wheel" effect. The boss is pulling the cart vigorously in the front, and the employee is pushing the cart vigorously in the back. The boss focuses on the general trend and direction, and looks forward. It is difficult to find that the wheels of the cart are square, and the employees pulling the cart may have seen it. Square wheels, but since the boss was pulling hard in front, he didn't dare to stop at all, so he could only bite the bullet and push hard. And the student who suggested to change the round wheel was ruthlessly ignored. Changing the round wheel does require an extra pause, as everyone knows that it is to make it run faster and longer. (Digression: Suddenly a slogan from a hotel flashed in his mind, knowing to stop). As you may have already thought, the round wheels here are actually engineering efficiency.

 The graph on the right is divided into four quadrants according to the importance and urgency of the matter. We only discuss A quadrant and B quadrant here. The A quadrant is important but not urgent. It is usually some basic and long-term important things, such as new product planning to seize the market, infrastructure construction, process optimization, talent training, etc. I like to call this quadrant the "preventive quadrant". The B quadrant is both important and urgent. It is usually something that must be dealt with immediately, such as system failures, online defect repairs, etc. I often call this quadrant the "fire fighting quadrant". 

Ideally, more time should be placed in the "Preventive Quadrant" and a small amount of time should be used in the "Fire Fighting Quadrant."

When the "Quadrant for Precautions" is done, the probability of events in the "Quadrant for Fire Fighting" will decrease. If a company spends most of its time fighting fires, it usually means that the time allocation in these two quadrants is out of balance or upside down, and needs to focus on investing in long-term important but not urgent things.

For software development, the most important part of the "Preventive Quadrant" is research and development efficiency.

Finally, I use a more vivid example as a metaphor. I believe everyone has heard the story of a goose laying a golden egg. Is it true that the more golden eggs a goose lays, the higher the efficiency? In fact, it is not. Blindly letting the goose lay golden eggs during lunch breaks will exhaust the goose sooner or later. This is not a sustainable long-term strategy. The real efficiency should be to let geese give birth to geese, and geese regenerate geese, so that more geese can lay golden eggs together.

Tencent TEG’s rapidly developing intellectual research platform, Alibaba’s productized cloud efficiency platform, Baidu’s efficiency cloud based on the engineering efficiency white paper, etc. are all benchmarks in the field of R&D efficiency, but have you ever thought about it, why recently In the past few years, leading companies in major industries have begun to make efforts in the field of R&D efficiency, and the pace is so consistent. I think the reasons behind it are the following three points:

  1. Just like the concept of "middle stage", many large companies now have very broad product lines, and there are a lot of repeated wheels among them. If we pay attention to the repeated wheels in the business, then it is the middle stage in the business; if we pay attention to the repeated wheels in the data construction , Then it is the data center; if we pay attention to the repeated wheels in the construction of R&D efficiency, it is the research efficiency platform. In fact, the research efficiency platform can also be called "R&D efficiency middle station" to some extent, and its goal is to realize the enterprise The reuse of R&D capabilities of high-grade products and projects avoids that each product line is doing the "0 to 1" necessary for R&D efficiency, and no one has the energy to pay attention to the more valuable "1 to n". The current research performance platform will be unified to create a best practice platform for general research and development capabilities at the organization level.

  2. From a business perspective, toC products are now becoming saturated. The bonus era that used to be filled with a lot of idle time is gone forever. In the past, the business developed extremely fast, so use the method of burning money (extensive research and development, People-sea tactics) in exchange for a faster market share to achieve winner-take-all is the best choice. At that time, the focus was on software product output, and the efficiency of research and development could be filled with money. And now toC has gradually moved towards the Red Sea, and the scale of R&D is larger than ever before. It is time to tighten your belt and live your life. When open source (open source in open source and throttling) encounters a bottleneck, throttling should play a role . This throttling is to improve the efficiency of research and development, the same resources, the same time to get more output.

  3. From the perspective of organizational structure, many companies have a "barn dilemma". As shown in the second picture above, various aspects of R&D may have been optimized internally, but there may be a lot of circulation in light-link collaboration. And communication costs, thereby affecting overall efficiency. Based on process optimization, breaking the invisible walls of all links, removing unnecessary waiting, and improving the flow of value is a large category of problems that R&D efficiency is trying to solve.

Slides on this page lists the issues that need to be paid attention to in each stage of R&D efficiency improvement from the perspective of software development, testing and release. The main line is a series of practices around CI/CD. Due to space reasons, I will not expand them one by one here. I will just cite a few examples to give you a perceptual understanding.

  1. The All-in-one development environment can reduce the time cost of each developer's development environment preparation while ensuring the consistency of the development environment. A more advanced gameplay is to use the cloud IDE, which allows you to change the code as long as you have a browser.

  2. With the help of AI-based code prompt plug-ins, the development efficiency of the code in the IDE can be greatly improved. Similarly, input a piece of code, without the help of AI code to prompt the plug-in, you need to hit the keyboard 200 times, and enable the plug-in may only require 50 keystrokes.

  3. The static inspection of the code does not need to wait until the code is submitted and initiated by the Sonar process in CI. At that time, it is too late to fix the problem. It is completely possible to initiate a local code inspection in real time through the Sonar Lint plug-in combined with the IDE. If there is a problem, directly in the IDE Prompt, repair directly.

  4. Unit testing is time-consuming, and you can use tools such as EvoSuite to reduce the development workload of unit testing.

  5. For larger projects, the compilation time is longer after each modification. You can use incremental compilation or even distributed compilation (Distcc and CCache) to improve efficiency.

  6. Front-end development can use tools such as JRebel and Nodemon to make the front-end development preview experience smoother.

  7. Choosing a code branching strategy suitable for the project is also very helpful to improve efficiency.

  8. Building a highly automated CI and CD pipeline will greatly increase the rate of value circulation.

  9. Choosing an appropriate release strategy will also play a positive role in the balance between effectiveness and risk. For example, the architecture is relatively simple, but the cluster is large, preferably canary. If the architecture is more complex, but the cluster size is not too large, blue-green releases may be more dominant.

From the above description, we can see that the improvement of R&D efficiency involves a wide range of aspects, both technology-based and process-based. In actual engineering practice, how do we implement R&D efficiency improvement? The point I advocate is "use MVP (Minimum Viable Product) ideas to improve R&D efficiency".

The concept of MVP comes from the book "Lean Startup" by Eric Rise. The core idea refers to the product strategy that shows the core concept as much as possible at the lowest cost, that is, to build a usable product prototype in the fastest and most concise way. This prototype should express the final desired effect of your product, and then iterate to refine the details.

This idea is especially suitable for the construction of research efficiency platforms. After identifying the research and efficiency problems to be solved, we must first give the simplest solution, and then continue to optimize and iterate in the subsequent practice. If we try to close the door To build a research and efficiency platform, expect to wait for all the functions to be perfect and then push it to the business line team, it must be a dead end.

It is also necessary to point out the common misunderstandings of MVP. A certain function is realized but has no actual value to the customer for the time being, and it will be useful to the customer only after the latter function comes out. This is not an MVP. What MVP pursues is "the sparrow is small and has all its internal organs", that is to say, the realized function points can be small and simple, but it is necessary to be valuable to customers. Using the two triangles in the picture above, "cross cut" is not the MVP, and "vertical cut" is the MVP.

So in the field of research and development efficiency, we must ensure that the research tools we make can solve the actual pain points, although the initial method can be relatively simple. From a product perspective, the R&D platform itself is not essentially different from general software products, and it also requires continuous iteration and continuous improvement.

Here is the hardest part of this sharing, and this is also a staged summary of how to promote R&D efficiency.

People often ask me that the research efficiency improvement projects you have led have been successful. If you come here, how many years can you do it? This is actually an unsolvable problem. To a certain extent, if the investment is large, the cycle will be short, but the implementation cycle will not be infinitely short because of the infinite investment. We can avoid many pits we have stepped on, and try our best to avoid detours. However, we still have to take the path that suits us. Uprooting will only harm our long-term interests. Can you become a racer after buying a sports car? In view of this, I combined the tuition fees I paid and the pits I stepped on and summarized the 8 suggestions in the above picture for your reference. Next, I will explain them one by one.

 The first point is "start from the pain point". Many times, when we hold a hammer in our hands, everything looks like a nail. But the improvement of R&D efficiency happens to be the reverse. We must first find out which nails are the most eye-catching, and then use a systematic methodology to build a suitable hammer.

Therefore, in the early stage of the implementation of R&D efficiency, we usually adopt a bottom-up strategy, starting from the actual pain points (nails) in each engineering practice, and creating the highlights of research efficiency improvement from the perspective of problem solving. At this time, we pursue It is the principle of "short and fast", the problem is solved one by one. For example, the following scenarios:

  • Local compilation takes a long time: provide incremental compilation and distributed compilation capabilities

  • Local testing is difficult, and the preparation of the test environment is complicated and time-consuming : K8S-based Pod provides the ability to build a test environment with one click

  • The number of automated test cases is large, and the execution regression time is too long: the concurrent test case execution mechanism is adopted, and hundreds of thousands of test execution machines are used to execute test cases in parallel, so as to realize the use of hardware resources for time

  • High maintenance cost of automated test cases: The test cases adopt a modular and layered system to achieve low-cost automated use case maintenance

  • Difficulty in preparing test data: introducing unified test data service (Test Data Service) capabilities

  • In the late stage of R&D, the code submission was concentrated, and the defects were blown out: the implementation of testing leftward, encouraging R&D self-test, implementing "who develops, who tests, who goes online, who oncall"

  • Performance defects were discovered in the late stage of development, and the cost of repairing and retesting remained high: from performance testing to performance engineering, letting performance be integrated into all aspects of software development, rather than the last one

  • Frequent security issues: Incorporate security testing capabilities into the full life cycle of R&D to achieve DevSecOps instead of early SDL

  • The scale of the cluster is huge, and the release process takes too long: concurrent deployment capabilities at all levels, concurrency of nodes in the cluster, concurrency between clusters, etc.

  • The process data of the project is filled in the later stage and loses the meaning of measurement: the process data of the project is automatically filled by the tool and no longer depends on the manual input of the engineer. For example, the development completion time no longer depends on the developer to fill in manually, but is automatically filled in after the Jenkins build is completed to ensure the authenticity of all process data, so as to provide reliable information input for subsequent measurement and improvement.

The second is "cutting in from the overall situation." Many times we will try to optimize a specific link, while ignoring the possibility of global optimization.

 For example, when we go to the hospital to see a doctor, we usually queue for half an hour for registration, but the actual registration may take only two minutes. Then there is a long queue waiting for the doctor to see a doctor. After we finally get in line, we may be asked to go for the examination in less than five minutes blood…. The percentage of actual effective time in the whole process is very small. If at this time we are still trying to optimize the time of the registration itself, it is obviously the wrong direction to not pay attention to optimizing the waiting time of each link, so the efficiency improvement should not only focus on the optimization of a single step, but also focus on reducing the time between steps. Useless waiting. At this point, the medical examination center is much better than public hospitals. You rarely see long lines at the door of each department of the medical examination center, because the medical examination center will pay attention to the throughput due to economic interests and will pass the overall situation. Queue scheduling optimization to achieve higher profitability.

Returning to the field of software development, you will find unreasonable queuing phenomena like the above hospitals can be seen everywhere, such as the circulation of software defects, the realization and delivery of software requirements, and the waiting for the release of software product packages. These are also areas that need to be focused on to improve R&D efficiency. It is necessary to clarify the entire process from the overall perspective, identify the waiting time wasted, and achieve overall efficiency improvement through process reengineering and optimization.

The third article is "User Benefits". For the improvement of research efficiency, we must keep in mind that the criterion of success is not the success of the research and development efficiency platform, but the success of the customer. Only the benefit of customers is the only criterion for testing the success of research projects. Here I talk about the following three points:

Pseudo-requirements: False-requirements refer to the conjectures of the research team themselves, which are typical cases of "holding a hammer in your hand and seeing everything like a nail". So how to identify false requirements? The identification standard is actually very simple. It depends on whether the customer is willing to share the cost with you. If the business line has already started, or if you want to start doing it, it means that it is just needed by the business line. If the research platform can help provide solutions, The access to the research platform is a matter of course. I have seen many examples of this kind of just-needed needs, such as the establishment of integrated test environments under the microservice architecture. 

Structural issues: Liu Run, a well-known business consultant, once said "the structure is wrong, nothing is right". For example, the story of two monks dividing porridge must have been heard by everyone. A bowl of porridge must be divided equally between two monks, but the monk who divides the porridge always wants to drink more porridge. How can it be fair without supervision? Does the monk who teaches the sharing of porridge say that the monk’s family “takes less food as his heart?” Obviously, once there is no supervision, he will give himself more points. The best way to solve this problem is to share the porridge with one monk and the other choose the porridge, then This system determines the uniformity of the porridge. So a good strategy is to admit that everyone is selfish, but your strategy can maximize the overall benefits on the basis that everyone is selfish. If your overall benefits maximize is based on requiring everyone to be selfish Yes, that is a failed design, because it will inevitably lead to failure. Going back to the issue of improving research efficiency, we must position ourselves with "not how good our research platform is, but how the business line will improve in the future" in order to gain structural success.

Service consciousness: After understanding the above point of view, it is natural to understand service consciousness. In the process of launching the research platform, we need to help each other to achieve a win-win situation. The business line harvests ready-to-use solutions, and the research platform harvests the precipitation of best practices. The precipitation of these best practices is crucial for the later batch Successful replication provides the technical basis.

There is also a little experience to share with you. Sometimes, in order for the research and efficiency platform to be able to land on the business line as soon as possible, so that the business line is willing to become the test field of our research and efficiency platform, we sometimes must be able to take the initiative to back the pot. , If you are interested, you can communicate with me privately.

Continuous improvement is the only way for the development of the research efficiency platform. At the beginning of the solution of many problems, we focused on how to solve the problem quickly and simply, but when the scale is up, we need to pay more attention to the universality and versatility of the solution. If you try to find the perfect solution from the very beginning, you will inevitably lose out.

Here is a specific example. For example, we need to trigger some operations (such as code static scanning, unit testing, etc.) through the hook mechanism in Jenkins. The easiest way is to implement the specific steps of the operation in the hook. This implementation is one The initial efficiency is very high and it is very easy to implement, but it is not the optimal solution, because the code in the hook will only be executed once, and after more and more hooks, various implementations are scattered in various places and difficult to maintain. Whenever there are new needs (such as adding slow SQL scanning), you need to change the hook implementation, and this approach also violates the IaC (Infrastructure as Code) principle.

A better approach is to introduce a research and development efficiency message center (MQ) to achieve future scalability through the subscription model of downstream operations. However, if you build MQ from the beginning, the difficulty and cost of implementation will greatly increase, and the business line may not be able to wait for your plan, so the improvement of research efficiency cannot be implemented as scheduled. Therefore, my point is that the implementation of research results can follow the strategy of "enclosure first, then improve".

Slides on this page is about the implementation of R&D efficiency improvement. Bottom-up and top-down alone are not feasible. Instead, two-pronged approach is needed. Squeezing from the middle on both sides is a practical solution. Due to space reasons , I will not expand it in detail here.

Here I propose two concepts: "singing opera" and "arranging stage". When we first started doing R&D performance, we were both a platform and a performer. On the basis of the research platform (a platform), we provided solutions for each business line (singing), but when the scale of business line access continued to expand At that time, there are more and more diversified needs in various vertical fields. At this time, it is difficult for us to respond to the individualized non-universal needs of each family (the play to be sung is different for each family), so at this time the open ability of the research platform It has become the key. The construction of the research efficiency platform must be able to cope with this diversity. The responsibility of the research efficiency platform has been transformed into a standardized platform, so that business lines can achieve their individual needs on this platform, so the research efficiency platform itself The design of the technical architecture must consider scalability and flexibility.

 For example, the platform is Jenkins, and the flexibility is the various plugins above.

Covering ears and stealing bells is a mistake we often make in the process of landing research. The figure above shows the "worst practice" of R&D efficiency. You can count the number of hits in your heart.

Another example of deception is the widespread use of vanity indicators to measure performance. So what exactly is an indicator of vanity? Vanity indicators refer to indicators that cannot be directly used to guide follow-up actions. What we need are executable indicators that can guide our actions. It's still quite abstract, I will give you a few examples to understand it easily. 

  • "The number of projects connected to Sonar" is the vanity indicator, and the corresponding executable indicators are "the increasing trend of Sonar problems" and "the repair time of Sonar problems";

  • "The number of system users" is the vanity indicator, and the corresponding executable indicators are "DAU single-day active users" and "MAU monthly active users"

  • "The number of projects connected to the research efficiency platform" is the vanity index, and the corresponding executable index is "what percentage of the projects have used the research efficiency platform to complete the development, testing and release process;

What we need is charcoal in the snow, not icing on the cake.

The last point is easy for everyone to understand "Be the first user of your own research platform". The research and development platform itself must go through its own platform, so that you can look at your own solutions from the user's point of view, and can interact with line-of-business users. Empathy".

The last is the prospect of the future development direction of the R&D efficiency industry. Due to space reasons, I will not elaborate one by one. I will write an article to talk about this content in the follow-up plan, so stay tuned! 

Some derivative readings are recommended. Among them, "Test Engineer Full Stack Technology Advancement and Practice" and "Efficient Automated Test Platform Design and Development Practical Combat" are my books written in 2019 and 2020 respectively. They mainly focus on the improvement of software testing efficiency. At the same time, Professor Zhu Shaomin's "Whole Software Testing (3rd Edition)" is also recommended.

Warm up the "The Beauty of Software R&D Efficiency Improvement" which I and Wu Junlong are writing. If everything goes well, I will meet you in 2021.

Reply to the keyword [GIAC] in the background to get the PPT of the guests of this conference.

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/108250398