The war against software complexity

1. What is R&D efficiency?

When we talk about R&D effectiveness, what are we talking about? This topic was raised and discussed because there is a problem. The problem is that the actual R&D efficiency is far lower than expected. When a company starts up, it takes one person two hours to complete an idea from formation to launch. But when the company grows to thousands of people, the execution of similar things often requires multiple teams and takes several weeks to complete. This creates a stark contrast, and the impression created by this contrast is difficult to understand for people who do not have a deep understanding of software engineering, but often there is nothing they can do about it.

Careful readers will notice that I used both the word "effectiveness" and the word "efficiency" in the previous article. This is to make a rigorous distinction. Effectiveness is often used to measure the economic performance of a product, while efficiency only refers to improving business responsiveness, increasing throughput, and reducing costs.

The definition here quotes the course material "How to Build a High-Performance R&D Team" by Qiao Liang. This article does not discuss product development methods, so the following focus is on "efficiency."

In the 1910s, when early Internet practitioners developed simple websites, they only needed to learn to use Linux, Apache, MySql, and PHP (Perl). This technology has an easy-to-remember name: LAMP. But today, developers working in a large Internet company need to understand the technology stack that has increased by an order of magnitude, such as distributed systems, microservices, web development frameworks, DevOps pipelines, containers and other cloud-native technologies, and so on. If it were just these complexities, it would be okay. After all, they are industry standard technologies, and developers can master them quickly with their learning abilities. The daunting complexity is that large Internet companies have one or more software systems. The scale of these software systems is often more than a million lines, and the quality is good or bad (mostly bad), and developers must be based on these The system works. At this time, a very high cognitive load must be borne, and when modifying the software, there is also a huge risk of destroying the original functions, and the increase in risk will inevitably lead to a decrease in speed.

Therefore, one of the core factors for the substantial reduction in R&D efficiency is the exponential increase in software complexity.

2. Essential complexity and accidental complexity

Fred Brooks has a wonderful discussion of software complexity in his article "No Silver Bullet" in the classic book "The Myth of the Man-Moon". He divides software complexity into essential complexity (Essential Complexity) and accidental complexity (Accidental Complexity). The two words essential and accidental here come from Aristotle's "Metaphysics". In Aristotle's view, essential attributes are attributes that an object must have, and accidental attributes are attributes that an object can have (also You don’t have to own it). For example, an e-commerce software will inevitably include business complexities such as transactions and commodities, so we call them essential complexity; and the same e-commerce software can be implemented based on container technology (or not), or written based on Java (or not), so we call the complexity introduced by container technology or Java technology accidental complexity.

The essential complexity of software described by Fred Brooks refers to the complexity coming from the problem domain itself. Unless the scope of the problem domain is reduced, the essential complexity cannot be eliminated. The accidental complexity is caused by the solution, such as choosing Java, choosing containers, choosing the middle platform, etc.

In addition, we can understand these two complexities from the so-called problem space (Problem Space) and solution space (Solution Space). The problem space is the initial state and desired state of reality, as well as a series of constraint rules (we often call it business ), the solution space is designed and implemented by engineers, a series of steps from the initial state to the desired state. Inexperienced engineers often rush to write code without fully understanding the problem. This is a lack of understanding of the problem space and solution space. In recent years, domain-driven design has been praised by so many engineers. The core reason is that it It has guided everyone to pay attention to the problem space and face the essential complexity. The subtitle of Eric Evans' 2003 book "Domain Driven Design" is "Tackling Complexity in the Heart of Software". I think this is no accident.

"The Mythical Man-Moon" was written in 1975, 47 years ago. Brooks believes that the essential complexity of software cannot be essentially reduced. He also believes that with the evolution of high-level programming languages ​​and the development of development environments, Occasionally the complexity will be substantially reduced. The first half of his conclusion is right, but the second half is wrong. Today we do have more advanced programming languages, more feature-rich frameworks, and more powerful IDEs, but everyone gradually finds that learning these tools has become an indispensable task. A small burden.

3. Explosion of complexity

As long as software does not die, as long as there are people using it and developers maintaining it, its complexity will almost certainly continue to rise. The survival and development of software means commercial success. As time goes by, more and more people use it, more and more functions are added, and its value becomes greater and greater, bringing a steady stream of benefits to enterprises. income. As we explained earlier, the essential complexity of software is actually brought about by the problem space (or business). Therefore, the more functions you add to the software, the more intrinsic complexity it will inevitably contain. In addition, every problem solved corresponds to a solution, and the implementation of the solution will inevitably introduce new accidental complexity. For example, in order to realize payment, a new protocol needs to be implemented to connect a three-party payment system. Software complexity is a happy annoyance that commercially successful companies must face.

What is different from Brooks' era is that today's software has penetrated into every aspect of human life. Internet software of some scale serves millions or tens of millions of users. Alibaba's Double 11 peaked in 2020 with 583,000 transactions per second; Netflix had 220 million subscribers in Q4 2021; and TikTok announced in September 2021 that its monthly active users exceeded 1 billion. Behind these amazing business successes, complex software systems are indispensable. All these complex software systems have to face huge Scalability challenges. There is a huge difference in complexity between a system that serves one person and a system that serves 100 million people.

Essential complexity is one aspect. After all, more users means more functional features, but we cannot ignore the accidental complexity here, the most typical of which is the accidental complexity introduced by distributed systems. In order to support such a large number of users, the system needs to be able to manage tens of thousands of machines (scheduling system), be able to manage user traffic (load balancing system), and be able to manage communications between different computing units (service discovery). , RPC, messaging system), need to be able to maintain the stability of the service (high availability system). Each topic here can be extended and described in several books, and only after developers have initially mastered this knowledge can they design and implement a system that is scalable enough to serve large-scale users.

Compared with the complexity introduced by distributed systems, the expansion of the team is more likely to bring about a sudden increase in complexity. The software development teams of successful products often have hundreds of people, and some have reached the scale of one or two thousand people. If the company does not have strict and clear talent recruitment standards, and there is no strict technical specification training after personnel join the job, when everyone submits code to the code warehouse with different styles and different long-term and short-term goals, the complexity of the software will rise sharply. .

For example, a team member adds a NodeJS component to a system that is all Java based on personal preference. When the member leaves the team, this component is just an accidental complexity for other members who are not familiar with NodeJS. ;

For example, if a new member of the team is not familiar with the system and is eager to launch a feature without affecting other parts of the system, they will naturally add a flag somewhere and then add if judgments in all places that need to be changed instead of Adapt system design to fit new problem spaces;

For example, different people use different names for the same domain concept in different modules of the system. The core connotation is exactly the same, but different attributes are added, which adds a lot of understanding costs.

Similar complexities are not intrinsic to the software, but they accumulate over time, placing a huge cognitive load on developers. If the software exists for a long time, in addition to the size of the current development team, everyone who has contributed code to the software in history must also be considered. It is no wonder that when programmers see "ancestral code, don't touch it!" When you make fun of someone like that, you will smile knowingly.

I like to learn a variety of powerful programming languages, such as Ruby and Scala, which have metaprogramming capabilities, and use these capabilities to unleash your interests and creativity. But I have reservations about using these programming languages ​​in a production environment with a certain team size, because unless a lot of effort is made to review and control the code style, it is very likely that the code written by 10 people will be in 10 styles. This complexity The growth was a disaster. On the contrary, using a less flexible language like Java, it is more difficult for everyone to write code inconsistently.

The expansion of the team will also bring about another problem. In a large-scale team, the goals of key stakeholders are actually a key factor affecting the complexity of the software. I have personally seen many cases where simple solutions were clearly placed in the solution space, but for this reason, the parties had to choose complex solutions, such as:

  • The original plan only required direct changes to System A, but since the team responsible for System A had no motivation to solve the problem, others had to take a detour and modify Systems B, C, and D to solve the problem.
  • The original plan only required direct changes to system A, but due to pressure from the person in charge or superior of system B, the plan had to evolve to change A, B, or even introduce C at the same time.

What's more, for various reasons, some completely hypothetical problems are raised (that is, essential complexity that does not actually exist), and then the software system is used for a while. In the end, the individual or team's goal is achieved, but the software does not provide any incremental value, and the complexity will not stop growing because of this.

Therefore, as long as the software is valuable, has users, and is maintained by developers, then functions will continue to be added, and commercially successful software will inevitably be accompanied by the growth of the number of users and the growth of the R&D team. These three factors will continue to promote As software complexity grows until it explodes, R&D efficiency will naturally become lower and lower. A core proposition that software engineering needs to solve is how to control complexity so that R&D efficiency does not drop too sharply. This is a war against software complexity.

4. Wrong response methods

Faced with the continuous decline in efficiency, managers of the R&D team must do something. Unfortunately, many managers do not understand that the decrease in efficiency is caused by the increase in software complexity, and they do not calmly think about the root cause of the complexity spreading until it explodes. As a result, we see that many managers are superficial. Coping methods have little effect, and may even be counterproductive.

The most common mistake is to set an unchangeable deadline to force the development team to deliver features. But countless experiences tell us that software development is about seeking trade-offs in the triangle of quality, scope and time. The R&D team can gain some time in the short term by working overtime and sacrificing vacations (long-term overtime is actually harmful), but if this time limit is too harsh, it will inevitably sacrifice the scope of requirements and software quality. When the scope of requirements is irreducible, the only thing that can be sacrificed is quality, which actually means pouring a lot of accidental complexity into the system in a short period of time.

Another approach is to replace the existing system technology with "more advanced" technology, such as using Java's microservice system technology to replace the PHP + Golang system technology; or using middle-end technology that has supported successful commercial products. Replace the original microservice system technology; or it can be as simple as replacing self-built open source services with cloud products. The basic logic behind these approaches is that “more advanced” technologies have been proven in successful business scenarios and can therefore be directly applied to solve existing problems.

But in real situations, decision makers often ignore whether the current problem is one that can be solved by "more advanced" technology. If the number of users served by the existing system is growing rapidly and Scalablity is facing serious difficulties, then the answer is yes; if the stability of the existing system is of concern, is often unavailable and seriously affects the user experience, then the answer is yes definitely. However, if the existing software system is facing the problem of declining research and development efficiency, then "more advanced" technology will not only be of little help, but will also add accidental complexity to the system due to switching between old and new technologies.

5. Correct technology strategy

Earlier I explained several core factors that lead to the growth of complexity, including the growth of business complexity, the growth of distributed system size, the growth of team size, and factors such as key stakeholder goals. Among them, the accidental complexity introduced by distributed systems is the easiest to eliminate. In order to better understand this point of view, let me briefly introduce Wardley Map.

Wardley Map is a tool to help analyze technology strategy. It is displayed in the form of a map. Each component in the map can be understood as a software module. The vertical axis is the value direction. The higher it goes, the closer it is to user value. The horizontal axis is the evolution direction. , the closer to the right, the closer to mature commercial products.

For example, in the figure above, Compute is a computing resource that is provided by many mature cloud computing companies today, but it is very far from the user value of the contextual business in the figure. Virtual Fitting (virtual fitting) is very close to user value, because it can make users more confident whether they have purchased the right clothes. However, this technology is obviously not a mature product yet. It is just a self-developed module, far from it. Reach the stage of open commercialization.

It is very challenging to design and develop a distributed system to support millions or tens of millions of users, and it will introduce a lot of complexity into the system. Managing this complexity is a huge challenge in itself. Fortunately, today's cloud vendors, including Alibaba, Amazon, Google and Microsoft, all have rich experience in this area, and have provided this experience to the market through commercial products through years of accumulation.

Analyzing from the Wardley Map method, we will find that the upper left corner of almost all businesses (close to direct user value, immature) must be developed by ourselves and bear the complexity, and as long as the correct software architecture is done , then you can extract the part in the lower right corner (away from direct user value, with ready-made commercial products) and purchase it directly. So today, a qualified architect, unless he is a cloud vendor, should never invest in the research and development of databases, scheduling systems, message queues, distributed caches and other software. By purchasing, the R&D team does not need to bear these complexities at all, and can easily support the growth of user scale.

6. Complexity control at the micro level

The right technology strategy can help control system complexity at the macro level, but at the micro level we need a completely different approach. Before discussing the method I would like to quote a simple example from the book "Grokking Simplicity". (Interestingly, the subtitle of this book "Taming complex software with functional thinking" also expresses the intention of fighting complexity.)

Let's look at two functions (JavaScript):

function emailsForCustomers(customers, goods, bests) {  var emails = [];  for(var i = 0; i < customers.length; i++) {    var customer = customers[i];    var email = emailForCustomer(customer, goods, bests);    emails.push(email);  }}
function biggestPurchasePerCustomer(customers) {  var purchases = [];  for(var i = 0; i < customers.length; i++) {    var customer = customers[i];    var purchase = biggestPurchase(customer);    purchases.push(purchase);  }}

At first glance, there is nothing wrong with these two functions. They both prepare a return value, write a loop, and extract the required data according to the specific business logic. The only difference is that the business logic of the former function is to obtain the customer's Email, and the latter function is The business logic is to obtain the largest order the customer has ever placed. However, for such a simple code, there is still complexity that can be reduced. To understand and read these two functions, you need to understand the for loop every time. Can this complexity be further reduced? The answer is yes.

This ubiquitous logic, that is, traversing each element of the collection, doing some processing on each element, returning a new element, and finally putting it into a new collection, can be abstracted into a map function. In this example, I assume that JavaScript supports the map function, then the above code can be written as:

function emailsForCustomers(customers, goods, bests) {  return map(customers, function(customer) {    return emailForCustomer(customer, goods, bests);  });}
function biggestPurchasePerCustomer(customers) {  return map(customers, function(customer) {    return biggestPurchase(customer);  });}

Putting aside the factors of language syntax, apart from the map function, the rest of this code is the function name. As long as the function name is properly named, it actually represents the essential complexity and business logic. Industry forefathers, the well-known Martin Fowler, Kent Beck, and Robert C. Martin, all emphasize the importance of naming in their books. They all hope that the code can clearly communicate intentions, and the core intention here should be to communicate with the problem domain. matched.

The code in this example is extremely simple and can be understood by any programmer, but even here there is room to reduce complexity. You can imagine how much complexity can be eliminated in code accumulated over several years. I also remembered the teachings of a colleague and elder programmer many years ago. He said that excellent code should be:

  • It works
  • It is easy to understand
  • It is safe to change

In fact, achieving the second point is already a very high requirement. This requires software engineers to carefully design, clearly communicate requirements, think about the integration with legacy systems, and also need to restrain themselves from using new technologies (new languages, new paradigm) impulse. The third point actually teaches us to write unit tests seriously.

I don’t know if you have experienced this feeling: after the requirements were clearly discussed, I wrote the corresponding code and unit tests. Specifically, I added a few hundred lines to the original tens of thousands of lines of code base, and added about 1,000 lines to the original code base. 3-5 unit tests were added to the unit test, and then I executed an mvn clean test locally. This process only took a few minutes, and all the tests passed. At this time, I was very confident in submitting the code, and I knew The probability of code running into problems in a production environment is extremely low.

The core of this feeling is quality feedback. The shorter the feedback time, the higher the efficiency. The longer the feedback time, the lower the efficiency. In addition to controlling complexity, software engineers must understand the importance of timely quality feedback. If a line of code is written, it will take several hours or even days to know that there is a quality problem. The low efficiency can be imagined. So when I see some strange phenomena in people’s practice when organizations advocate writing unit tests from top to bottom, I often find it strange. These phenomena include:

  • Low-quality unit testing: including not writing asserts and having print statements everywhere, requiring people to verify them.
  • Unstable unit tests: The code is fine, the tests fail, and the test suite cannot be trusted.
  • Very long unit tests: it takes dozens of minutes or hours to run.
  • Generating unit tests from code: Sorry, I think this stuff is pointless except to boost coverage vanity metrics.

7. Software Ethics

Controlling software complexity at the micro level and carefully writing unit tests to ensure quality feedback of code writing is crucial to R&D efficiency, but it is also time-consuming and labor-intensive. And because the value of this investment to the business takes a long time to be reflected, it is easily overlooked by R&D executives.

Developers are producing software intermediate products such as code, documents, and API services. These intermediate products are gradually assembled into products and generate commercial value. The quality of software intermediate products is crucial to the overall efficiency of the R&D organization, and codes and systems whose complexity is well controlled are high-quality software intermediate products; good software R&D ethics, or sometimes this is considered A good engineering culture means that everyone forms a consensus culture in which everyone is proud to deliver high-quality software intermediate products and is ashamed to deliver low-quality software intermediate products.

One of the core responsibilities of software development is to pay attention to software complexity, make software complexity information transparent through open code, documentation, Code Review, etc., and make all behaviors that increase/reduce complexity transparent, and continue to encourage those who eliminate complexity. degree of behavior. Only in this way can methods of controlling complexity at the micro level be implemented.

8. The impact of system architecture on complexity

Between the macro technical strategy and the micro engineering culture, there is an important decision-making area that also has a key impact on software complexity. I call it system architecture. When faced with requirements, inexperienced engineers will directly think about solving them directly in modules they are familiar with, while experienced engineers will first think about the system context. In the excellent technical document writing guide "Design Docs at Google", it is highlighted that the design document should clearly write the system-context-diagram. What is the reason behind this?

I recently did a sorting and analysis of dependency links on a legacy system. This system is responsible for the management of various resources in the production environment, including resource specifications, versions, dependencies, etc. After the sorting is completed, the overall structure is shocking. I was taken aback. This picture roughly looks like this:

The blue part in the figure is the control and execution subsystem (System X, Y, Z), such as controlling the scheduling of containers, controlling the execution of image changes, etc., which is relatively clear. But this is not the case for the rest (A1, A2, A3, C1, C2, S, E). They all manage the running version of a resource, including the version of the image, the specifications of the container, whether there is a GPU, and the size of the container. number, associated network resources, etc., but evolved into seven subsystems, which is actually a very high accidental complexity. When the concepts of a domain are dispersed into so many subsystems, a series of problems will arise:

  • Different subsystems have different names for the same concept, and various translations are involved during interaction.
  • Different subsystems assume some concepts of the same entity, which results in large-scale modifications being required and error-prone.
  • Higher operation and maintenance costs.

After carefully analyzing the factors that caused this complexity, I found that this is neither a problem of technical strategy nor the production of low-quality code by micro-level engineers, but other deeper problems. The core factor is that these subsystems belong to different teams at different times, and some even belong to different departments. Specifically, when the goals of each team in each department are inconsistent, and this system unfortunately If it is split into various teams, no one will be responsible for controlling the overall complexity of the system. Some teams are responsible for commercializing this system for external export, some teams are responsible for evolving this system from virtual machine mode to container mode, some teams are responsible for resource cost control, and some teams are thinking about the overall situation. Available architecture, but without a global architect controlling the concept and boundaries from the overall perspective, the system will naturally deteriorate into such a state.

When a problem domain does not have a system architecture, or its system architecture is wrong, you will find that different people are inventing different languages. This is like two villages separated by dozens of kilometers, often having different views on the same concept. Use words or pronunciation. Inaccuracies in language are not a problem in daily life, because daily communication is full of context (expressions, atmosphere, environment, etc.), but in the computer world, inaccuracies in language mean that code translation needs to be written. Once the translation is wrong, the software will fail. An execution error will occur. This is why domain-driven design emphasizes unified language and limited context. But domain-driven design is a methodology, and knowing the method cannot replace the absence of the system architecture role.

This complex system is an excellent example of Conway's Law. Conway's Law says: "The system structure of any system designed will replicate the communication structure of the organization." This sentence is actually somewhat abstract. Some more specific explanations are :

"Conway's law... is a reasonable sociological observation.... Unless the designers and implementers of module A and module B can communicate effectively, the two software modules cannot be connected correctly. Therefore, the interface structure of the software system must will correspond to the social structures and organizations that produce software systems.”

The fact revealed by Conway's Law is that software architecture is largely determined by the organization's structure and collaboration model. This is actually no longer a software technology issue, but an organizational management issue. Therefore, to solve the software complexity problem at the system architecture level, we must face the challenges of organizational management. Is there a single owner for key problem areas? When different teams are building systems repeatedly in the same problem domain, how to integrate the teams? How to identify this problem when an existing team continues to exaggerate the importance and particularity of the system it is responsible for for the sake of its own survival? How can the organization give everyone a sufficient sense of security so that engineers are willing to give up the system modules they have worked so hard on for the sake of the rationality of the architecture?

Discussing management work seems to be beyond the scope of this article discussing software complexity, but many engineers either have a vague feeling, or think about it and finally realize that this is the fundamental factor that makes our software systems either elegant and robust or riddled with holes. .

summary

My former big boss Guo Dongbai once discussed the characteristics of excellent architects in a QCon speech. In addition to the traits that everyone understands well, such as vision, good thinking, and the ability to inspire, he also particularly emphasized "having a conscience." He said:

Having a conscience is the most important quality that an architect can acquire over time. What does it mean to have a conscience? Be honest and choose to do the right thing. Many people are very smart, have strong business understanding, and have rich technical practices, but they may not necessarily do the right thing for the company or the organization. Having a conscience is a very important thing. If the architect has no quality, he will cause a company to suffer heavy losses.

Software complexity is caused by human behavior. Whether it is focusing on quality and engineering culture at the micro level, making organizational structure and communication consistent with the objective problem domain at the system architecture level, or making decisions that are in line with the company's interests at the technical strategy level, it all exists here. An objective law that cannot be changed. How to recognize these laws and make decisions based on these laws (which can be changed and influenced), strive to create value for the company, and strive to make every engineer respected is the basic attitude that every engineer, architect, and technical manager should uphold. . The original intention of this article to discuss software complexity is to try to reveal the objective laws behind complexity, hoping to help everyone recognize the reality, use a more pragmatic attitude to think and make decisions, and create a more valuable and satisfying software system.

Reference reading:

  1. Why choose Domain-Driven Design? This article clearly explains the relationship between intrinsic complexity and domain-driven design.
  2. The Myth of the Man-Moon - "No Silver Bullet" explains the concepts of essential complexity and accidental complexity.
  3. "The Lean Product Playbook" - Chapter 2 of this book clearly explains Problem Space and Solution Space.
  4. Wardley Map - an excellent tool for analyzing technology strategies. Reasonable selection of commercial products can help reduce system complexity.
  5. Grokking Simplicity - At the micro level, using functional thinking to reduce software complexity.
  6. Design Docs at Google
  7. Conway’s Law
  8. Beware of the Complexity Dilemma: Thoughts on Software Complexity

Author|Xiaobin

Click to try the cloud product for free now to start your practical journey on the cloud!

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

Lei Jun: The official version of Xiaomi’s new operating system ThePaper OS has been packaged. A pop-up window on the Gome App lottery page insults its founder. The U.S. government restricts the export of NVIDIA H800 GPU to China. The Xiaomi ThePaper OS interface is exposed. A master used Scratch to rub the RISC-V simulator and it ran successfully. Linux kernel RustDesk remote desktop 1.2.3 released, enhanced Wayland support After unplugging the Logitech USB receiver, the Linux kernel crashed DHH sharp review of "packaging tools": the front end does not need to be built at all (No Build) JetBrains launches Writerside to create technical documentation Tools for Node.js 21 officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10120475