Refactoring or rewriting? (2020 version)

Refactoring or rewriting? (2020 version)

Joel Spolsky (author of Software Caprice Records) once wrote a famous article, Things You Should Never Do (1), in which he asserted that you should never rewrite a code base from scratch. He cited the example of Netscape. They spent several years rewriting the software. Eventually the company died in the process. A year ago, I re-read that article, but I chose to rewrite our application from scratch, yes, all of it. The following describes why we did this, how we succeeded, and some heuristic analysis on whether you should do it too.

The story starts in January 2019. At the time, Remesh was still a much smaller company than it is now. At that time, some engineers were recruited, 5 engineers focused on product development, and a small group of engineers were responsible for machine learning (ML) or DevOps. Despite these engineers, the development speed is very slow, simple functions take a long time to complete, the product has many known bugs that have not been fixed, and the entire product seems to have not changed significantly for a long time.

It is important to understand why these problems occur. Assuming that the problem is not on people, we have excellent engineers (this is also verified by the success of the new version). The problem mainly lies in the code base and process. The historical code base we used did not match the team's skills and business scenarios. The process at that time also encouraged and relied on the knowledge of the vertical field of engineers, and there was no "full stack" engineer.

State of the code base in January 2019

The original design of the old version of the application is completely different from the current version. Initially, Remesh allowed users to have a two-way conversation between the entire group or between one person and a group. For example, you can let Democrats and Republicans talk to each other, get to know each other, and find common ground. Or, you can ask the mayor of a town to talk to their citizens to better understand what they need, believe, and want. However, when we found the fit between the product and the market, the use cases also changed. We tend to have a single host talking to a group of people.

As a result of the changes in requirements, some old design schemes no longer make sense, and the schema needs to be changed significantly. In addition to the database, the code base itself is also difficult to understand, because these functions are bolted together by developers before major refactoring. Where refactoring is most needed, test coverage is poor, because these codes are the oldest code and were written before good test practices were established.

In addition, the language and framework are not suitable for our team. The back-end code base is developed in a language called Elixir, and few developers are familiar with Elixir. One of the front-end code bases is a very old version of Angular (I don't even want to know which version it is, the past is unbearable), and we have two front-ends written in React. But almost no engineer understands one of the technologies, let alone the three cities. The language and framework used are not suitable for the team and our scenario, which makes development very slow.

What are the options?

Needless to say, our code base needs a major change. When you have a bunch of code in front of you and it is difficult to move forward, there are probably three options:

  • Refactor it until all problems are fixed.
  • Rewrite it all in one go
  • Gradually rewrite in small areas

For the front end, refactoring is not a suitable choice. The Angular version is too old, so that there is no clear upgrade path to upgrade to the modern version of Angular (to be honest, any version of Angular is of little interest). And since major changes to the UI and API are expected, refactoring is not feasible. Therefore, in the front-end, we can only choose to rewrite once or gradually rewrite in a small range.

There are some problems that need to be solved on the backend — the current patterns, languages, and code bases are not suitable for our scenario. We used Elixir because it has strong concurrency support, but we didn’t need this feature in the end, and it trapped us: the way in which concurrency is handled in the Erlang virtual machine makes code analysis very difficult, you know the calculation What is it, but I don't know where it was transferred from — I wish you good luck in performance tuning.

Elixir's code base also limits the contribution of machine learning engineers to the back-end code base: they work in Python every day and don't have time to learn Elixir in depth. To make a long story short, we want to abandon Elixir and switch to the Python language, because in this way, the entire team can participate in contributing back-end code. This language can solve our needs and analyze the code more conveniently.

We also have some "product debts". The old version introduced some new things to users. After they got in touch, they gradually fell in love with these concepts, but the final effect was not ideal. They are local extremes. If we want to go beyond this local limit and make something better. We must make a major revision. In this process, smaller iterations may continue to encounter resistance from users. To remove these functions before, you need to do a lot of things at the same time.

In the final analysis, the reasons for rewriting actually boil down to the following factors:

  • I hope that every member of the team can contribute to the back-end code base, and Python is easy to learn and widely recognized in the team, so it suits us.
  • The old code base is very fragile and the amount of testing is small. Refactoring the code base is a difficult process.
  • Improve efficiency by moving to a powerful framework like Django, and save time with many ready-made things (such as Django Admin).
  • There is an opportunity to make a brand new version based on what you have learned from the user, and then you can easily upgrade to the new version, instead of spending time with the customer to explain each small change, and lasting a 12-month rally. This also makes the training of our customer service team and sales team a one-time batch training in the end, rather than constantly introducing new concepts.

In order to reach this decision, we have made quite extensive planning. Although talking about agile and lean all day, this time is actually a waterfall development-not because we are going to implement a waterfall plan, but we found that rewriting the application takes a lot of time, but refactoring or scattered Rewriting takes longer and the uncertainty is much higher. If we take the route of reconstruction, we will take greater risks.

In the end, we are confident in our decision, and all levels of the company support us. We decided to rewrite and fix the mistakes of the past few years while letting the product move forward.

Let the rewriting begin.

The progress

We started to rewrite in February 2019. After we planned out the scope of functions, we started to rewrite. As part of our due diligence, we made a very solid plan around the functions we want to develop. This goes against the dogma of agile, but having a plan that can be adjusted can help guide us along the way and see if we are off track. When we were in the testing phase with users (internal users and some external customers), we did deviate a lot from the plan in the end, and more content will be discussed later.

After the initial ups and downs, the actual process of building the new version was fairly smooth. For engineers, switching to a new technology stack is painful. Although we chose Python to achieve the lowest cost of entry, there are still some people who need to learn. And our back-end engineer has never touched Django (but our chief front-end engineer has a deep understanding of Django). Similarly, on the front-end side, many people know React, but few people have in-depth experience with TypeScript. We choose TypeScript language (there are some stories to be told later). After some initial study time, we all quickly gained considerable gains.

This is the first experience we have verified: even with less experience in this new technology stack, features can be built faster. To determine that the productivity increase comes from a new technology stack and a new code base, rather than just an empty project, it will take longer, but we finally reached the goal.

The first thing to do is to let everyone access the database. Since one of our goals is to reduce information islands and allow engineers to understand the entire technology stack as much as possible, we guide some front-end developers who have no experience in database design to think and design the initial data access version, and then interact with the entire The team iterates together. This gives them the ability to participate in database issues. Although they haven't been involved in this work for a long time, they still show this ability and can ask some really challenging questions.

After that, we moved forward quickly for a few months, rewriting the familiar and interesting things in the old version, and constantly optimizing it to make it more usable. We completed a very good project in a reasonable time. At the beginning, the timetable was very optimistic, and we continued to proceed as planned until around June. However, some features were added and changed later, because we know that without these features, the new version will not succeed. This slowed down the project, but real feedback from internal researchers, customer service teams and some trusted users is necessary for the success of our project.

Throughout the process, we have achieved some achievements that I am proud of, not all of which are technical.

  • The team has grown dramatically. We started with 4 product development engineers at the beginning, and now we have nine. This does not include the recruitment of a complete QA/SDET team, the addition of the machine learning engineering team, and the recruitment of DevOps engineers. In this process of rapid growth, the usual project delays did not occur due to the increase in personnel-on the contrary, we accelerated the speed (I think this is mainly due to this is a brand new project).
  • Improved the entire company's view of the engineering team. At the beginning, we were a bit slow in delivering new features, but at least we could quickly rewrite existing features and see that new features were added quickly. Once, we made a cool demo and coded Django's Admin in real time to prove that we can do things much faster than before. Although only a small demonstration, it is very effective.
  • From a service-oriented architecture with multiple services to a monolithic architecture that only relies on one service, we have designed fault tolerance and horizontal scalability from the beginning. This was a big pain point before.
  • The iteration speed has been greatly improved, largely because we have a new architecture that fits our scenario and is in a technology stack that everyone (now) is willing to participate in. The icing on the cake is that machine learning teams can now and do occasionally submit code to the production backend.

Main experience

We believe that we are successful, and of course we made some considerable mistakes in the process.

The reason for success is that we have a clear vision for what we want to build from the beginning (a true MVP, we know that the old product is "viable", so we have to achieve this goal or less), according to The scope needs to be reduced to maintain a clear goal. Although we did not "deliver on time", we also did not become Netscape's way. The total project duration is less than twice the expected time (based on the expected time to fully replicate the functions of the old product), but we finally got a better product, and there are some new features, such as the ability to upload and send videos, and download automatic PowerPoint reports generated etc.

Another key to success is to get feedback early and often. In the process of rewriting, we often use the product internally and find critical bugs and performance issues. We also regularly hold company-wide presentations to get quick feedback from helping customers succeed, selling, researching, and early trial users who can tolerate various problems.

What did you do wrong? We have introduced two technologies that we were not familiar with before. We have used TypeScript in a prototype before, but we don't have deep expertise in it. Although progress is so-so, we still do not believe that productivity will be higher and defect rates will be lower; time will prove that statically typed languages ​​will be better (if anyone has a precise study on this, I would be happy to send them to me ).

Another mistake is to use GraphQL. We have considerable experience in REST and Redux, but we have only used GraphQL in a prototype before. Looking back now, GraphQL made the initial prototype development much faster, but the long-term price is that we don’t agree with some key design decisions in Apollo (for example, the front-end does not expose the ability to detect disconnection/reconnection in subscriptions) , And the performance tuning experience in its back-end is hard to say... It was a very difficult month or two in my life, and I never want to go back. We are now migrating from GraphQL. For performance-critical things, we will migrate quickly, and then slowly migrate those calls that are more tolerant of request performance.

The last thing to note is that when rewriting, your team and morale will be affected, and you must actively respond. Starting a new project is quite exciting at first, but the next thing is to build existing features and fix bugs. After a while, you will feel tired. I am very pleased to see that my team has gone from building our existing functions to developing new ones, and I also realize that rewriting is really exhausting.

We successfully completed the rebuild, part of the reason is to balance the development of new features with the migration of old code. Having said that, I hope we can do better in terms of balance. Next time, I will focus on ensuring that we have an early alpha testing plan, testing with a few trusted users to get regular feedback and encouragement, and keep everyone excited about the rebuild. I will also make sure that we add a lot of new features early, instead of discovering that everyone is a bit tired before starting to introduce new features. Some monotony is inevitable, but you can mitigate it.

Should you do this?

In my experience, you probably shouldn't do it like me if you are convinced that rewriting will never be the right decision for those articles. In any case, you should default to the "no rewrite" position, and then push forward very hard and prove that it is correct not to rewrite.

But there are several situations where rewriting may be reasonable.

  • If your architecture or model is seriously out of touch with your needs, and there is no clear migration path, it becomes very difficult to update the architecture or model incrementally.
  • If these problems seriously drag down your team
  • If your current technology stack limits the code contribution of many engineers, and technology stack training is not feasible.

Even if all these are in line with your situation, you have to further consider the actual situation of the company and consider whether this is meaningful to your company and your team.

It is possible that in more cases, rewriting makes sense. It is difficult to defend this, but it may be a way to go, and it can be done successfully.

https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/

Original English:
https://remesh.blog/refactor-vs-rewrite-7b260e80277a

Reference reading:

  • Why Kubernetes has become so popular (2020 version)
  • The Art of Pull Request
  • Grab fuse design: how to deal with sudden peak rides
  • Netflix cloud native microservice design analysis
  • The right way to clean architecture
  • Code review rules

This article is a translation of high-availability architecture, technical originality and architecture practice articles, and you are welcome to submit articles through the official account menu "Contact Us".

Highly available architecture

Changing the way the internet is built

Refactoring or rewriting?  (2020 version)

Guess you like

Origin blog.51cto.com/14977574/2546122