Added one line of code that gave us a 3000% performance boost

Author | Itamar Lechowicer 

Source| InfoQ Translator| Xu Xuewen

Planning|Tina Review|Wang Qiang

This article was originally published on Itamar Lechowicer's blog, translated and shared by InfoQ Chinese site with the authorization of the original author

Overview

Our company operates and maintains 15 web applications, and its main job is to deliver data-driven web applications on demand to support real-time decision-making.

These applications are expected to remain highly available under heavy load. The main Web application is a legacy large multi-service system. Most of the services in the system are over 15 years old and have been refactored over several generations. Just imagine that the person responsible for writing the code for the system may have now left or moved to another role.

The main goal of our team over the past few years has been to optimize performance for these services. This time I will share with you some of our main experiences in the process of performance optimization and the reasons why we decided to do so at the time.

Cognitive Change Moments

In one incident, users increased their usage of our app, causing a significant increase in our app's data traffic. During this incident, users complained that our application performance was too poor to complete a full set of business processes on the application. To this end, we started using monitoring tools to analyze application performance bottlenecks. By applying monitoring tools, we found that the service was consuming 90% of the response time on getting DB connections.

But the DB looks fine, so let's start analyzing the application's DB connection pool. Analysis found that all pods used all available connections in the connection pool. So we're guessing that the service might have a problem closing the connection. So we spent hours reviewing the code, trying to find where the connection wasn't being released. Eventually, one of our TeamLeaders found that the pod's liveness probe did not release the DB connection after a simple DB heartbeat request. Immediately afterwards, we added a line of code to release the DB connection to the request of the pod liveness probe. The impact is dire. In the blink of an eye, the performance of the application began to stabilize and the user resumed normal use.

Just a day before this incident, we performed a load test to ensure the application could withstand the expected increase in usage, and the results indicated that the application's performance was within the normal range. However, it turns out that this test conclusion is wrong, and the wrong test conclusion misleads us into thinking that the application has no problems that need to be fixed. We deeply recognize our mistakes and we need to do better. The following are some of the experiences and conclusions we have learned from this event.

Summary 1: Don't use average wait time as a measure of service load - check the "tail" value of your application

When users complained that the app was slow to respond, we found that the average wait time metric did not change significantly. When we reviewed these metrics, we noticed something interesting: previously we used average request time as our primary metric for service waits. If you want to become an architect, it is recommended to take a look at this architect map and avoid detours.

Therefore, this time we made a graph of the data of 90% of the request waiting time to see if this graph can feed back some information. Sure enough, we observed a dramatic increase in wait times in the graph when users complained about the slow app.

The average wait time metric did not change significantly because too many fast requests were pulling the average down. So my suggestion is to not use the average waiting time, but use the average waiting time of 50%, 90%, 95%, 99% as an indicator of service response. It is important to check for "tail" values ​​that are well outside the normal range.

Summary 2: Invest time, tools, and manpower on performance optimization

To maintain the high performance of the application, we must have the following conditions:

  1. Load Tests and Load Scenarios - It is important to have load tests and load scenarios available.

  2. Application Monitoring Tools (APM) - Tools such as Dyanatrace, AppDynamics, and Epsagon. APM can save us a lot of time in monitoring services. Therefore it is very necessary to install at least one APM in the production environment.

  3. Effective logging - Effective logging is an essential requirement for production service outage investigations and performance problem investigations. Therefore you must ensure that your application's logs are clear and useful.

  4. Log analysis tools - you can't read and search logs from many files, especially if your service is clustered, it will become more difficult to read logs through files. Therefore, it is necessary to take the time to put into production a log collector and analysis tool such as ELK, Grafana or Splunk.

  5. Professional human support - For the knowledge or tools mentioned above, if your team does not have relevant professionals, then you will not be able to do anything.

Therefore, for complex systems, I recommend dedicating people and time to deal with them. (For example, an SRE team would be well-suited for this job)

Conclusion 3: The old systems will die (unless we activate them)

As human beings, we all have the urge and desire to create new things and have a sense of ownership over the products we create. In the world of software, among the contradictions we need to deal with, sometimes there are such contradictions. On the one hand, there is an old system that we need to maintain; on the other hand, there is a cool new system that we want to develop.

Then at this time, we need to decide where to invest our time. When we are faced with such contradictions, we must remember that if we do not continue to develop and add new features on the old system, then the knowledge of the old system will disappear over time.

Therefore, when we face system failures or new customer requirements, we will not be able to achieve our goals due to lack of understanding of the old system or capability issues. In other words, the MTTR (Mean Time To Repair) of the system goes up when we lose knowledge of the old system.

So my advice is to always resist the urge to create something new and cool, and invest your time in getting familiar with the old maintenance system and improving your problem-solving skills. Also, the best way to stay familiar with the old system is to try to add code to the old system.

Conclusion 4: Every Line of Code Matters

Sometimes, when we are writing code, we may forget that the code will eventually run in production and serve a real user's real job. In the above-mentioned case we experienced, just because the programmer forgot to release the DB connection (one line of code), it can interfere with the normal work of a user (those whose work is affected are probably reluctant to pay us. ).

my suggestion is:

Imagine (though it's hard) that on the other side of the world, a user's work is completely dependent on the code you write, and imagine that every line of code you write will affect their experience with the app.

Perform load testing during the CI or CD session. If you want to make sure your code is highly available, then load test every PR or release that is about to go into production.

When you find performance issues, be suspicious of every line of code - in our experience, every character in your code can be a bottleneck in performance.

Summarize

This article describes all the lessons and experiences we have learned about system performance optimization. I hope that this article can help you realize the potential risks of system performance defects.

In my opinion, application performance should be considered the highest priority. Because a beautiful UI and a cool product are nothing compared to the end user being unable to use the system.

These conclusions I wrote are based on my daily performance optimization experience, so, in my opinion, all the above conclusions are the cornerstone of every successful performance optimization. So, I hope you find them useful too.

Original: https://medium.com/@ilechowicer/how-every-code-line-matters-we-improved-performance-by-3000-c9ce858c39a8

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326605341&siteId=291194637