01 - How to formulate performance tuning standards?

1. Why performance tuning?

If an online product has not passed the performance test, it is like a time bomb. You don't know when it will have problems, and you don't know where the limit it can withstand.

Some performance problems are slowly accumulated over time, and will naturally explode after a certain period of time; more performance problems are caused by fluctuations in visits, for example, activities or an increase in the number of users of company products; of course, it may also be a After the product was launched, it was half dead, and there has been no large number of visits, so this time bomb has not yet been triggered.

Now suppose your system is going to have an event, and the product manager or boss tells you that there are expected to be hundreds of thousands of user visits, and asks whether the system can withstand the pressure of this event. If you don't know the performance of your system, you can only answer the boss with trepidation, maybe there is no problem.

Therefore, whether to do performance tuning or not, this question is actually easy to answer. After all systems are developed, there will be performance problems more or less. The first thing we need to do is to find a way to expose the problems, such as stress testing, simulating possible operation scenarios, etc., and then solve them through performance tuning. question.

For example, when you use a certain app to query a certain piece of information, you need to wait for more than ten seconds; during a snap-up activity, you cannot enter the activity page, etc. You see, system response is the most direct reference factor to reflect system performance.

So if there is no response problem in the system online, do we not need to do performance optimization? Let me tell you another story.

Once my former employer’s system research and development department came to a great master. Why did he call him a great master? Because during the one year he came to the company, he only did one thing, which was to reduce the number of servers to half of the original one. The performance indicators of the system have also improved.

Good system performance tuning can not only improve the performance of the system, but also save resources for the company. This is also the most direct purpose of our performance tuning.

2. When to intervene in tuning?

After solving the problem of why performance optimization is needed, a new question arises: If we need to do a comprehensive performance monitoring and optimization of the system, when do we start to intervene in performance tuning? Is it better to intervene sooner?

In fact, in the early stage of project development, we don’t need to pay too much attention to performance optimization, which will make us tired of performance optimization, not only will not improve system performance, but will also affect the development progress, and even get the opposite effect. The system brings new problems.

We only need to ensure effective coding at the code level, such as reducing disk I/O operations, reducing the use of contention locks, using efficient algorithms, and so on. When encountering more complex businesses, we can make full use of design patterns to optimize business codes. For example, when designing commodity prices, there are often many discount activities and red envelope activities. We can use the decoration mode to design this business.

After the system coding is completed, we can perform performance tests on the system. At this time, the product manager will generally provide online expected data, and we will perform stress testing on the provided reference platform, and use performance analysis and statistical tools to count various performance indicators to see if they are within the expected range.

After the project is successfully launched, we also need to observe the system performance problems according to the actual situation on the line, according to the log monitoring and performance statistics logs. Once a problem is found, we need to analyze the log and fix the problem in time.

3. What reference factors can reflect the performance of the system?

Above we talked about how performance tuning is involved in each stage of project development, and performance indicators have been mentioned many times, so what are the performance indicators?

Before we look at performance metrics, let's understand which computer resources can become the performance bottleneck of the system.

CPU : Some applications require a lot of calculations, and they will occupy CPU resources for a long time without interruption, causing other resources to be unable to compete for the CPU and respond slowly, resulting in system performance problems. For example, infinite loops caused by code recursion, backtracking caused by regular expressions, frequent FULL GC of JVM, and a large number of context switches caused by multi-threaded programming, etc., may cause CPU resources to be busy.

Memory : Java programs generally allocate and manage memory through the JVM, mainly using the heap memory in the JVM to store objects created by Java. The read and write speed of the system heap memory is very fast, so there is basically no read and write performance bottleneck. However, since the cost of memory is higher than that of disk, the storage space of memory is very limited compared with disk. Therefore, when the memory space is full and objects cannot be recycled, problems such as memory overflow and memory leaks will occur.

Disk I/O : Compared with memory, disk has much larger storage space, but the speed of disk I/O read and write is slower than that of memory. Although the currently introduced SSD solid state drive has been optimized, it still cannot match the speed of memory. Read and write speeds are on par.

Network : The network also plays a vital role in system performance. If you have purchased cloud services, you must have experienced the link of choosing the size of network bandwidth. If the bandwidth is too low, the network can easily become a performance bottleneck for a system with a relatively large amount of transmitted data or a relatively large amount of concurrency.

Exception : In Java applications, throwing an exception requires building an exception stack to capture and process the exception. This process consumes a lot of system performance. If exceptions are thrown under high concurrency and exception handling continues, the performance of the system will be significantly affected.

Database : Most systems use databases, and database operations often involve disk I/O reads and writes. A large number of database read and write operations will cause disk I/O performance bottlenecks, which in turn will lead to delays in database operations. For systems with a large number of database read and write operations, database performance optimization is the core of the entire system.

Lock competition : In concurrent programming, we often need multiple threads to share the same resource for read and write operations. modification), we will use the lock. The use of locks may introduce a context switch, which imposes a performance overhead on the system. After JDK1.6, in order to reduce the context switching caused by lock competition, Java has optimized the internal locks of the JVM many times, for example, adding bias locks, spin locks, lightweight locks, lock coarsening, and lock elimination wait. How to use lock resources reasonably and optimize lock resources requires you to understand more operating system knowledge, Java multi-threaded programming foundation, accumulate project experience, and deal with related problems in combination with actual scenarios.

After understanding the above basic content, we can get the following indicators to measure the performance of the general system.

3.1. Response time

Response time is one of the important indicators to measure system performance. The shorter the response time, the better the performance. Generally, the response time of an interface is at the millisecond level. In the system, we can subdivide the response time into the following categories from bottom to top:

  • Database response time: The time consumed by database operations is often the most time-consuming in the entire request chain;
  • Server response time: the server includes the time consumed by requests distributed by Nginx and the time consumed by server program execution;
  • Network response time: This is the time consumed by network hardware for operations such as parsing the transmitted request during network transmission;
  • Client response time: For ordinary web and app clients, the time consumed is negligible, but if your client embeds a large amount of logic processing, the time consumed may become longer, thus becoming a problem of the system. bottleneck.

3.2. Throughput

In the test, we tend to pay more attention to the TPS (transaction processing per second) of the system interface, because TPS reflects the performance of the interface, and the larger the TPS, the better the performance. In the system, we can also divide the throughput into two types from bottom to top: disk throughput and network throughput.

Let's look at disk throughput first . Disk performance has two key metrics.

One is IOPS (Input/Output Per Second), which is the amount of input and output per second (or the number of reads and writes). This refers to the number of I/O requests that the system can handle per unit time. I/O requests are usually read Or write data operation requests, focusing on random read and write performance. It is suitable for applications with frequent random reading and writing, such as small file storage (picture), OLTP database, and mail server.

The other is data throughput, which refers to the amount of data that can be successfully transmitted per unit of time. For applications with a large number of sequential reads and writes that transmit a large amount of continuous data, such as video editing and VOD (Video On Demand) in TV stations, data throughput is the key measurement indicator.

Next, look at the network throughput , which refers to the maximum data rate that the device can accept without frame loss during network transmission. Network throughput is not only related to bandwidth, but also closely related to the processing power of CPU, network card, firewall, external interface and I/O. The throughput is mainly determined by the processing capability of the network card, internal program algorithm and bandwidth.

3.3. Computer resource allocation utilization rate

Resource usage is usually represented by CPU usage, memory usage, disk I/O, and network I/O. These parameters are like a wooden barrel. If there is a short board in any of the boards, or if any item is unreasonably allocated, the impact on the performance of the entire system will be devastating.

3.4. Load bearing capacity

When the system pressure rises, you can observe whether the rising curve of the system response time is smooth. This indicator can intuitively give you feedback on the load pressure limit that the system can withstand. For example, when you perform a stress test on the system, the response time of the system will increase with the increase of the number of concurrency of the system, until the system cannot handle so many requests and throws a lot of errors, it will reach the limit.

4. Summary

Through today's study, we know that performance tuning can make the system stable, the user experience better, and even in relatively large systems, it can also help the company save resources.

But at the beginning of the project, we don't need to intervene in performance optimization prematurely, we only need to ensure its excellence, efficiency, and good program design when coding.

After completing the project, we can conduct system testing. We can use the following performance indicators as performance tuning standards, such as response time, throughput, computer resource allocation usage, and load bearing capacity.

Looking back on my own project experience, there are e-commerce systems, payment systems, and game recharge and billing systems. The user level is tens of millions, and they have to withstand various large-scale panic buying activities, so I have very strict requirements on the performance of the system. In addition to determining the performance of the system by observing the above indicators, it is also necessary to fully guarantee the stability of the system during the update iteration.

Here, I will extend a method for you, which is to use the system performance indicators of the previous version as a reference standard, and through automated performance testing, to verify whether the system performance after the iterative release is abnormal. This is not just a comparison of throughput and response time. , load capacity and other direct indicators, it is also necessary to compare changes in several indirect indicators such as CPU usage, memory usage, disk I/O, and network I/O of system resources.

Guess you like

Origin blog.csdn.net/qq_34272760/article/details/131770811