Performance testing: system architecture performance optimization ideas

Today, let’s talk about the analysis, diagnosis and performance optimization of business system performance problems. The focus of this article is to talk about the key points of problem diagnosis and optimization after subsequent performance problems of the online business system.

System performance problem analysis process

Let's first analyze that if a business system has no performance problems before it goes online, but has serious performance problems after it goes online, then the actual potential scenarios mainly come from the following aspects.

  • There are large concurrent accesses in the business, resulting in performance bottlenecks

  • The data in the system database after going online accumulates day by day, and performance bottlenecks appear after the amount of data increases

  • Other key environmental changes, such as the network bandwidth impact we often say

It is for this reason that when we find a performance problem, we first need to judge whether there is a performance problem in a single-user non-concurrent state, or in a concurrent state. For single-user performance issues, it is often easier to test and verify. For concurrent performance issues, we can perform stress testing and verification in the test environment to judge the performance under concurrency.

If the single user itself has performance problems, then most of the problems lie in the program code and SQL that need to be further optimized. If it is a concurrent performance problem, we need to further analyze the status of the database and the middleware itself to see if performance tuning of the middleware is required.

During the stress test, we also need to monitor the CPU, memory, and JVM to see if there are situations such as memory leaks that cannot be released, that is, performance problems under concurrency may also be caused by the code itself that causes abnormal performance.

Analysis of factors affecting performance problems

As for the influencing factors of performance problems, in simple terms, it includes the main content of three aspects: hardware environment, software operating environment and software program. Let's expand the description separately below.

hardware environment

The hardware environment is what we often call computing , storage and network resources.

Regarding the computing power of servers, manufacturers generally provide TPMC parameters as reference data, but we actually see that the capabilities of X86 servers with the same TPMC capabilities are still lower than those of minicomputers.

In addition to the computing power parameters of the server, another key point is the storage device we are talking about. The key point affecting storage is IO read and write performance. Sometimes we monitor and find that the CPU and memory remain high, but the real bottleneck is found to be caused by the IO bottleneck through analysis. Because the read and write performance cannot keep up, a large amount of data cannot be quickly persisted and memory resources are released.

For example, in the Linux environment, it also provides performance monitoring tools to facilitate performance analysis. For example, commonly used iostat, ps, sar, top, vmstat, etc. These tools can monitor and analyze the performance of CPU, memory, JVM, disk IO, etc. to find out where the real performance problems are.

For example, we often say that the memory usage continues to alarm, you must find out whether it is caused by high concurrent calls, JVM memory leaks, or disk IO bottlenecks.

For an idea of ​​CPU, memory, and disk IO performance monitoring and analysis, you can refer to:

Operating environment - database and application middleware

Database and application middleware performance tuning is another place where performance problems often arise.

Database Performance Tuning

Taking the Oracle database as an example, factors that affect database performance include: system, database, and network. The optimization of the database includes: optimizing the database disk I/O, optimizing the rollback segment, optimizing the Rrdo log, optimizing the system global area, and optimizing the database object.

To adjust, you first need to monitor the performance of the database

We can set TIMED_STATISTICS=TRUE in the init.ora parameter file and set ALTER SESSION SET STATISTICS=TRUE in your session layer. Run svrmgrl to register with connect internal. During normal activities of your application system, run utlbstat.sql to start counting system activities. After a certain period of time, execute utlestat.sql to stop counting. The statistical results will be generated in the report.txt file.

Database performance optimization should be a continuous work. One aspect is its own performance and parameter inspection. Another aspect is that the DBA will often extract the most memory-consuming inefficient SQL statements for further analysis by developers. Problems were found in the following alarm KPI indicators.

For example, we may find that the Oracle database has a high memory usage alarm, and through inspection, we will find that it is caused by a large number of Redo logs, then we need to further analyze why so many rollbacks are generated from the program.

Application middleware performance analysis and tuning

The application middleware container is what we often call Weblogic, Tomcat and other application middleware containers or Web containers. One aspect of application middleware tuning is its own configuration parameter optimization settings, and another aspect is JVM memory startup parameter tuning.

For the parameter setting of the application middleware itself, it mainly includes JVM startup parameter setting, thread pool setting, minimum and maximum connection number setting, etc. If it is a cluster environment, it also involves cluster-related configuration tuning.

The tuning of JVM startup parameters is often a key point of application middleware tuning, but generally JVM parameter tuning will be analyzed together with the application.

For example, our common JVM heap memory overflow, if the program code does not have memory leaks, I need to consider adjusting the heap memory settings when the JVM starts. It can only be set to 4G under the 32-bit operating system, but it can already be set to 8G or even a larger value under the 64-bit operating system.

The main control parameters of JVM startup are described as follows:

-Xmx  #设置最大堆空间
-Xms  #设置最小堆空间
-XX:MaxNewSize #设置最大新生代空间
-XX:NewSize    #设置最小新生代空间
-XX:MaxPermSize  #设置最大永久代空间(注:新内存模型已经替换为Metaspace)
-XX:PermSize     #设置最小永久代空间(注:新内存模型已经替换为Metaspace)
-Xss   #设置每个线程的堆栈大小

The entire Java heap size setting, Xmx and Xms are set to 3-4 times of the surviving objects in the old age, that is, 3-4 times the memory usage of the old age after FullGC. The PermSize and MaxPermSize of the permanent generation are set to 1.2-1.5 times of the surviving objects in the old generation.

The setting of Xmn in the young generation is 1-1.5 times of the surviving objects in the old generation.

The memory size of the old generation is set to 2-3 times of the surviving objects in the old generation.

Note that under the new JVM memory model, there is no PermSize but Metaspace, so the ratio of Heap memory to Metaspace size needs to be considered, and the type of related garbage collection mechanism should also be considered.

For the problem of JVM memory overflow, I wrote a special analysis article earlier for reference.

From the appearance to the root cause - a software system JVM memory overflow problem analysis and solution to the whole process

Software program performance problem analysis

The first thing to emphasize here is that when we find performance problems, the first thing we think of is to expand resources, but most of the performance problems are not caused by insufficient resource capabilities, but obvious defects in our program implementation.

For example, we often see a large number of loops to create connections, resources are not released after use, and SQL statements are executed inefficiently.

In order to solve these performance problems, the best way is still to control in advance. This includes the use of pre-code static inspection tools, as well as the development team's Code Review of the code to find performance problems.

All known problems must form the development specification requirements of the development team to avoid repetition.

Extended thinking on business system performance issues

For the performance optimization of business systems, in addition to the standard analysis process and analysis elements mentioned above, let’s talk about some other key considerations caused by performance issues.

Is performance testing before going live useful?

Sometimes you may wonder why our system has been tested before it goes online, and why there are still system performance problems after it goes online. Then we can consider some places that may not be able to truly simulate the production environment in the performance test before we go online, specifically:

  • Can the hardware fully simulate the real environment? The best performance tests are often performed directly in the production environment that has been built.

  • Can the amount of data simulate the actual scenario? The real scenario is often that multiple business tables already have a large amount of data accumulation instead of empty tables.

  • Can concurrency simulate real scenarios? One is the need to record composite business scenarios, and the other is the need for multiple stress testing machines.

In fact, when we are doing performance testing, it is difficult to really achieve the above points, so it is quite difficult to completely simulate the real production environment, which also leads to many performance problems that are only discovered after the actual launch.

Does the horizontal elastic expansion of the system itself completely solve the performance problem?

The second point is also a point that we often talk about a lot, that is, when we design the architecture of our business system, especially in the face of non-functional requirements, we will talk about the database of the system itself, and the middleware adopts the cluster technology, which can achieve elastic horizontal expansion. So does this elastic horizontal expansion capability really solve the performance problem?

In fact, we have seen that it is often difficult to truly achieve unlimited elastic horizontal expansion for databases. Even for Oracle RAC clusters, it is often expanded to 2 to 3 times the performance of a single point. For application clusters, elastic horizontal expansion can often be achieved, and the current technology is relatively mature.

When the middleware can achieve full elastic expansion, there may still be performance problems, that is, with the operation of our system and the continuous accumulation of business data, the value-added. In fact, you can see that single-user access in a non-concurrent state is very slow in itself, not that it is slow after concurrency comes up. Therefore, it is also what we often say to give points, namely:

  • When the single-point access performance is normal, the cluster can be expanded to deal with simultaneous access in a large concurrency state

  • When the performance of single-point access itself is problematic, the performance of single-node access should be optimized first

Classification of business system performance diagnosis

For business system performance diagnosis, from a static point of view, we can consider classification from the following three aspects

  • Operating system and storage layer

  • Middleware level (including database, application server middleware)

  • Software level (including database SQL and stored procedures, logic layer, front-end presentation layer, etc.)

Then there is a problem with the application function of a business system. Of course, we can also look at the code and hardware infrastructure that an actual application request has passed through from the call to the dynamic level, and use the segmentation method to locate and query the problem.

For example, what we often see is that if there is a problem with a query function, the first thing is to find out whether the SQL statement corresponding to this query function is very slow in the background query. If the SQL itself is slow, then it is necessary to optimize the SQL statement. If the SQL itself is fast but the query is slow, it depends on whether it is a front-end performance problem or a cluster problem.

The problem of software code is often a performance problem that cannot be ignored

For business system performance issues, what we often think of is to expand the hardware performance of the database, such as expanding CPU and memory, and expanding clusters. However, we can actually see that the performance problems of many applications are not caused by hardware performance, but by software code. caused by performance. I have also talked about the common performance problems of software code in previous blog posts, and the typical ones are included.

  • Initialize large structure objects, database connections, etc. in a loop

  • Memory leaks caused by not releasing resources, etc.

  • There is no way to moderately improve performance through caching based on scene requirements

  • Long-term transaction processing consumes resources

  • When dealing with a certain business scenario or problem, the optimal data structure or algorithm is not selected

The above are some common software code performance problems, and these often need to be discovered through our Code Review or code review. Therefore, if you want to do a comprehensive performance optimization, it is necessary to troubleshoot the performance problems of the software code.

Identify performance issues through IT resource monitoring or APM application tools

There are generally two ways to discover performance problems. One is to discover performance problems in advance through the monitoring of our IT resources, APM performance monitoring and early warning, and the other is to find performance problems through feedback from business users during use.

APM application performance management mainly refers to monitoring and optimizing the key business applications of enterprises, improving the reliability and quality of enterprise applications, ensuring users to receive good services, and reducing IT total cost of ownership (TCO).

Resource Pool-"Application Layer-"Business Layer

This can be understood as a key point of APM. The original network management monitoring software is more at the resource and operating system level, including the use and utilization of computing and storage resources, and the performance of the network itself. However, it is difficult to analyze how all resource layer issues correspond to specific applications and specific business functions.

In the traditional mode, when the CPU or memory is fully loaded, it is often not easy to find out which application, which process or specific business function, which SQL statement caused the problem. In actual performance problem optimization, it is often necessary to do a lot of log analysis and problem location before finally finding the problem point.

For example, in our recent project implementation, combined with APM and service chain monitoring, we can quickly find out which service call has a performance problem, or quickly locate which SQL statement has a verification performance problem. This can help us quickly analyze and diagnose performance problems.

The resource carries the application, and the application itself includes the database and the application middleware container, as well as the front end; on top of the application, it corresponds to specific business functions. Therefore, a core of APM is to integrate, analyze and connect resources-"application-"functions.

With the advancement of DevOps and automated operation and maintenance, we hope to find performance problems through active monitoring through tools such as APM. The biggest advantage of APM tools is that they can perform performance analysis on the entire link of the service, so that we can find out what the performance problem is. where it happened. For example, it is very slow for us to submit a form. Through APM analysis, we can easily find out which business service is slow to call, or which SQL statement is slow to process. This can greatly improve the efficiency of our performance problem analysis and diagnosis.


Finally: In order to give back to the die-hard fans, I have compiled a complete software testing video learning tutorial for you. If you need it, you can get it for free【保证100%免费】

Software Testing Interview Documentation

We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.

Guess you like

Origin blog.csdn.net/jiangjunsss/article/details/131456445