Performance Bottleneck Analysis with Java Thread Stacks

Improved performance means doing more with less. In order to take advantage of concurrency to improve system performance, we need to use the existing processor resources more efficiently, which means we want to keep the CPU as busy as possible (of course, not to use CPU cycles for useless computation, but keep the CPU busy doing useful things). If the program is limited by the current CPU computing power, then we can increase the overall performance by adding more processors or by clustering. In general, the performance improvement requires and only needs to address the current constrained resources, which may be:

  • CPU: If the current CPU has been able to close to 100% utilization, and the code business logic can no longer be simplified, it means that the performance of the system has reached the line, and the performance can only be improved by increasing the processor
  • Other resources: such as the number of connections, etc. You can modify the code to use the CPU as much as possible, and you can get a great performance improvement

If your system has the following characteristics, it means that the system has a performance bottleneck:

  • As the system gradually increases the pressure, the CPU usage cannot approach 100% (as shown below)

  • Continues to run slowly. From time to time, the application is found to be slow. The overall response time cannot be effectively improved by changing environmental factors (load, number of connections, etc.)

  • System performance degrades gradually over time. When the load is stable, the longer the system runs, the slower the speed will be. May be due to exceeding a certain threshold range, the system operates frequently with errors, resulting in system deadlock or crash
  • System performance decreases gradually with increasing load.

A good program should be able to make full use of the CPU. If a program can't make the CPU usage close to 100% no matter how much pressure on a single-CPU machine, there is a problem with the program design. The performance bottleneck analysis process of a system is roughly as follows:

  1. The performance bottleneck analysis of the advanced single process is limited to optimize the performance of the single process.
  2. Perform an overall performance bottleneck analysis. Because the performance of a single process is optimal, the performance of the entire system may not be optimal. In multi-threaded situations, lock contention can also lead to performance degradation.

High performance has different meanings in different applications:

  1. In some cases, high performance means user speed experience, such as interface operation, etc.
  2. In some cases, high throughput means high performance, such as SMS or MMS, the system pays more attention to throughput, and is not sensitive to the processing time of each message
  3. In some cases, it is a combination of the two

The ultimate goal of performance tuning is that the CPU utilization of the system is close to 100%. If the CPU is not fully utilized, there are the following possibilities:

  1. Insufficient pressure applied
  2. There is a bottleneck in the system

1 Common performance bottlenecks

1.1 Resource contention due to improper synchronization

1.1.1 Two unrelated functions share a lock, or different shared variables share the same lock, creating unnecessary resource contention

The following is a common mistake

class MyClass {
    Object sharedObj;
	synchronized fun1() {...} //     访问共享变量 sharedObj
	synchronized fun2() {...} //     访问共享变量 sharedObj
	synchronized fun3() {...} //     不访问共享变量  sharedObj
	synchronized fun4() {...} //     不访问共享变量  sharedObj
	synchronized fun5() {...} //     不访问共享变量  sharedObj
}

The above code adds synchronized to each method of the class, which violates the principle of protecting what locks are. For methods with no shared resources, the same lock is used, which artificially causes unnecessary waits. Java provides this lock by default, so many people like to use synchronized lock directly on the method. In many cases, this is inappropriate. If you do it without careful consideration, it is easy to cause the lock granularity to be too large:

  • Two unrelated methods (not using the same shared variable) share this lock, resulting in artificial resource competition
  • Even the code in a method doesn't need lock protection everywhere. If the entire method uses synchronized, then it is likely to artificially expand the scope of synchronized. Adding locks at the method level is a rough lock usage habit.

The above code should become the following

class MyClass {
	Object sharedObj;
	synchronized fun1() {...} //     访问共享变量 sharedObj
	synchronized fun2() {...} //     访问共享变量 sharedObj
	fun3() {...} //     不访问共享变量  sharedObj
	fun4() {...} //     不访问共享变量  sharedObj
	fun5() {...} //     不访问共享变量  sharedObj
}

1.1.2 The granularity of the lock is too large. After the access to the shared resource is completed, the subsequent code is not placed outside the synchronized synchronization code block.

This will cause the current thread to occupy the lock for too long, and other threads that need the lock can only wait, which will eventually greatly affect the performance.

void fun1()
{
    synchronized(lock) {
    ...... //正在访问共享资源
    ...... //做其他耗时操作,但这些耗时操作与共享资源无关
    }
}

The above code will cause a thread to hold the lock for a long time, and other threads can only wait for such a long time. This way of writing has different room for improvement in different occasions:

  • In the case of a single CPU, the time-consuming operation is taken out of the synchronization block. In some cases, the performance can be improved, and in some cases it cannot:
    • The time-consuming code of the synchronization block is CPU-intensive code (pure CPU operation, etc.), and there is no code with low CPU consumption such as disk IO/network IO. In this case, since the CPU executes this code, the utilization rate is 100%. So shrinking the sync block won't bring any performance gain either. However, there is no performance penalty for shrinking the sync block at the same time
    • The time-consuming code in the synchronization block belongs to the code with low CPU consumption such as disk/network IO. When the current thread is executing the code that does not consume the CPU, the CPU is idle at this time. If the CPU is busy at this time, it can bring the overall The performance is improved, so in this scenario, putting the time-consuming operation code outside the synchronization can definitely improve the overall performance (?)
  • Taking time-consuming operations out of synchronized blocks in multi-CPU situations can always improve performance
    • The time-consuming code of the synchronization block is CPU-intensive code (pure CPU operation, etc.), and there is no code with low CPU consumption such as disk IO/network IO. In this case, due to multiple CPUs, other CPUs may be idle, so Shrinking the synchronized block allows other threads to execute this code immediately, which can improve performance
    • The time-consuming code in the synchronization block belongs to the code with low CPU consumption such as disk/network IO. When the current thread is executing the code that does not consume CPU, there is always CPU idle at this time. If the CPU is busy at this time, you can bring To improve the overall performance, so in this scenario, putting the time-consuming operation code outside the synchronization block can definitely improve the overall performance

In any case, reducing the synchronization scope will not have any adverse effect on the system. In most cases, it will bring performance improvements, so the synchronization scope must be reduced, so the above code should be changed to

void fun1()
{
     synchronized(lock) {
		...... //正在访问共享资源
	}
	...... //做其他耗时操作,但这些耗时操作与共享资源无关
}

1.1.3 Other issues

  • The abuse of Sleep, especially the use of sleep in polling, will make users feel the delay obviously, which can be modified to notify and wait
  • Abuse of String +, each + will generate a temporary object with a copy of the data
  • Inappropriate threading model
  • Inefficient SQL statements or improper database design
  • Poor performance caused by inappropriate GC parameter settings
  • Insufficient number of threads
  • Frequent GCs due to memory leaks

2.2 Means and tools for performance bottleneck analysis

The performance bottlenecks caused by the reasons mentioned above can all be found by thread stack analysis to find the root cause.

2.2.1 How to simulate and find performance bottlenecks

Several characteristics of performance bottlenecks:

  • There is only one current performance bottleneck, and only when this one is resolved can the next one be known. If the current performance bottleneck is not solved, the next performance bottleneck will not appear. As shown in the figure below, the second segment is the bottleneck. After solving the bottleneck of the second segment, the first segment becomes the bottleneck, so all performance bottlenecks are repeatedly found.

  • The performance bottleneck is dynamic, where it is not a bottleneck under low load, it may become a bottleneck under high load. Due to the overhead caused by performance analysis tools such as JProfile attached to the JVM, the system cannot achieve the performance required when the bottleneck occurs. Therefore, thread stack analysis is a really effective method in this scenario.

In view of the above characteristics of performance bottlenecks, when performing performance simulation, be sure to use a slightly higher pressure than the current system to simulate, otherwise the performance bottleneck will not appear. Specific steps are as follows:

2.2.2 How to identify performance bottlenecks through thread stacks

Through the thread stack, it is easy to identify the performance bottleneck that occurs when the load is high in multi-threaded situations. Once a system has a performance bottleneck, the most important thing is to identify the performance bottleneck and then modify it according to the identified performance bottleneck. In general multi-threaded systems, first classify (group) according to the function of the thread, and analyze the threads that execute the same function code as a group. When using stacks for analysis, perform statistical analysis with this set of threads. If a thread pool serves different functional codes, then the threads of the entire thread pool can be analyzed as a group.

Generally, once a system has a performance bottleneck, there are three most typical stack characteristics as follows:

I specially sorted out the above technologies. There are many technologies that cannot be explained clearly in a few words, so I simply asked a friend to record some videos. The answers to many questions are actually very simple, but the thinking and logic behind them are not simple. If you know it, you also need to know why. If you want to learn Java engineering, high performance and distributed, explain the profound things in simple language. Friends of microservices, Spring, MyBatis, and Netty source code analysis can add my Java advanced group: 725219329. In the group, there are Ali Daniel's live broadcast technology and Java large-scale Internet technology videos to share with you for free.

  1. The stacks of the vast majority of threads appear to be in the same calling context, and there are very few idle threads left. Possible reasons are as follows:
    • Too few threads
    • Lock competition caused by excessive lock granularity
    • competition for resources
    • There are a lot of time-consuming operations in the lock scope
    • The counterparty of the remote communication is slow to process
  2. Most of the threads are in a waiting state, and there are only a few working threads, and the overall performance will not go up. The possible reason is that there is a critical path in the system, and the critical path has reached the bottleneck
  3. The total number of threads is very small (some thread pool implementations create threads on demand, and threads may be created in the program

one example

"Thread-243" prio=1 tid=0xa58f2048 nid=0x7ac2 runnable
   [0xaeedb000..0xaeedc480]
          at java.net.SocketInputStream.socketRead0(Native Method)
	       at java.net.SocketInputStream.read(SocketInputStream.java:129)
	       at oracle.net.ns.Packet.receive(Unknown Source)
	       ... ...
		          at oracle.jdbc.driver.LongRawAccessor.getBytes()
	       at oracle.jdbc.driver.OracleResultSetImpl.getBytes()
	       - locked <0x9350b0d8> (a oracle.jdbc.driver.OracleResultSetImpl)
	       at oracle.jdbc.driver.OracleResultSet.getBytes(O)
	       ... ...
		          at org.hibernate.loader.hql.QueryLoader.list()
	       at org.hibernate.hql.ast.QueryTranslatorImpl.list()
	       ... ...
		          at com.wes.NodeTimerOut.execute(NodeTimerOut.java:175)
	       at com.wes.timer.TimerTaskImpl.executeAll(TimerTaskImpl.java:707)
	       at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627)
	       - locked <0x80df8ce8> (a com.wes.timer.TimerTaskImpl)
	       at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209)
	       at com.wes.threadpool.PooledExecutorEx$Worker.run()
	       at java.lang.Thread.run(Thread.java:595)
	"Thread-248" prio=1 tid=0xa58f2048 nid=0x7ac2 runnable
	   [0xaeedb000..0xaeedc480]
	          at java.net.SocketInputStream.socketRead0(Native Method)
	       at java.net.SocketInputStream.read(SocketInputStream.java:129)
	       at oracle.net.ns.Packet.receive(Unknown Source)
	       ... ...
		          at oracle.jdbc.driver.LongRawAccessor.getBytes()
	       at oracle.jdbc.driver.OracleResultSetImpl.getBytes()
	       - locked <0x9350b0d8> (a oracle.jdbc.driver.OracleResultSetImpl)
	       at oracle.jdbc.driver.OracleResultSet.getBytes(O)
	       ... ...
		          at org.hibernate.loader.hql.QueryLoader.list()
	       at org.hibernate.hql.ast.QueryTranslatorImpl.list()
	       ... ...
		           at com.wes.NodeTimerOut.execute(NodeTimerOut.java:175)
	        at com.wes.timer.TimerTaskImpl.executeAll(TimerTaskImpl.java:707)
	        at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627)
	        - locked <0x80df8ce8> (a com.wes.timer.TimerTaskImpl)
	        at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209)
	        at com.wes.threadpool.PooledExecutorEx$Worker.run()
	        at java.lang.Thread.run(Thread.java:595)
	        ... ...
			"Thread-238" prio=1 tid=0xa4a84a58 nid=0x7abd in Object.wait()
			[0xaec56000..0xaec57700]
			    at java.lang.Object.wait(Native Method)
	    at com.wes.collection.SimpleLinkedList.poll(SimpleLinkedList.java:104)
	    - locked <0x6ae67be0> (a com.wes.collection.SimpleLinkedList)
	    at com.wes.XADataSourceImpl.getConnection_internal(XADataSourceImpl.java:1642)
	    ... ...
		    at org.hibernate.impl.SessionImpl.list()
	    at org.hibernate.impl.SessionImpl.find()
	    at com.wes.DBSessionMediatorImpl.find()
	    at com.wes.ResourceDBInteractorImpl.getCallBackObj()
	    at com.wes.NodeTimerOut.execute(NodeTimerOut.java:152)
	    at com.wes.timer.TimerTaskImpl.executeAll()
	    at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627)
	    - locked <0x80e08c00> (a com.facilities.timer.TimerTaskImpl)
	    at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209)
	    at com.wes.threadpool.PooledExecutorEx$Worker.run()
	    at java.lang.Thread.run(Thread.java:595)
	 
	 
	"Thread-233" prio=1 tid=0xa4a84a58 nid=0x7abd in Object.wait()
	[0xaec56000..0xaec57700]
	 
	    at java.lang.Object.wait(Native Method)
	    at com.wes.collection.SimpleLinkedList.poll(SimpleLinkedList.java:104)
	    - locked <0x6ae67be0> (a com.wes.collection.SimpleLinkedList)
	    at com.wes.XADataSourceImpl.getConnection_internal(XADataSourceImpl.java:1642)
	    ... ...
		    at org.hibernate.impl.SessionImpl.list()
	    at org.hibernate.impl.SessionImpl.find()
	    at com.wes.DBSessionMediatorImpl.find()
	    at com.wes.ResourceDBInteractorImpl.getCallBackObj()
	    at com.wes.NodeTimerOut.execute(NodeTimerOut.java:152)
	    at com.wes.timer.TimerTaskImpl.executeAll()
	    at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627)
	    - locked <0x80e08c00> (a com.facilities.timer.TimerTaskImpl)
	    at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209)
	    at com.wes.threadpool.PooledExecutorEx$Worker.run()
	    at java.lang.Thread.run(Thread.java:595)
	    ... ...

From the stack, there are 51 (socket) accesses, of which 50 are JDBC database accesses. Other methods are blocked on the java.lang.Object.wait() method.

I specially sorted out the above technologies. There are many technologies that cannot be explained clearly in a few words, so I simply asked a friend to record some videos. The answers to many questions are actually very simple, but the thinking and logic behind them are not simple. If you know it, you also need to know why. If you want to learn Java engineering, high performance and distributed, explain the profound things in simple language. Friends of microservices, Spring, MyBatis, and Netty source code analysis can add my Java advanced group: 725219329. In the group, there are Ali Daniel's live broadcast technology and Java large-scale Internet technology videos to share with you for free.

2.2.3 Other ways to improve performance

Reduce the granularity of locks. For example, the implementation of ConcurrentHashMap uses an Array of 16 locks by default (there is a side effect: locking the entire container will be very laborious, you can add a global lock)

2.2.4 Ending Conditions for Performance Tuning

Performance tuning always has a termination condition. If the system meets the following two conditions, it can be terminated:

  1. Algorithms are optimized enough
  2. No underutilization of CPU due to improper use of threads/resources

original:

http://www.klion26.com/2018/03/14/Java-%E5%86%85%E5%AD%98%E6%B3%84%E6%BC%8F%E5%88%86%E6%9E%90%E5%92%8C%E5%AF%B9%E5%86%85%E5%AD%98%E8%AE%BE%E7%BD%AE/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325985762&siteId=291194637