Diagnose common database performance hotspots in Java code

When I'm helping some developers or architects analyze and optimize the performance of a Java application, the point is often not to fine-tune individual methods to save a microsecond or two of execution time. While microsecond optimizations are really important for some software, I don't think that's where the focus is. I analyzed hundreds of applications over the course of 2015 and found that most performance and scalability problems stem from poor architectural decisions, misconfiguration of frameworks, incorrect database access patterns, excessive logging, and excessive memory consumption due to The resulting impact of garbage collection.

In my opinion, the essence of performance engineering is to correlate key architectural metrics, scalability metrics, and performance metrics through a large number of observations. Find regression bugs or bottlenecks in your system by analyzing the results of each build and performance under different load conditions. The dashboard in the following image is an example:

(click to enlarge image)

 

By correlating metrics such as system load, response time, and execution times of SQL statements, the root cause of some performance engineering issues can be derived.

The top graph is called the "Tier Breakdown" graph, and it shows the overall execution time of the various logical components in your application (such as web services, database access, business logic, web servers, etc.). The red part represents the time spent by a backend Web Service, and it is obvious that a component hotspot is generated here. At the same time, we can find that the Web Service is not under abnormal load, because from the second picture, the number of requests processed by the application at that time is relatively stable. Under normal circumstances, most of the overall response time is spent in the data layer, but this does not mean that the speed of the database itself is slow! I understand that inefficient database access is often the main cause of poor performance, so it is usually combined with analyzing the number of SQL statement executions. In this example, it can already be clearly seen that it is correlated with most of the response time peaks.

The most common problem patterns I've observed are poor database access patterns, in addition to too fine-grained service calls, poor shared data access sharing, excessive logging, caused by memory leaks and massive object creation Garbage collection affects or crashes the application.

Optional diagnostic tool

In this article, I'll focus on the database side of things, because I'm pretty sure all of your applications are having problems with one of these access patterns! You can choose between the various performance diagnostics, tracing, or APM tools available on the market, but my choice is the free Dynatrace Personal License . Java itself also provides a variety of great tools, such as Java Mission Control and more. Many frameworks that provide data access capabilities also often provide various diagnostic options through their log output, such as Hibernate or Spring to name a few.

When using these tracing tools, no code modification is usually required, because they all use the JVMTI (JVM Tooling Interface) to capture code-level information and even trace calls across remote layers, which is very important for distributed , Very useful for (micro)service-oriented applications. All you have to do is modify your JVM startup command line options to load these tools. Some tool developers also offer integration with IDEs, you can simply say "turn on XYZ performance diagnostics at runtime". I made a simple video tutorial on YouTube that demonstrates how to trace an application launched in Eclipse.

Identify database performance hotspots

Even if you have discovered that the main cause of the overall application response time is the database, don't blame the database and DBA lightly for that! The reasons for the busy database may include the following:

  • Inefficient use of the database: bad query design, poor application logic, incorrect configuration of data access frameworks
  • Bad database design and data structures: table associations, slow stored views, missing or bad indexes, outdated table statistics
  • Improper database configuration, such as memory, disk, tablespace, connection pool, etc.

In this article, I will focus on how to minimize the time spent accessing the database on the application side:

Diagnosing bad database access patterns

When diagnosing problems with an application, I usually always check several database access patterns. I will analyze the application's request one by one, and put these problems into the following classification table of "DB problem pattern":

  • Excessive SQLs: Execute a large number (greater than 500) of different SQL statements
  • N+1 query problem (N+1 Query): execute the same SQL statement multiple times (greater than 20)
  • Slow Single SQL: The execution time of a single SQL statement accounts for more than 80% of the response time
  • Data -Driven: The same request will execute different SQL statements due to different input parameters
  • Database Heavy: The overall execution time of the database accounts for more than 60% of the overall response time
  • Unprepared Statements: Statements are not prepared when executing the same SQL
  • Pool Exhaustion: The connection acquisition time is too long (getConnection time exceeds executeStatement)
  • Inefficient Pool Access: Too many accesses to the connection pool (calls to getConnection exceed 50% of the calls to executeStatement)
  • Overloaded Database Server : Too many requests from various applications cause the database server to be overloaded

Example 1: Self-designed O/R mapper produces excessive SQL

My first example is a web application that provides information on meeting rooms in a building. The information of the conference room is stored in a database, and every time the user generates a report of the conference room information, a custom data access layer will be called to access the database.

When analyzing individual requests, I always start with the so-called Transaction Flow. Transaction flow is a visualization option that shows how an application processes a request. For the request of the conference room information report, it can be seen that the request first enters the web server layer (left in the figure), then enters the application service layer (in the figure), and then initiates a call to the data layer (right in the figure). The "links" between these layers represent the number of interactions between these layers, such as how many SQL queries were executed by this single request.

From this screen we can immediately spot the first two modes that are causing the problem, the Excessive SQL Execution Mode and the Database Busy Mode. Let's analyze it:

(click to enlarge image)

It's easy to see that this request resulted in a lot of SQL executions and a busy database effect: it executed a total of 24889 SQLs! It took 40.27 seconds (66.51% of the entire request time) to complete the entire execution process!

If we analyze the individual SQL statements, we will find that this request has another problem, namely the N+1 query problem and the inefficient connection pool access (discussed in detail below):

(click to enlarge image)

This bad access pattern cannot be solved by optimizing the database index.

I've seen this problem happen countless times. The logic of the application needs to iterate over a list of objects, but instead of choosing to use the "Eager Loading" method, it uses the "Lazy Loading" method. This choice may come from an O/R mapping framework, such as Hibernate or Spring, or from a homegrown framework, as in the example above. This example uses a self-developed implementation that loads each meeting room object and obtains all the properties of each meeting room through a separate SQL query. Each SQL query is executed in a JDBC connection obtained from the connection pool, and then returned after each query completes. This also explains why the request generates 12444 set clientname operations, because the Sybase JDBC driver submits this request every time it requests a connection from the connection pool. This is where the problem lies! Other JDBC drivers may not generate the call to set clientname. You can check the number of calls to getConnection, which can also reflect this problem.

For the N+1 query problem itself, the use of join queries can easily avoid this problem. In this example of a room with properties, the following join query can be used:

select r.*, p.*
from meeting_rooms as r
inner join room_properties as p on p.room_id = r.room_id

The result is that the entire execution process only produces one query execution, not more than 12,000 times! It also exempts 12,000 connection acquisitions and calls to "set clientname".

Example 2: Incorrect Hibernate configuration causing excessive SQL execution

As far as I know, there are many consumers of Hibernate or other O/R mappers. I want to remind you that the lazy and greedy loading options provided by the O/R mapper, as well as various other caching layers, have their own reasons for being there. Make sure you use these features and options correctly for your specific use case.

In the example below, lazy loading is not a good choice because loading 2k objects and their properties would result in over 4k SQL queries. Considering that we always need to fetch all objects, a better approach is to greedily load these objects, and then consider caching them, provided that these objects do not change very frequently:

(click to enlarge image)

When using O/R mappers such as Hibernate or Spring, you need to choose the correct loading and caching options. You need to understand how they work behind the scenes.

Most O/R mappers provide excellent diagnostic options through logging, but also check out the content in the online community for various best practices. I recommend reading a series of blog posts written by Alois Reitbauer, who did a very deep dive into Hibernate in its early years. In this series of articles, he particularly emphasizes how to effectively use the cache and load options .

Example 3: Statements used in custom DB access code are not preprocessed

When the database engine completes parsing a SQL statement and creates an execution plan for data access, the result is stored in a cache area in the database for reuse without re-parsing the statement (statement parsing is the the most CPU-intensive operation). The key used to find a query in the cache is the full text of the statement. This also means that if you call the same statement 1000 times and pass it 100 different parameter values ​​(such as in a where statement), you will have 1000 different entries in the cache, and using The 1001st query with the new parameters must also be parsed again. This way of working is very inefficient. Therefore, we propose the concept of "preprocessed statement": a statement is preprocessed and parsed and stored in the cache, representing variables as placeholders. During the actual execution of this statement, these placeholders will be replaced by actual values, and the execution plan can be found directly from the cache without parsing the statement again.

Database access frameworks usually do a good job of this, preprocessing queries. But in custom code, I find that developers often overlook this. In the following example, only a small fraction of the SQL execution is preprocessed:

(click to enlarge image)

By comparing the number of SQL executions with the number of preprocessed SQL executions, the problem of unpreprocessed database access was found

If you plan to develop your own database access code, please double-check that you are calling prepareStatement correctly. For example, if you're calling a query more than once, it's usually better to use PreparedStatement. If you choose to use frameworks to access data, also double-check the behavior of those frameworks and what configuration options are available when optimizing and executing the generated SQL. The easiest way to achieve this is to monitor the number of executions of executeStatement and prepareStatement. If you repeat the same monitoring for each SQL query, it will be easy to find optimization hotspots.

Example 4: Connection pool cannot be resized efficiently due to time-consuming backend SQL report execution

I often find that some applications use a default connection pool size, such as 10 or 20 connections per pool. Developers always ignore connection pool size optimizations because they don't do the necessary large-scale load testing, and don't know how many users will use these new features, let alone what parallel DB access will lead to. It is also possible that the configuration information of the connection pool is "lost" when deploying from the pre-release environment to the production environment, resulting in the configuration in the production environment using the default configuration in the application server.

Through the JMX indicator information, the usage of the connection pool can be easily monitored. Every application server (Tomcat, JBoss, Websphere, etc.) provides these metrics, although some require you to explicitly enable this feature. The following figure shows connection pool usage for WebLogic Servers in a cluster. You can see that the "Number of Active DB Connections" has reached its maximum value in all three of the application servers.

(click to enlarge image)

Make sure you size your connection pool appropriately and don't use default settings that don't match the load you expect

The root cause of this problem is not the spike in traffic. In the " System Load / Response Time / Database Executions " dashboard described at the beginning of this article , the application did not generate any particular traffic spikes. In the end, it was found that a plan to run the report was set at more than 2:00 pm every day, and it needed to execute multiple UPDATE statements that ran for a long time, and each statement used a different connection. This blocks other connections for a few minutes, causing performance problems for the application under "normal" traffic, as user requests cannot get a connection to the database:

(click to enlarge image)

Individual SQL executions blocked other connections for several minutes, causing the problem of exhaustion of connection pool resources

If you have learned that certain requests can hang the connection for an extended period of time, you have several options:

  • Send these requests to a separate server to avoid affecting other users
  • Reset its execution time and only execute it during a time period that will not affect others
  • Increase the connection pool size to ensure that enough connections are available under normal traffic

First, though, you want to make sure those queries are optimized. Find out which operations are the most time-consuming by analyzing SQL query execution plans. Today, most APM tools allow you to get the execution plan of a SQL statement in some way. If no tools are available, the easiest way is to use the database's command line tools, or ask a DBA to help you generate an execution plan.

(click to enlarge image)

Optimize your SQL statements by learning the SQL query execution plan ,

The execution plan shows how the DB engine processes the SQL statement. There are many reasons for slow execution of SQL statements, not limited to lack of indexes or the wrong way to use indexes, but in many cases it is caused by the design, structure or join query. If you are not an expert in SQL, you can turn to a DBA or a SQL expert for help.

Tips and tricks for load testing and monitoring in production

In addition to analyzing individual requests to point out these problem patterns, I also look at long-term trends when an application is under load. In addition to the dashboard I showed you at the beginning of this article, I'll also point out changes in data-driven behavior and verify that data caching is working correctly.

Checkpoint 1: Due to the existence of the data cache, the number of accesses to the DB should gradually decrease

The chart below shows the average number of SQL statement executions (green) and the total number of SQL statement executions (blue). We ran a two-hour performance test for the app, keeping the load at a consistently high level. The result I would expect is for the average number to gradually decrease while the total number level off. Because according to my assumption, most of the data fetched from DB is static or will be cached in some different layer.

(click to enlarge image)

If your app doesn't perform as expected, you may be experiencing data-driven performance issues or caching issues

Suppose you have the common N+1 query problem in your application as I have shown before. Then as the end user generates more and more data in the DB, the average number of SQL generated by the application will continue to increase, because more and more data will be returned by these queries! So be sure to pay attention to the numbers!

Checkpoint 2: Indicate SQL access patterns by category

Similar to Example 4, which shows the problems caused by a background reporting application that executes at 2pm every day, I would also like to focus on SQL access patterns over time. I am concerned with not only the total execution time, but also the number of SELECT, INSERT, UPDATE, and DELETE executions. This way, I can indicate if there is some special activity going on during a certain period of time, such as a background job updating a large batch of data.

(click to enlarge image)

Understand the application's database access behavior by observing the total execution time and the number of SELECT, INSERT, UPDATE, and DELETE executions

The execution of a batch job with a large number of update operations takes a while to complete, especially for tables with a large number of rows. If the entire table is thus locked, other requests to update the table, or even just some of its rows, must wait for the lock to be released. You should consider running these jobs during times when no other users are online, or implement some different locking logic to lock, update, and release individual rows.

Checkpoint 3: The running status of the database instance

In this article, most of the database performance issues I focus on are not related to the slowness of the database server itself, but are mainly caused by poor database access patterns (N+1 query issues, unprepared statements, etc.) application code, or misconfiguration (inefficient connection pool access, data-driven issues).

However, it would also be unwise to ignore the database itself entirely. Therefore, I always check key database performance metrics. Most databases provide rich performance information through special system tables. For example, Oracle provides certain v$ tables and views to access key database performance indicators (session, wait time, parsing time, execution time, etc.), Or information like table locks and slow-running SQL from individual applications using this shared database instance.

I usually look at two dashboards when doing database health checks, where you can see metrics from these performance tables:

(click to enlarge image)

Watch to see if the database is healthy, or if it is impacted by excessive load from the applications sharing the database instance.

(click to enlarge image)

Through table lock and other information, determine whether there is a SQL statement in execution that has a negative impact on the server and even your application

Automatic detection of database metrics during continuous integration processes

Before I introduce you to some new ideas for analyzing key database metrics and use cases, I would like to first fill in a missing topic that we should all consider, and that is automation!

I recommend that you do not perform these checking steps manually, but check these metrics through a continuous integration tool, combining this step with steps such as unit testing, integration testing, REST API or other types of functional testing. If you already have a set of test cases designed to check the functionality of various REST APIs or new features, why not capture these metrics during the test execution of each build? This approach can bring the following benefits:

  1. Let the code review process focus on these metrics instead of reading every line of code over and over
  2. Notify if a code check-in causes this problem

The screenshot below shows how these metrics are tracked per build and per test, and warns when they behave abnormally. You can integrate these metrics into your build pipeline and be informed when a code change has an impact, and then fix the problem immediately, avoiding system crashes when the code is released to production.

(click to enlarge image)

Add these metrics to your continuous integration process and watch for changes to automatically spot all kinds of bad database access patterns!

Performance issues go far beyond databases

In this article, we focus on hot topics in databases. But in the course of my work, I also find many types of performance issues in other areas. In 2015, I was involved in a project to migrate a monolithic application to a (micro)service and found a huge spike in it. The problem is similar to some patterns we have analyzed, such as the N+1 query problem, because a use case calls a backend service hundreds of times. Most of the time, this kind of problem is caused by poor interface design and not considering what happens when a method that is originally called locally is executed in a Docker container or cloud computing environment. Network issues can crop up, including information passing over the network and new connection pools (meaning you need to think about threads vs sockets), and these are issues you have to deal with. However, this aspect is beyond the scope of this article, so stay tuned for our follow-up articles. "May the indicator be with you" :-)

About the author

Andreas Grabner (@grabnerandi) is a performance engineer who has been working in this area for the past fifteen years. Andreas' job is to help organizations identify real problems in their applications and share the knowledge gained as engineering best practices with others on how to avoid them.

 

Harald Zeitlhofer (@HZeitlhofer) has over 15 years of experience serving databases and applications for small startups and large enterprises around the world. Performance monitoring and optimization is always a key factor in a successful environment.

 

View the original English text: Diagnosing Common Database Performance Hotspots in our Java Code

 
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326846745&siteId=291194637