"High Performance MySQL" - Server Performance Analysis (Notes)

3. Analysis of server performance

  • Whether the server has reached the state of the best performance
  • Why is a certain statement not executing fast enough?
  • Some intermittent troubles

To understand the above problems, the simple way is to focus on measuring where the server's time is spent

3.1 Introduction to performance optimization

We define performance as the measure of time required to complete a task, ie— response time .

Measure performance by tasks and time rather than resources.

The purpose of the database server is to execute SQL statements, so the tasks it focuses on are queries or statements, such as SELECT, UPDATE, DELETE, and so on. The performance of a database server is measured in terms of query response time, in units of time spent on each query.

For optimization, we assume that performance optimization is to reduce the response time as much as possible under a certain workload.

  1. If you think performance optimization is to reduce CPU utilization, then you can reduce the use of resources. But this is a trap, resources are used to consume and used to work, so sometimes consuming more resources can speed up the query.

    • In many cases, after upgrading MySQL that uses the old version of the InnoDB engine to the new version, the CPU utilization rate will rise sharply. This does not mean that there is a performance problem, but it means that the resource utilization rate of the new version of InnoDB has increased .

    • The response time of the query can better reflect whether the performance after the upgrade is better. Version upgrades sometimes bring some bugs, such as the inability to use certain indexes, which leads to an increase in CPU utilization. CPU utilization is a phenomenon, not a good measurable goal.

  2. If you think of performance optimization as just improving the query volume per second, it is actually just throughput optimization.

    • The increase in throughput can be seen as a by-product of performance optimization.
    • Optimizing queries allows the server to execute more queries per second because each query takes less time to execute (throughput is defined as the number of queries per unit of time, which happens to be the inverse of our definition of performance).

So if the goal is to reduce response time, then you need to understand why the server takes so much time to execute the query,
and then reduce or eliminate unnecessary work to obtain the query results.

In other words, first figure out where your time is being spent. This leads to the second principle of optimization: you can't optimize effectively if you can't measure it.

So the first step should be to measure where time is spent .

We will spend a lot, maybe even 90% of the time measuring where the response time is spent.

If you don't find the answer by measuring, you either measured it in the wrong way, or it was not measured completely enough. If the complete and correct data in the system is measured, performance problems can usually be exposed, and the solution to the problem will be more obvious.

Measuring is challenging, and analyzing the results is equally challenging, measuring where time is being spent and knowing why it is being spent there are two different things.

It was mentioned earlier that an appropriate measurement range is required. What does this mean? An appropriate measurement range means that only activities that need to be optimized are measured. There are two relatively common situations that lead to inappropriate measurements:

  • Start and stop measurements at the wrong time.
  • It is the aggregated information that is measured, not the targeting activity itself.

For example, a common mistake is to first look at slow queries, and then go to the entire server to determine where the problem is
. If you confirm that you have slow queries, you should measure the slow queries, not the entire server. What should be measured
is the time from the start to the end of the slow query, not the time before or after the query.


The time required to complete a task can be divided into two parts:

  • execution time
  • waiting time.

If you want to optimize the execution time of a task, the best way is to measure the time spent on locating different subtasks, and then optimize and remove some subtasks, reduce the execution frequency of subtasks, or improve the efficiency of subtasks.

The waiting time of optimization tasks is relatively more complicated, because the waiting may be caused by indirect effects of other systems, and tasks may also affect each other due to competition for disk or CPU resources. Diagnosis requires different tools and techniques depending on whether time is spent executing or waiting.

3.1.1 Optimization through performance profiling

Once you master and practice response time-oriented optimization methods, you will find that you need to continuously analyze the performance of the system (profiling) .

Profiling is the primary way to measure and analyze where time is being spent. Performance profiling generally has two steps:

  1. Measure time spent on tasks
  2. The results are then counted and sorted to prioritize important tasks.

Start the timer at the beginning of the task, stop the timer at the end of the task, and then subtract the start time from the end time to get the response time.

These resulting data can be used to draw call graphs, but more importantly for our purposes, can group and aggregate similar tasks.

Grouping similar tasks and summarizing them can help with more complex statistical analysis of those tasks grouped together, but at least you need to know how many tasks are in each group and calculate the total response time.

The required results can be obtained through the performance analysis report (profile report) . A profiler report lists all tasks. Each row records a task, including the task name, task execution time, task consumption time, task average execution time, and the percentage of the task execution time in the total time. The performance profiling report is sorted in descending order by the elapsed time of tasks.

For better illustration, here is an example of performance profiling of the entire database server workload. The main output is various types of queries and the time to execute them.

This is to analyze the response time from an overall perspective. The output below is the result of analysis with pt-query-digest in Percona Toolkit. For the convenience of display, some fine-tuning is made to the results, and only the first few lines of results are intercepted:

insert image description here
The above are just the first few rows of the profiling results, ranked by total response time, including only the minimum column combinations required for profiling. Each row includes the query's response time and percentage of total time, the number of times the query was executed, the average response time for a single execution, and a summary of the query. Through this performance analysis, you can clearly see the cost comparison of each query with each other, as well as the comparison of each query's total cost.

In this example, the task refers to the query. In fact, when analyzing MySQL, it often refers to the query.

We will practically discuss two types of profiling:

  • Execution Time Based Analysis: Study what tasks take the longest to execute
  • Wait-based analysis: determine where tasks are blocked for the longest time

If the task execution time is long because too many resources are consumed and most of the time is spent on execution, and the waiting time is not much, in this case, the analysis based on waiting is not very useful.

And vice versa, if the task has been waiting, consuming no resources, it will be fruitless to analyze the execution time. If you can't confirm whether the problem lies in execution or waiting, then you need to try both methods.

In fact, when analysis based on execution time finds that a task takes too much time, it should be analyzed in depth, and it may be found that some of the "execution time" is actually waiting.


Before analyzing the performance of the system, it must be able to measure, which requires the support of system measurability. Measurable systems typically have multiple measurement points where data can be captured and collected, but practical systems are rarely measurable.

Most systems don't have many measurable points, and even if they do, they only provide counts of some activities, but no statistics on the time spent in activities.

MySQL is a typical example. It was not until version 5.5 that it first provided a Performance Schema, which had some time-based measurement points, while versions 5.1 and earlier did not have any time-based measurement points.

MySQL default database information_schema, mysql, performance_schema content
MYSQL performance schema detailed explanation
MYSQL performance_schema trick kill

Most of the server operation data that can be collected from MySQL is in the form of show status counters, which count the number of times a certain activity occurs.

This is the main reason why we finally decided to create Percona Server, which provides many more detailed query-level measurement points since version 5.0.

By the way, the difference between schema and database is actually based on the fact that different vendors
consider one thing in MYSQL.
insert image description hereIn Oracle, there are database objects, schema objects, and non-schema objects.insert image description here

Percona server introduction and comparison with mysql and installation

While the ideal performance optimization technique relies on more measurement points, fortunately, even if the system does not provide measurement points, there are other ways to carry out optimization work.

Because the system can also be measured from the outside, if the measurement fails, some reliable guesses can be made based on the understanding of the system. But when doing so, it is important to remember that no data, whether externally measured or guessed, is 100 percent accurate, and this is the risk of an opaque system.

For example, in Percona Server 5.0, the slow query log revealed causes of poor performance such as disk I/O waits or row-level lock waits. If the log shows that a query takes 10 seconds, 9.6 seconds of which are waiting for disk I/O, then it is meaningless to investigate where the other 4% of the time is spent. Disk I/O is the most important reason.

3.1.2 Understanding Performance Profiling

MySQL's performance profile (profile) shows the most important tasks first, but sometimes the information that is not displayed is also very important.

You can refer to the example of performance analysis mentioned above. Unfortunately, although profiling outputs ranks, totals, and averages, a lot of needed information is missing, as shown below.

Worth while query

Profiling won't automatically tell you which queries are worth the time to optimize. Here we want to emphasize two points again:

  1. Some queries that only account for a small proportion of the total response time are not worth optimizing. According to Amdahl's Law (Amdahl's Law) , optimizing a query that accounts for no more than 5% of the total response time, no matter how hard you work, the benefit will not exceed 5%.

  2. If it costs $1,000 to optimize a task, but the revenue of the business does not increase, it can be said that the business has been de-optimized by $1,000 instead. If the cost of optimization is greater than the benefit, the optimization should be stopped.

abnormal situation

Certain tasks require optimization even if they do not appear first in the profiling output.

For example, some tasks are executed very few times, but each execution is very slow, which seriously affects the user experience. Because of its low execution frequency, the proportion of the total response time is not prominent.

unknown unknown

A good profiling tool will show possible "lost time". Lost time is the difference between the total time for the task and the actual measured time.

For example, if the processor's CPU time is 10 seconds, and the total task time profiled is 9.7 seconds, then there is 300 milliseconds of lost time. This may be because some tasks are not measured, or it may be due to measurement errors and precision problems. If the tool finds this kind of problem, it should be taken seriously, because it is possible to miss something important.

Even if profiling does not find lost time, you need to pay attention to the possibility of such problems, so as not to miss important information. Lost time is not shown in our example, which is a limitation of the tools we use.

hidden details

Profiling cannot show the distribution of all response times. Believing only in averages is very dangerous, it hides a lot of information and doesn't tell the whole story.

To make the best decision, you need to provide more information for the 12773 queries contained in this line of performance analysis output, especially more response time information, such as histograms, percentages, standard deviations, deviation indices, etc. .

3.2 Profiling the application

Profiling can be done for any time-consuming task, including applications. In fact, profiling an application is generally easier and more rewarding than profiling a database server.

Although the previous demonstration examples are all for the analysis of the MySQL server, it is recommended to analyze the performance of the system from top to bottom, so that the entire process from the user's initiation to the server's response can be traced.

Although performance problems are mostly database-related, there are also many performance problems caused by applications.

Performance bottlenecks can have many contributing factors:

  • External resources, such as calls to external Web services or search engines.
  • The application needs to process a large amount of data, such as analyzing a very large XML file.
  • Perform expensive operations in a loop, like abusing regular expressions.
  • An inefficient algorithm is used, such as a naive search algorithm, to find items in the list.

Fortunately, identifying the problem with MySQL is not so complicated and requires only an application profiling tool (and in return, once you have such a tool, you can write efficient code from the beginning).

It is recommended that all new projects consider including profiling code. While adding profiling code to an existing project can be difficult, new projects are easier.

Does profiling itself cause the server to slow down?

  • Say "yes" because profiling does make your application a little slower.
  • Say "no" because profiling can help your application run faster.

Both profiling and periodic instrumentation introduce additional overhead. The question is how much this part of the cost, and whether the resulting benefits can offset these costs.

Most people who have designed and built high-performance applications believe that everything that can be measured should be measured as much as possible, and that the additional overhead of these measurements should be accepted as part of the application. Also, most applications don't need to run detailed performance measurements every day. Even if you disagree with this point of view, it makes sense to build some lightweight profiling for your application that you can use forever. If the system does not have performance statistics that change every day, it is a headache to encounter performance bottlenecks that cannot be predicted in advance. When a problem is discovered, if there is historical data, the value of these historical data is unlimited. And performance data can help plan hardware purchases, resource allocation, and predict periodic performance spikes.

So what is "lightweight" performance analysis?

For example, all SQL statements can be timed, plus the total time statistics of the script, which is not expensive, and does not need to be executed every time the page is viewed (page view) .

If the traffic trend is relatively stable, random sampling is also possible, and random sampling can be achieved by setting in the application:

<?php
$profiling_ enabled = rand(0, 100) > 99;
?>

This way only 1% of the sessions will perform performance sampling to help locate some serious problems. This strategy is especially useful in a production environment, where it can uncover problems that would otherwise go unnoticed.

(measurement php application omitted)

3.3 Analyzing MySQL queries

There are two ways to profile a query, each with its own problems.

An entire database server can be profiled so that you can figure out which queries are the main stressors (if you've done profiling at the top application layer, you probably already know which queries require special attention).

After locating the specific queries that need to be optimized, you can also drill down to analyze these queries separately, and analyze which subtasks are the main consumers of response time.

3.3.1 Profiling server load

Server-side profiling is valuable because inefficient queries can be effectively audited on the server side.

Locating and optimizing "bad" queries can significantly improve application performance, and can also solve certain problems. It also reduces the overall pressure on the server, so that all queries benefit from reduced contention for shared resources ("indirect benefit"). Reducing the server load can also delay or avoid the need to upgrade more expensive hardware, and can also detect and locate poor user experience, such as certain corner cases.

Every new version of MySQL adds more measurable points. If current trends are to be believed, performance-critical measurement requirements could soon be supported globally. But if you just need to profile and find expensive queries, it doesn't need to be so complicated.

There is a tool that can help us a long time ago, and this is the slow query log .

Capture MySQL queries into log files

In MySQL, the slow query log initially only captures relatively "slow" queries, but performance analysis needs to target all queries. Moreover, in MySQL 5.0 and earlier versions, the unit of response time for slow query logs is seconds, and the granularity is too coarse. Fortunately, these restrictions are history.

In MySQL 5.1 and newer versions, the function of the slow log has been enhanced, and all queries can be captured by setting long_query_time to 0, and the response time unit of the query can reach the microsecond level.

If you are using Percona Server, version 5.0 has these features, and Percona Server provides more control over log content and query capture.

In the current version of MySQL, the Slow Query Log is the cheapest, most accurate tool for measuring query times.

If you are still worried about the additional I/O overhead caused by enabling the slow query log, you can rest assured. Even more of a concern is that the logs can consume a lot of disk space.

  • If the slow query log is enabled for a long time, pay attention to deploying log rotation (log rotation) tools.

Use logrotate to complete log automatic segmentation and rotation

  • Or don't enable the slow query log for a long time, only enable it when you need to collect load samples.

MySQL also has another kind of query log, which is called "general log" , but it is rarely used for analysis and profiling of server performance. The general log records when the query request is sent to the server, so it does not contain important information such as response time and execution plan.

After MySQL 5.1, it is supported to record logs into database tables, but in most cases it is not necessary to do so. This not only has a great impact on performance, but MySQL 5.1 already supports microsecond-level information when recording slow queries into files. However, recording slow queries into tables will cause the time granularity to degrade to the second level. The slow query log at the second level is not very meaningful.

The slow query log of Percona Server records more detailed and valuable information than the official version of MySQL, such as query execution plan, lock, I/O activity, etc. In addition, the manageability has also been enhanced. For example, globally modify the long_query_time threshold for each connection, so that when the application uses a connection pool or persistent connection, it can start or stop the query log of the connection without resetting session-level variables.

In general, the slow query log is a lightweight and comprehensive performance analysis tool, and it is a powerful tool for optimizing server queries.


Sometimes queries cannot be logged on the server for some reason such as insufficient permissions etc.

We often encounter such limitations, so we developed two alternative technologies, which are integrated into pt-query-digest in Percona Toolkit.

The first is to constantly watch the output of SHOW FULL PROCESSLIST via the --processlist option, recording when the query first appears and when it disappears. This level of precision is good enough to spot problems in some cases, but not to catch all queries. Some queries that execute faster may be executed in the gap between two executions, so they cannot be captured.

The second technique is to capture TCP network packets and then analyze them according to the MySQL client/server communication protocol. You can first save the network packet data to disk through tcpdump, and then use the –type=tcpdump option of pt-query-digest to parse and analyze the query. This method is more accurate and captures all queries. It can also parse more advanced protocol features, such as parsing binary protocols to create and execute server-side preparsed statements (prepared statements) and compression protocols.

Another method is to record all queries through the script of the MySQLProxy proxy layer, but we rarely do this in practice.

Analyzing query logs

Don't just open the entire slow query log for analysis, it's a waste of time and money.

A profiling report should be generated first, and if necessary, you can then look at the parts of the log that require special attention.

Top-down is a better way, otherwise it may lead to de-optimization of the business as mentioned above.

Generating analysis reports from slow query logs requires a good tool. It is recommended to use pt-query-digest, which is undoubtedly the most powerful tool for analyzing MySQL query logs.

Technology sharing | Use pt-query-digest to analyze slow logs

The tool is powerful, including the ability to save query reports to a database and track workload changes over time.

In general, you only need to pass the slow query log file as a parameter to pt-query-digest, and it will work correctly. It prints a query dissection report, with the option to print out more detailed information item by item for "important" queries.

The tool is under continuous development, so please read the latest version of the documentation for the latest features.

pt-query-digest

Here is an example of a report output by pt-query-digest as a starting point for profiling. Here's an unmodified profile of the aforementioned profile:

insert image description here
First, each query has an ID, which is the hash value fingerprint calculated for the query statement. During the calculation, the text value and all spaces in the query condition are removed, and all are converted into lowercase letters.

Under Item, the question mark after the table name (such as InvitesNew) in the statement means that this is a shard table , and the shard identifier after the table name is replaced by a question mark, so that the same group of shard tables can be used as a Collect statistics as a whole.

The V/M column in the report provides detailed data on the variance-to-mean ratio , which is also known as the index of dispersion . Queries with a high dispersion index correspond to large changes in execution time, and such queries are usually worth optimizing.

If pt-query-digest specifies the –explain option, a column briefly describing the execution plan of the query will be added to the output, which is the "geek code" behind the query. By jointly observing the execution plan column and the V/M column, it is easier to identify low-performance queries that need to be optimized.

Finally, a line of output is also added at the end, showing statistics for the other 17 queries that are too small to be worth displaying alone. Instead of summarizing some unimportant queries in the last line, you can specify the tool to display more query details via the --limit and --outliers options.

By default, only the top 10 queries will be printed, or the queries whose execution time exceeds the 1 second threshold by many times, both limits are configurable.

A detailed report for each query is included following the profile report. You can match the previous profiling statistics and query detailed reports by query ID or rank.

3.3.2 Analyzing a single query

In practical applications, three more methods are used:

  • SHOW STATUS
  • SHOW PROFILE
  • Check the entries in the slow query log (this also requires Percona Server, the slow query log of the official MySQL version lacks a lot of additional information)

Use SHOW PROFILE

The SHOW PROFILE command was introduced in versions later than MySQL 5.1 and was contributed by Jeremy Cole in the open source community. This is the only real query profiling tool included in the GA release at the time of writing.

It is disabled by default, but can be changed dynamically at the session (connection) level via server variables.

 mysql> SET profiling = 1;

Then, all statements executed on the server will measure their elapsed time and some other data related to the state change of query execution.

When a query is submitted to the server, the tool records the profiling information to a temporary table and assigns the query an integer identifier starting from 1.

Below is the result of profiling a view of the Sakila sample database −

mysql> SELECT * FROM sakila.nicer_but_slower_film_list;
[query results omitted]
997 rows in set (0.17 sec)

The query returned 997 rows and took about 1/6 of a second.

insert image description here
Inquiry details

mysql> SHOW PROFILE FOR QUERY 1;

insert image description here
It can be seen that there is quite a lot of information, and they record the time according to the execution steps.
At the same time, it exists in a table, and we can query it directly.

mysql> SET @query_id = 1;
Query OK,0 rows affected (0.00 sec)
mysql> SELECT STATE, SUM(DURATION) AS Total R, 
-> ROUND(
-> 100 * SUM(DURATION) /
->       (SELECT SUM(DURATION)
->        FROM INFORMATION_ SCHEMA. PROFILING
->        WHERE QUERY_ ID = @query. id
->       ), 2) AS Pct_ R,
->   COUNT(*) AS Calls,
->   SUM(DURATION) / COUNT(*) AS "R/Ca11"
->   FROM INFORMATION_ SCHEMA. PROFILING
->    WHERE QUERY_ID = @query_id
-> 		GROUP BY STATE
-> 		ORDER BY Total_R DESC;

insert image description here
Through the above sorted results, we can easily find:

  1. Reduce the use of temporary tables
  2. Sending data also needs to be optimized here, but there are many possible reasons: different server activities, matching record rows during association, and troubleshooting is more troublesome.
  3. The result sorting does not need to be optimized , because the proportion is relatively small (Total_R).

Use SHOW STATUS

MySQL's SHOW STATUS command returns some counters. There are both global server-level counters and session-level counters based on a connection.

For example, Queries is 0 at the beginning of the session, and increases by 1 for each query submitted.

If you execute SHOW GLOBAL STATUS (notice the newly added GLOBAL keyword), you can
view the server-level query statistics counted from the server startup.

The visibility range of different counters is different, but global counters will also appear in the results of SHOW STATUS, which can easily be mistaken for session-level, so don't get confused.

If you try to optimize what you observe from some specific connection, but measure the data at the global level, it will lead to confusion. The official MySQL manual explains in detail whether all variables are session-level or global-level.

SHOW STATUS is a useful tool, but not a profiling tool .

Most of the results of SHOW STATUS are just a counter , which can show how frequently certain activities such as index reading are, but cannot tell how much time is consumed.

Only one of the results of SHOW STATUS refers to the operation time (Innodb_ row_ lock_ time), and it can only be at the global level, so it is still impossible to measure the work at the session level.

Although SHOW STATUS does not provide time-based statistics, it is useful for observing the values ​​of certain counters after a query has executed.

Sometimes it is possible to guess which operations are more expensive or take more time. The most useful counters include handle counters (handler counters) , temporary file and table counters, and so on. exist

The following example demonstrates how to reset a session-level counter to 0 , then query the aforementioned view, and check the result of the counter:

mysql> FLUSH STATUS;
mysql> SELECT * FROM sakila.nicer_but_slower_film_list;
mysql> SHOW STATUS WHERE Variable_name LIKE 'Handler%' OR Variable name LIKE ' Created% ' ;

insert image description here

From the results, we can see that the query uses three temporary tables, two of which are disk temporary tables, and there are many read operations (Handler_ read_ rnd_ next) that do not use indexes.

Assuming that we don’t know the specific definition of this view, we can only speculate from the results that this query may have done a multi-table join query, and there is no suitable index. It may be that one of the subqueries created a temporary table, and then combined with Other tables do joint queries. The temporary table used to save the subquery results has no index, which roughly explains such results.

When using this technique, it should be noted that SHOW STATUS itself will also create a temporary table , and this temporary table will also be accessed through handle operations, which will affect the corresponding numbers in the SHOW STATUS results, and different versions may behave differently same. Compared with the results of the execution plan of the query obtained through SHOW PROFILES, at least the counter of the temporary table has increased by 1.

You may notice that most of the same information can be obtained by viewing the execution plan of the query through EXPLAIN, but EXPLAIN is the result obtained by estimation , while the counter is the actual measurement result. For example, EXPLAIN cannot tell you whether the temporary table is a disk table, which is very different from the performance of the memory temporary table.

Use slow query log

Percona Server has made some modifications to the official slow query log.

Here are the results fetched after executing the same query:
insert image description here

It can be seen from here that the query does create a total of three temporary tables, two of which are disk temporary tables.

The results are simplified here for readability. But in the end, the data of SHOW PROFILE executed on the query will also be written to the log, so the detailed information of SHOW PROFILE can even be recorded in Percona Server.

It can also be seen that the detailed entries in the slow query log contain all the output of SHOWPROFILE and SHOWSTATUS, and more information. So after finding "bad" queries through pt-query-digest, you can get enough useful information in the slow query log.

When viewing the report of pt-query-digest, the title part will generally have the following output:

insert image description here
Through the offset value of 3214, we can view the corresponding log
insert image description here

Using Performance Schema

The following query shows the main reasons for waiting in the system:

mysql> SELECT event_name, count_ star, sum_timer_wait
-> FROM events_ waits_sumary_g1obal_by_ event_name
-> ORDER BY sum_timer_wait DESC LIMIT 5;

insert image description here
Obtaining useful results directly from the raw data of the Performance Schema is relatively too complex and low-level for most users. The feature implemented so far is mainly for measuring when modifying MySQL source code to improve server performance, including waiting and mutex locks.

3.3.3 Performance Analysis

How do I use it once I get a profiling report for a server or query?

A good profiling report can reveal potential problems, but the final solution is up to the user to decide (although the report may give suggestions).

When optimizing a query, the user needs to have a solid understanding of how the server executes the query. The analysis report can collect as much information as possible, give the correct direction to diagnose the problem, and provide basic information for other tools such as EXPLAIN.

Existing systems often do not have perfect measurement support, although a profiling report with complete measurement information can make things easier.

3.4 Diagnosing intermittent problems

Intermittent problems (such as occasional system pauses or slow queries) are difficult to diagnose.

Diagnosing some phantom issues that only happen when you don't notice it and how to reproduce it can't be determined can take a lot of time, sometimes even months.

Along the way, some people try to diagnose by trial and error, sometimes even trying to get away with finding the problem by randomly changing some server settings.

Try not to use trial and error to solve problems. This approach carries great risk as the outcome could be worse. It's also a frustrating and inefficient way to do it.

If the problem cannot be located for a while, it may be that the measurement method is incorrect, or the measurement point is selected incorrectly, or the tool used is inappropriate.

To demonstrate why trial-and-error diagnostics should be avoided as much as possible, here are some real-world examples of intermittent database performance issues that we believe have been resolved:

  • The application uses curl to obtain exchange rate quote data from an external service that runs very slowly.

curl is a common command-line tool used to request web servers. Its name means client URL tool

  • Some important entries in the memcached cache expired, causing a large number of requests to MySQL to regenerate the cache entries.

  • DNS queries occasionally time out.

  • It may be due to mutex contention, or the inefficiency of the internal query cache deletion algorithm, MySQL's query cache sometimes causes a short pause in the service.

  • When the concurrency exceeds a certain threshold, the scalability limitation of InnoDB causes the optimization of the query plan to take a long time.

As you can see from the above, some problems are indeed the cause of the database, and some are not. Only by observing resource usage where the problem occurs and measuring the data as much as possible can we avoid wasting energy where there is no problem.

3.4.1 Single query problem or server problem.

Have you found any clues of the problem? If so, you must first confirm whether it is a problem with a single query or a problem with the server.

  • If all the programs on the server suddenly slow down, and then suddenly everything gets better, and every query slows down, then the slow query may not necessarily be the cause, but the result of other problems.

  • Conversely, if the server as a whole is running fine, and only a certain query is occasionally slow, you need
    to focus on this specific query.

Server issues are very common

In the past few years, hardware capabilities have become stronger and stronger, and servers with 16 cores or more CPUs have become standard configurations, and the scalability limitations of MySQL on SMP-based machines have become increasingly apparent.

Especially for older versions, the problem is more serious, and there are still many old versions in the production environment.

The new version of MySQL still has some scalability limitations, but compared to the old version, it is not so serious, and the frequency of occurrence is relatively small, only occasionally encountered.

Here's the good news, and the bad news:

  • The good news is that you rarely run into this problem
  • The bad news is that once you encounter it, you need to know more about MySQL internals to diagnose it.

Of course, this also means that many problems can be solved by upgrading to a new version of MySQL.

How to judge whether it is a single query problem or a server problem

If the problem keeps recurring, it can be observed in a campaign

Or run the script overnight to collect data and analyze the results the next day

Most cases can be resolved by three techniques.

1. Use SHOW GLOBAL STATUS

This method is actually to execute the SHOWGLOBALSTATUS command to capture data at a higher frequency, such as once a second. When a problem occurs, it can be found through the "spike" or "sag" of some counters.

This method is relatively simple, can be used by everyone (no special permissions required), and has little impact on the server, so it is a good way to understand the problem well without spending a lot of time.

The following are example commands and their output:

$ mysqladmin ext -i1| awk’
/Queries/{q=$4-qp;qp=$4}
/Threads_connected/{tc=$4}
/Threads_running/{printf "%5d %5d %5d\n", q, tc, $4}'

The number of queries per second, the number of connections, and the total number of threads currently executing queries
insert image description here
This command captures the data of SHOWGL0BALSTATUS once per second, and outputs it to awk to calculate and output the number of queries per second, Threads_connected and Threads_running (representing the number of threads currently executing queries ).

The trends in these three figures are highly sensitive to occasional pauses at the server level.


Generally, when this kind of problem occurs, depending on the cause and the way the application connects to the database, the number of queries per second will generally drop, while at least one of the other two will have a spike.

In this example, the application uses connection pooling, so Threads_connected does not change.

However, the number of threads executing queries has increased significantly, and the number of queries per second has dropped significantly compared to normal data.


There is some risk in guessing. In practice, however, two reasons are more likely.

  • One of them is usually some kind of bottleneck inside the server, causing new queries to accumulate before starting to execute because they need to acquire the locks that old queries are waiting for. This type of lock will generally cause back-end pressure on the application server, causing the application server to also have queuing problems.

  • Another common reason is that the service area suddenly encounters the impact of a large number of query requests, such as the query storm caused by the sudden failure of the front-end memcached.

This command outputs one line of data per second, can run for hours or days, and then plots the results as a graph, so that you can easily find out if there is a sudden change in trend.

If the problem is indeed intermittent and occurs infrequently, you can run this command for as long as necessary until you find the problem and then look back at the output.

In most cases, the output can more clearly locate the problem.

2. Use SHOW PROCESSLIST

This method is to continuously capture the output of SHOW PROCESSLIST to observe whether a large number of threads are in an abnormal state or have other abnormal characteristics.

For example, queries rarely stay in the "statistics" state for a long time. This state generally refers to how the server determines the order of table associations during the query optimization phase-usually very fast.

In addition, it is rare to see a large number of threads reporting that the currently connected user is an unauthenticated user (Unauthenticated user) . This is only the state in the middle of the connection handshake. Will appear.

When using the SHOW PROCESSLIST command, adding \G at the end can output the results in a vertical manner, which is very useful, because this will output each column of each row record as a separate row, which can be conveniently used. sort|uniq|sort class command to count the number of occurrences of a column value:

$ mysql -e 'SHOW PROCESSLIST\G' | grep State: | sort | uniq -c | sort -rn

insert image description here
If you want to view different columns, you only need to modify the pattern of grep. The State column is useful in most cases.

As you can see from the output of this example, there are many threads in states that are at the end of the query execution, including "freeing items", "end", "cleaning up", and "logging slow query". In fact, on the server in this case, the same pattern or similar output samples occurred many times. A large number of threads in the "freeing items" state is a clear sign and indicator of a large number of problematic queries.

The example demonstrated above was due to contention and flushing of dirty blocks within InnoDB, but sometimes the cause can be much simpler than that.

A classic example is that many queries are in the "Locked" state. This is a typical problem of MyISAM. Its table-level locking may quickly lead to server-level thread accumulation when there are many write requests.

3. Using query logs

If you want to find problems through query logs, you need to enable slow query logs and set long_query_time to 0 at the global level, and confirm that all connections use the new settings.

This may require resetting all connections for the new global settings to take effect

Or use a feature of Percona Server that dynamically enforces settings without disconnecting existing connections.

If for some reason, you cannot set the slow query log to record all queries, you can use tcpdump and pt-query-digest tools to simulate instead.


Be careful to find logs for periods of sudden drop in throughput . Queries are written to the slow query log only during the completion phase, so the accumulation will cause a large number of queries to be in the completion phase, and other queries cannot be executed until the resource occupants blocking other queries release resources.

One benefit of this behavioral feature is that when encountering a sudden drop in throughput, it can be blamed on the first query that completes after the drop in throughput (and sometimes not necessarily the first query. When some queries are blocked , other queries can continue to run unaffected, so this experience cannot be relied upon entirely).

Once again, good tools can help diagnose such problems, otherwise you have to manually search for the cause in hundreds of GB of query logs.

The following example has only one line of code, but it can count the number of queries per second according to the mode that MySQL writes the current time into the log every second:

$ awk '/^# Time:/{print $3, $4, c;c=0}/^# User/{c++}' slow-query.log

insert image description here
From the above output, it can be seen that there is a sudden drop in throughput, and there is a sudden peak before the drop. It is difficult to determine what happened only from this output without querying the detailed information at that time, but it should be possible to say There must be a correlation between this sudden spike and the subsequent dip.

In any case, this phenomenon is very strange, and it is worth digging the detailed information of this time period in the log (in fact, through the detailed information of the log, it can be found that many connections are disconnected during sudden peak hours, it may be that there is a Caused by application server restart. So not all problems are MySQL problems).

3.4.2 Capturing diagnostic data

When intermittent problems occur, it is important to collect as much data as possible, not just when the problem occurs. While this collects a lot of diagnostic data, it's better than not collecting the data that can actually diagnose the problem.

Before we start, two things need to be clear:

  1. A reliable and real-time "trigger", that is, a way to distinguish when problems arise.
  2. A tool for collecting diagnostic data.

diagnostic trigger

Triggers are very important. This is fundamental to being able to capture data when problems arise. There are two common problems that can cause undesired results:

  • False positive: Refers to collecting a lot of diagnostic data, but the problem did not occur during the period, which may be a waste of time
  • False negative: Failure to capture data when a problem arises, a missed opportunity, and a waste of time

So it pays to spend a little extra time confirming that the triggers actually identify the problem before starting to collect data.

Typically this is a count, such as the number of running threads, the number of threads in the "freeing items" state, etc. The -c option to grep is useful when counting the number of threads in a certain state:

$ mysq1 -e 'SHOW PROCESSLIST\G'I grep -c "State: freeing items"
36

It is important to choose an appropriate threshold, high enough to ensure that it will not be triggered when it is normal; but not too high to ensure that problems will not be missed when they occur.

Another thing to note is that you can't set the threshold too high if you want to capture data when the problem starts.

A escalating trend of problems generally leads to more problems, and if you start capturing data when the problem is causing the system to crash, it can be difficult to diagnose the original root cause.

We need a tool to monitor the server and collect data when trigger conditions are met. Of course, you can write your own scripts to achieve this, but it doesn't have to be so troublesome. pt-stalk in Percona Toolkit is designed for this situation.

What data needs to be collected

Now that the diagnostic triggers have been identified, it's time to start some processes to collect data.

Collect all the data you can, but only for as long as you need it.

When an unknown problem occurs, there are generally two possibilities:

  • The server needs to do a lot of work, causing a lot of CPU consumption
  • or waiting for some resource to be freed

Therefore, it is necessary to collect diagnostic data in different ways to confirm the cause:

  • Profiling report to confirm if there is too much work
  • Wait analysis is used to confirm whether there are large waits.

For unknown problems, only two kinds of data can be collected as much as possible.

On the GNU/Linux platform, an important tool that can be used for server internal diagnostics is oprofile. You can also use strace to analyze the system calls of the server, but there are certain risks in using it in a production environment.

If you want to analyze the query, you can use tcpdump. Most MySQL versions cannot easily turn on and off the slow query log. At this time, you can simulate it by listening to TCP traffic. Also, network traffic can be very useful in several other analyses.

For wait analysis, a common method is GDB's stack trace. Threads within MySQL tend to have the same stack trace if they get stuck in a particular place for a long time. The tracking process is to start gdb first, then attach (attach) to the mysqld process, and dump the stacks of all threads. Then you can use some short scripts to summarize similar stack trace information, and then use the "magic" of sort|uniq|sort to sort the stack information with the largest total.

You can also use the snapshot information of SHOW PROCESSLIST and SHOW INNODB STATUS to observe the status of threads and transactions for waiting analysis.

None of these methods are perfect, but they have proven to be very helpful.

Gathering all that data sounds like a lot of work. Perhaps the reader has done something similar before, but the tools we provide can help.

This tool is called pt-collect and is also a member of Percona Toolkit. pt-collect is generally called through pt-stalk. Because it involves the collection of a lot of important data, it needs to be run with root privileges. By default, it collects data for 30 seconds after startup and then exits. For the diagnosis of most problems, this is enough, but if there are false positives (false positive) problems, it may not be enough information collected.

This tool is easy to download and does not require any configuration, the configuration is done through pt-stalk. It is best to install gdb and oprofile in the system, and then configure and use them in pt-stalk. In addition, mysqld also needs to have debugging symbol information. pt-collect does a good job of collecting complete data when trigger conditions are met. It also creates timestamp files in the directory.

Interpret result data

If you have set the trigger conditions correctly and run pt-stalk for a long time, you only need to wait long enough to capture a few problems, and you will be able to get a lot of data to filter.

We recommend looking at something first for two purposes.

  • Check if the problem actually happened
  • Is there a very obvious jumping change

It is often most rewarding to look at unusual query or transaction behavior, as well as unusual server internal behavior.

The behavior of queries or transactions can indicate whether the problem is due to the way the server is being used:

  • Low-performance SQL queries, improper use of indexes, poorly designed database logical architecture, etc.

By grabbing TCP traffic or SHOW PROCESSLIST output, you can get where queries and transactions occur, so as to know what operations users have performed on the database.


Through the internal behavior of the server, you can know whether the server has bugs, or whether there are problems with internal performance and scalability . This information can be seen in similar places, including in the output of oprofile or gdb, but it takes more experience to understand.

If you encounter unexplained errors, it is a good idea to package up all the collected data and submit it to technical support for analysis.

MySQL technical support experts should be able to analyze the reasons from the data, detailed data is very important for support staff. In addition, you can also package the output results of the other two tools in Percona Toolkit, pt-mysql-summary and pt-summary. These two tools will output MySQL status and configuration information, as well as operating system and hardware information.

Percona Tolkit also provides a tool for quickly inspecting collected sample data: pt-sift.

The tool navigates through all sample data in turn, getting summary information for each sample. You can also drill down to details if required. Using this tool can at least save a lot of typing and knocking on the keyboard many times.

Below is the oprofile output on a problem server, can you find the problem?

insert image description here

An important performance bottleneck analysis tool for wait analysis is gdb's stack trace.

Here is the output of a stack trace for a thread, formatted a bit for printing:
insert image description here
The stack needs to be viewed from the bottom up.

In other words, the thread is currently executing the pthread_cond_wait function, which is called by oS_event_wait_low.

Going on, it seems that the thread is trying to enter the InnoDB kernel (srv_conc_enter_innodb), but it is put into an internal queue (os_event_wait_low). The reason should be that the number of threads in the kernel has exceeded the limit of innodb_thread_concurrency.

Of course, to truly exploit the value of a stack trace requires a lot of information to be aggregated.

appendix

"High Performance MySQL"
by Baron Schwartz, Peter Zaitsev, Vadim Tkachenko, written by
Ning Haiyuan, Zhou Zhenxing, Peng Lixun, Zhai Weixiang, and Liu Hui

Guess you like

Origin blog.csdn.net/weixin_46949627/article/details/129251964