Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Introduction: For Internet applications, the backend needs to cope with massive requests. How to quickly locate the part that has the greatest impact on the database in these requests is a problem in the industry. How to perform scientific analysis on online queries without causing performance degradation is also a very technical matter. This article is LinkedIn's answer to the above questions.

Introduction

LinkedIn uses MySQL extensively, and more than 500 services within the company rely on MySQL. In order to facilitate management and improve resource utilization, we use a multi-tenant architecture model. However, a major disadvantage of this model is that queries from one application may affect other applications.

Although we have optimized the database by adjusting InnoDB, operating system and MySQL server configuration, we cannot control your schema and queries. We hope to solve this problem by analyzing and optimizing the query. In order to do this, we get the complete information of all queries on the database.

Why do we need a query analyzer?

In order to better understand the application dynamics at runtime, we need to deeply study the SQL queries invoked by hundreds of applications, understand their performance characteristics, and then further optimize and adjust them.

Considering performance issues, we did not use slow query logs. We can set a threshold for the query time, and then record all queries that cross the threshold in a file for later analysis. The disadvantage of this method is that it cannot capture all queries. If the threshold is set to 0, all queries can be captured, but it will not work. Millions of queries recorded in the file will cause massive IO and greatly reduce system throughput. So using the slow query log is completely impossible.

The next option we consider is MySQL Performance Schema, which can be used to monitor the running status of the MySQL server at a low level (available since MySQL 5.5.3). It provides a way to check the internal execution of the server at runtime. However, the main disadvantage of using this method is that enabling or disabling performance_schema requires a database restart. You can try to enable Performance Schema and then close all callers, which will increase the overhead by about 8%; if you enable all callers, it will increase the overhead by about 20-25%. Analyzing Performance Schema is also very complicated. In order to overcome this problem, MySQL introduced sys schema from MySQL version 5.7.7. But in order to view historical data, we still need to dump the data from the Performance Schema to other servers.

Because neither of these two methods can meet all our needs, we built a query analyzer running on the network layer to minimize overhead and effectively measure all queries.

How does the query analyzer work?

The query analyzer has three components:

1) Agent running on the database server.
2) Central server for storing query information.
3) The UI on the central server is used to display SQL analysis results.
Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Advanced architecture of query analyzer

Agent

Agent The agent is a service running on the MySQL server node. It uses raw socket to capture TCP packets, and then uses MySQL protocol to decode the packets from the packet stream and build queries. Then, the agent calculates the query response time by recording the time when the query arrives at the port and the time when the first data packet is sent (after the database responds).

The query response time is the difference between the time when the first packet enters and the time when the first response packet is sent. Then send the query to go routine to identify the fingerprint of the query (we use the Percona GO software package [1]). The fingerprint corresponds to the query after data cleaning. Use the fingerprint hash value as the key for the query. We can uniquely identify each query by its hash value.

The agent stores the query hash value, total response time, count, user and database name in a hash table. If the query has another identical hash value, the agent only needs to append the count and add the query time to the total response time. In addition, the agent maintains metadata information in another hashmap, including query hash values ​​and fingerprints, maximum time, minimum time, etc.

The agent collects query information for a period of time, then sends the information (query hash value, sum_query_time, count, etc.) to the central host, and then resets the counter. Metadata information is only sent when there is a change, such as a new query or a new minimum or maximum value for the query. The agent only uses a few MB of memory to manage these data structures, and the network bandwidth used to send query information is negligible.

Table 1: Query fingerprint example

询问  指纹
查询A SELECT * FROM table WHERE value1 ='abc' SELECT * FROM table WHEREvalue1 ='?'
查询B SELECT * FROM table WHEREvalue1 ='abc'AND value2 = 430  SELECT * FROM table WHEREvalue1 ='?'AND value2 =?
查询C SELECT * FROM table WHEREvalue1 ='xyz'AND value2 = 123  SELECT * FROM table WHEREvalue1 ='?'AND value2 =?
查询D SELECT * FROM table WHERE VALUES IN(1,2,3)  SELECT * FROM table WHERE VALUES IN(?+)

Please note that the fingerprints of A and B are different, but the fingerprints of B and C are the same.

Table 2: Hashmap example

查询哈希(KEY)   查询时间    计数  用户  DB
3C074D8459FDDCE3    6ms(1ms + 2ms + 3ms)    3   APP1    DB1
B414D9DF79E10545    9s(1s + 3s + 4s + 1s)   4   APP2    DB2
791C5370A1021F19    12ms(5ms + 7ms) 2   APP3    DB3

Table 3: Metadata hashmap example

查询哈希    指纹  第一次出现   最大时间的查询 最小时间    最大时间
3C074D8459FDDCE3    SELECT * FROM T1 WHERE a>?  1个月 SELECT * FROM T1 WHERE a> 0 1毫秒 为3ms
B414D9DF79E10545    SELECT * FROM T2 WHERE b =? 1天  SELECT * FROM T2 WHERE b = 430  1秒  5S
791C5370A1021F19    SELECT * FROM T3 WHERE c <? 1小时 SELECT * FROM T3 WHERE c <1000000   5ms的    7毫秒

UI

The UI for display analysis runs on the central server. The user can select the host name and time range to view the query to display the statistical information of each query run within that time. You can click any query to view the query trend graph.

The interesting aspect is the query load percentage, which is the load caused by the total number of queries running on the server during this period. For example, suppose there are 3 queries.

  • Query #1 takes 2 seconds each time and is executed 100 times. The load it causes is 2 * 100 = 200.
  • Query #2 takes 0.1 milliseconds each time and is executed 10M times. The resulting load is 0.0001 * 10,000,000 = 1000.
  • Query #3 takes 10 milliseconds each time and is executed 1M times. The load it causes is 0.01 * 1,000,000 = 10000.

Therefore, the total load on the server during this interval is 200 + 1000 + 10000 = 11200. The load percentage of each query is as follows.

  • Query #1 is 200/11200 * 100 = 1.78%
  • Query #2 is 1000/11200 * 100 = 8.93%
  • Query #3 is 10000/11200 * 100 = 89.29%

Please note that the query that users should look at is Query#3, because it causes a load of 89.29%, even though it only takes 10 milliseconds to execute each time.

The UI is shown in the figure below. For security reasons, the host name and table name are blocked.
Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Query Analyzer UI shows all the different queries

Click any query to display the query trend and more information.
Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Graph showing query trends

performance

In order to show the impact on throughput (transactions per second), we run MySQL 5.6.29-76.2-log Percona server on a machine with Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz-12 core CPU (GPL).

Run the benchmark test, then continue to increase the sysbench thread and measure its performance. We found that before reaching 128 concurrent threads, the query analyzer does not affect throughput. For the case of 256 threads, we observe a 5% decrease in transactions per second, but this is still better than Perfomance Schema (throughput decreases by 10%).

In our test, the query analyzer occupies less than 1% of the CPU, and when more than 128 threads are running, this peak value rises to 5%, which is still negligible. Please note that the number of threads means the number of concurrent queries in MySQL, which does not include dormant connections.
Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Use various tools to benchmark production
Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Use various tools to benchmark CPU utilization

Indicator collection

For the original version of Query Analyzer, we used MySQL to store data (basically time series data). There are two tables: query_history and query_info.

query_history is the location to save information from the query hashmap. The table has the following columns: hostname, checksum, timestamp, count, query time, user, and db. The primary key is (hostname, checksum, timestamp), the range partition is performed by timestamp, and the subpartition is performed by the key on the hostname. There are indexes on (hostname, timestamp, querytime, count) and checksum.

The query_info table is used to store information about query metadata. It has the following columns: hostname, checksum, fingerprint, sample, first_seen, mintime, mintime_at, maxtime, maxtime_at, is_reviewed, reviewed_by, reviewed_on, comments. (Hostname, checksum) is used as the primary key and there is an index on checksum.

So far, we have not encountered any problems. Occasionally, there will be some delay when drawing a long-term query trend graph. To overcome this problem, we plan to store the data in MySQL in an internal monitoring tool (called inGraphs[2]).

Safety

The agent needs to be run under sudo. In order to alleviate potential security issues, you can provide the agent with advanced permissions "cap_net_raw". In addition, by setting the execution authority to a specific user (chmod 100 or 500), you can run the agent under a specific user without sudo. For details, see https://linux.die.net/man/7/capabilities .

Overview

The advantages of query analyzer are many. Allows our database engineers to identify problematic queries at a glance, so that engineers can compare weekly queries and quickly and efficiently eliminate database slowdowns. Developers and business analysts can visualize query trends, check the query load in the segmented environment before entering development, and obtain metrics for each table and database, such as the number of inserts, updates, and deletes, through which they can analyze the business. From a security perspective, Query Analyzer allows us to receive alerts when new queries access the database, and we can also audit queries that are accessing sensitive information. Finally, analyzing query load allows us to ensure that queries are evenly distributed among the servers, thereby optimizing our hardware. We can also perform performance planning more accurately.

Although a timeline has not been defined, we plan to eventually open source the query analyzer and hope it will be useful to everyone else.

Thanks

Thanks to the LinkedIn MySQL team: Basavaiah Thambara and Alex Lurthu for design review, Kishore Govindaluri for UI development, and Naresh Kumar Vudutha for code review.

Related Links

1.https://github.com/percona/go-mysql/tree/master/query
2.https://engineering.linkedin.com/blog/2017/08/ingraphs--monitoring-and-unexpected-artwork

The author of this article, Karthik Appigatla, was translated by Jesse. Please indicate the source, technical originality and architectural practice articles for reprinted translations. Welcome to submit articles through the official account menu "Contact Us".

Recommended reading

  • Performance optimization tool: analysis of the new feature sys schema of MySQL 5.7
  • Why did Uber announce the switch from Postgres to MySQL?
  • MySQL 5.7 new features and future prospects
  • MySQL optimization and operation and maintenance for big data scenarios such as 6 billion records in a single table
  • 10 experience summaries of system reconstruction

Highly available architecture

Changing the way the internet is built

Why did LinkedIn give up MySQL slowlog and switch to a slow query analyzer based on the network layer?

Guess you like

Origin blog.51cto.com/14977574/2546993