mysql combat 29 | How to judge a database is not a problem?

In my first article 25 and 27, and introduce you to the standby switching process. By explaining the content, you should be very clear: In a double-M architecture of the main one, the standby switch just need to cut traffic to the backup client library; and more from the architecture, the standby switch in a primary Besides presenting the client traffic is switched to the standby database, the need to take on the new primary library from the library.

Standby switching two scenarios, one is automatically switch a passive switch. Passive switching of which is often a problem because the main library, sponsored by the HA system.

This also raises the issue we are discussing today: how to determine a main library a problem?

You will say, this is very simple ah, even on MySQL, execute a select 1 just fine. But select 1 successfully returned, it means that the main library is no problem right?

Analyzing select 1

In fact, select 1 successful return, only that the process is still the library, and can not explain the main library is no problem. Now, let's look at this scenario.

set global innodb_thread_concurrency=3;

CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  `c` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

 insert into t values(1,1)
复制代码


                                                                  Figure 1 query blocked

We set the parameters innodb_thread_concurrency purpose is to control the upper limit of concurrent threads in InnoDB. In other words, once the number of concurrent threads to achieve this value, InnoDB when receiving a new request, it will enter a waiting state until the thread exits.

Here, I innodb_thread_concurrency set to 3, it represents the InnoDB allows only three threads execute in parallel. In our example, sleep (100) in the first three session, so these three statements are in the "execution" state, in order to simulate large queries.

You see, session D inside, select 1 is able to perform successfully, but the look-up table t the statement will be blocked. That is, if this time we select 1 to detect instances are normal, then the problem is not detected.

In InnoDB, innodb_thread_concurrency default value for this parameter is 0, meaning an unlimited number of concurrent threads. However, it does not limit the number of concurrent threads is definitely impossible. Because a limited number of machines CPU core, the thread to the whole burst, context switching costs will be too high.

Therefore, under normal circumstances, we propose innodb_thread_concurrency set to a value between 64 to 128. At this point, you must have a doubt, the maximum number of concurrent threads doing enough set to 128, the number of concurrent connections the line at every turn thousands.

The reason for this doubt is confused concurrent connections and concurrent queries.

Concurrent connections and concurrent queries is not the same concept. The results show processlist of you, saw thousands of connections, refers to the concurrent connections. The "currently being executed," the statement, is what we call concurrent queries.

The number of concurrent connections reaches thousands of influence is not large, it is more than accounted for some memory of it. We should be concerned about is concurrent queries, because of concurrent queries is too high CPU killer. This is also why we need to set the parameters innodb_thread_concurrency.

Then, you may also recall that we talked about in the first seven articles in the hot update and deadlock detection time, if the innodb_thread_concurrency set to 128, then the problem hot spots update the same line, is not soon put 128, exhausted, so that the entire system is not hung up on it?

In fact, the thread enters the lock after waiting, the count will be reduced by a concurrent threads , i.e. other row lock (lock also comprises a gap) is not in the thread 128 inside.

MySQL this design is very meaningful. Because the thread into the lock has been waiting to eat up CPU; more importantly, it must be so designed to prevent the entire system lockup.

why? Thread assumed to be in lock wait also accounted count of concurrent threads, you can imagine this scenario:

  1. Thread 1 executes begin; update t set c = c + 1 where id = 1, the transaction started trx1, then remain in this state. At this time, the thread is idle, not in concurrent threads inside.
  2. 129 threads to the thread 2 perform update t set c = c + 1 where id = 1; etc. Because row locks into a wait state. So there are 128 threads in a wait state;
  3. If you are in lock wait state of a thread count unabated, InnoDB will think the number of threads with a full, will prevent other statements into the engine to perform, so the thread 1 can not commit the transaction. While the other 128 threads and in lock wait state, the entire system is blocked.

Figure 2 shows that under this condition.


                                       Figure 2 system lockup state (assuming equal row locks concurrent statement occupancy count)

It does not respond to any requests InnoDB time, the entire system is locked. Moreover, since all the threads are in a wait state, then occupied by the CPU is 0, which is obviously unreasonable. So, we say in the design of InnoDB, into the situation encountered in the process of waiting for the lock, the count of concurrent threads minus 1 design, it is reasonable and necessary.

Although locks and other concurrent threads in the thread count is not in, but if it is actually executing the query, like on our example above (100) from t first three affairs select sleep, or to be counted concurrent threads count.

In this case, while statements executed exceeds the value of innodb_thread_concurrency set, this time the system actually has to die, but to select 1 through the detection system, or the system will be considered normal.

Therefore, we use judgment logic select 1 to change it.

Look-up table to determine

In order to be able to detect system too InnoDB number of concurrent threads due to unavailability, we need to find a scene of access InnoDB. The general practice is to create a system library (mysql database) in a table, such as the name health_check, which put only one row of data, then a regular basis:

mysql> select * from mysql.health_check; 
复制代码

Using this method, we can detect cases because too many concurrent threads leading to the database unavailable.

However, we soon will encounter the next question, namely: space is full, this approach will become so bad.

We know that the update transaction to write binlog, and disk space usage once binlog where up to 100%, then commit the update statement and statement of all transactions committed on will be blocked. However, this time the system can still read the data correctly.

So, we put the monitor statement and then improve it. Next, we take a look at the query into effect after the update statement.

Update judgment

Since you want to update, it is necessary to put a meaningful field, a common practice is to put a timestamp field used to indicate the last execution time of detection. This update statement similar to:

mysql> update mysql.health_check set t_modified=now();
复制代码

The availability of the detection node and the library should contain a primary standby database. If the update to detect the main library, then the library should be updated detection equipment.

However, the library is prepared to write the detection of binlog. Since we in master and slave databases A and B will generally designed as a double structure M, it is executed on the standby database command B is detected, but also back to the main gallery A.

However, if the main library A and B are prepared by the library with the same update command line conflict may occur, which is likely to lead to the master and backup synchronization stops. So now it seems mysql.health_check this table can not be only one row of data.

To make updates between the primary and that no conflict arises, we can deposit multiple rows on mysql.health_check table, and with A, server_id B is a primary key.

mysql> CREATE TABLE `health_check` (
  `id` int(11) NOT NULL,
  `t_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

/* 检测命令 */
insert into mysql.health_check(id, t_modified) values (@@server_id, now()) on duplicate key update t_modified=now();
复制代码

Because MySQL provides server_id main library and library equipment must be different (or the creation of master and slave relationship when it will error), so as to ensure the primary and backup library respective detection command does not conflict.

Update judgment is a relatively common scenario, but still there are some problems. Among them, "judgment slow" has been let DBA headache.

You must be wondering update statement, if it fails or times out, you can initiate standby switching, why there will be judgment slow problem?

In fact, there is a problem related to the server IO resources allocation.

First, all of the detection logic requires a timeout N. Execute an update statement, after more than N seconds is not returned, they think the system is unavailable.

You can imagine a log disk IO usage has 100% of the scene. At this time, the system response is very slow, the shots need to have a standby switch.

But you know, IO IO utilization rate of 100% indicates that the system is working, each IO request to have access to resources, perform their tasks. And we use testing of the update command requires few resources, it might be time to get the IO to submit successful in resources, and returned to the testing system before the timeout does not reach N seconds.

A look detection system, update command is not timed out, so he got "normal system" conclusion.

In other words, this time on a normal business systems SQL query has been executed very slowly, but the DBA up to look, HA system is still working properly, and that the main library is now in a usable state.

The reason for this phenomenon, simply because we say all of the above methods are based on the external detection. External sense there is a natural question is randomness.

Because the external detection requires regular polling, the system may have been a problem, but it needs to wait until the next detection initiate execution of the statement, we might find the problem. And, if your luck is not good enough, it may not be the first poll also found that this leads to the problem of slow switching.

So, the next I'd like to introduce you to a method and database problems found within MySQL.

Internal statistics

Disk utilization for this issue, if MySQL can tell us, every time the internal IO requests, then we determine whether the method database problem is much more reliable.

In fact, performance_schema library MySQL 5.6 version later provided in file_summary_by_event_name statistics table each time IO request.

file_summary_by_event_name table has many rows of data, we take a look at event_name = 'wait / io / file / innodb / innodb_log_file' this line.


                      FIG row 3 performance_schema.file_summary_by_event_name

This line drawing is represented statistical redo log write time, the first column indicates the type EVENT_NAME statistics.

The next three sets of data, statistics show that the time to redo log operation.

The first group of five, all types of IO statistics. Wherein, COUNT_STAR IO is the total number of all, followed by four specific statistical terms, the unit is picoseconds; prefix SUM, MIN, AVG, MAX, refers to the sum of the name suggests, minimum, average and maximum values.

The second group of six, is read statistics. The last one SUM_NUMBER_OF_BYTES_READ statistics is the total number of bytes read from the redo log in.

The third group of six, statistics is a write operation.

Finally, a fourth set of data, statistics on other types of data. In the redo log, you may think they are the statistics for fsync.


In file_summary_by_event_name table performance_schema library, binlog corresponds event_name = "wait / io / file / sql / binlog" this line. Statistics logic for each field, and each field is identical to redo log. Here, I will not go into details.

Because every time we operate the database, performance_schema require additional information to these statistics, so we open this statistics is performance loss.

My test result, if you open all the performance_schema term, performance will probably fall about 10%. So, I suggest you just open the item they need statistics. You can open or close the statistics for a specific item by the following method.

If you want to open time redo log monitoring, you can execute this statement:

mysql> update setup_instruments set ENABLED='YES', Timed='YES' where name like '%wait/io/file/innodb/innodb_log_file%';
复制代码

Suppose you have now opened a redo log and binlog these two statistics, it is up to how this information is used in the diagnosis of the state of the instance of it?

Very simple, you can determine whether the problem through the database of values ​​MAX_TIMER. For example, you can set a threshold value, a single IO request time exceeds 200 milliseconds abnormality belongs, and then use the following statement as a similar detection logic.

mysql> select event_name,MAX_TIMER_WAIT  FROM performance_schema.file_summary_by_event_name where event_name in ('wait/io/file/innodb/innodb_log_file','wait/io/file/sql/binlog') and MAX_TIMER_WAIT>200*1000000000;
复制代码

After the unusual, to get the information you need, and through the following statement:

mysql> truncate table performance_schema.file_summary_by_event_name;
复制代码

Before the empty statistics. So, if in the back of the monitor, this anomaly appears again, you can add up the cumulative value monitoring.

summary

Today, I introduce you to the detection of a MySQL instance the health status of several methods, as well as the logical evolution of the problem and the existence of various methods.

After you read may feel, select 1 this method is it not already been eliminated, but actually use a very wide range of MHA (Master High Availability), the default method is used.

Another alternative method MHA is the only connection is that "if the connection is successful you think the main library is no problem." But as far as I know, very few choose this method.

In fact, every improvement program, additional losses will not use "right or wrong" to make a direct judgment, you do need to weigh the business according to the actual situation.

I personally tend to the program, it is a priority to update the system tables, and then with the increase of information detection performance_schema.

Finally, we went to the Questions of time.

Today, I want to ask you is: business systems generally have high availability requirements, you had to develop and maintain the service, how do you judge the service there is no problem in it?

You can put your methodology and analysis used to write in the comments section, I will select interesting programs together to share and analyze in the next article. Thank you for listening, you are welcome to send this share to more friends to read together.

On the issue of time

The problem is that the previous period, if GTID sites such as read and write separate programs to do, what will happen at the big table to do DDL time.

Assume that this statement in the main library to be performed 10 minutes after submission by the library will spread to 10 minutes (typical of large transactions). So, GTID affairs after the main library DDL re-submitted, time to prepare Kuchar, it will wait 10 minutes to appear.

In this way, the reader will timeout separation mechanism in these 10 minutes, then take the main library.

Operating within such expectations should be low peak business period of time, ensure that the main library supports all business inquiries, then read requests are cut to the main library, and then do the DDL in the main library. After the delay catch other by the library, and then cut back to a read request by the library.

By this thinking questions, I mainly want to focus on is the impact site of a large transaction peer program.

Of course, using the gh-ost solution to solve this problem is a good choice.

Reproduced in: https: //juejin.im/post/5d05bdb7f265da1b7004a785

Guess you like

Origin blog.csdn.net/weixin_34132768/article/details/93183416