Sub-library sub-table practice high availability middleware

Foreword

Middleware sub-library sub-table in our temper more than a year, basically solved the problem of availability and high performance (can only say that basically, there must have been hidden pit to fill), the problem will naturally focus on high availability. This article describes some of the work we have made in this regard.

Availability existence of a problem

As a stateless middleware, high-availability problem is not so difficult. However, to minimize flow losses during unusable, still need some work. These flow losses are mainly distributed in :

 
 

(1) a physical machine downtime suddenly located intermediate station.      

 
 

(2) middleware upgrades and releases.

Since our Middleware is provided to the application as a proxy database , i.e., the application middleware as our database, as shown below:

 

 

Therefore, the problems mentioned above, it is difficult to mask the impact on other operations by retrying operations. This inevitably requires us to do something at the bottom, the middleware can automatically sense the state to effectively prevent the loss of traffic.

Middleware case where the physical machine downtime

In fact, physical machine downtime is a common phenomenon, this moment when the application is no response. Then run the above sql certainly failed the (exact status is unknown, unless re-query the back-end database, the application can not know the exact status). This part of the traffic we certainly can not save. What we have done is to quickly identify and eliminate the middleware node goes down in the client end (Druid data source).

Find and eliminate the unavailable node

Heartbeat unavailable node to discover

We naturally to probe the survival of the state of the back-end middleware heartbeat. We create a new connection through the timing sql ping (mysql the ping) what to do and then immediately close the heartbeat (such an approach allows us to distinguish between normal traffic flow and heart rate, and if the holder has been transmitted via a connection similar to select 'a' such then the way will be to distinguish between traffic a little trouble). 

In order to prevent accidental connect network jitter caused by the failure, we determined that only after three failed connect a Taiwan middleware in an unusable state. And this three times to extend the probe was living the wrong perception of time, so we connect three time intervals is exponential decay, as shown below: 

Why not for the first time after the connect failure, continuous transmission of two connect it? Taking into account possible jitter network may have a window of time, if made within the time window in a row three times, this time out of the window of the network and okay, then erroneously found that the back end of a node is not available, so we We made a compromise exponential decay.

Error count by the unavailable node to discover

Above heartbeat perception there is always a window of time when a lot of traffic when using the unavailable node will fail within this time window , so we can use the error count is not available to assist the perception of nodes (of course, to achieve this means also planned). 

There is a point to note here is that only abnormal counted by creating a connection, and can not be read timeout or the like is calculated. The reason is, read timeout exception may be a problem or a slow back-end database sql lead, only to create a connection exception to determine the problem middleware (connection closed may be the back-end closed the connection, does not mean a whole is not available), As shown below:

Issues a request using a plurality of connection leads

Since we need to ensure that transactions as small as possible, so a request in which multiple sql does not use the same connection. In case of non transaction (auto-commit), running on how many sql much taken inside connection from the connection pool and returned. To ensure that small business is very important, but at this time middleware downtime can cause some problems, as shown below:

 

 

 

As shown above, the fault detection window period (i.e., a station has not yet determined the middleware is not available), data source is connected to a random selection. And this connection must have a 1 / N (N is the number of intermediate) The probability of hitting a sql unavailable middleware failure cause which led to failure of the entire request. We do a calculation:

 

Suppose N is 8, a request with a 20 sql, 
then the probability of failure of each request during this period is on ( . 1 - ( 7 / 8 ) power of 20) = 0.93 , 
i.e. 93 % probability will fail!

Even worse, the entire application cluster will experience at this stage that each application has a 93% probability of failure. A middleware downtime cause the entire service requests within ten seconds of all basic fundamental failed, this is intolerable.

Using sticky data source to solve the problem

Since we can not instantly find and confirm the middleware is not available, so there is certainly the fault detection window (of course, the error counting method will shorten the time to find a great extent). Ideally however, a downtime, loss of flow only 1 / N the like. We use the sticky data source to solve this problem, so that the flow losses generally only 1 / N of the probability, as shown below: 

And then with the error count, the total flow loss will be smaller (because of the short fault window)
shown above, only within the failure to randomly selected intermediate 2 (unavailable) requests will fail, let us look the entire application cluster. 

 

 

Only sticky request to the middleware only 2 flow loss, as is the random selection, the loss of the application flow in 1 / N.

High-availability middleware upgrade process of release

Sub-library sub-table middleware upgrade release is inevitable. For example bug fix and add new features and so you need to restart the middleware. The restart time can lead to unusable, the situation with the physical machine downtime compared to the point in time that is not available is known, restart the action is controlled, then we can use this information to do traffic smooth and undamaged.

Let client perception about to end off the assembly line

I know in a lot of practice, so that client-side off the assembly line is the introduction of a third-party perception coordinator (eg zookeeper / etcd). And we do not want to introduce third-party component to do this operation because it will zookeeper introduced high-availability problems, and make client-side configuration is more complicated. Thinking substantially smooth lossless (state machine) as shown below: 

 

 

 

Let heartbeat traffic flow to maintain normal perception and offline

Before we can reuse client-side detection logic is not available, that allow new connections heartbeat failure, and the new normal request of a successful connection. In this way, client end will think Server is unavailable, while excluding transfer within the server. Since we only simulated unavailable, the connection has been established to the new connection and normal (non-heart) are normally available, as shown below:

 

 

Heartbeat connection is created in the server end through its first execution of a ping mysql while the first is a normal flow of execution to differentiate sql (of course, we use the Druid connection pool after the new connection is successful it will ping, Therefore, using another way to distinguish, this is not explained here in detail).

After three heart failure, Client-side determination Server1 fails, the connection needs to be connected to the destruction of server1. The idea is that the service layer connection is returned when the connection pool exhausted directly to close off (of course, this is a simple description of the actual operation data source is the Druid there are subtle differences in).

 

 

Because of the configuration of a connection longest hold time, so after this time will certainly be the number of connections Server1 is zero due to the flow line is not low, this convergence time is relatively fast (further practice, in fact, take the initiative to destroy, but we have not done this operation).

How to determine the offline Server no longer flow

After the above-described operation carefully, the process of Server1 offline, there will be no loss of flow. But we also need to decide when to end Server will be no new traffic, the criteria Server1 that is a client-side without any connection. This is also above us after executing sql destroy a connection which allows the number of connections becomes zero reason, as shown below: 

 

 

When the number of connections is 0, we can re-released Server1 (sub-library sub-table middleware) a. For this, we wrote a script, its pseudo-code as follows:

the while ( to true ) { 
    COUNT = `netstat -anp | grep Port | grep the ESTABLISHED | WC - l`
     IF (0 == COUNT) {
         // traffic has to 0, turn off the server 
        the kill Server
         // publishing server upgrade 
        public Server
         BREAK 
    } the else { 
        Sleep (30s) 
    } 
}

This script will access publishing platform, it can be off the assembly line on rolling.
I can now explain why recover_time be longer, because the new connection would lead to an increase the number of connection count calculated from the script, so do not need a window of time to establish a heartbeat, so let this script running smoothly.

In fact, not be necessary recover_time

If the port number and port number of the normal flow of the heartbeat we will create a separate, it is not required recover_time, as shown below: 

 

 

With this program, it will largely reduce our client-side code complexity.
But this no doubt gave the client side adds a new configuration, the use of personnel on the addition of a burden on the network more than once had to open the wall of the operation, so we took recover_time program.

Middleware boot order issue

In front of the process is an elegant off the assembly line process, but we found that when we middleware was on the line in some cases will not be elegant. After that middleware start time, if the back-end database to build up the newly established connection is disconnected for some reason, it will lead the middleware reactor thread stuck for a minute, this time unable to serve, resulting in loss of flow. So we are all connected to the back-end database created after the successful restart of reactor accept thread to receive new traffic, so as to solve this problem, as shown below: 

 

 

to sum up

I personally feel high availability is more complicated than high performance. Because the high performance can repeatedly online to pressure measurement by the pressure measurement data to analyze bottlenecks and improve performance. And availability need to deal with all manner of online phenomenon, this blog is about high availability program is only a small part of our work, there is a great deal of energy on the part of the issue middleware itself. But as long as not miss any point, we will be able to analyze problems clearly and resolve, it will make the system better.

https://mp.weixin.qq.com/s/vkvYJnKfQyuUeD_BDQy_1g

For more learning materials, can be added to the group: 473 984 645 or under the Fanger Wei code scanning

Guess you like

Origin www.cnblogs.com/lemonrel/p/11715202.html