Building a distributed database load balancing architecture based on ShardingSphere

This article is mainly divided into 3 parts, which will be introduced in turn:

  • The main points of "load balancing architecture construction" of distributed database based on ShardingSphere
  • Combined with the actual "user problem case", introduce the impact of introducing "load balancing"
  • Introduce and demonstrate the "one-stop solution" of ShardingSphere distributed database on the cloud

ShardingSphere load balancing architecture construction points

Apache ShardingSphere is a distributed database ecosystem, which can convert any database into a distributed database, and enhance the original database through data fragmentation, elastic scaling, encryption and other capabilities. It is composed of two products, ShardingSphere-JDBC and ShardingSphere-Proxy, which can be deployed independently and support mixed deployment. The hybrid deployment architecture is as follows:
insert image description here

SharidngSphere-JDBC load balancing solution

Among them, ShardingSphere-JDBC is positioned as a lightweight Java framework, providing additional services at the Java JDBC layer. ShardingSphere-JDBC only adds computing operations before the application performs database operations, and the application process still directly connects to the database through the database driver.
Therefore, users do not need to consider the load balancing of ShardingSphere-JDBC separately, but only need to pay attention to how the application itself performs load balancing .

SharidngSphere-Proxy load balancing solution

deployment architecture

ShardingSphere-Proxy is positioned as a transparent database agent, providing services to database clients through database protocols. ShardingSphere-Proxy is an independently deployed process, and the reference architecture for load balancing on its upper layer is as follows:
insert image description here

Key points of load balancing scheme

Some students in the community have discussed in detail how to build a ShardingSphere-Proxy cluster, and some students have also consulted about the inconsistency between the behavior of ShardingSphere-Proxy after load balancing and imagination:

The main points of ShardingSphere-Proxy cluster load balancing:The database protocol itself is designed to be stateful. For example, connection authentication status, transaction status, prepared statement (Prepared Statement), etc.

If the load balancing on the upper layer of ShardingSphere-Proxy cannot understand the database protocol, it can only choose the four-layer load balancing proxy ShardingSphere-Proxy cluster, and the database connection status between the client and ShardingSphere-Proxy is maintained by a specific Proxy instance.
Since the state of the connection itself is maintained on a specific Proxy instance, Layer 4 load balancing can only achieve connection-level load balancing. For multiple requests connected to the same database, multiple Proxy instances cannot be polled for execution, that is, request-level load balancing cannot be achieved.
For the detailed information of Layer 4 load balancing and Layer 7 load balancing, this article will not go into details.

Recommendations for the application layer

In theory, there is no functional difference between a client directly connecting to a single ShardingSphere-Proxy and connecting to a ShardingSphere-Proxy cluster through a load balancing portal. However, there are differences in the implementation and configuration of different load balancing technologies. For example, the direct connection to ShardingSphere-Proxy does not limit the maximum time for the database connection session to be kept, but the four-layer session of some ELB products is allowed to be kept for a maximum of 60 minutes. If the idle database connection is closed by the load balancing timeout, but the client is passive. There is no perception of TCP connection closing, which may cause the application to report an error.
Therefore, in addition to considerations at the level of load balancing, the client itself can also consider taking some measures to avoid the impact of introducing load balancing.

For scenarios with long execution intervals, consider creating connections on demand

For example, if the scheduled job is executed with an interval of 1 hour and the execution time is short, if the connection singleton is continuously used, the database connection will be idle most of the time. If the client itself cannot perceive the change of the connection state, long periods of idleness will increase the uncertainty of the connection state.
For scenarios with long execution intervals, you can consider creating connections on demand and release them after use.

Consider managing database connections through a connection pool

General database connection pools have the ability to maintain valid connections and eliminate invalid connections. Managing database connections through connection pools can reduce the cost of self-maintaining connections.

The client considers enabling TCP KeepAlive

General clients can support the configuration of TCP KeepAlive, for example:

  • MySQL Connector/J supports configuring autoReconnect or tcpKeepAlive, which is not enabled by default;
  • PostgreSQL JDBC Driver supports configuration of tcpKeepAlive, which is disabled by default.

However, there are certain restrictions on how to enable TCP KeepAlive:

  • The client does not necessarily support the configuration of TCP KeepAlive or automatic reconnection;
  • The client does not intend to make any code or configuration adjustments;
  • TCP KeepAlive is dependent on operating system implementation and configuration.

User case: Connection interruption due to unreasonable load balancing configuration

Some time ago, there was user feedback that the deployed ShardingSphere-Proxy cluster provided external services through upper-layer load balancing. During use, it was found that there was a problem with the stability of the connection between the application and ShardingSphere-Proxy.

Problem Description

A user's production environment uses a 3-node ShardingSphere-Proxy cluster, and the cluster provides services to applications through the ELB of a cloud vendor.

insert image description here
One of the applications is a resident process that executes scheduled jobs. The scheduled job execution frequency is once an hour, and there are database operations in the job logic. User feedback, every time a scheduled job is triggered, an error will appear in the application log:

send of 115 bytes failed with errno=104 Connection reset by peer

Check the ShardingSphere-Proxy log, there is no abnormal information.
This problem only occurs in scheduled jobs with an execution frequency of one hour, and access to ShardingSphere-Proxy by other applications is normal.
Since the job logic has a retry mechanism, the job can be executed successfully after each retry, without affecting the original business.

problem analysis

The reason why the application displays an error is very clear: the client sends data to a closed TCP connection.
Therefore, the goal of troubleshooting is to determine the specific reason why this TCP connection was closed.

For the following considerations, we recommend users to simultaneously capture packets on both sides of the application and ShardingSphere-Proxy within a few minutes before and after the problem recurrence time.

  • The problem will reproduce on time every hour;
  • The problem is network related;
  • This problem does not affect real-time services of users.

Packet capture phenomenon one

ShardingSphere-Proxy will receive a TCP connection establishment request initiated by the client every 15 seconds. After completing the three-way handshake to establish a connection, the client immediately sends an RST to the Proxy. The connection establishment of the MySQL protocol is initiated by the server sending a Greeting to the client. From the packet capture results, the client sends an RST to the Proxy without any response after receiving the Server Greeting, even before the Proxy sends the Server Greeting When the RST is sent.

insert image description here
However, in the packet capture results on the application side, no traffic matching the above behavior was found.
After reading the documentation of the ELB used by the user, it is found that the above network interaction is the implementation of the four-layer health check mechanism of the ELB. Therefore, this phenomenon is irrelevant to the problem in this case.

insert image description here

Packet capture phenomenon two

In the MySQL connection established between the client and ShardingSphere-Proxy, the client sends RST to the Proxy during the disconnection of the TCP connection.
insert image description here
The above packet capture results show that the client first actively sent the COM_QUIT command to ShardingSphere-Proxy, that is, the MySQL connection was actively disconnected by the client, including but not limited to the following possible situations:

  • The application has finished using the MySQL connection, and the database connection is closed normally;
  • The database connection between the application and ShardingSphere-Proxy is managed by the connection pool, and the connection pool releases idle connections that have timed out or exceeded the longest life cycle.

Since the connection is actively closed by the application side, if there is a problem with the logic of the application itself, it will not affect other business operations in theory.

After multiple rounds of packet capture analysis, no ShardingSphere-Proxy was found to send RST to the client within a few minutes before and after the problem reappeared. According to the existing information, it is speculated that the connection between the client and ShardingSphere-Proxy may have been disconnected earlier, but the packet capture time is limited, and the moment of disconnection was not collected.
ShardingSphere-Proxy itself does not have the logic to actively disconnect the client. Consider troubleshooting the problem from the client and ELB layers.

Client application and ELB configuration check

According to user feedback:

  • The scheduled job of the application is executed once an hour. The application does not use the database connection pool, and manually maintains a database connection for continuous use by the scheduled job;
  • ELB is configured with layer-4 session persistence, and the session idle timeout is 40 minutes.

Considering the execution frequency of scheduled jobs, we recommend that users modify the ELB session idle timeout to be greater than the execution interval of scheduled jobs.
After the user modifies the ELB timeout to 66 minutes, the Connection reset problem no longer occurs.

If you continue to capture packets during the troubleshooting process, it is very likely that the ELB disconnected TCP connection traffic will be captured at the 40th minute of every hour.

Question conclusion

The root cause of the client error Connection reset by peer:
The ELB idle timeout time is less than the scheduled task execution interval, and the client idle time exceeds the ELB session retention timeout time, resulting in the connection between the client and ShardingSphere-Proxy being disconnected by the ELB timeout .
The client sends data to the TCP connection that has been closed by the ELB, resulting in the error Connection reset by peer.

Timeout simulation experiment

This article conducts a simple experiment to verify the performance of the client after the load balancing session times out, captures packets during the experiment, analyzes network traffic, and observes the behavior of load balancing.

Build a load-balanced ShardingSphere-Proxy cluster environment

In theory, any four-layer load balancing implementation can be used as the object of this article, so this article uses nginx as a four-layer load balancing technology implementation.

Configure nginx stream

The idle timeout is set to 1 minute, that is, the TCP session remains at most 1 minute.

user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

stream {
    upstream shardingsphere {
        hash $remote_addr consistent;

        server proxy0:3307;
        server proxy1:3307;
    }

    server {
        listen 3306;
        proxy_timeout 1m;
        proxy_pass shardingsphere;
    }
}

Construct Docker compose

version: "3.9"
services:

  nginx:
    image: nginx:1.22.0
    ports:
      - 3306:3306
    volumes:
      - /path/to/nginx.conf:/etc/nginx/nginx.conf

  proxy0:
    image: apache/shardingsphere-proxy:5.3.0
    hostname: proxy0
    ports:
      - 3307

  proxy1:
    image: apache/shardingsphere-proxy:5.3.0
    hostname: proxy1
    ports:
      - 3307

startup environment

 $ docker compose up -d 
[+] Running 4/4
 ⠿ Network lb_default     Created                                                                                                      0.0s
 ⠿ Container lb-proxy1-1  Started                                                                                                      0.5s
 ⠿ Container lb-proxy0-1  Started                                                                                                      0.6s
 ⠿ Container lb-nginx-1   Started                                                                                                      0.6s

Simulate the client based on the same connection timing task

Construct the client to execute SQL in a delayed manner

Here, ShardingSphere-Proxy is accessed through Java and MySQL Connector/J.
The logic is roughly as follows:

  1. Establish a connection with ShardingSphere-Proxy and execute a query to Proxy;
  2. After waiting for 55 seconds, perform another query to the Proxy;
  3. After waiting 65 seconds, perform another query to the Proxy.
public static void main(String[] args) {
    
    
    try (Connection connection = DriverManager.getConnection("jdbc:mysql://127.0.0.1:3306?useSSL=false", "root", "root"); Statement statement = connection.createStatement()) {
    
    
        log.info(getProxyVersion(statement));
        TimeUnit.SECONDS.sleep(55);
        log.info(getProxyVersion(statement));
        TimeUnit.SECONDS.sleep(65);
        log.info(getProxyVersion(statement));
    } catch (Exception e) {
    
    
        log.error(e.getMessage(), e);
    }
}

private static String getProxyVersion(Statement statement) throws SQLException {
    
    
    try (ResultSet resultSet = statement.executeQuery("select version()")) {
    
    
        if (resultSet.next()) {
    
    
            return resultSet.getString(1);
        }
    }
    throw new UnsupportedOperationException();
}

Expected results and client running results

expected outcome:

  1. The connection between the client and ShardingSphere-Proxy is established and the first query is successful;
  2. The client's second query is successful;
  3. Since the nginx idle timeout is set to 1 minute, the third query of the client reports an error due to disconnection of the TCP connection.

The execution results are in line with expectations. Due to the difference between the programming language and the database driver, the error message is inconsistent, but the root cause is the same: the TCP connection has been disconnected.
The log looks like this:

15:29:12.734 [main] INFO icu.wwj.hello.jdbc.ConnectToLBProxy - 5.7.22-ShardingSphere-Proxy 5.1.1
15:30:07.745 [main] INFO icu.wwj.hello.jdbc.ConnectToLBProxy - 5.7.22-ShardingSphere-Proxy 5.1.1
15:31:12.764 [main] ERROR icu.wwj.hello.jdbc.ConnectToLBProxy - Communications link failure
The last packet successfully received from the server was 65,016 milliseconds ago. The last packet sent successfully to the server was 65,024 milliseconds ago.
        at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
        at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
        at com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1201)
        at icu.wwj.hello.jdbc.ConnectToLBProxy.getProxyVersion(ConnectToLBProxy.java:28)
        at icu.wwj.hello.jdbc.ConnectToLBProxy.main(ConnectToLBProxy.java:21)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure

The last packet successfully received from the server was 65,016 milliseconds ago. The last packet sent successfully to the server was 65,024 milliseconds ago.
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
        at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
        at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
        at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
        at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
        at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:581)
        at com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:761)
        at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:700)
        at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1051)
        at com.mysql.cj.protocol.a.NativeProtocol.sendQueryString(NativeProtocol.java:997)
        at com.mysql.cj.NativeSession.execSQL(NativeSession.java:663)
        at com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1169)
        ... 2 common frames omitted
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
        at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:67)
        at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:81)
        at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:63)
        at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:45)
        at com.mysql.cj.protocol.a.TimeTrackingPacketReader.readHeader(TimeTrackingPacketReader.java:52)
        at com.mysql.cj.protocol.a.TimeTrackingPacketReader.readHeader(TimeTrackingPacketReader.java:41)
        at com.mysql.cj.protocol.a.MultiPacketReader.readHeader(MultiPacketReader.java:54)
        at com.mysql.cj.protocol.a.MultiPacketReader.readHeader(MultiPacketReader.java:44)
        at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:575)
        ... 8 common frames omitted

Analysis of packet capture results

The packet capture results show that after the connection idle timeout, nginx disconnected the TCP connection with the client and Proxy at the same time. But because the client has no perception, after sending the command, nginx returns RST.
After the idle timeout of the nginx connection, the TCP disconnection process with the Proxy is normally completed, and the Proxy is completely unaware when the client uses the disconnected connection to send a request.
Analyze the following packet capture results:

  • Numbers 1~44 are the interactive process of establishing a MySQL connection between the client and ShardingSphere-Proxy;
  • Numbers 45~50 are the first query executed by the client;
  • Numbers 55~60 are the second query executed 55 seconds after the client executes the first query;
  • Numbers 73 to 77 are when the session times out, nginx simultaneously initiates a TCP connection disconnection process to the client and ShardingSphere-Proxy;
  • Numbers 78 to 79 are 65 seconds after the client executes the second query, executes the third query, and a Connection Reset occurs.

insert image description here

ShardingSphere on Cloud one-stop solution

Manual deployment, operation and maintenance of ShardingSphere-Proxy clusters and load balancing will inevitably consume a certain amount of manpower and time costs. In this regard, Apache ShardingSphere launched a collection of cloud solutions - ShardingSphere on Cloud .

ShardingSphere on Cloud includes automated deployment scripts for virtual machines in AWS, GCP, Alibaba Cloud and other cloud environments, such as CloudFormation Stack templates, Terraform one-click deployment scripts, etc., Helm Charts, Operator, and automatic horizontal expansion in Kubernetes cloud-native environments. Tools such as shrinkage, as well as various practical content in high availability, observability, security compliance, and so on.
ShardingSphere on Cloud includes the following capabilities:

  • One-click deployment of ShardingSphere-Proxy based on Helm Charts in the Kubernetes environment;
  • Operator-based ShardingSphere-Proxy one-click deployment and automatic operation and maintenance in the Kubernetes environment;
  • Rapid deployment of ShardingSphere-Proxy based on AWS CloudFormation;
  • Rapid deployment of ShardingSphere-Proxy in Terraform-based AWS environment.

This article briefly demonstrates one of the basic capabilities of ShardingSphere on Cloud: using Helm Charts to deploy a ShardingSphere-Proxy cluster in Kubernetes with one click.

  1. Use the following 3 lines of commands to create a 3-node ShardingSphere-Proxy cluster in the Kubernetes cluster with the default configuration and provide services through Service.
    helm repo add shardingsphere https://apache.github.io/shardingsphere-on-cloud
    helm repo update
    helm install shardingsphere-proxy shardingsphere/apache-shardingsphere-proxy-charts -n shardingsphere
    
    insert image description here
  2. The application can access the ShardingSphere-Proxy cluster through the svc domain name.
kubectl run mysql-client --image=mysql:5.7.36 --image-pull-policy=IfNotPresent -- sleep 300
kubectl exec -i -t mysql-client -- mysql -h shardingsphere-proxy-apache-shardingsphere-proxy.shardingsphere.svc.cluster.local -P3307 -uroot -proot

insert image description here
The above is just a demonstration of one of the basic capabilities of ShardingSphere on Cloud. For more advanced features available for production, welcome to explore the official documentation of ShardingSphere on Cloud.
https://shardingsphere.apache.org/oncloud/current/cn/overview/

Guess you like

Origin blog.csdn.net/wu_weijie/article/details/128807275