Analysis of Spark2x Principle (2)

I. Overview

Based on the existing JDBC Server in the community, the multi-master instance mode is used to implement its high availability solution. The cluster supports the coexistence of multiple JDBCServer services at the same time, and the client can randomly connect to any one of the services for business operations. Even if one or more JDBCServer services in the cluster stop working, it will not affect users to connect to other normal JDBCServer services through the same client interface.
Compared with the active-standby mode HA solution, the advantages of the multi-master instance mode are mainly reflected in the improvements in the following two scenarios.

  • In the active-standby mode, when the active-standby switchover occurs, the service will be unavailable for a period of time, which cannot be controlled by JDBCServer and depends on the resources of the Yarn service.
  • Spark provides services through Thrift JDBC similar to HiveServer2, and users access through Beeline and JDBC interfaces. Therefore, the processing capability of the JDBCServer cluster depends on the single-point capability of the main Server, and the scalability is not enough.

The HA solution using the multi-master instance mode can not only avoid the problem of service interruption during active-standby switching, achieve no or less service interruption, but also improve concurrency capabilities by horizontally expanding the cluster.

2. Implementation plan

The principle of the HA solution in the multi-master instance mode is shown in the figure below.
Insert image description here

  1. When JDBCServer starts, it registers its own message with ZooKeeper, and writes the node in the specified directory. The node contains information such as the IP, port, version number, and serial number corresponding to the instance (multi-node information is separated by commas).
    Examples are as follows:
[serverUri=192.168.169.84:22550
;version=8.2.0;sequence=0000001244,serverUri=192.168.195.232:22550 ;version=8.2.0;sequence=0000001242,serverUri=192.168.81.37:22550 ;version=8.2.0;sequence=0000001243]
  1. When the client connects to the JDBCServer, it needs to specify the Namespace, that is, which directory of the ZooKeeper to access the JDBCServer instance. When connecting, an instance connection will be randomly selected from the Namespace. For the detailed URL, please refer to the URL connection introduction.
  2. After the client successfully connects to the JDBCServer service, it sends an SQL statement to the JDBCServer service.
  3. After the JDBCServer service executes the SQL statement sent by the client, it returns the result to the client.

In the HA solution, each JDBCServer service (that is, an instance) is independent and equal. When one instance is being upgraded or the business is interrupted, other instances can also accept client connection requests.

The multi-master instance scheme follows the following rules:

  • When an instance exits abnormally, other instances will not take over the session on this instance, nor will they take over the services running on this instance.
  • When the JDBCServer process stops, delete the corresponding node on ZooKeeper.
  • Since the client's strategy for selecting the server is random, there may be uneven random distribution of sessions, which may cause load imbalance among instances.
  • After an instance enters maintenance mode (that is, no new client connections are accepted after entering this mode), when the decommissioning timeout period is reached, services still running on this instance may fail.

3. Introduction to URL connection

Multi-master instance mode

The client in multi-master instance mode reads the content in the ZooKeeper node and connects to the corresponding JDBCServer service. The connection string is:

  • In safe mode:
    JDBCURL in Kinit authentication mode is as follows:
jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.<系统域名>@<系统域名>;

Notice:

  • Among them, "<zkNode_IP>:<zkNode_Port>" is the URL of ZooKeeper, and multiple URLs are separated by commas.
    For example: "192.168.81.37:24002,192.168.195.232:24002,192.168.169.84:24002".
  • "sparkthriftserver2x" is a directory on ZooKeeper, which means that the client randomly selects a JDBCServer instance from this directory to connect.

Example: Execute the following command when connecting through the Beeline client in safe mode:

sh CLIENT_HOME/spark/bin/beeline -u "jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.<系统域名>@<系统域名>;"

The JDBCURL in Keytab authentication mode is as follows:

jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.<系统域名>@<系统域名>;user.principal=<principal_name>;user.keytab=<path_to_keytab>
  • In normal mode:
jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;

Example: Execute the following command when connecting through the Beeline client in normal mode:

sh CLIENT_HOME/spark/bin/beeline -u "jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;"

Non-multi-master instance mode

Clients in non-multi-master instance mode connect to a specified JDBCServer node. Compared with the multi-master instance mode, the connection string of this mode removes the parameter items "serviceDiscoveryMode" and "zooKeeperNamespace" about Zookeeper.

Example: Execute the following command when connecting to non-multi-master instance mode through Beeline client in safe mode:

sh CLIENT_HOME/spark/bin/beeline -u "jdbc:hive2://<server_IP>:<server_Port>/;user.principal=spark2x/hadoop.<系统域名>@<系统域名>;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.<系统域名>@<系统域名>;"

Notice:

  • Among them, "<server_IP>:<server_Port>" is the URL of the specified JDBCServer node.
  • "CLIENT_HOME" refers to the client path.

Compared with the JDBCServer interfaces in multi-master instance mode and non-multi-master instance mode, the usage methods are the same except for the different connection methods. Since Spark JDBCServer is another implementation of HiveServer2 in Hive, for how to use it, please refer to the Hive official website: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients .

Guess you like

Origin blog.csdn.net/weixin_43114209/article/details/132684287