Data Warehouse Hive HA Introduction and Practical Operation

I. Overview

In data warehousing, the architecture and solutions that provide high availability Hive HA(High Availability)for the data warehouse query and analysis tools. Apache HiveHive is a data warehouse solution built on top of the Hadoop ecosystem for querying and analyzing large-scale data. To ensure the continuity and availability of Hive services, especially in the event of hardware failures, software issues, or other disruptions, it is important to implement Hive's high-availability solutions.

Hive HA usually involves the following aspects:

  • High availability of metadata storage : Metadata is stored in Hive Metastore, including table structure, partition information, table location, etc. To ensure high availability of metadata, database replication, backup and recovery strategies can be used. Common database choices include MySQL, PostgreSQL, and more.

  • High availability of query engine : Hive's query engine can achieve high availability in a variety of ways, such as using Hadoop's YARN resource manager to manage query jobs, or deploying multiple Hive Servers to achieve load balancing and failover.

  • Redundant backup of data storage : Data stored in Hadoop HDFS can be backed up by data redundancy to ensure data reliability and high availability. HDFS usually uses a copy mechanism to save multiple copies of data to prevent data loss caused by a single node failure.

  • Automatic Failover : A Hive HA solution should be able to automatically detect failures and failover when needed. This means that when a node or service fails, the system can quickly route requests to available nodes or services, reducing downtime.

  • Monitoring and alerting system : In order to achieve high availability, monitoring and alerting system is very important to detect and deal with faults in time. These systems can monitor the running status of Hive services, issue alerts in time and take necessary measures to deal with potential problems.

In general, Hive HA aims to ensure that the normal operation of Hive services can be maintained under various circumstances through redundancy, backup, automatic failover, and monitoring systems, thereby providing continuous data query and analysis capabilities. The exact implementation may vary depending on the organization's needs and technology stack.

insert image description here

2. Introduction and configuration of Hive MetaStore HA

Hive MetaStore HA(High Availability)It is a series of measures and configurations taken to ensure the high availability of Hive metadata storage. Hive metadata is stored in the MetaStore, including table definitions, partitions, and table attributes. Ensuring the high availability of Hive MetaStore is an important step to ensure the reliability and stability of the entire Hive system.

General connection principle:

insert image description here

High availability principle:insert image description here

Here is an example, configuring the ZooKeeper address into hive.metastore.uris:

<configuration>
  <property>
 	<name>hive.server2.thrift.bind.host</name>
  	<value>metastore1_host</value>
  </property>
  
  <!-- 启用 ZooKeeper 用于 HA -->
  <property>
    <name>hive.metastore.uris</name>
    <value>
      thrift://metastore1_host:9083,
      thrift://metastore2_host:9083
    </value>
  </property>
  <!-- 其他配置项 -->
</configuration>

In this example, you need to replace metastore1_host, metastore2_host, metastore3_hostwith the host address of your Hive MetaStore instance. Specify multiple addresses separated by commas. This way, when there is a problem connecting to one instance, Hive will try to connect to the next address for failover and redundancy.

3. Introduction and configuration of Hive HiveServer2 HA

HiveServer2 HA(High Availability)It is a series of measures and configurations taken to ensure the high availability of Apache Hive's query service HiveServer2. HiveServer2 is a query engine of Hive, which allows users to submit and execute Hive queries in various ways (such as JDBC, ODBC, etc.). By configuring HiveServer2 for high availability, you can ensure that continuous query services can still be provided in the event of hardware failures, software problems, or other interruptions.

insert image description here

The following is a sample HiveServer2 high availability configuration using Apache ZooKeeper for failover. Note that this is just a simplified example, and actual configurations may vary depending on your environment and needs.

  1. Install and configure ZooKeeper : Make sure you have installed and configured a ZooKeeper cluster. You need to know the hostname or IP address and port number of the ZooKeeper server.

  2. Edit the Hive Site configuration : Open the Hive configuration file hive-site.xmland add the following properties to configure the high availability of HiveServer2 and the integration with ZooKeeper:

<configuration>
  <!-- 启用ZooKeeper用于HA -->
  <property>
    <name>hive.server2.zookeeper.namespace</name>
    <value>hiveserver2</value>
  </property>
  <property>
    <name>hive.zookeeper.client.port</name>
    <value>2181</value>
  </property>
  <property>
    <name>hive.zookeeper.quorum</name>
    <value>zk1_host:2181,zk2_host:2181,zk3_host:2181</value>
  </property>
  <property>
    <name>hive.server2.support.dynamic.service.discovery</name>
    <value>true</value>
  </property>
  <!-- 其他配置项 -->
</configuration>

Replace zk1_host, zk2_host, zk3_host with your ZooKeeper host address and port number.

4. Environment deployment

Here, in order to quickly deploy the environment, the k8s environment is used to deploy Hadoop. For the hadoop on k8s tutorial, you can refer to my article: Hadoop HA on k8s Arrangement and Deployment Advanced

hive-site.xmlThe complete configuration is as follows:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<!-- 配置hdfs存储目录 -->
	<property>
			<name>hive.metastore.warehouse.dir</name>
			<value>/user/hive_remote/warehouse</value>
	</property>

	<property>
			<name>hive.metastore.local</name>
			<value>false</value>
	</property>

	<!-- 所连接的 MySQL 数据库的地址,hive_local是数据库,程序会自动创建,自定义就行 -->
	<property>
			<name>javax.jdo.option.ConnectionURL</name>
			<value>jdbc:mysql://192.168.182.110:13306/hive_metastore?createDatabaseIfNotExist=true&amp;useSSL=false&amp;serverTimezone=Asia/Shanghai</value>
	</property>

	<!-- MySQL 驱动 -->
	<property>
			<name>javax.jdo.option.ConnectionDriverName</name>
			<!--<value>com.mysql.cj.jdbc.Driver</value>-->
			<value>com.mysql.jdbc.Driver</value>
	</property>

	<!-- mysql连接用户 -->
	<property>
			<name>javax.jdo.option.ConnectionUserName</name>
			<value>root</value>
	</property>

	<!-- mysql连接密码 -->
	<property>
			<name>javax.jdo.option.ConnectionPassword</name>
			<value>123456</value>
	</property>

	<!--元数据是否校验-->
	<property>
			<name>hive.metastore.schema.verification</name>
			<value>false</value>
	</property>

	<property>
			<name>system:user.name</name>
			<value>root</value>
			<description>user name</description>
	</property>

	<property>
			<name>hive.metastore.uris</name>
			<value>thrift://{{ include "hadoop.fullname" . }}-hive-metastore-0.{{ include "hadoop.fullname" . }}-hive-metastore:{{ .Values.service.hive.metastore.port }},{{ include "hadoop.fullname" . }}-hive-metastore-1.{{ include "hadoop.fullname" . }}-hive-metastore:{{ .Values.service.hive.metastore.port }}</value>
	</property>

	<!-- host -->
	<property>
			<name>hive.server2.thrift.bind.host</name>
			<value>0.0.0.0</value>
			<description>Bind host on which to run the HiveServer2 Thrift service.</description>
	</property>

	<!-- hs2端口 默认是10000-->
	<property>
			<name>hive.server2.thrift.port</name>
			<value>{{ .Values.service.hive.hiveserver2.port }}</value>
	</property>
	
	<!-- 启用ZooKeeper用于HA -->
	<!--设置hiveserver2的命名空间-->
	<property>
			<name>hive.server2.zookeeper.namespace</name>
			<value>hiveserver2</value>
	</property>

	<!--指定zk的端口,这个其实是否可以去掉,因为hive.server2.zookeeper.quorum 配置里有配置端口的-->
	<property>
			<name>hive.zookeeper.client.port</name>
			<value>2181</value>
	</property>

	<!--设置zk集群的客户端地址-->
	<property>
			<name>hive.zookeeper.quorum</name>
			<value>{{ include "hadoop.fullname" . }}-zookeeper-0.{{ include "hadoop.fullname" . }}-zookeeper.{{ .Release.Namespace }}.svc.cluster.local:2181,{{ include "hadoop.fullname" . }}-zookeeper-1.{{ include "hadoop.fullname" . }}-zookeeper.{{ .Release.Namespace }}.svc.cluster.local:2181,{{ include "hadoop.fullname" . }}-zookeeper-2.{{ include "hadoop.fullname" . }}-zookeeper.{{ .Release.Namespace }}.svc.cluster.local:2181</value>
	</property>

	<!-- 用于启用或禁用 HiveServer2 动态服务发现功能。-->
	<property>
			<name>hive.server2.support.dynamic.service.discovery</name>
			<value>true</value>
	</property>

</configuration>

[Reminder] If you are not deploying with hadoop on k8s, remember to modify the values ​​of javax.jdo.option.ConnectionURL, hive.metastore.uris, and hive.server2.zookeeper.quorumthese configurations.

start deployment

cd hadoop-ha-on-kubernetes
#mkdir -p /opt/bigdata/servers/hadoop/{nn,jn,dn,zk}/data/data{1..3}
#chmod 777 -R /opt/bigdata/servers/hadoop/
# 安装
helm install hadoop-ha ./ -n hadoop-ha --create-namespace

# 查看
kubectl get pods,svc -n hadoop-ha -owide

# 更新
# helm upgrade hadoop-ha ./ -n hadoop-ha

# 卸载
# helm uninstall hadoop-ha -n hadoop-ha
#rm -fr /opt/bigdata/servers/hadoop/*

insert image description here

5. Test verification

1) hive metastore test verification

hive_pod_name=`kubectl get pods -n hadoop-ha|grep 'hiveserver2'|head -1 |awk '{print $1}'`

# 登录pod 
kubectl exec -it $hive_pod_name -n hadoop-ha -- bash

# 启动命令,
hive 

create database test2023;
create table test2023.person_local_1(id int,name string,age int) row format delimited fields terminated by ',';
# 查看表结构
show create table test2023.person_local_1;

drop table test2023.person_local_1;
drop database test2023;

# 指定具体metastore,不指定就是查询可用的metastore服务
# 交互式
SET hive.metastore.uris=thrift://hadoop-ha-hadoop-hive-metastore-0.hadoop-ha-hadoop-hive-metastore:9083;

# 非交互式
hive --hiveconf hive.metastore.uris=thrift://hadoop-ha-hadoop-hive-metastore-0.hadoop-ha-hadoop-hive-metastore:9083 -e "show databases;"

2) hive hiveserver2 test verification

hive_pod_name=`kubectl get pods -n hadoop-ha|grep 'hiveserver2'|head -1 |awk '{print $1}'`

# 登录pod 
kubectl exec -it $hive_pod_name -n hadoop-ha -- bash

# 非交互式,这里我使用svc访问,当然你也可以展开,写具体的pod或IP
beeline -u "jdbc:hive2://hadoop-ha-hadoop-zookeeper.hadoop-ha:2181/;serviceDiscoveryMode=zookeeper;zookeeperNamespace=hiveserver2/default" -n hadoop -e "select version();"

# 交互式操作
beeline -u "jdbc:hive2://hadoop-ha-hadoop-zookeeper.hadoop-ha:2181/;serviceDiscoveryMode=zookeeper;zookeeperNamespace=hiveserver2/default" -n hadoop

--- 1、创建表
create table person_local_1(id int,name string,age int) row format delimited fields terminated by ',';
create table person_hdfs_1(id int,name string,age int) row format delimited fields terminated by ',';
show tables;

--- 2、 从local加载数据,这里的local是指hs2服务所在机器的本地linux文件系统
load data local inpath '/opt/bigdata/hadoop/data/hive-data' into table person_local_1;

--- 3、查询
select * from person_local_1;

--- 4、从hdfs中加载数据,这里是移动,会把hdfs上的文件mv到对应的hive的目录下
load data inpath '/person_hdfs.txt'  into table person_hdfs_1;

--- 5、查询
select * from person_hdfs_1;

This is the end of the introduction and practical operation explanation of Shucang Hive HA. 大数据与云原生技术分享If you have any questions, please follow my official account: , for technical exchanges . Collection ) ~

Guess you like

Origin juejin.im/post/7264078722758869047