Apache Hive for big data development

1 Overview

Apache Hive is an open source data warehouse system built on the Hadoop ecosystem. It can map structured and semi-structured data files stored in Hadoop files into a database table, and then provide a SQL-like query based on the database table. Model, also called Hive Query Language, HQL for short;

Then the core of Hive is to convert HQL into a MapReduce program, and then submit the converted MapReduce program to the Hadoop cluster for execution

Why use Hive

Directly writing Hadoop MapReduce programs requires a certain learning cost, such as Java foundation, and then complex data query association, sorting, deduplication and other logic writing is more complicated;

Hive provides a method similar to SQL, we can quickly get started and write SQL to analyze and process massive data;

Relationship between Hive and Hadoop

As a data warehouse software, it must at least have the ability of data storage and data analysis;

Hive uses hadoop to achieve

Data storage Hive uses Hadoop's Hdfs to store data
Data analysis Hive uses Hadoop's MapReduce to analyze data

It can be seen that Hive actually does only one function, parsing HQL into MapReduce , and the ability to store and analyze is based on Hadoop components.

2. Hive architecture

insert image description here

UI – The user interface through which users submit queries and other operations to the system. As of 2011, the system has a command-line interface and a web-based GUI is being developed.
Driver – the component that receives the query. This component implements the concept of a session handle and provides an execution and acquisition API modeled on the JDBC/ODBC interface.
Compiler - includes parser, plan compiler, optimizer, executor ; parses queries, performs semantic analysis on different query blocks and query expressions, and finally in table and partition metadata looked up from metastore Generate an execution plan with help.
Metastore – A component that stores all the structural information for the various tables and partitions in the warehouse, including column and column type information, serializers and deserializers needed to read and write data, and the corresponding HDFS files where the data is stored.
Execution Engine - The component that executes the execution plans created by the compiler. The program is a one-stage DAG. The execution engine manages the dependencies between these different phases of the plan and executes these phases on the appropriate system components. Hive itself does not directly deal with data files. Instead, it is processed by the execution engine. Currently, Hive supports three execution engines: MapReduce, Tez, and Spark.

3.Hive Metastore

Hive Metastore is a component or service that manages Hive metadata

Metadata, also known as metadata, describes the data of the data; for example, our common database fields, this can be called metadata;

Hive's metadata Hive Metadata , we all know that Hive's data storage is saved using files in Hadoop, so these Hadoop files correspond to the table in our HQL? This is recorded by Hive Metastore; Hive Metastore includes Hive's database, table, field attributes, field order, and hadoop-hdfs storage location of the table;

There are two types of metadata storage methods, a Derby storage that comes with Hive, or a third-party storage such as MySQL;

Hive Metastore plan data service manages hive Metadata, provides service addresses externally, and clients connect to Hive Metastore instead of directly storing metadata databases, which guarantees the security of metadata to a certain extent

There are 3 modes of metastore service configuration: embedded mode, local mode, remote mode

	embedded mode	local mode	remote mode
Metastore separate deployment starts	no	no	yes
Metadata storage	Comes with Derby	Third-party MySQL	Third-party MySQL

4. Hive deployment

Preparation

Hadoop cluster is available normally, cluster time synchronization, firewall settings, host name, password-free login, JDK, environment variables, etc. are installed successfully;

4.1 Hadoop Hive Consistency

Hive needs to store data on HDFS, so you need to add relevant configuration properties in Hadoop to meet Hive running on Hadoop

Modify core-site.xml in Hadoop, and Hadoop cluster synchronization configuration file, restart to take effect

<!-- 整合hive 用户代理设置 -->
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>

<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

4.2 MySQL installation

slightly slightly

4.4 Installation package

Install Hive on the master node

tar zxvf apache-hive-3.1.2-bin.tar.gz
mv apache-hive-3.1.2-bin/ hive
# 解决Hive与Hadoop之间guava版本差异
cd /soft/server/apache-hive-3.1.2-bin/
rm -rf lib/guava-19.0.jar
cp /soft/server/hadoop-3.3.0/share/hadoop/common/lib/guava-27.0-jre.jar ./lib/

Modify hive-env.sh

cd /soft/server/apache-hive-3.1.2-bin/conf
mv hive-env.sh.template hive-env.sh
vim hive-env.sh
export HADOOP_HOME=/soft/server/hadoop-3.3.0
export HIVE_CONF_DIR=/soft/server/apache-hive-3.1.2-bin/conf
export HIVE_AUX_JARS_PATH=/soft/server/apache-hive-3.1.2-bin/lib

Add hive-site.xml

<configuration>
<!-- 存储元数据mysql相关配置 -->
<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://192.168.141.155:3306/hive3?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
</property>

<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
</property>

<property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>root</value>
</property>

<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>hadoop</value>
</property>

<!-- H2S运行绑定host -->
<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>node1</value>
</property>

<!-- 远程模式部署metastore metastore地址 -->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://node1:9083</value>
</property>

<!-- 关闭元数据存储授权  --> 
<property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
</property>
</configuration>

4.5 Initialization

Upload the MySQL JDBC driver to the mysql-connector-java-5.1.32.jar under the lib path of the Hive installation package, and initialize the metadata of Hive

cd /soft/server/apache-hive-3.1.2-bin/
bin/schematool -initSchema -dbType mysql -verbos

4.6 Start the service

#前台启动
/soft/apache-hive-3.1.2-bin/bin/hive --service metastore
#前台启动开启debug日志
/soft/apache-hive-3.1.2-bin/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console
##后台启动、
nohup /soft/apache-hive-3.1.2-bin/bin/hive --service metastore &

4.7 hive client

insert image description here

Corresponding to the following scripts beeline and hive respectively

[root@node1 bin]# pwd
/soft/apache-hive-3.1.2-bin/bin
[root@node1 bin]# ll
总用量 44
-rwxr-xr-x. 1 root root   881 8月  23 2019 beeline
drwxr-xr-x. 3 root root  4096 7月  12 21:45 ext
-rwxr-xr-x. 1 root root 10158 8月  23 2019 hive
....

Start the metastore service first, and then start the hiveserver2 service

nohup /soft/servers/hive/bin/hive --service metastore &
nohup /soft/servers/hive/bin/hive --service hiveserver2 &

5. Hive use

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL