Hive configuration installation deployment

@[Hive configuration]

I have put all the packages used to install hive on Baidu Netdisk. If you want them, come and get them.
Link: https://pan.baidu.com/s/1fqHXZ0ijP4i5FAFudm6oTQ
Extraction code: 1zkf

1. Basic concepts of hive

hive is Facebook's open source data statistics tool used to solve massive structured logs. It is a data warehouse tool based on Hadoop that can map structured data files into a table and provide SQL-like query functions.
Essentially converting HQL into MapReduce program execution

Data processed by Hive is stored on HDFS.
The underlying implementation of hive analysis data is MapReduce.
The executor runs on yarn

1.1. Advantages and disadvantages of hive

1. Advantages:
(1) The operation interface adopts SQL-like syntax, which is simple and easy to use.
(2) It eliminates the need to write MapReduce and reduces learning costs.
(3) Support custom functions
2. Disadvantages
hql has limited expression ability
(1) Iterative algorithms cannot be expressed
(2) It is not good at data mining. Due to the limitations of the MapReduce data processing process, more efficient algorithms cannot be implemented.
Low efficiency
(3) MapReduce automatically generated by hive is not intelligent enough
(4) hive tuning is difficult and has coarse granularity

1.2. Hive architecture principle

Insert image description here
1) User interface: Client
CLI (command-line interface), JDBC/ODBC (jdbc access hive), WEBUI (browser access hive)
2) Metadata: Metastore metadata
includes: table name, database to which the table belongs (default is default), table owner, column/partition field, table type (whether it is an external table), directory where the table data is located, etc. 3)
Hadoop
uses HDFS for storage and MapReduce for calculation.
4) Driver: Driver
(1) Parser (SQL Parser): Convert the SQL string into an abstract syntax tree AST. This step is generally completed with a third-party tool library, such as antlr; perform syntax analysis on the AST, such as whether the table exists , whether the field exists, and whether the SQL semantics are incorrect.
(2) Compiler (Physical Plan): Compile the AST to generate a logical execution plan.
(3) Optimizer (Query Optimizer): Optimizes the logical execution plan.
(4) Execution: Convert the logical execution plan into a physical plan that can be run. For Hive, it is MR/Spark.

1.3. hive operating mechanism

Insert image description herehive receives the user's SQL instructions through a series of interactive interfaces provided by the user, uses its own Driver and metadata (MetaStore) to translate these instructions into MapReduce, submits them to Hadoop for execution, and outputs the returned results to the user interactive interface. .

1.4 Comparison between Hive and database

Since Hive uses a SQL-like query language HQL (Hive Query Language), it is easy to understand Hive as a database. In fact, from a structural point of view, apart from having similar query languages, Hive and database have nothing in common.
1.4.1 Query language
Since SQL is widely used in data warehouses, the SQL-like query language HQL is designed specifically for the characteristics of Hive. Developers who are familiar with SQL development can easily use Hive for development.
1.4.2 Data update
Since Hive is designed for data warehouse applications, the content of the data warehouse requires more reading and less writing. Therefore, it is not recommended to rewrite data in Hive. All data is determined when loading.
1.4.3 Execution delay
When Hive queries data, since there is no index, the entire table needs to be scanned, so the delay is high. Another factor that leads to high Hive execution latency is the MapReduce framework. Since MapReduce itself has a high latency, there will also be a high latency when using MapReduce to execute Hive queries. In contrast, the execution latency of the database is low. Of course, this low is conditional, that is, the data scale is small. When the data scale is large enough to exceed the processing capabilities of the database, Hive's parallel computing can obviously show its advantages.
1.4.4 Data scale
Since Hive is built on a cluster and can use MapReduce for parallel computing, it can support large-scale data; correspondingly, the database can support smaller data scale.

2 hive installation and deployment

2.1. Hive installation address

1) Hive official website address
http://hive.apache.org/
2) Document viewing address
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
3) Download address
http://archive.apache.org /dist/hive/

2.2Hive installation and deployment

2.2.1 Install Hive

1) Upload apache-hive-3.1.2-bin.tar.gz to the /opt/software directory of Linux

2) Unzip apache-hive-3.1.2-bin.tar.gz to the /opt/module/ directory

tar -zxvf /opt/software/apache-hive-3.1.2-bin.tar.gz -C /opt/module/

3) Modify the name of apache-hive-3.1.2-bin.tar.gz to hive

mv /opt/module/apache-hive-3.1.2-bin/ 
/opt/module/hive

4) Modify /etc/profile.d/my_env.sh and add environment variables

sudo vim /etc/profile.d/my_env.sh

5) Add content

#HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin

6) Initialize metadata database

[hadoop@hadoop01 hive]$ bin/schematool -dbType derby -initSchema

2.2.2 Start and use Hive

1) Start Hive

[hadoop@hadoop01 hive]$ bin/hive

2) Using Hive

hive> show databases;
hive> show tables;
hive> create table test(id int);
hive> insert into test values(1);
hive> select * from test;

If it can be used normally, you can go to the next step.

2.3MySQL installation

1) Check whether MySQL is installed on the current system

[hadoop@hadoop01 ~]$ rpm -qa|grep mariadb
mariadb-libs-5.5.56-2.el7.x86_64 

//If it exists, uninstall it through the following command

[hadoop@hadoop01 ~]$ sudo rpm -e --nodeps mariadb-libs

2) Copy the MySQL installation package to the /opt/software directory
3) Unzip the MySQL installation package

[hadoop@hadoop01 software]# tar -xf mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar

4) Execute rpm installation in the installation directory

[hadoop@hadoop01 software]$ 
yum install -y libaio
sudo rpm -ivh mysql-community-common-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-compat-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-client-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-server-5.7.28-1.el7.x86_64.rpm

5) Delete all contents in the directory pointed to by datadir in the /etc/my.cnf file, if there is any content:

 查看 datadir 的值:
[mysqld]
datadir=/var/lib/mysql
 删除/var/lib/mysql 目录下的所有内容:
[hadoop@hadoop01 mysql]# cd /var/lib/mysql
[hadoop@hadoop01 mysql]# sudo rm -rf ./* //注意执行命令的位置

6) Initialize the database

[hadoop@hadoop01 opt]$ sudo mysqld --initialize --user=mysql

7) View the temporarily generated root user password

[hadoop@hadoop01 opt]$ sudo cat /var/log/mysqld.log 

8) Start the MySQL service

[hadoop@hadoop01 opt]$ sudo systemctl start mysqld

9) Log in to the MySQL database

[hadoop@hadoop01 opt]$ mysql -uroot -p
Enter password: 输入临时生成的密码
 登录成功.

10) The password of the root user must be modified first, otherwise an error will be reported when performing other operations.

mysql> set password = password("新密码");

11) Modify the root user in the user table under the mysql library to allow any IP connection

mysql> update mysql.user set host='%' where user='root';
mysql> flush privileges;

2.4 Configuring Hive metadata to MySQL

2.4.1 Copy driver

Copy MySQL's JDBC driver to Hive's lib directory

[hadoop@hadoop01 software]$ cp /opt/software/mysql-connector-java5.1.37.jar $HIVE_HOME/lib

1) Create a new hive-site.xml file in the $HIVE_HOME/conf directory

[hadoop@hadoop01 software]$ vim $HIVE_HOME/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <!-- jdbc 连接的 URL -->
 <property>
 <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://主机名:3306/metastore?useSSL=false</value>
</property>
 <!-- jdbc 连接的 Driver-->
 <property>
 <name>javax.jdo.option.ConnectionDriverName</name>
 <value>com.mysql.jdbc.Driver</value>
</property>
<!-- jdbc 连接的 username-->
 <property>
 <name>javax.jdo.option.ConnectionUserName</name>
 <value>mysql用户</value>
 </property>
 <!-- jdbc 连接的 password -->
 <property>
 <name>javax.jdo.option.ConnectionPassword</name>
 <value>mysql密码</value>
</property>
 <!-- Hive 元数据存储版本的验证 -->
 <property>
 <name>hive.metastore.schema.verification</name>
 <value>false</value>
</property>
 <!--元数据存储授权-->
 <property>
<name>hive.metastore.event.db.notification.api.auth</name>
 <value>false</value>
 </property>
 <!-- Hive 默认在 HDFS 的工作目录 -->
 <property>
 <name>hive.metastore.warehouse.dir</name>
 <value>/user/hive/warehouse</value>
 </property>
</configuration>

2) Log in to MySQL

[hadoop@hadoop01 software]$ mysql -uroot -p000000

3) Create a new Hive metadata database

mysql> create database metastore;
mysql> quit;

4) Initialize Hive metadatabase

[hadoop@hadoop01 software]$ schematool -initSchema -dbType mysql -verbose

5) hive startup script

[hadoop@hadoop01 hive]$ vim $HIVE_HOME/bin/hiveservices.sh

copy

#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
if [ ! -d $HIVE_LOG_DIR ]
then
    mkdir -p $HIVE_LOG_DIR
fi
#检查进程是否运行正常,参数 1 为进程名,参数 2 为进程端口
function check_process()
{
   pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
   ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
 echo $pid
 	 [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1}
function hive_start()
{
   metapid=$(check_process HiveMetastore 9083)
   cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
 [ -z "$metapid" ] && eval $cmd || echo "Metastroe 服务已启动"
 server2pid=$(check_process HiveServer2 10000)
 cmd="nohup hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
 [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2 服务已启动"
}
function hive_stop()
{
   metapid=$(check_process HiveMetastore 9083)
   [ "$metapid" ] && kill $metapid || echo "Metastore 服务未启动"
   server2pid=$(check_process HiveServer2 10000)
   [ "$server2pid" ] && kill $server2pid || echo "HiveServer2 服务未启动"
}
case $1 in
"start")
   hive_start
 ;;
"stop")
   hive_stop
 ;;
"restart")
   hive_stop
   sleep 2
   hive_start
 ;;
"status")
   check_process HiveMetastore 9083 >/dev/null && echo "Metastore 服务运行正常" || echo "Metastore 服务运行异常"
   check_process HiveServer2 10000 >/dev/null && echo "HiveServer2 服务运行正常" || echo "HiveServer2 服务运行异常"
 ;;
*)
   echo Invalid Args!
   echo 'Usage: '$(basename $0)' start|stop|restart|status'
 ;;
esac

3) Add execution permissions

[hadoop@hadoop01 hive]$ chmod +x $HIVE_HOME/bin/hiveservices.sh

4) Start Hive background service

[hadoop@hadoop01 hive]$ hiveservices.sh start

2.9 Hive common attribute configuration

2.9.1 Hive running log information configuration

1) Hive's log is stored in the /tmp/user/hive.log directory by default (under the current user name)
2) Modify hive's log and store the log in /opt/module/hive/logs
(1) Modify /opt/module/ hive/conf/hive-log4j2.properties.template The file name is hive-log4j2.properties

[hadoop@hadoop01 conf]$ pwd
/opt/module/hive/conf
[hadoop@hadoop01 conf]$ mv hive-log4j2.properties.template hivelog4j2.properties


(2) Modify the log storage location hive.log.dir=/opt/module/hive/logs in the hive-log4j2.properties file.
2.9.2 Print the current library and header.
Add the following two configurations to hive-site.xml. :

	 <property>
		 <name>hive.cli.print.header</name>
		 <value>true</value>
	 </property>
	 <property>
		 <name>hive.cli.print.current.db</name>
		 <value>true</value>
	 </property>

2.9.3 Parameter configuration method
1) View all current configuration information

hive>set;

2) Three ways to configure parameters
(1) Configuration file method
Default configuration file: hive-default.xml
User-defined configuration file: hive-site.xml
Note: User-defined configuration will overwrite the default configuration. In addition, Hive will also read the Hadoop configuration. Because Hive
is started as a Hadoop client, the Hive configuration will overwrite the Hadoop configuration. The settings in the configuration file
are effective for all Hive processes started on this machine.
(2) Command line parameter method
When starting Hive, you can add -hiveconf param=value to the command line to set parameters.
For example:

[hadoop@hadoop01 hive]$ bin/hive -hiveconf mapred.reduce.tasks=10;

Note: This is only valid for this hive startup.
View parameter settings:

hive (default)> set mapred.reduce.tasks;

(3) Parameter declaration method
You can use the SET keyword to set parameters in HQL
. For example:

hive (default)> set mapred.reduce.tasks=100;

Note: This is only valid for this hive startup.
View parameter settings

hive (default)> set mapred.reduce.tasks;

The priority of the above three setting methods increases in sequence. That is, configuration file <command line parameters <parameter declaration. Note that some system-
level parameters, such as log4j-related settings, must be set using the first two methods, because the reading of those parameters
has been completed before the session is established.

The configuration is now complete.

Guess you like

Origin blog.csdn.net/shenBaoYun/article/details/123987738