Apache Hive installation and deployment detailed graphic tutorial

Table of contents

1. Apache Hive metadata

1.1 Hive Metadata

1.2 Hive Metastore 

2. Metastore three configuration methods 

2.1 Embedded mode

2.2 Local mode 

2.3 Remote mode 

3. Hive deployment practice

3.1 Preparation before installation  

3.2 Integration of Hadoop and Hive

3.3 Remote mode installation 

3.3.1 Install MySQL

3.3.2 Hive installation 

3.4 Start Hive service 

3.4.1 Foreground startup (not recommended)

3.4.2 Background startup (recommended)

4. Use of Apache Hive client 

4.1 bin/hive、bin/beeline 

4.2 HiveServer, HiveServer2 services 

4.3 Relationship sorting 

4.4 bin/hive client use 

4.5 bin/beeline client use

5. First experience with Apache Hive 

5.1 Experience 1: Is Hive similar to MySQL? 

5.1.1 Background 

5.1.2 Process 

5.1.3 Verification 

5.1.4 Conclusion 

5.2 Experience 2: How can Hive map structured data into tables? 

5.2.1 Background 

5.2.2 Process 

5.2.3 Verification 

5.2.4 Conclusion 

5.3 Experience 3: How about using Hive for small data analysis?

5.3.1 Background 

5.3.2 Process 

5.3.3 Verification 

5.3.5 Conclusion 


 

1. Apache Hive metadata

1.1 Hive Metadata

        Metadata , also known as intermediary data and relay data, is data that describes data ( data about data ). It mainly describes information about data attributes ( property ). It is used to support such things as indicating storage location, historical data, and resource search. , file recording and other functions.

        Hive Metadata  is  the metadata of Hive  . Contains  database , table , table location, type, attributes, field sequence type and other meta-information created  with Hive  . Metadata is stored in a relational database . Such as  hive's  built-in  Derby , or third parties such as  MySQL,  etc.

1.2 Hive Metastore 

        Metastore is a metadata service . The role of the Metastore  service is to manage  metadata  metadata and expose the service address to the outside world, allowing various clients to access the metadata by connecting to  the metastore  service, and  then  the metastore  connects to the MySQL  database.

        With the metastore service, multiple clients can connect at the same time, and these clients do not need to know the  user name and password of the MySQL  database, they only need to connect to  the metastore service. To some extent, the security of hive  metadata is also guaranteed  .

2. Three configuration methods of Metastore 

        There are three modes for metastore service configuration  : embedded mode , local mode , and remote mode . The key to distinguishing the three  configuration methods is to clarify two issues:

  1. Does the Metastore  service need to be configured and started separately ?

  2. Is the metadata  stored in the built-in  derby  , or in a third-party  RDBMS, such as  MySQL .

This article uses the enterprise recommended mode- remote mode deployment .

2.1  Embedded mode

        Embedded mode ( Embedded Metastore ) is  the default deployment mode of metastore  . In this mode, metadata is stored in the built-in  Derby  database , and  the Derby  database and  metastore  services are embedded in the main  HiveServer  process. When the HiveServer  process is started  , both Derby  and  metastore  will be started. No additional  Metastore  service is required . However, it can only support one active user at a time, which is suitable for testing experience and not suitable for production environment.

2.2  Local mode 

        In local mode ( Local Metastore ) , the Metastore  service  runs in the same process as the main HiveServer  process , but the database that stores metadata runs in a separate process and can be on a separate host. The metastore  service will  communicate with  the metastore  database via JDBC  .

        Local mode uses an external database to store metadata , and MySQL is recommended  . hive is judged based on  the hive.metastore.uris parameter value. If it is empty, it is local mode. The disadvantage is: every time the hive service is started  , a metastore is built-in  .

2.3  Remote mode 

        In remote mode ( Remote Metastore ), the Metastore  service runs on its own separate  JVM  and does not run in  the HiveServer  's  JVM  . If other processes wish to communicate with  the Metastore  server, they can use  the Thrift Network API  to communicate.

        In remote mode, you need to configure the hive.metastore.uris parameter to specify  the machine IP  and port where the metastore  service runs  , and you need to start the metastore  service manually separately  . Metadata also uses an external database to store metadata, and  MySQL is recommended .

        In a production environment, it is recommended to configure Hive Metastore in remote mode . In this case, other  software  that relies on hive  can access  hive through the Metastore  . This also results in better manageability / security since the database layer can also be completely shielded.

3. Hive  deployment actual combat

3.1  Preparation before installation  

        Since Apache Hive is a data warehouse software based on  Hadoop  , it is usually deployed and run on  Linux  systems. Therefore, no matter which method is used to configure  Hive Metastore , you must first ensure that the basic environment of the server is normal and the Hadoop  cluster is healthy and available.

Server basic environment: cluster time synchronization, firewall shutdown, host  mapping, password-free login, JDK  installation, etc.

The Hadoop  cluster is healthy and available: The Hadoop  cluster must be started before  starting Hive  . In particular, please note that you need to wait  until HDFS  safe mode is turned off before starting  Hive .

        Hive is not a software that can be installed and run in a distributed manner. Its distributed features are mainly accomplished through  Hadoop  . Including distributed storage and distributed computing.

This time, Hive is deployed on the Hadoop cluster: Hadoop YARN HA cluster installation and deployment detailed graphic tutorial_Stars.Sky's blog-CSDN blog

3.2  Integration of Hadoop  and  Hive 

        Because Hive needs to store data on  HDFS  and process data through  MapReduce  as the execution engine; therefore,  relevant configuration properties need to be added to Hadoop to allow  Hive  to run on  Hadoop  . Modify  core-site.xml in  Hadoop , and  synchronize the configuration file of the Hadoop  cluster to take effect after restarting.

(base) [root@hadoop01 ~]# cd /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/
(base) [root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim core-site.xml
<!-- 整合 hive -->
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

3.3 Remote mode installation 

The biggest features of remote mode are two:

  1. MySQL  needs to be installed  to store  Hive  metadata;

  2. You need to manually configure and start  the Metastore  service separately.

3.3.1 Install MySQL

This time I installed MySQL 5.7 version: Linux Deployment JDK+MySQL+Tomcat Detailed Process_Tomcat on Linux Connecting to mysql_Stars.Sky's Blog-CSDN Blog

3.3.2 Hive installation 

Hive official download address: Index of /hive/hive-3.1.2 

mysql-connector-java-5.1.32.jar Official download address: MySQL :: Download MySQL Connector/J (Archived Versions) 

Third-party download address: mysql-connector-java-5.1.32.jar download and Maven, Gradle introduction code, pom file and class in the package - Times Java

#1. 解压安装包
(base) [root@hadoop01 ~]# tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /bigdata/
(base) [root@hadoop01 ~]# mv /bigdata/apache-hive-3.1.2-bin/ /bigdata/apache-hive-3.1.2

#2. 解决 hadoop、hive 之间 guava 版本差异
(base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# rm -rf lib/guava-19.0.jar 
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# cp /bigdata/hadoop/server/hadoop-3.2.4/share/hadoop/common/lib/guava-27.0-jre.jar ./lib/

#3. 添加 mysql jdbc 驱动到 hive 安装包 lib/ 文件下
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/lib]# pwd
/bigdata/apache-hive-3.1.2/lib
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/lib]# mv mysql-connector-java-5.1.32-bin.jar mysql-connector-java-5.1.32.jar 
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/lib]# ls mysql-connector-java-5.1.32.jar 
mysql-connector-java-5.1.32.jar

#4. 修改 hive 环境变量文件,添加 Hadoop_HOME
(base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/conf/
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/conf]# mv hive-env.sh.template hive-env.sh
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/conf]# vim hive-env.sh
export HADOOP_HOME=/bigdata/hadoop/server/hadoop-3.2.4
export HIVE_CONF_DIR=/bigdata/apache-hive-3.1.2/conf
export HIVE_AUX_JARS_PATH=/bigdata/apache-hive-3.1.2/lib

#5. 新增 hive-site.xml 配置 mysql 等相关信息
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/conf]# vim hive-site.xml
<configuration>
    <!-- 存储元数据 mysql 相关配置 -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value> jdbc:mysql://hadoop01:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>

    <!-- mysql 用户 -->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>

    <!-- mysql 密码 -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>qwe123456</value>
    </property>

    <!-- H2S 运行绑定 host -->
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>hadoop01</value>
    </property>

    <!-- 远程模式部署 metastore 服务地址 -->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://hadoop01:9083</value>
    </property>

    <!-- 关闭元数据存储授权  -->
    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>

    <!-- 关闭元数据存储版本的验证 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
</configuration>

#6. 初始化 metadata,初始化成功会在 mysql 中创建 74 张表
(base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# bin/schematool -initSchema -dbType mysql -verbos

3.4 Start Hive service 

Start the hadoop cluster in advance! !

3.4.1 Foreground startup (not recommended)

 Close ctrl+c 

(base) [root@hadoop01 ~]# /bigdata/apache-hive-3.1.2/bin/hive --service metastore
 
# 前台启动开启 debug 日志
/bigdata/apache-hive-3.1.2/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console

3.4.2 Background startup (recommended)

Close using jps + kill 

(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore &

(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# jps
3347 QuorumPeerMain
3830 DataNode
5018 NodeManager
3643 NameNode
4827 ResourceManager
4412 DFSZKFailoverController
5676 Jps
4157 JournalNode
5549 RunJar

4. Use of Apache Hive client 

4.1 bin/hivebin/beeline 

Since the development of Hive, it has gone through two generations of client tools.

        First-generation client (deprecated): $HIVE_HOME /bin/hive , which is a shell Util . Main functions: First, it can be used to run Hive  queries  in interactive or batch mode  ; second, it is used to start Hive-  related services, such as  metastore  services .

        The second-generation client (recommended): $HIVE_HOME /bin/beeline is a  JDBC  client and a Hive  command line tool that is highly recommended by the officialCompared with the first-generation client, it has enhanced performance and improved security.

        Beeline works in both embedded and remote modes . In embedded mode, it runs embedded Hive ( similar to Hive Client ) ; in remote mode,  beeline  connects to a separate HiveServer2  service through Thrift   , which is also the officially recommended mode for use in production environments . So the question is,  what is HiveServer2? Where has HiveServer1  gone?

4.2  HiveServer , HiveServer2  services 

        HiveServer and HiveServer2  are  two services that come with Hive  , allowing clients  to operate data in Hive  without  starting the CLI (command line) , and both allow remote clients to use multiple programming languages ​​such as  Java . , python  , etc. submit a request to  hive  and retrieve the results.

        However, HiveServer cannot handle concurrent requests from more than one client. Therefore,  the HiveServer  code was rewritten  in Hive-0.11.0  version to obtain  HiveServer2 , which solved the problem. HiveServer  has been deprecated. HiveServer2  supports multi-client concurrency and identity authentication, aiming to  provide better support for open API  clients such as  JDBC and ODBC  .

4.3  Relationship sorting 

HiveServer2 reads and writes metadata         through the Metastore  service. Therefore, in remote mode, the metastore  service must be started before  starting  HiveServer2  .

        Special note: In remote mode, Beeline clients can only  access  Hive through the HiveServer2  service . Bin  /hive  is accessed through  the Metastore  service. The specific relationship is as follows:

4.4 Bin/hive client usage 

In the bin  directory         of the hive installation package  , there is  bin/hive, the first-generation client provided by hive  . The client can access  hive 's  metastore  service to achieve the purpose of operating  hive  .

        Friendly reminder: If you deploy in remote mode, please manually start and run  the metastore  service . If it is in embedded mode or local mode, run  bin/hive directly , and the metastore  service will be started inline. You can directly  use  the bin/hive  client operation on the machine where the Hive metastore service is started. No configuration is required at this time.

If you need to access  the hive metastore  service         through bin/hive on other machines  , you only need to send the hive installation directory scp on hadoop01 to other machines, and  add the metastore  service address in the hive-site.xml  configuration  of the machine. 

(base) [root@hadoop01 /bigdata]# scp -r apache-hive-3.1.2 hadoop02:$PWD

# 到 hadoop02 上操作
(base) [root@hadoop02 ~]# vim /bigdata/apache-hive-3.1.2/conf/hive-site.xml 
<configuration>
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop01:9083</value>
</property>
</configuration>

# 启动 metastore 服务
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore &

# hadoop02 上连接客户端
(base) [root@hadoop02 ~]# /bigdata/apache-hive-3.1.2/bin/hive

4.5  bin/beeline client use

        After development, hive launched the second-generation client beeline. However,  the beeline  client does not directly access  the metastore  service, but needs to start  the hiveserver2  service separately. On the server where hive  is installed , first start  the metastore  service, and then start  the hiveserver2  service .

# 先启动 metastore 服务然后启动 hiveserver2 服务
nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore &
nohup /bigdata/apache-hive-3.1.2/bin/hive --service hiveserver2 &

Use the beeline  client         on hadoop02  for connection access. It should be noted that  after the hiveserver2  service is started, it will take a while before it can provide services to the outside world.

        Beeline is a JDBC  client that  communicates with the Hiveserver2  service through the JDBC  protocol  . The protocol address is: jdbc:hive2://hadoop01:10000

(base) [root@hadoop02 ~]# /bigdata/apache-hive-3.1.2/bin/beeline 
beeline> ! connect jdbc:hive2://hadoop01:10000
Connecting to jdbc:hive2://hadoop01:10000
Enter username for jdbc:hive2://hadoop01:10000: root
Enter password for jdbc:hive2://hadoop01:10000: 

5. First experience with Apache Hive  

5.1 Experience  1 : Is Hive  similar to  MySQL  ? 

5.1.1 Background 

        For those who are new to Apache Hive  , the biggest confusion is: Hive  looks similar to the relational database  MySQL  from the data model . Including  Hive SQL  is also a  SQL-  like language. So how does it actually work? 

5.1.2 Process 

        Experience steps: According to the mysql thinking, create and switch databases in hive, create tables and perform data insertion operations, and finally query whether the insertion is successful. 

--创建数据库
create database test;
--列出所有数据库
show databases;
--切换数据库
use test;

--建表
create table t_student(id int,name varchar(255));
--插入一条数据
insert into table t_student values(1,"allen");
--查询表数据
select * from t_student;

When inserting data, I found that the insertion speed was extremely slow and the SQL  execution time was very long . Why ? 

Finally, a piece of data is inserted, with a history  of 98 seconds . Query the table data and show that the data was inserted successfully:

Query the table data and show that the data was inserted successfully.

5.1.3 Verification 

First, log in  to Hadoop YARN  and observe whether there are  traces of MapReduce  task execution.  YARN Web UI: http://hadoop01:8088/cluster

        Then log in to Hadoop HDFS to browse the file system. According to Hive's data model, the table data is ultimately stored in HDFS and in the folder corresponding to the table.

HDFS Web UI: http://hadoop01:9870/

5.1.4 Conclusion 

  • Hive SQL  syntax is very similar to standard  SQL  , which reduces the learning cost a lot.
  • The bottom layer of Hive  is  the data insertion action performed through MapReduce  , so it is slow.
  • If a large data set is inserted one by one, it is very unrealistic and the cost is extremely high.
  • Hive  should have its own unique way of inserting data into tables, and structured files are mapped into tables.

5.2  Experience  2 : How can Hive  map structured data into tables? 

5.2.1 Background 

        In Hive,  the insert+values  ​​statement is used to insert data, and the bottom layer is  executed through MapReduce  , which is very inefficient. At this point, we return to  the essence of Hive  : it can map structured data files into a table and provide table-based  SQL  query analysis.

        If there is a structured data file now, how can it be mapped successfully? What issues need to be paid attention to during the successful mapping process? File storage path? Field Type? Field order? Separator between fields?

5.2.2 Process 

Create a structured data file  user.txt in  the HDFS  root directory with the following contents:

(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# vim user.txt
1,zhangsan,18,beijing
2,lisi,25,shanghai
3,allen,30,shanghai
4,woon,15,nanjing
5,james,45,hangzhou
6,tony,26,beijing

(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# hadoop fs -put user.txt /

Create a table  t_user in  hive  . Note: The order of field types must be consistent with the fields in the file . 

create table t_user(id int,name varchar(255),age int,city varchar(255));

5.2.3 Verification 

        Execute data query operation and find that there is no data in the table. Conjecture  1 : Do the data files need to be placed in the HDFS  path corresponding to the table  to succeed ? Use HDFS  commands to move the data to the path corresponding to the table.

(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# hadoop fs -mv /user.txt /user/hive/warehouse/test.db/t_user
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# hadoop fs -ls /user/hive/warehouse/test.db/t_user
Found 1 items
-rw-r--r--   3 root supergroup        117 2023-09-20 15:28 /user/hive/warehouse/test.db/t_user/user.txt

Execute the query operation again, and the display is as follows, all of which are null: 

        It seems that the table is aware of the existence of the structured file, but does not correctly identify the data in the file . Guess  2 : Do you also need to specify the separator between fields in the file ? Rebuild a new table and specify the delimiter.

--建表语句 增加分隔符指定语句
create table t_user_1(id int,name varchar(255),age int,city varchar(255))
row format delimited
fields terminated by ',';

# 把 user.txt 文件从本地文件系统上传到 hdfs
hadoop fs -put user.txt /user/hive/warehouse/test.db/t_user_1/

--执行查询操作
select * from t_user_1;

At this time, create another table and save the separator syntax, but deliberately make the field type inconsistent with that in the file.

--建表语句 增加分隔符指定语句
create table t_user_2(id int,name int,age varchar(255),city varchar(255))
row format delimited
fields terminated by ',';

# 把 user.txt 文件从本地文件系统上传到hdfs
hadoop fs -put user.txt /user/hive/warehouse/test.db/t_user_2/

--执行查询操作
select * from t_user_2;

        At this time, it was discovered that some columns displayed null and some columns displayed normal. The name field itself is a string, but if int is specified when creating the table , the type conversion fails; age  is a numeric type, and if the string type is specified when creating the table, the conversion can be successful. Note that  hive  has its own type conversion function, but the conversion is not necessarily guaranteed to be successful .

5.2.4 Conclusion 

In order to successfully create a table and map a structured file in hive  , you need to pay attention to the following aspects:

  • When creating a table, the field order and field types must be consistent with those in the file .

  • If the types are inconsistent, hive  will try to convert, but there is no guarantee that the conversion will be successful. If unsuccessful,  null is displayed .

  • The file seems to be placed in  the HDFS  directory corresponding to  the Hive  table . Can other paths be used ? Worth exploring.

  • When creating a table, it seems that the delimiter must be specified based on the file content. Is it okay not to specify it? Worth exploring .

5.3  Experience  3 : How about using  Hive  for small data analysis?

5.3.1 Background 

        Because Hive  is based on  HDFS  for file storage, it can theoretically support a large scale of data storage and is naturally suitable for big data analysis. If  the data in  Hive  is small data, how efficient is it to use Hive  for analysis? 

5.3.2 Process 

We created a table  t_user_1 before, and now we use  Hive SQL  to find out  how many of them are older than 20  years old. 

5.3.3 Verification 

--执行查询操作
select count(*) from t_user_1 where age > 20;

Discovery is again  a data query function executed through the MapReduce  program. 

5.3.5 Conclusion 

  • The bottom layer of  Hive  does process data through the MapReduce  execution engine.

  • It takes a long time to execute MapReduce  program.

  • If it is a small data set, using  hive  for analysis will not be worth the gain and the latency will be high.

  • If it is a large data set, it will be great to use  hive  for analysis and underlying  MapReduce  distributed computing.

Previous article: Getting Started with Apache Hive_Stars.Sky’s Blog-CSDN Blog 

Guess you like

Origin blog.csdn.net/weixin_46560589/article/details/133036234