Table of contents
2. Metastore three configuration methods
3.1 Preparation before installation
3.2 Integration of Hadoop and Hive
3.4.1 Foreground startup (not recommended)
3.4.2 Background startup (recommended)
4.2 HiveServer, HiveServer2 services
5. First experience with Apache Hive
5.1 Experience 1: Is Hive similar to MySQL?
5.2 Experience 2: How can Hive map structured data into tables?
5.3 Experience 3: How about using Hive for small data analysis?
1. Apache Hive metadata
1.1 Hive Metadata
Metadata , also known as intermediary data and relay data, is data that describes data ( data about data ). It mainly describes information about data attributes ( property ). It is used to support such things as indicating storage location, historical data, and resource search. , file recording and other functions.
Hive Metadata is the metadata of Hive . Contains database , table , table location, type, attributes, field sequence type and other meta-information created with Hive . Metadata is stored in a relational database . Such as hive's built-in Derby , or third parties such as MySQL, etc.
1.2 Hive Metastore
Metastore is a metadata service . The role of the Metastore service is to manage metadata metadata and expose the service address to the outside world, allowing various clients to access the metadata by connecting to the metastore service, and then the metastore connects to the MySQL database.
With the metastore service, multiple clients can connect at the same time, and these clients do not need to know the user name and password of the MySQL database, they only need to connect to the metastore service. To some extent, the security of hive metadata is also guaranteed .
2. Three configuration methods of Metastore
There are three modes for metastore service configuration : embedded mode , local mode , and remote mode . The key to distinguishing the three configuration methods is to clarify two issues:
-
Does the Metastore service need to be configured and started separately ?
-
Is the metadata stored in the built-in derby , or in a third-party RDBMS, such as MySQL .
This article uses the enterprise recommended mode- remote mode deployment .
2.1 Embedded mode
Embedded mode ( Embedded Metastore ) is the default deployment mode of metastore . In this mode, metadata is stored in the built-in Derby database , and the Derby database and metastore services are embedded in the main HiveServer process. When the HiveServer process is started , both Derby and metastore will be started. No additional Metastore service is required . However, it can only support one active user at a time, which is suitable for testing experience and not suitable for production environment.
2.2 Local mode
In local mode ( Local Metastore ) , the Metastore service runs in the same process as the main HiveServer process , but the database that stores metadata runs in a separate process and can be on a separate host. The metastore service will communicate with the metastore database via JDBC .
Local mode uses an external database to store metadata , and MySQL is recommended . hive is judged based on the hive.metastore.uris parameter value. If it is empty, it is local mode. The disadvantage is: every time the hive service is started , a metastore is built-in .
2.3 Remote mode
In remote mode ( Remote Metastore ), the Metastore service runs on its own separate JVM and does not run in the HiveServer 's JVM . If other processes wish to communicate with the Metastore server, they can use the Thrift Network API to communicate.
In remote mode, you need to configure the hive.metastore.uris parameter to specify the machine IP and port where the metastore service runs , and you need to start the metastore service manually separately . Metadata also uses an external database to store metadata, and MySQL is recommended .
In a production environment, it is recommended to configure Hive Metastore in remote mode . In this case, other software that relies on hive can access hive through the Metastore . This also results in better manageability / security since the database layer can also be completely shielded.
3. Hive deployment actual combat
3.1 Preparation before installation
Since Apache Hive is a data warehouse software based on Hadoop , it is usually deployed and run on Linux systems. Therefore, no matter which method is used to configure Hive Metastore , you must first ensure that the basic environment of the server is normal and the Hadoop cluster is healthy and available.
Server basic environment: cluster time synchronization, firewall shutdown, host mapping, password-free login, JDK installation, etc.
The Hadoop cluster is healthy and available: The Hadoop cluster must be started before starting Hive . In particular, please note that you need to wait until HDFS safe mode is turned off before starting Hive .
Hive is not a software that can be installed and run in a distributed manner. Its distributed features are mainly accomplished through Hadoop . Including distributed storage and distributed computing.
This time, Hive is deployed on the Hadoop cluster: Hadoop YARN HA cluster installation and deployment detailed graphic tutorial_Stars.Sky's blog-CSDN blog
3.2 Integration of Hadoop and Hive
Because Hive needs to store data on HDFS and process data through MapReduce as the execution engine; therefore, relevant configuration properties need to be added to Hadoop to allow Hive to run on Hadoop . Modify core-site.xml in Hadoop , and synchronize the configuration file of the Hadoop cluster to take effect after restarting.
(base) [root@hadoop01 ~]# cd /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/
(base) [root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim core-site.xml
<!-- 整合 hive -->
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
3.3 Remote mode installation
The biggest features of remote mode are two:
-
MySQL needs to be installed to store Hive metadata;
-
You need to manually configure and start the Metastore service separately.
3.3.1 Install MySQL
This time I installed MySQL 5.7 version: Linux Deployment JDK+MySQL+Tomcat Detailed Process_Tomcat on Linux Connecting to mysql_Stars.Sky's Blog-CSDN Blog
3.3.2 Hive installation
Hive official download address: Index of /hive/hive-3.1.2
mysql-connector-java-5.1.32.jar Official download address: MySQL :: Download MySQL Connector/J (Archived Versions)
Third-party download address: mysql-connector-java-5.1.32.jar download and Maven, Gradle introduction code, pom file and class in the package - Times Java
#1. 解压安装包
(base) [root@hadoop01 ~]# tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /bigdata/
(base) [root@hadoop01 ~]# mv /bigdata/apache-hive-3.1.2-bin/ /bigdata/apache-hive-3.1.2
#2. 解决 hadoop、hive 之间 guava 版本差异
(base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# rm -rf lib/guava-19.0.jar
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# cp /bigdata/hadoop/server/hadoop-3.2.4/share/hadoop/common/lib/guava-27.0-jre.jar ./lib/
#3. 添加 mysql jdbc 驱动到 hive 安装包 lib/ 文件下
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/lib]# pwd
/bigdata/apache-hive-3.1.2/lib
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/lib]# mv mysql-connector-java-5.1.32-bin.jar mysql-connector-java-5.1.32.jar
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/lib]# ls mysql-connector-java-5.1.32.jar
mysql-connector-java-5.1.32.jar
#4. 修改 hive 环境变量文件,添加 Hadoop_HOME
(base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/conf/
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/conf]# mv hive-env.sh.template hive-env.sh
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/conf]# vim hive-env.sh
export HADOOP_HOME=/bigdata/hadoop/server/hadoop-3.2.4
export HIVE_CONF_DIR=/bigdata/apache-hive-3.1.2/conf
export HIVE_AUX_JARS_PATH=/bigdata/apache-hive-3.1.2/lib
#5. 新增 hive-site.xml 配置 mysql 等相关信息
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2/conf]# vim hive-site.xml
<configuration>
<!-- 存储元数据 mysql 相关配置 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://hadoop01:3306/hive?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- mysql 用户 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!-- mysql 密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>qwe123456</value>
</property>
<!-- H2S 运行绑定 host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop01</value>
</property>
<!-- 远程模式部署 metastore 服务地址 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop01:9083</value>
</property>
<!-- 关闭元数据存储授权 -->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<!-- 关闭元数据存储版本的验证 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>
#6. 初始化 metadata,初始化成功会在 mysql 中创建 74 张表
(base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# bin/schematool -initSchema -dbType mysql -verbos
3.4 Start Hive service
Start the hadoop cluster in advance! ! !
3.4.1 Foreground startup (not recommended)
Close ctrl+c
(base) [root@hadoop01 ~]# /bigdata/apache-hive-3.1.2/bin/hive --service metastore
# 前台启动开启 debug 日志
/bigdata/apache-hive-3.1.2/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console
3.4.2 Background startup (recommended)
Close using jps + kill
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore &
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# jps
3347 QuorumPeerMain
3830 DataNode
5018 NodeManager
3643 NameNode
4827 ResourceManager
4412 DFSZKFailoverController
5676 Jps
4157 JournalNode
5549 RunJar
4. Use of Apache Hive client
4.1 bin/hive、bin/beeline
Since the development of Hive, it has gone through two generations of client tools.
First-generation client (deprecated): $HIVE_HOME /bin/hive , which is a shell Util . Main functions: First, it can be used to run Hive queries in interactive or batch mode ; second, it is used to start Hive- related services, such as metastore services .
The second-generation client (recommended): $HIVE_HOME /bin/beeline is a JDBC client and a Hive command line tool that is highly recommended by the official . Compared with the first-generation client, it has enhanced performance and improved security.
Beeline works in both embedded and remote modes . In embedded mode, it runs embedded Hive ( similar to Hive Client ) ; in remote mode, beeline connects to a separate HiveServer2 service through Thrift , which is also the officially recommended mode for use in production environments . So the question is, what is HiveServer2? Where has HiveServer1 gone?
4.2 HiveServer , HiveServer2 services
HiveServer and HiveServer2 are two services that come with Hive , allowing clients to operate data in Hive without starting the CLI (command line) , and both allow remote clients to use multiple programming languages such as Java . , python , etc. submit a request to hive and retrieve the results.
However, HiveServer cannot handle concurrent requests from more than one client. Therefore, the HiveServer code was rewritten in Hive-0.11.0 version to obtain HiveServer2 , which solved the problem. HiveServer has been deprecated. HiveServer2 supports multi-client concurrency and identity authentication, aiming to provide better support for open API clients such as JDBC and ODBC .
4.3 Relationship sorting
HiveServer2 reads and writes metadata through the Metastore service. Therefore, in remote mode, the metastore service must be started before starting HiveServer2 .
Special note: In remote mode, Beeline clients can only access Hive through the HiveServer2 service . Bin /hive is accessed through the Metastore service. The specific relationship is as follows:
4.4 Bin/hive client usage
In the bin directory of the hive installation package , there is bin/hive, the first-generation client provided by hive . The client can access hive 's metastore service to achieve the purpose of operating hive .
Friendly reminder: If you deploy in remote mode, please manually start and run the metastore service . If it is in embedded mode or local mode, run bin/hive directly , and the metastore service will be started inline. You can directly use the bin/hive client operation on the machine where the Hive metastore service is started. No configuration is required at this time.
If you need to access the hive metastore service through bin/hive on other machines , you only need to send the hive installation directory scp on hadoop01 to other machines, and add the metastore service address in the hive-site.xml configuration of the machine.
(base) [root@hadoop01 /bigdata]# scp -r apache-hive-3.1.2 hadoop02:$PWD
# 到 hadoop02 上操作
(base) [root@hadoop02 ~]# vim /bigdata/apache-hive-3.1.2/conf/hive-site.xml
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop01:9083</value>
</property>
</configuration>
# 启动 metastore 服务
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore &
# hadoop02 上连接客户端
(base) [root@hadoop02 ~]# /bigdata/apache-hive-3.1.2/bin/hive
4.5 bin/beeline client use
After development, hive launched the second-generation client beeline. However, the beeline client does not directly access the metastore service, but needs to start the hiveserver2 service separately. On the server where hive is installed , first start the metastore service, and then start the hiveserver2 service .
# 先启动 metastore 服务然后启动 hiveserver2 服务
nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore &
nohup /bigdata/apache-hive-3.1.2/bin/hive --service hiveserver2 &
Use the beeline client on hadoop02 for connection access. It should be noted that after the hiveserver2 service is started, it will take a while before it can provide services to the outside world.
Beeline is a JDBC client that communicates with the Hiveserver2 service through the JDBC protocol . The protocol address is: jdbc:hive2://hadoop01:10000
(base) [root@hadoop02 ~]# /bigdata/apache-hive-3.1.2/bin/beeline
beeline> ! connect jdbc:hive2://hadoop01:10000
Connecting to jdbc:hive2://hadoop01:10000
Enter username for jdbc:hive2://hadoop01:10000: root
Enter password for jdbc:hive2://hadoop01:10000:
5. First experience with Apache Hive
5.1 Experience 1 : Is Hive similar to MySQL ?
5.1.1 Background
For those who are new to Apache Hive , the biggest confusion is: Hive looks similar to the relational database MySQL from the data model . Including Hive SQL is also a SQL- like language. So how does it actually work?
5.1.2 Process
Experience steps: According to the mysql thinking, create and switch databases in hive, create tables and perform data insertion operations, and finally query whether the insertion is successful.
--创建数据库
create database test;
--列出所有数据库
show databases;
--切换数据库
use test;
--建表
create table t_student(id int,name varchar(255));
--插入一条数据
insert into table t_student values(1,"allen");
--查询表数据
select * from t_student;
When inserting data, I found that the insertion speed was extremely slow and the SQL execution time was very long . Why ?
Finally, a piece of data is inserted, with a history of 98 seconds . Query the table data and show that the data was inserted successfully:
Query the table data and show that the data was inserted successfully.
5.1.3 Verification
First, log in to Hadoop YARN and observe whether there are traces of MapReduce task execution. YARN Web UI: http://hadoop01:8088/cluster
Then log in to Hadoop HDFS to browse the file system. According to Hive's data model, the table data is ultimately stored in HDFS and in the folder corresponding to the table.
HDFS Web UI: http://hadoop01:9870/
5.1.4 Conclusion
- Hive SQL syntax is very similar to standard SQL , which reduces the learning cost a lot.
- The bottom layer of Hive is the data insertion action performed through MapReduce , so it is slow.
- If a large data set is inserted one by one, it is very unrealistic and the cost is extremely high.
- Hive should have its own unique way of inserting data into tables, and structured files are mapped into tables.
5.2 Experience 2 : How can Hive map structured data into tables?
5.2.1 Background
In Hive, the insert+values statement is used to insert data, and the bottom layer is executed through MapReduce , which is very inefficient. At this point, we return to the essence of Hive : it can map structured data files into a table and provide table-based SQL query analysis.
If there is a structured data file now, how can it be mapped successfully? What issues need to be paid attention to during the successful mapping process? File storage path? Field Type? Field order? Separator between fields?
5.2.2 Process
Create a structured data file user.txt in the HDFS root directory with the following contents:
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# vim user.txt
1,zhangsan,18,beijing
2,lisi,25,shanghai
3,allen,30,shanghai
4,woon,15,nanjing
5,james,45,hangzhou
6,tony,26,beijing
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# hadoop fs -put user.txt /
Create a table t_user in hive . Note: The order of field types must be consistent with the fields in the file .
create table t_user(id int,name varchar(255),age int,city varchar(255));
5.2.3 Verification
Execute data query operation and find that there is no data in the table. Conjecture 1 : Do the data files need to be placed in the HDFS path corresponding to the table to succeed ? Use HDFS commands to move the data to the path corresponding to the table.
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# hadoop fs -mv /user.txt /user/hive/warehouse/test.db/t_user
(base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# hadoop fs -ls /user/hive/warehouse/test.db/t_user
Found 1 items
-rw-r--r-- 3 root supergroup 117 2023-09-20 15:28 /user/hive/warehouse/test.db/t_user/user.txt
Execute the query operation again, and the display is as follows, all of which are null:
It seems that the table is aware of the existence of the structured file, but does not correctly identify the data in the file . Guess 2 : Do you also need to specify the separator between fields in the file ? Rebuild a new table and specify the delimiter.
--建表语句 增加分隔符指定语句
create table t_user_1(id int,name varchar(255),age int,city varchar(255))
row format delimited
fields terminated by ',';
# 把 user.txt 文件从本地文件系统上传到 hdfs
hadoop fs -put user.txt /user/hive/warehouse/test.db/t_user_1/
--执行查询操作
select * from t_user_1;
At this time, create another table and save the separator syntax, but deliberately make the field type inconsistent with that in the file.
--建表语句 增加分隔符指定语句
create table t_user_2(id int,name int,age varchar(255),city varchar(255))
row format delimited
fields terminated by ',';
# 把 user.txt 文件从本地文件系统上传到hdfs
hadoop fs -put user.txt /user/hive/warehouse/test.db/t_user_2/
--执行查询操作
select * from t_user_2;
At this time, it was discovered that some columns displayed null and some columns displayed normal. The name field itself is a string, but if int is specified when creating the table , the type conversion fails; age is a numeric type, and if the string type is specified when creating the table, the conversion can be successful. Note that hive has its own type conversion function, but the conversion is not necessarily guaranteed to be successful .
5.2.4 Conclusion
In order to successfully create a table and map a structured file in hive , you need to pay attention to the following aspects:
-
When creating a table, the field order and field types must be consistent with those in the file .
-
If the types are inconsistent, hive will try to convert, but there is no guarantee that the conversion will be successful. If unsuccessful, null is displayed .
-
The file seems to be placed in the HDFS directory corresponding to the Hive table . Can other paths be used ? Worth exploring.
-
When creating a table, it seems that the delimiter must be specified based on the file content. Is it okay not to specify it? Worth exploring .
5.3 Experience 3 : How about using Hive for small data analysis?
5.3.1 Background
Because Hive is based on HDFS for file storage, it can theoretically support a large scale of data storage and is naturally suitable for big data analysis. If the data in Hive is small data, how efficient is it to use Hive for analysis?
5.3.2 Process
We created a table t_user_1 before, and now we use Hive SQL to find out how many of them are older than 20 years old.
5.3.3 Verification
--执行查询操作
select count(*) from t_user_1 where age > 20;
Discovery is again a data query function executed through the MapReduce program.
5.3.5 Conclusion
-
The bottom layer of Hive does process data through the MapReduce execution engine.
-
It takes a long time to execute a MapReduce program.
-
If it is a small data set, using hive for analysis will not be worth the gain and the latency will be high.
-
If it is a large data set, it will be great to use hive for analysis and underlying MapReduce distributed computing.
Previous article: Getting Started with Apache Hive_Stars.Sky’s Blog-CSDN Blog