Source code of this article: GitHub || GitEE
1. Introduction to Hive Basics
1. Basic description
Hive is a data warehouse tool based on Hadoop, used for data extraction, transformation, and loading. It is a component that can query, analyze and store large-scale data stored in Hadoop. Hive data warehouse tools can convert structured data. The file is mapped to a database table and SQL query function is provided. SQL statements can be converted into MapReduce tasks for execution. The use cost is low. Fast MapReduce statistics can be realized through similar SQL statements, making MapReduce easier without the need for special development. MapReduce application. Hive is very suitable for statistical analysis of data warehouse.
2. Composition and structure
User interface : ClientCLI, JDBC to access Hive, WEBUI browser to access Hive.
Metadata : Hive stores metadata in databases, such as mysql and derby. The metadata in Hive includes the name of the table, the columns and partitions and attributes of the table, the attributes of the table (whether it is an external table, etc.), and the directory where the data of the table is located.
Driver : Based on the interpreter, editor, and optimizer, the HQL query statement is generated from lexical analysis, syntax analysis, compilation, optimization and query plan generation.
Execution Engine : ExecutionEngine converts the logical execution plan into a physical plan that can be run.
Hadoop bottom layer : storage based on HDFS, calculation using MapReduce, scheduling mechanism based on Yarn.
Hive receives the interactive request sent to the client, receives the operation instruction (SQL), translates the instruction into MapReduce, submits it to Hadoop for execution, and finally outputs the execution result to the client.
2. Hive environment installation
1. Prepare the installation package
hive-1.2, depends on the Hadoop cluster environment, and is located on the hop01 service.
2. Unzip and rename
tar -zxvf apache-hive-1.2.1-bin.tar.gz
mv apache-hive-1.2.1-bin/ hive1.2
3. Modify the configuration file
Create configuration file
[root@hop01 conf]# pwd
/opt/hive1.2/conf
[root@hop01 conf]# mv hive-env.sh.template hive-env.sh
Add content
[root@hop01 conf]# vim hive-env.sh
export HADOOP_HOME=/opt/hadoop2.7
export HIVE_CONF_DIR=/opt/hive1.2/conf
The configuration content is the Hadoop path and the hive configuration file path.
4. Hadoop configuration
First start hdfs and yarn; then create two directories /tmp and /user/hive/warehouse on HDFS and modify permissions.
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -mkdir -p /user/hive/warehouse
bin/hadoop fs -chmod g+w /tmp
bin/hadoop fs -chmod g+w /user/hive/warehouse
5. Start Hive
[root@hop01 hive1.2]# bin/hive
6. Basic operations
View database
hive> show databases ;
Select database
hive> use default;
View data sheet
hive> show tables;
Create a database to use
hive> create database mytestdb;
hive> show databases ;
default
mytestdb
hive> use mytestdb;
Create table
create table hv_user (id int, name string, age int);
View table structure
hive> desc hv_user;
id int
name string
age int
Add table data
insert into hv_user values (1, "test-user", 23);
Query table data
hive> select * from hv_user ;
Note: Through the observation of the query log, the process executed by Hive can be clearly seen.
Delete table
hive> drop table hv_user ;
Exit Hive
hive> quit;
View the Hadoop directory
# hadoop fs -ls /user/hive/warehouse
/user/hive/warehouse/mytestdb.db
The database and data created by Hive are stored on HDFS.
Three, integrate MySQL5.7 environment
Here the MySQL5.7 version is installed by default, and the relevant login account is configured, and the Host of the root user is configured in% mode.
1. Upload the MySQL driver package
Upload the MySQL driver dependency package to the lib directory of the hive installation directory.
[root@hop01 lib]# pwd
/opt/hive1.2/lib
[root@hop01 lib]# ll
mysql-connector-java-5.1.27-bin.jar
2. Create hive-site configuration
[root@hop01 conf]# pwd
/opt/hive1.2/conf
[root@hop01 conf]# touch hive-site.xml
[root@hop01 conf]# vim hive-site.xml
3. Configure MySQL storage
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hop01:3306/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
After the configuration is complete, restart the MySQL, hadoop, and hive environments in turn to view the MySQL database information. There are more metastore databases and related tables.
4. Start hiveserver2 in the background
[root@hop01 hive1.2]# bin/hiveserver2 &
5. Jdbc connection test
[root@hop01 hive1.2]# bin/beeline
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://hop01:10000
Connecting to jdbc:hive2://hop01:10000
Enter username for jdbc:hive2://hop01:10000: hiveroot (账户回车)
Enter password for jdbc:hive2://hop01:10000: ****** (密码123456回车)
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
0: jdbc:hive2://hop01:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
+----------------+--+
Four, advanced query syntax
1. Basic functions
select count(*) count_user from hv_user;
select sum(age) sum_age from hv_user;
select min(age) min_age,max(age) max_age from hv_user;
+----------+----------+--+
| min_age | max_age |
+----------+----------+--+
| 23 | 25 |
+----------+----------+--+
2. Conditional query statement
select * from hv_user where name='test-user' limit 1;
+-------------+---------------+--------------+--+
| hv_user.id | hv_user.name | hv_user.age |
+-------------+---------------+--------------+--+
| 1 | test-user | 23 |
+-------------+---------------+--------------+--+
select * from hv_user where id>1 AND name like 'dev%';
+-------------+---------------+--------------+--+
| hv_user.id | hv_user.name | hv_user.age |
+-------------+---------------+--------------+--+
| 2 | dev-user | 25 |
+-------------+---------------+--------------+--+
select count(*) count_name,name from hv_user group by name;
+-------------+------------+--+
| count_name | name |
+-------------+------------+--+
| 1 | dev-user |
| 1 | test-user |
+-------------+------------+--+
3. Connect query
select t1.*,t2.* from hv_user t1 join hv_dept t2 on t1.id=t2.dp_id;
+--------+------------+---------+-----------+-------------+--+
| t1.id | t1.name | t1.age | t2.dp_id | t2.dp_name |
+--------+------------+---------+-----------+-------------+--+
| 1 | test-user | 23 | 1 | 技术部 |
+--------+------------+---------+-----------+-------------+--+
Five, source code address
GitHub·地址
https://github.com/cicadasmile/big-data-parent
GitEE·地址
https://gitee.com/cicadasmile/big-data-parent
Recommended reading: finishing programming system
Serial number | project name | GitHub address | GitEE address | Recommended |
---|---|---|---|---|
01 | Java describes design patterns, algorithms, and data structures | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
02 | Java foundation, concurrency, object-oriented, web development | GitHub·click here | GitEE·Click here | ☆☆☆☆ |
03 | Detailed explanation of SpringCloud microservice basic component case | GitHub·click here | GitEE·Click here | ☆☆☆ |
04 | Comprehensive case of SpringCloud microservice architecture actual combat | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
05 | Getting started with SpringBoot framework basic application to advanced | GitHub·click here | GitEE·Click here | ☆☆☆☆ |
06 | SpringBoot framework integrates and develops common middleware | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
07 | Basic case of data management, distribution, architecture design | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |
08 | Big data series, storage, components, computing and other frameworks | GitHub·click here | GitEE·Click here | ☆☆☆☆☆ |