Preface
Big data platform construction | Hadoop cluster construction (1)
1 Introduction
- Based on Hive 3.1.2 version
- Hive download address
- The operation of Hive depends on Hadoop 3.X
- -Depends on JDK 1.8 environment
2. Architecture
- The essence is to store
Hdfs文件
and in表、数据库
between映射关系(元数据)
, and then provide aSQL
way to access file data, just like accessing table structured data. It translates SQL and then计算引擎
calculates the query results
元数据MetaStore:
It is the mapping relationship data between the Hdfs file, the table and the database. The default is stored in the built-in derby database, and the general configuration is stored in MySQLDriver:
- SQL parser: converts SQL strings into abstract syntax tree AST, and then parses the AST
- Physical Plan Compiler: Compile AST to generate logical execution plan.
- Query Optimizer: Optimize the logical execution plan.
- Execution executor: Convert logical execution plans into runnable physical plans (such as MapReduce or Spark)
客户端:
Provides various ways to access Hive, such as CLI (hive shell), JDBC/ODBC (java access hive), beeline
3. Server planning
- Which server you want to use the hive client can be deployed, you can choose to serve as a server or as a client. As a client, you don’t start the metastore service (or hiverServer2 service) to connect to services on other servers. For example, hadoop300 started. metastore service, then Hadoop301 and Hadoop302 only need to configure the address of the accessed metastore service to access Hive (
比如地址是thrift://hadoop300:9083
)
Hadoop300 | Hadoop301 | Hadoop302 | |
---|---|---|---|
hive | V |
4. How to access Hive
- The so-called access to Hive is essentially to access the metadata stored on mysql
# 3种访问方式流程
1 Hive客户端 ----> myql(元数据)
2 Hive客户端 ---> metastore服务----- > myql(元数据)
3 Hive客户端 ----> hiveServer2服务-----> metastore服务----- > myql(元数据)
1. mysql direct connection
- You only need to configure the Hive client with the mysq information where the metadata is located to access the Hive metadata in a direct connection.
- If the metastore service and hiveServer2 service are not configured, the direct connection is used by default. You can directly access Hive metadata without starting the metastore service and hiveServer2 service Hive's shell client
- This method is suitable for local access without leaking mysql information, and without starting additional services.
2. Metastore method of metadata service
- It is a thrift service, you need to manually start the service and then connect to it to access Hive
- By starting a metastore service on top of mysql (metadata), shielding mysql connection details, first connecting to the Metastore service, and then connecting to MySQL through the Metastore service to obtain metadata
- If this
hive.metastore.uris
parameter is configured, this method is used - Mainly responsible for access to metadata, that is, table structure, library information
3. The way of hiveServer2 service
- It is a thrift service, you need to manually start the service and then connect to it to access Hive
- By starting another service on top of the metastore service
- Mainly responsible for accessing specific table data in Hive, such as python and java remote access to hive data, beeline client also accesses data through HiveServer2
5. Installation
Download and unzip
[hadoop@hadoop300 app]$ pwd
/home/hadoop/app
drwxrwxr-x. 12 hadoop hadoop 166 2月 22 00:08 manager
lrwxrwxrwx 1 hadoop hadoop 47 2月 21 12:33 hadoop -> /home/hadoop/app/manager/hadoop_mg/hadoop-3.1.3
lrwxrwxrwx 1 hadoop hadoop 54 2月 22 00:04 hive -> /home/hadoop/app/manager/hive_mg/apache-hive-3.1.2-bin
Add hive environment variables
- Modify
vim ~/.bash_profile
file
# ================== Hive ==============
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$PATH:$HIVE_HOME/bin
hive configuration
1, modify the ${HIVE_HOME}/conf/hive-site.xml
file
- If you do not have this file directly copy
${HIVE_HOME}/conf/hive-default.xml.template
the file to create - It is mainly to configure the storage mode and path of Hive's metadata. The default is delpy, which is now stored in mysql, so you need to configure the relevant properties of connecting to mysql. And configure the metastore and hiverServer2 services
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 1、metastore服务启动地址(可配置多个,以逗号分隔)-->
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop300:9083</value>
</property>
<!-- 2、元数据存储的mysql路径, 将元数据存放到这个Hadoops_Hive数据库中 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://www.burukeyou.com:3306/Hadoops_Hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- jdbc 连接的 username-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!-- jdbc 连接的 password-->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<!-- 3、设置hive在HDFS 的工作目录,
默认数据仓库是在/user/hive/warehouse路径下 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/warehouse</value>
</property>
<!--4、 hiveServer2 启动端口 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<!-- hiveServer2 启动地址 -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop300</value>
</property>
<!-- 5、Hive 元数据存储版本的验证 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<!-- 6、hive命令行 可以显示select后的表头 -->
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<!-- 7、hive命令行 可以显示当前数据库信息 -->
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
</configuration>
2. After modifying the configuration, you need to import the mysql driver package into ${HIVE_HOME}
the lib directory
[hadoop@hadoop300 ~]$ cp mysql-connector-java-5.1.46.jar /home/hadoop/app/hive/lib/
3. Initialize the table information and data of the metadata in the database
[hadoop@hadoop300 ~]$schematool -initSchema -dbType mysql
4. You can see the generated table file in the Hadoops_Hive library
6, Hive client use
6.1 hive CLI (interactive client)
- Thanks to a metastore service, so you need to start it. Hive clients can go to access the metadata by metastore service. If the service is not configured metastore That
hive.metastore.uris
argument would not start
[hadoop@hadoop300 ~]$ hive --service metastore
- Execute
hive
command to enter interactive command line
[hadoop@hadoop300 conf]$ hive
hive (default)> show tables;
hive (default)> show tables;
OK
tab_name
student
user
6.2 beeline
- Beeline accesses hive through hiveServer2, so you need to start hiverServer2 first
[hadoop@hadoop300 shell]$hive --service hiveserver2
After starting to hiveserver2 can access the WebUI interface http://hadoop300:10002
address
Start the beeline client and connect to hiveServer2
beeline -u jdbc:hive2://hadoop300:10000 -n hadoop
[hadoop@hadoop300 shell]$ beeline -u jdbc:hive2://hadoop300:10000 -n hadoop
Connecting to jdbc:hive2://hadoop300:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
# 查看所有表
0: jdbc:hive2://hadoop300:10000> show tables;
INFO : Compiling command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069): show tables
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069); Time taken: 1.044 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069): show tables
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069); Time taken: 0.06 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+-----------+
| tab_name |
+-----------+
| student |
| user |
+-----------+
3 rows selected (1.554 seconds)
test
1. Create an external partition teacher table
create external table if not exists teacher (
`id` int,
`name` string,
`age` int COMMENT '年龄'
) COMMENT '教师表'
partitioned by (`date` string COMMENT '分区日期')
row format delimited fields terminated by '\t'
stored as parquet
location '/warehouse/demo/teacher'
tblproperties ("parquet.compress"="SNAPPY");
2. Insert data
insert overwrite table teacher partition(`date`='2021-02-29')
select 3, "jayChou",49;
3. View HDFS