Big data platform construction | Hive

Preface

Big data platform construction | Hadoop cluster construction (1)

1 Introduction

  • Based on Hive 3.1.2 version
  • Hive download address
  • The operation of Hive depends on Hadoop 3.X
  • -Depends on JDK 1.8 environment

2. Architecture

  • The essence is to store Hdfs文件and in 表、数据库between 映射关系(元数据), and then provide a SQLway to access file data, just like accessing table structured data. It translates SQL and then 计算引擎calculates the query results
    Insert picture description here
  • 元数据MetaStore: It is the mapping relationship data between the Hdfs file, the table and the database. The default is stored in the built-in derby database, and the general configuration is stored in MySQL
  • Driver:
    • SQL parser: converts SQL strings into abstract syntax tree AST, and then parses the AST
    • Physical Plan Compiler: Compile AST to generate logical execution plan.
    • Query Optimizer: Optimize the logical execution plan.
    • Execution executor: Convert logical execution plans into runnable physical plans (such as MapReduce or Spark)
  • 客户端: Provides various ways to access Hive, such as CLI (hive shell), JDBC/ODBC (java access hive), beeline

3. Server planning

  • Which server you want to use the hive client can be deployed, you can choose to serve as a server or as a client. As a client, you don’t start the metastore service (or hiverServer2 service) to connect to services on other servers. For example, hadoop300 started. metastore service, then Hadoop301 and Hadoop302 only need to configure the address of the accessed metastore service to access Hive ( 比如地址是thrift://hadoop300:9083)
Hadoop300 Hadoop301 Hadoop302
hive V

4. How to access Hive

  • The so-called access to Hive is essentially to access the metadata stored on mysql
    Insert picture description here
# 3种访问方式流程
1  Hive客户端 ----> myql(元数据)
2  Hive客户端 ---> metastore服务----- > myql(元数据)
3  Hive客户端 ----> hiveServer2服务----->  metastore服务----- > myql(元数据)

1. mysql direct connection

  • You only need to configure the Hive client with the mysq information where the metadata is located to access the Hive metadata in a direct connection.
  • If the metastore service and hiveServer2 service are not configured, the direct connection is used by default. You can directly access Hive metadata without starting the metastore service and hiveServer2 service Hive's shell client
  • This method is suitable for local access without leaking mysql information, and without starting additional services.

2. Metastore method of metadata service

  • It is a thrift service, you need to manually start the service and then connect to it to access Hive
  • By starting a metastore service on top of mysql (metadata), shielding mysql connection details, first connecting to the Metastore service, and then connecting to MySQL through the Metastore service to obtain metadata
  • If this hive.metastore.urisparameter is configured, this method is used
  • Mainly responsible for access to metadata, that is, table structure, library information

3. The way of hiveServer2 service

  • It is a thrift service, you need to manually start the service and then connect to it to access Hive
  • By starting another service on top of the metastore service
  • Mainly responsible for accessing specific table data in Hive, such as python and java remote access to hive data, beeline client also accesses data through HiveServer2

5. Installation

Download and unzip

[hadoop@hadoop300 app]$ pwd
/home/hadoop/app
drwxrwxr-x. 12 hadoop hadoop 166 2月  22 00:08 manager
lrwxrwxrwx   1 hadoop hadoop  47 2月  21 12:33 hadoop -> /home/hadoop/app/manager/hadoop_mg/hadoop-3.1.3
lrwxrwxrwx   1 hadoop hadoop  54 2月  22 00:04 hive -> /home/hadoop/app/manager/hive_mg/apache-hive-3.1.2-bin

Add hive environment variables

  • Modify vim ~/.bash_profilefile
# ================== Hive ==============
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$PATH:$HIVE_HOME/bin

hive configuration

1, modify the ${HIVE_HOME}/conf/hive-site.xmlfile

  • If you do not have this file directly copy ${HIVE_HOME}/conf/hive-default.xml.templatethe file to create
  • It is mainly to configure the storage mode and path of Hive's metadata. The default is delpy, which is now stored in mysql, so you need to configure the relevant properties of connecting to mysql. And configure the metastore and hiverServer2 services
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <!-- 1、metastore服务启动地址(可配置多个,以逗号分隔)-->
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop300:9083</value>
  </property>

    <!-- 2、元数据存储的mysql路径, 将元数据存放到这个Hadoops_Hive数据库中 -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://www.burukeyou.com:3306/Hadoops_Hive?createDatabaseIfNotExist=true&amp;useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <!-- jdbc 连接的 username-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <!-- jdbc 连接的 password-->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
    </property>
 
    <!-- 3、设置hive在HDFS 的工作目录,  
    		默认数据仓库是在/user/hive/warehouse路径下 -->
     <property>
             <name>hive.metastore.warehouse.dir</name>
             <value>/warehouse</value>
     </property>

      <!--4、 hiveServer2 启动端口 -->
	  <property>
    			<name>hive.server2.thrift.port</name>
                <value>10000</value> 
       </property>
        <!-- hiveServer2 启动地址 -->
       <property>
               <name>hive.server2.thrift.bind.host</name>
               <value>hadoop300</value>
       </property>

	<!-- 5、Hive 元数据存储版本的验证 -->
	<property>
		<name>hive.metastore.schema.verification</name>
		<value>false</value>
	</property>

	  <!-- 6、hive命令行 可以显示select后的表头 -->
        <property>
               <name>hive.cli.print.header</name>
               <value>true</value>
       </property>
       
       <!-- 7、hive命令行 可以显示当前数据库信息 -->
       <property>
               <name>hive.cli.print.current.db</name>
               <value>true</value>
       </property>
        
</configuration>

2. After modifying the configuration, you need to import the mysql driver package into ${HIVE_HOME}the lib directory

[hadoop@hadoop300 ~]$  cp mysql-connector-java-5.1.46.jar /home/hadoop/app/hive/lib/

3. Initialize the table information and data of the metadata in the database

[hadoop@hadoop300 ~]$schematool -initSchema -dbType mysql

4. You can see the generated table file in the Hadoops_Hive library
Insert picture description here

6, Hive client use

6.1 hive CLI (interactive client)

  • Thanks to a metastore service, so you need to start it. Hive clients can go to access the metadata by metastore service. If the service is not configured metastore That hive.metastore.urisargument would not start
[hadoop@hadoop300 ~]$ hive --service metastore
  • Execute hivecommand to enter interactive command line
[hadoop@hadoop300 conf]$ hive
hive (default)> show tables;
hive (default)> show tables;
OK
tab_name
student
user

6.2 beeline

  • Beeline accesses hive through hiveServer2, so you need to start hiverServer2 first
[hadoop@hadoop300 shell]$hive --service hiveserver2 

After starting to hiveserver2 can access the WebUI interface http://hadoop300:10002address

Insert picture description here

Start the beeline client and connect to hiveServer2

  • beeline -u jdbc:hive2://hadoop300:10000 -n hadoop
[hadoop@hadoop300 shell]$ beeline -u jdbc:hive2://hadoop300:10000 -n hadoop
Connecting to jdbc:hive2://hadoop300:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive

# 查看所有表
0: jdbc:hive2://hadoop300:10000> show tables;
INFO  : Compiling command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069): show tables
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069); Time taken: 1.044 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069): show tables
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20210227161152_17b6a6dd-bcd2-4ab0-8bd4-be600ae07069); Time taken: 0.06 seconds
INFO  : OK
INFO  : Concurrency mode is disabled, not creating a lock manager
+-----------+
| tab_name  |
+-----------+
| student   |
| user      |
+-----------+
3 rows selected (1.554 seconds)

test

1. Create an external partition teacher table

create external  table if not exists teacher (
  `id` int,
  `name` string,
  `age` int  COMMENT '年龄'
) COMMENT '教师表'
partitioned by (`date` string COMMENT '分区日期') 
row format delimited fields terminated by '\t' 
stored as parquet
location  '/warehouse/demo/teacher' 
tblproperties ("parquet.compress"="SNAPPY");

2. Insert data

insert overwrite table teacher partition(`date`='2021-02-29')
select 3, "jayChou",49;

3. View HDFS
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_41347419/article/details/114157349