Hive介绍与使用

数据仓库的基本介绍

数据仓库的基本概念：

英文是datawarehourse数据仓库，主要用于存储数据和分析性报告以及决策支持，不会产生数据，也不会消费数据

数据仓库的主要特征：

面向主题：有确切的分析目标

集成性：相关的数据都会被放入数据仓库，便于下一步的分析

非易失性：数据一旦进入数据仓库不会轻易的改变

时变性：根据不同的需求，会产生一些不同的分析维度

数据库与数据仓库的区别：

数据库： OLTP 主要用于联机事务处理，实现业务数据库中的增删改查

数据仓库：OLAP 主要用于联机分析处理，实现对数据的分析查询，操作的都是历史数据，不会新增，也不会修改

数据仓库的分层：

分为三层：源数据层、数据仓库层、数据应用层

源数据层：贴源层 ODS层，主要用于获取源数据

数据仓库层：DW层，主要用于对贴源层的数据进行分析，得出想要的结果

数据应用层：APP层，主要用于对仓库层分析之后的结果进行进一步的展示

数据在各个层级之间流动的过程，称之为ETL过程（抽取Extra，转化Transfer，装载Load）的过程

数据仓库的元数据管理

主要用于记录数据库表之间的关系，数据库表字段的含义，还有一些数据处理的规则，数据装载的周期，数据导出的周期等

hive的基本介绍

hive是基于hadoop的一个数据仓库工具，可以将hdfs上的结构化数据映射成一张表，hive底层的数据存储使用的是hdfs，数据的统计计算时使用的MapReduce，可以将hive当作一个MapReduce的客户端工具，写的hql语句会被翻译成mapreduce程序去运行。

数据结构：结构化数据是指字段个数一定，字段之间的分隔符一定，半结构化数据是指xml，json这类，非结构化数据是指没有任何规律格式的数据。

hive特点：

扩展性：hadoop集群的可扩展

延展性：支持用户的自定义函数

容错性：良好的容错

hive的架构：

用户接口：编写sql语句，提交给hive

解析器：编译器，将我们的sql语句编译成一个mapreduce程序

优化器，将sql语句进行优化

执行器：提交mapreduce任务，执行

元数据库：hive的元数据包含了表与hdfs数据之间的映射关系，默认使用的时derby，一般改用mysql

hive的安装：

使用mysql作为元数据库存储（使用yum源进行安装）

#解压hive的安装包
cd /export/softwares
tar -zxvf hive-1.1.0-cdh5.14.0.tar.gz -C ../servers/
#在线安装mysql相关的软件包
yum install mysql mysql-server mysql-devel
#启动mysql服务
/etc/init.d/mysqld start
#通过mysql安装的自带脚本进行设置
/usr/bin/mysql_secure_installation
#1.没有root密码直接回车  2.设置root用户密码 3.移除匿名用户y
#4.是否远程访问 n  5.移除测试数据库 y 6.重新加载mysql y
#进入mysql的客户端进行授权
mysql -uroot -p
grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
#刷新权限表
flush privileges;

修改hive的配置文件

修改hive-env.sh

cd /export/servers/hive-1.1.0-cdh5.14.0/conf
vim hive-site.xml

HADOOP_HOME=/export/servers/hadoop-2.6.0-cdh5.14.0
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/export/servers/hive-1.1.0-cdh5.14.0/conf

修改hive-site.xml

vim hive-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://node03.hadoop.com:3306/hive?createDatabaseIfNotExist=true</value>
        </property>

        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.jdbc.Driver</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>123456</value>
        </property>
        <property>
                <name>hive.cli.print.current.db</name>
                <value>true</value>
        </property>
        <property>
                <name>hive.cli.print.header</name>
                <value>true</value>
        </property>
        <property>
                <name>hive.server2.thrift.bind.host</name>
                <value>node03.hadoop.com</value>
        </property>
<!--
        <property>
                <name>hive.metastore.uris</name>
                <value>thrift://node03.hadoop.com:9083</value>
        </property>
-->
</configuration>

上传mysql的lib驱动包

将mysql的lib驱动包上传到hive的lib目录下
cd /export/servers/hive-1.1.0-cdh5.14.0/lib
将mysql-connector-java-5.1.38.jar 上传到这个目录下

交互方式：

第一种：hive交互shell

bin/hive

第二种：hive JDBC服务

启动hiveserver2服务

前台启动

bin/hive --service hiveserver2

后台启动

nohup bin/hive --service hiveserver2  &

beeline连接hiveserver2

bin/beeline
beeline> !connect jdbc:hive2://node03.hadoop.com:10000

第三种：hive命令

使用 –e 参数来直接执行hql的语句

bin/hive -e "use myhive;select * from test;"

使用 –f 参数通过指定文本文件来执行hql的语句

vim hive.sql
use myhive;select * from test;

bin/hive -f hive.sql

Hive基本操作

创建数据库操作

创建数据库

create database if not exists myhive;
use myhive；

hive的库和表的存放位置由hive-site.xml当中的一个属性决定

<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>

创建数据库并指定hdfs的存储位置

 create database myhive2 location '/myhive2';

修改数据库

使用alter database 命令来修改数据库的一些属性，但是数据库的元数据信息是不可更改的，包括数据库的名称以及数据库所在的位置

alter database myhive2 set dbproperties('createtime'='201812');

查看数据库的基本信息

desc  database  myhive2;

查看数据库的更多详细信息

desc database extended myhive2;

删除数据库

删除一个空的数据库，如果数据库下有数据，就会报错，相应的文件也会被删除

drop database myhive2;

创建数据库表的语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
   [(col_name data_type [COMMENT col_comment], ...)] 
   [COMMENT table_comment] 
   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
   [CLUSTERED BY (col_name, col_name, ...) 
   [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
   [ROW FORMAT row_format] 
   [STORED AS file_format] 
   [LOCATION hdfs_path]

每一行的解释：

1、创建表的三个关键字段

2、定义表的列名以及类型

3、注释信息，只能使用英文或者拼音

4、分区：这里的是hive的分区，分的是文件夹

5、分桶：按照字段进行划分文件

6、划分到多少个桶里去

7、指定字段之间的分隔符

8、指定数据的存储格式为哪一种

9、指定表在hdfs的哪个位置

hive中的表模型

hive建表时的字段类型

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

（一）管理表（内部表）

建表:

use myhive;
create table stu(id int,name string);
insert into stu values (1,"zhangsan");
select * from stu;

创建表并指定字段之间的分隔符，指定文件存储格式，指定hdfs的存储位置

create table if not exists stu2(id int,name string) row format delimited fields terminated by '\t' sorted as textfile location '/user/stu2';

根据查询结果创建表：这种语法会将stu2里面的数据以及表结构都复制到stu3中

create table stu3 as select * from stu2;

根据已存在的表结构创建表：只复制表结构不复制数据

create table stu4 like stu2;

查询表的类型

desc formatted stu2；

（二）外部表

说明：外部表的数据都是指定hdfs文件的文件路径加载进来，外部表认为自己没有独享数据，所以删除外部表的时候，不会同步删除hdfs的数据，与内部表的特征相反，删除表的时候，不会删除表数据

创建学生和老师表，并向表中加载数据

#创建学生表
create external table techer (t_id string,t_name string) row format delimited fields terminated by '\t';
#创建老师表
create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';

从本地文件系统向表中加载数据

load data local inpath '/export/servers/hivedatas/student.csv' into table student;

加载数据并覆盖原有的数据

load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;

从hdfs文件系统向表中加载数据（需要提前将数据上传到hdfs文件系统，相当于是移动文件的操作）

cd /export/servers/hivedatas
hdfs dfs -mkdir -p /hivedatas
hdfs dfs -put techer.csv /hivedatas/
#从hdfs系统上加载数据
load data inpath '/hivedatas/techer.csv' into table techer;

如果删除student表，hdfs的数据依然存在，并且重新创建表之后，就直接存在数据了，因为student表使用的是外部表，删除表之后数据依然保留在hdfs上

#删除教师表
drop table techer；
#查看教师表
show tables；
#去到数据的保存目录，可以发现/user/hive/warehouse/myhive.db/techer下还有techer.csv的数据
#再次创建教师表
create external table techer (t_id string,t_name string) row format delimited fields terminated by '\t';
#直接查看表 可以看到数据直接就能导入进去
select * from techer；

（三）分区表：

就是分文件夹，可以按照时间或者其他条件，创建一些文件夹关键词是partitioned by

创建一个分区的表

create table score (s_id string,c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

创建多个分区的表

create table score2(s_id string,c_id string,s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

加载数据到分区

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month = '201806');

加载数据到一个多分区的表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

查看表分区

show partitions score；

添加一个分区

alter table score add partition(month='201803');

同时添加多个分区

alter table score add partition(month='201804') partition(month = '201805');

在添加分区之后，就能在hdfs文件系统上看到表下面多了一个文件夹

删除分区

alter table score drop partition（month = '201803'）;

（四）分桶表

将数据按照指定的字段分到多个桶中去，也就是将数据按照字段进行划分，可以将数据按照字段划分到多个文件中去

在这之前需要开启hive的桶表功能，默认是关闭的

set hive.enforce.bucketing = true;
#设置reduce的个数
set mapreduce.job.reduces=3;

创建桶表

创建分桶表的语法关键字：clustered by (col_name) into xx buckets

create table course (c_id string,c_name string,t_id string) clustered by (c_id) into 3 buckets row format delimited fields terminated by '\t';

桶表的数据加载，只能通过insert overwrite来进行数据的加载

创建普通表，通过insert overwrite的方式来将普通表的数据通过查询的方式加载到桶表中

#创建普通表
create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';
#普通表加载数据
load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;
#通过insert overwrite 给桶表加载数据
insert overwrite table course select * from course_common cluster by (c_id);

修改表

表重命名

alter table score4 rename to score5;

增加/修改/删除列信息

#查询表结构
desc score5;
#添加列
alter table score5 add columns (mycol string,mysco string);
#更新列
alter table score5 change column mysco mysconew int;
#删除表
drop table score5；

hive表当中加载数据：

load data通过load的方式加载数据

 load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

insert overwrite select
通过一张表，然后将查询结果插入到另外一张表里面去

insert overwrite table score4 partition(month='201802') select s_id ,c_id ,s_score from score;

通过查询语句查询某张表，并且将数据弄到另外一张表里面去

Hive参数配置方式

Hive参数大全：
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

对于一般参数，有三种设定方式

配置文件：

自定义的配置文件：hive-site.xml

默认的配置文件：hive-default.xml

用户自定义的配置会覆盖默认配置，此外，hive也会去读入Hadoop的配置，因为hive是作为Hadoop的客户端启动的，hive的配置会覆盖Hadoop的配置，配置文件对本机的所有hive进程都有效

命令行参数：

启动hive时，可以在命令行添加-hiveconf param=value来设定参数

参数声明：

可以在HQL中使用set关键字来设定

三种设定方式的优先级依次递增，参数声明>命令行参数>配置文件参数

hive函数

hive自带了一些函数，当不能满足需求时，需要我们自己自定义函数

官方文档地址：

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

编程步骤：

1、继承org.apache.hadoop.hive.ql.UDF

2、需要实现evaluate函数，evaluate支持重载

注意点：

1、UDF必须要有返回值类型，可以返回null，但不能返回void

2、UDF中常用的时Text这样的类型，不推荐使用java类型

UDF开发实例

实现将一个字母字符串全部转换为大写

1、创建maven工程，打入jar包

<repositories>
    <repository>
        <id>cloudera</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0-cdh5.14.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.1.0-cdh5.14.0</version>
    </dependency>
</dependencies>
<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.2</version>
         <executions>
             <execution>
                 <phase>package</phase>
                 <goals>
                     <goal>shade</goal>
                 </goals>
                 <configuration>
                     <filters>
                         <filter>
                             <artifact>*:*</artifact>
                             <excludes>
                                 <exclude>META-INF/*.SF</exclude>
                                 <exclude>META-INF/*.DSA</exclude>
                                 <exclude>META-INF/*/RSA</exclude>
                             </excludes>
                         </filter>
                     </filters>
                 </configuration>
             </execution>
         </executions>
     </plugin>
</plugins>
</build>

2、编写java类继承UDF，并重载evaluate方法

public class ItcastUDF extends UDF {
    public Text evaluate(final Text s) {
        if (null == s) {
            return null;
        }
        //返回大写字母
        return new Text(s.toString().toUpperCase());
    }
}

3、将项目打包，并上传到hive的lib目录下添加jar包

cd /export/servers/hive-1.1.0-cdh5.14.0/lib
mv original-day_06_hive_udf-1.0-SNAPSHOT.jar udf.jar

4、hive的客户端添加jar包

add jar /export/servers/hive-1.1.0-cdh5.14.0/lib/udf.jar;

5、设置函数与自定义函数关联

create temporary function tolowercase as 'cn.lsy.udf.ItcastUDF';

6、使用自定义函数

select tolowercase（'abc'）

Hive的介绍与使用