样例数据：

1,小明1,lol-book-movie,beijing:shangxuetang-shanghai:pudong
2,小明2,lol-book-movie,beijing:shangxuetang-shanghai:pudong
3,小明3,lol-book-movie,beijing:shangxuetang-shanghai:pudong
4,小明4,lol-book-movie,beijing:shangxuetang-shanghai:pudong
5,小明5,lol-movie,beijing:shangxuetang-shanghai:pudong
6,小明6,lol-book-movie,beijing:shangxuetang-shanghai:pudong
7,小明7,lol-book,beijing:shangxuetang-shanghai:pudong
8,小明8,lol-book,beijing:shangxuetang-shanghai:pudong
9,小明9,lol-book-movie,beijing:shangxuetang-shanghai:pudong

Hive完整的DDL建表语法规则

create [temporary] [external] table [if not exists] [db_name.]table_name    -- (note: temporary available in hive 0.14.0 and later)
  [(col_name data_type [comment col_comment], ... [constraint_specification])]
  [comment table_comment]
  [partitioned by (col_name data_type [comment col_comment], ...)]
  [clustered by (col_name, col_name, ...) [sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
  [skewed by (col_name, col_name, ...)                  -- (note: available in hive 0.10.0 and later)]
     on ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [stored as directories]
  [
   [row format row_format] 
   [stored as file_format]
     | stored by 'storage.handler.class.name' [with serdeproperties (...)]  -- (note: available in hive 0.6.0 and later)
  ]
  [location hdfs_path]
  [tblproperties (property_name=property_value, ...)]   -- (note: available in hive 0.6.0 and later)
  [as select_statement];   -- (note: available in hive 0.5.0 and later; not supported for external tables)

Hive 建表（默认内部表）

-- 内部表
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 外部表
create table person(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n'
location '/usr';

查看表描述

语法：describe [extended|formatted] table_name

describe formatted person;

内部表和外部表区别

hive 内部表
create table [if not exists] table_name
删除表时，元数据与数据都会被删除
hive 外部表
create external table [if not exists] table_name location hdfs_path
删除外部表只删除metastore的元数据，不删除hdfs中的表数据

三种建表方式

1、create table table_name 常规见表语句

2、create table ...as select..（CTAS）懒人模式

3、半自动化模式create table like

详细地址：三种建表方式https://blog.csdn.net/qq_26442553/article/details/85621767

CTAS建表需要注意的几点：https://blog.csdn.net/qq_26442553/article/details/79593504

分区表的意义：在于优化查询。查询时尽量利用分区字段。如果不使用分区字段，就会全部扫描。

静态分区

1、静态分区-表操作

内部表中、对应分区的元数据和数据将被一并删除。

-- 创建分区表
create table p_person (
id int,
name string,
likes array<string>,
address map<string,string>
) 
partitioned by (sex string,age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 添加分区:注意要讲所有的分区字段都写上
alter table p_person add partition (sex='man',age=20);

-- 删除分区:删除的时候只写需要删除的分区即可
alter table p_person drop partition (age=20);

2、hive查询表的分区信息语法：

show partitions day_hour_table;

动态分区

1、创建动态分区表

-- 要开启支持动态分区参数设置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;

-- 首先创建数据表
create table person(
id int ,
name string,
age int,
sex string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 然后创建结构分区表-分区表
create table psn_partitioned_dongtai(
id int ,
name string,
likes array<string>,
address map<string,string>
)
partitioned by (age int,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';

-- 向数据表中加载数据
load data local inpath '/root/hivedata/psn1' into table person;
--向结构分区表中加载数据
from person
insert overwrite table psn_partitioned_dongtai partition(age,sex)
select id,name,likes,address,age,sex distribute by age,sex;

2、参数

开启支持动态分区
set hive.exec.dynamic.partition=true;
默认：false
set hive.exec.dynamic.partition.mode=nostrict;
默认：strict（至少有一个分区列是静态分区）


set hive.exec.max.dynamic.partitions.pernode;
每一个执行mr节点上，允许创建的动态分区的最大数量(100)
set hive.exec.max.dynamic.partitions;
所有执行mr节点上，允许创建的所有动态分区的最大数量(1000)
set hive.exec.max.created.files;
所有的mr job允许创建的文件的最大数量(100000)

分桶

1、分桶描述

分桶是以文件的形式存在。分桶表是对分桶列取哈希值然后取模的方式，将不同的数据放到不同的问题件中，每个文件标识一个桶
适用场景：数据抽样（sampling）、map-join

2、创建分桶表

-- 开启支持分桶
set hive.enforce.bucketing=true;
注意，设置该参数后，mr运行时，bucket个数与reduce task个数会保持一致
默认：false；设置为true之后，mr运行时会根据bucket的个数自动分配reduce task个数。（用户也可以通过mapred.reduce.tasks自己设置reduce任务个数，但分桶时不推荐使用）
注意：一次作业产生的桶（文件数量）和reduce task个数一致


-- 创建数据表
create table psn_fentong(
id int,
name string,
age int	
)
row format delimited 
fields terminated by ',';

-- 加载数据
load data local inpath '/root/hivedata/psn_fentong' into table psn_fentong;

-- 创建分通表
create table psn_fentong2(
id int,
name string,
age int
)
clustered by (age) into 4 buckets 
row format delimited fields terminated by ',';

-- 加载数据
insert into table psn_fentong2 select id ,name ,age from psn_fentong;

-- 数据抽样
select * from psn_fentong2 (bucket 2 out of 4 on age);

tablesample语法：
tablesample(bucket x out of y)
x：表示从哪个bucket开始抽取数据
y：必须为该表总bucket数的倍数或因子

例：
当表总bucket数为32时
TABLESAMPLE(BUCKET 3 OUT OF 8)，抽取哪些数据？
共抽取2（32/16）个bucket的数据，抽取第2、第18（16+2）个bucket的数据

3、from insert 加载数据

from psn21
insert overwrite table psn22 partition(age, sex)  
select id, name, age, sex, likes, address distribute by age, sex;

hive Lateral View 虚拟表

1、描述，作用

使用：Lateral View用于和UDTF函数（explode、split）结合来使用。

首先通过UDTF函数拆分成多行，再将多行结果组合成一个支持别名的虚拟表。

主要解决：在select使用UDTF做查询过程中，查询只能包含单个UDTF，不能包含其他字段、以及多个UDTF的问题

2、案例

统计人员表中共有多少种爱好、多少个城市?
-- 查看表描述
describe formatted psn2;
select count(distinct(myCol1)), count(distinct(myCol2)) from psn2 
LATERAL VIEW explode(likes) myTable1 AS myCol1 
LATERAL VIEW explode(address) myTable2 AS myCol2, myCol3;

hive View视图

1、描述

和关系型数据库中的普通视图一样，hive也支持视图

特点：

不支持物化视图

只能查询，不能做加载数据操作

视图的创建，只是保存一份元数据，查询视图时才执行对应的子查询

view定义中若包含了ORDER BY/LIMIT语句，当查询视图时也进行ORDER BY/LIMIT语句操作，view当中定义的优先级更高

view支持迭代视图

2、SQL
-- 语法
create view [if not exists] [db_name.]view_name 
  [(column_name [comment column_comment], ...) ]
  [comment view_comment]
  [tblproperties (property_name = property_value, ...)]
  as select ... ;

-- 创建视图
create view v_psn as select * from psn;

-- 查询视图
select colums from view;

-- 删除视图
drop view v_psn;

Hive 索引

1、目的：优化查询以及检索性能

2、SQL

-- 创建索引
create index t1_index on table psn2(name) 
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild 
in table t1_index_table;
as：指定索引器；
in table：指定索引表，若不指定默认生成在default__psn2_t1_index__表中

-- 查询索引
show index on psn2;

-- 重建索引（建立索引之后必须重建索引才能生效）
alter index t1_index on psn2 rebuild;

-- 删除索引
删除索引是会连带删除维护该索引的索引表t1_index_table.
注意：删除索引不要手动删除索引表，这个索引表是系统维护的。如果需要删除索引，只删除相应的索引就可以了，其他的不用管
drop index t1_index on psn2;

Hive 脚本运行方式

1、命令行方式cli：控制台模式
       --hive语法操作
       --与hdfs交互
           例如： hive> dfs -ls /;
       --与linux系统交互,以！开头
           例如：hive> !pwd

2、脚本运行方式（实际生产环境中用最多）
       --hive -e 'select * from psn2' 查询结果会输出到控制台
       --hive -e 'select * from psn2' >aaa 将查询结果输出到aaa文件中
       --hive -S -e 'select * from psn2' > aaa -S表示为静默输出，用的比较少
       --hive -f file（此方法使用的最多）注：hive操作命令语法写到file中，使用-f 来调用file文件，执行其中的命令语句
       --hive -i file 注：跟-f的不用是当执行完后不退出hive控制台，而是留在hive控制台。以上的几个语法均是执行完成之后退出hive控制台
       --source是在hive控制台中获取到linux系统中的文件内容然后执行，file中存放的是hive操作语句
           hive>source /root/file
3、JDBC方式

hiveserver2方式启动hive服务
4、web

GUI接口（hwi、hue等，一般使用hue，hwi根本不用太辣鸡）

Hive SerDe

1、建表
create table logdata (
    host string,
    identity string,
    t_user string,
    time string,
    request string,
    referer string,
    agent string)
  row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
  with serdeproperties (
    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
  )
  stored as textfile;
Serializer and Deserializer

SerDe 用于做序列化和反序列化。

构建在数据存储和执行引擎之间，对两者实现解耦。

Hive通过row format delimited以及serde进行内容的读写。

serde 序列化和反序列化（行转换器） 'org.apache.hadoop.hive.serde2.RegexSerDe'正则转换器，
       也可以是其他的转换器，也可以自定义实现serde接口定义自己的转换器
   row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
   with serdeproperties (
       "input.regex" = "针对每行进行解析的正则表达式"
   )

加载数据

1、load data方式

当数据被加载至表中时，不会对数据进行任何转换。Load操作只是将数据复制至Hive表对应的位置。数据加载时在表下自动创建一个目录

-- 普通表加载数据
load data local inpath '/root/hivedata/person' into table person;

-- 分区表加载数据（静态分区）
load data local inpath '/root/hivedata/person' into table p_person partition (sex='man',age=10);

2、from insert方式

语法

from from_statement 
insert overwrite table tablename1 [partition (partcol1=val1, partcol2=val2 ...) [if not exists]] select_statement1 
[insert overwrite table tablename2 [partition ... [if not exists]] elect_statement2] 
[insert into table tablename2 [partition ...] select_statement2] ...;

案例

-- 静态分区表
from person 
insert overwrite table p_person partition (sex='woman', age=60) 
select id,name,likes,address;

自定义函数

Hive  自定义函数
Hive的UDF开发只需要重构UDF类的evaluate函数即可。例：

package com.hrj.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class helloUDF extends UDF {
    public String evaluate(String str) {
        try {
            return "HelloWorld " + str;
        } catch (Exception e) {
            return null;
        }
    }
} 

Hive  自定义函数调用
将该java文件编译成helloudf.jar
hive> add jar helloudf.jar;
hive> create temporary function helloworld as 'com.hrj.hive.udf.helloUDF';
hive> select helloworld(t.col1) from t limit 10;
hive> drop temporary function helloworld;

注意 
1.helloworld为临时的函数，所以每次进入hive都需要add jar以及create temporary操作
2.UDF只能实现一进一出的操作，如果需要实现多进一出，则需要实现UDAF

hive 参数

命名空间	读写权限	含义
hiveconf	可读写	hive-site.xml当中的各配置变量例：hive --hiveconf hive.cli.print.header=true
system	可读写	系统变量，包含JVM运行参数等例：system:user.name=root
env	只读	环境变量例：env:JAVA_HOME
hivevar	可读写	例：hive -d val=key

1、通过${}方式进行引用，其中system、env下的变量必须以前缀开头

2、hive 参数设置方式

（1）、修改配置文件 ${HIVE_HOME}/conf/hive-site.xml

（2）、启动hive cli时，通过--hiveconf key=value的方式进行设置

例：hive --hiveconf hive.cli.print.header=true

（3）、进入cli之后，通过使用set命令设置

3、hive set命令

在hive CLI控制台可以通过set对hive中的参数进行查询、设置

set设置：

set hive.cli.print.header=true;

set查看：

set hive.cli.print.header

hive参数初始化配置：

当前用户家目录下的.hiverc文件

如: ~/.hiverc

如果没有，可直接创建该文件，将需要设置的参数写到该文件中，hive启动运行时，会加载改文件中的配置。

hive历史操作命令集：

~/.hivehistory

毛豆有毛没豆

发布了92 篇原创文章 · 获赞 3 · 访问量 5104

私信关注

Hive SQL、参数、自定义函数