数据类型

基本数据类型

Hive 数据类型	Java 数据类型	长度	例子
TINYINT	byte	1 byte 有符号整数	20
SMALINT	short	2 byte 有符号整数	20
INT	int	4 byte 有符号整数	20
BIGINT	long	8 byte 有符号整数	20
BOOLEAN	boolean	布尔类型，true 或者 false	TRUE，FALSE
FLOAT	float	单精度浮点数	3.14159
DOUBLE	double	双精度浮点数	3.14159
STRING	string	字符串，可以指定字符集，可以使用单引号或者双引号	‘Good morning’，“Good afternoon”
TIMESTAMP		时间类型
BINARY		字节数组

对于 Hive 的 String 类型相当于数据库的 varchar 类型，该类型是一个可变的字符串。不过它不能声明其中最多能存储多少个字符，理论上它可以存储 2GB 的字符数。

集合数据类型

数据类型	描述
STRUCT	和 c 语言中的 struct 类似，都可以通过“点”符号访问元素内容。例如，如果某个列的数据类型是 STRUCT{first STRING,last STRING}，那么第 1 个元素可以通过字段 .first 来引用
MAP	MAP 是一组键-值对元组集合，使用数组表示法可以访问数据。例如，如果某个列的数据类型是 MAP，其中键->值对是’first’->’John’和’last’->’Doe’，那么可以通过字段名 [‘last’] 获取最后一个元素
ARRAY	数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素，每个数组元素都有一个编号，编号从零开始。例如，数组值为[‘John’, ‘Doe’]，那么第 2 个元素可以通过数组名 [1] 进行引用

create table test(
	name string,
	friends array<string>,
	children map<string, int>,
	address struct<street:string, city:string>
)
row format delimited
fields terminated by ','				列分隔符
collection items terminated by '_'		MAP，STRUCT，和ARRAY 的分隔符(数据分割符号)
map keys terminated by ':'				MAP中的key与value的分隔符
lines terminated by '\n';				行分隔符

类型转化

Hive 的基本数据类型是可以进行隐式转换的，类似于 Java 的类型转换，例如TINYINT 会自动转换为 INT 类型。但是 Hive 不会进行反向转化，例如，INT 不会自动转换为 TINYINT 类型。
隐式类型转换规则如下

任何整数类型都可以隐式地转换为一个范围更广的类型，如 TINYINT 可以转换成 INT，INT 可以转换成 BIGINT。
所有整数类型、FLOAT 和 STRING 类型都可以隐式地转换成 DOUBLE。
TINYINT、SMALLINT、INT 都可以转换为 FLOAT。
BOOLEAN 类型不可以转换为任何其它的类型。

使用 CAST 操作显示进行数据类型转换，例如 CAST(‘1’ AS INT) 将把字符串’1’ 转换成整数 1；如果强制类型转换失败，如执行CAST(‘X’ AS INT)，表达式返回空值 NULL。

DDL 数据定义语言

创建数据库

create database if not exists hive;

创建数据库，指定数据库在HDFS上存放的位置，不会出现 *.db 文件夹

Create database if not exists hive location '/hive';

查询数据库

显示数据库

show datatbases;

显示数据库信息

desc database hive;

显示数据库详细信息

desc database extended hive;

修改删除数据库

数据库的其他元数据信息都是不可更改的，包括数据库名和数据库所在的目录位置。

alter database hive set dbproperties('createtime'='20200528');

删除空数据库

drop database if exists hive;

使用cascade级联删除空数据库

drop database if exists hive cascade;

创建表

创建表语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

普通创建表

create table if not exists student(
	id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/hive/warehouse/student';

根据查询结果创建表（查询的结果会添加到新创建的表中）

create table if not exists student as select id,name from person;

根据已经存在的表结构创建表

create table if not exists student like person;

查询表的类型

desc formatted student;

内部表和外部表

    默认创建的表都是所谓的管理表，有时也被称为内部表。当我们删除一个管理表时，Hive 也会删除这个表中数据。
    EXTERNAL 关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。
    应用场景：每天将收集到的网站日志定期流入 HDFS 文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过 SELECT+INSERT进入内部表。
    管理表与外部表的互相转换，注意：(‘EXTERNAL’=‘TRUE’)和(‘EXTERNAL’=‘FALSE’)一定要大写。

alter table student set tblproperties('EXTERNAL'='TRUE');

分区表

分区表实际上就是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive 中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。分区表形式如下：

/hive/warehouse/log_partition/20170702/20170702.log
/hive/warehouse/log_partition/20170703/20170703.log
/hive/warehouse/log_partition/20170704/20170704.log

创建分区表

create table log_partition(
	dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

加载数据到分区表中

load data local inpath '/opt/test.txt' into table log_partition partition(month='202005');

查询分区表中数据

select * from loc_partition where month='202005';

增加分区

alter table log_partition add partition(month='202005');

创建多个分区，之间用空格隔开

alter table log_partition add partition(month='202005') partition(month='202006');

删除分区

alter table log_partition drop partition(month='202005');

删除多个分区，之间用逗号隔开

alter table log_partition drop partition(month='202005'),partition(month='202006');

查看分区表有多少分区

show partitions log_partition;

查看分区表结构

desc formatted log_partition;

创建二级分区表

create table log_partition(
	dname string, loc string
)
partitioned by (month string,day string)
row format delimited fields terminated by '\t';

加载数据到二级分区表中

load data local inpath '/opt/test.txt' into table log_partition partition(month='202005',day='28');

数据与元数据

Hive查询数据是先查询元数据，然后根据元数据获取实际数据。Hive查询有两个条件：一、存在元数据。二、存在实际数据。不管是先有数据还是先有元数据，只要两个条件都满足，则可以查询到数据。如果缺少其中一个则查询不到数据。

hadoop fs -mkdir -p /hive/warehouse/test
hadoop fs -put /opt/test.txt /hive/warehouse/test
msck repair table test;
select * from test;

修改表

重命名表

alter table student rename to pupil;

增加列

alter table pupil add column(grade int);

更新列，注意需要加上类型

alter table pupil change column name last_name string;

替换列，REPLACE是替换表中所有字段。

alter table pupil replace column(id int,name string);

删除表

 drop table pupil;

DML 数据操作

数据导入

向表中装载数据（Load）

加载本地文件到表hive

load data local inpath '/opt/test.txt' into table hive;

加载 HDFS 文件到表 hive 中

load data inpath '/test.txt' into table hive;

加载数据覆盖表中已有的数据

load data inpath '/test.txt' overwrite into table hive;

向表中插入数据（Insert）

基本插入数据

insert into table student partition(month='202005') values(1,'abc');

insert into table student partition(month='202005') select * from table;

创建表并加载数据（As Select）

create table if not exists student as select id, name from student;

创建表时通过 Location 指定加载数据路径

create table if not exists student(
	id int, name string
)
row format delimited fields terminated by '\t'
location '/hive/warehouse/student';

hadoop fs -put /opt/test.txt /hive/warehouse/student
select * from student;

Import 数据到指定 Hive 表中。注意：先用 export 导出后，再将数据导入，因为export导出表时会产生元数据文件，import需要用到元数据。

export table temptable to '/hive/warehouse/export/student';
import table student partition(month='202005') from '/hive/warehouse/export/student';

数据导出

insert将查询的结果导出到本地

insert overwrite local directory '/opt/test' select * from student;

将查询的结果格式化导出到本地

insert overwrite local directory '/opt/test' row format delimited fields terminated by '\t' select * from student;

将查询的结果导出到 HDFS 上(没有 local)

insert overwrite directory '/opt/test' select * from student;

Hadoop 命令导出到本地

hadoop fs -get /hive/warehouse/test.txt /opt/test.txt

Hive Shell 命令导出

hive -e 'select * from student;' > /opt/student.txt;

Export 导出到 HDFS 上

export table student to /hive/warehouse/student;

清空表数据（truncate）

Truncate 只能删除管理表，不能删除外部表中数据

truncate table student;

Hive-数据类型，DDL数据定义语言，DML数据操作

目录