hive学习(三) hive的分区

1.Hive 分区partition

必须在表定义时指定对应的partition字段
a、单分区建表语句:
create table day_table (id int, content string) partitioned by (dt string);
单分区表,按天分区,在表结构中存在id,content,dt三列。
以dt为文件夹区分
b、 双分区建表语句:
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。
先以dt为文件夹,再以hour子文件夹区分
 

2.创建2个表psn2 psn3

create table psn2 (
id int,
name string,
hobby array<string>,
address map<string,string>
)
partitioned by (age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';

3.插入数据

load data local inpath '/root/data' into table psn2 partition (age=10);

4.查询psn2

hive> select * from psn2;
OK
1    小明1    ["lol","book","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
2    小明2    ["lol","book","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
3    小明3    ["lol","book","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
4    小明4    ["lol","book","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
5    小明5    ["lol","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
6    小明6    ["lol","book","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
7    小明7    ["lol","book"]    {"beijing":"changping","shanghai":"pudong"}    10
8    小明8    ["lol","book"]    {"beijing":"changping","shanghai":"pudong"}    10
9    小明9    ["lol","book","movie"]    {"beijing":"changping","shanghai":"pudong"}    10
Time taken: 7.93 seconds, Fetched: 9 row(s)
理论上分区可以无限分,但是实际需要根据需求来分区。
如:历史数据按天分区
 

5.错误实例psn3

create table psn3 (
id int,
name string,
age int,
hobby array<string>,
address map<string,string>
)
partitioned by (age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
报错提示:FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
原因:分区字段不能再表的列中
 

6.同时创建两个分区

create table psn3 (
id int,
name string,
hobby array<string>,
address map<string,string>
)
partitioned by (age int,sex string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
 
注意:添加两个字段相应的插入数据时要指定两个字段,一个字段会报错。
hive> create table psn3 (
    > id int,
    > name string,
    > hobby array<string>,
    > address map<string,string>
    > )
    > partitioned by (age int,sex string)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > COLLECTION ITEMS TERMINATED BY '-'
    > MAP KEYS TERMINATED BY ':'
    > LINES TERMINATED BY '\n';
OK
Time taken: 1.167 seconds

7.向双分区加载数据

hive> load data local inpath '/root/data' into table psn3 partition (age=10);
FAILED: SemanticException [Error 10006]: Line 1:63 Partition not found '10'
hive> load data local inpath '/root/data' into table psn3 partition (age=10,sex='boy');
Loading data to table default.psn3 partition (age=10, sex=boy)
OK
Time taken: 3.115 seconds
hive>

8.删除分区

alter table psn2 drop partition (sex='boy');
hive> alter table psn3 drop partition (sex='boy');
Dropped the partition age=10/sex=boy
OK
Time taken: 0.195 seconds

9.结论:

添加分区的时候,必须在现有分区的基础之上
删除分区的时候,会将所有存在的分区都删除
 

10.添加分区

添加时必须指定age=10 还是 age=20的分区删除,不然会报错
hive> alter table psn3 add partition(sex='man');
FAILED: ValidationFailureSemanticException default.psn3: partition spec {sex=man} doesn't contain all (2) partition columns
hive> alter table psn3 add partition(age=10,sex='man');
OK
Time taken: 0.418 seconds
删除前age=10 和age=20下分别有boy和man两个目录

11.执行删除

hive> alter table psn3 drop partition(sex='man');
Dropped the partition age=10/sex=man
OK
Time taken: 0.389 seconds
load 加载数据的过程其实是在上传文件 partition 是对应hdfs的目录
 

12.通过一个表的查询结果的数据插入到另一个表中

create table psn4 (
id int,
name string,
hobby array<string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
LINES TERMINATED BY '\n';
这种插入数据会转换成mapreduce任务
from psn3 insert overwrite table psn4 select id,name,hobby;

13.这种操作的作用:

1.复制表
2.可以作为中间表存在
 
 

猜你喜欢

转载自www.cnblogs.com/benjamin77/p/10248022.html