前言
如果一个表中数据很多,我们查询时就很慢,耗费大量时间,如果要查询其中部分数据该怎么办呢,这时我们引入分区的概念。
分区
可以根据PARTITIONED BY创建分区表,一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下。
分区是以字段的形式在表结构中存在,通过describe table命令可以查看到字段存在,但是该字段不存放实际的数据内容,仅仅是分区的表示。
分区建表分为2种,一种是单分区,也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区,表文件夹下出现多文件夹嵌套模式。
分区表演示
创建表
hive> create table stu(
> id int,name string,gender string,math int,english int)
> row format delimited fields terminated by ',';
OK
Time taken: 0.341 seconds
查询表结构信息
hive> desc stu;
OK
id int
name string
gender string
math int
english int
Time taken: 0.27 seconds, Fetched: 5 row(s)
插入数据
hive> insert into stu values(1,"shangguan","N",87,78);
Query ID = root_20200416153003_41bf7694-8f28-4016-8b13-2baf8fc1d89a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1587021132741_0001, Tracking URL = http://hadoop01:8088/proxy/application_1587021132741_0001/
Kill Command = /opt/app/hadoop/bin/hadoop job -kill job_1587021132741_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-04-16 15:30:13,909 Stage-1 map = 0%, reduce = 0%
2020-04-16 15:30:21,291 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
MapReduce Total cumulative CPU time: 1 seconds 500 msec
Ended Job = job_1587021132741_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:9000/user/hive/warehouse/student.db/stu/.hive-staging_hive_2020-04-16_15-30-03_216_5835972568713550258-1/-ext-10000
Loading data to table student.stu
Table student.stu stats: [numFiles=1, numRows=1, totalSize=20, rawDataSize=19]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.5 sec HDFS Read: 4206 HDFS Write: 87 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 500 msec
OK
Time taken: 19.593 seconds
hive> insert into stu values(1,"guan","m",87,78);
Query ID = root_20200416153050_93f07263-68ce-4fcd-b5e1-b9dc7b9fdb01
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1587021132741_0002, Tracking URL = http://hadoop01:8088/proxy/application_1587021132741_0002/
Kill Command = /opt/app/hadoop/bin/hadoop job -kill job_1587021132741_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-04-16 15:30:56,954 Stage-1 map = 0%, reduce = 0%
2020-04-16 15:31:04,368 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.55 sec
MapReduce Total cumulative CPU time: 1 seconds 550 msec
Ended Job = job_1587021132741_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:9000/user/hive/warehouse/student.db/stu/.hive-staging_hive_2020-04-16_15-30-50_562_7187561534696972662-1/-ext-10000
Loading data to table student.stu
Table student.stu stats: [numFiles=2, numRows=2, totalSize=35, rawDataSize=33]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.55 sec HDFS Read: 4296 HDFS Write: 82 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 550 msec
OK
Time taken: 15.121 seconds
查询表中信息
hive> select * from stu;
OK
1 shangguan N 87 78
1 guan m 87 78
创建分区表
hive> create table partition_table(
> id int,name string)
> partitioned by(gender string)
> row format delimited fields terminated by ',';
OK
Time taken: 0.17 seconds
查看分区表信息
hive> desc partition_table
> ;
OK
id int
name string
gender string
# Partition Information
# col_name data_type comment
gender string
Time taken: 0.068 seconds, Fetched: 8 row(s)
插入对应分区表
hive> insert into table partition_table partition(gender='N')select id,name from stu where gender='N';
Query ID = root_20200416154126_0bbfae98-f01a-4147-afba-5ab27362f57a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1587021132741_0003, Tracking URL = http://hadoop01:8088/proxy/application_1587021132741_0003/
Kill Command = /opt/app/hadoop/bin/hadoop job -kill job_1587021132741_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-04-16 15:41:32,346 Stage-1 map = 0%, reduce = 0%
2020-04-16 15:41:39,666 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec
MapReduce Total cumulative CPU time: 1 seconds 600 msec
Ended Job = job_1587021132741_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:9000/user/hive/warehouse/student.db/partition_table/gender=N/.hive-staging_hive_2020-04-16_15-41-26_261_148002425019543036-1/-ext-10000
Loading data to table student.partition_table partition (gender=N)
Partition student.partition_table{gender=N} stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.6 sec HDFS Read: 4110 HDFS Write: 95 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 600 msec
OK
Time taken: 14.945 seconds
hive> insert into table partition_table partition(gender='m')select id,name from stu where gender='m';
Query ID = root_20200416154229_667496ac-4947-44b7-8785-16e2c25c37cc
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1587021132741_0004, Tracking URL = http://hadoop01:8088/proxy/application_1587021132741_0004/
Kill Command = /opt/app/hadoop/bin/hadoop job -kill job_1587021132741_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-04-16 15:42:36,048 Stage-1 map = 0%, reduce = 0%
2020-04-16 15:42:43,461 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.63 sec
MapReduce Total cumulative CPU time: 1 seconds 630 msec
Ended Job = job_1587021132741_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:9000/user/hive/warehouse/student.db/partition_table/gender=m/.hive-staging_hive_2020-04-16_15-42-29_979_9199116166636330816-1/-ext-10000
Loading data to table student.partition_table partition (gender=m)
Partition student.partition_table{gender=m} stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.63 sec HDFS Read: 4190 HDFS Write: 95 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 630 msec
OK
Time taken: 14.795 seconds
查看hdfs
[root@hadoop01 ~]# hdfs dfs -ls /user/hive/warehouse/student.db/stu
Found 2 items
-rwxrwxr-x 2 root supergroup 20 2020-04-16 15:30 /user/hive/warehouse/student.db/stu/000000_0
-rwxrwxr-x 2 root supergroup 15 2020-04-16 15:31 /user/hive/warehouse/student.db/stu/000000_0_copy_1
[root@hadoop01 ~]# hdfs dfs -text /user/hive/warehouse/student.db/stu/000000_0
1,shangguan,N,87,78
[root@hadoop01 ~]# hdfs dfs -text /user/hive/warehouse/student.db/stu/000000_0_copy_1
1,guan,m,87,78
[root@hadoop01 ~]#