table of Contents
1 Basic operation of partition table
3 Dynamic partition adjustment
0 partition table
The partition table actually corresponds to an independent folder on the HDFS file system, and all the data files of the partition are under this folder. The partition in Hive is the sub-directory , which divides a large data set into small data sets according to business needs. In the query, the specified partition required by the query is selected through the expression in the WHERE clause, and the query efficiency will be improved a lot.
1 Basic operation of partition table
1 ) Introduce the partition table (the log needs to be managed according to the date, and simulated by department information)
dept_20200401.log
dept_20200402.log
dept_20200403.log
2 ) Create partition table syntax
hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
Note: The partition field cannot be data that already exists in the table. The partition field can be regarded as a pseudo column of the table. |
3 ) Load data into the partition table
- data preparation
dept_20200401.log
10 ACCOUNTING 1700
20 RESEARCH 1800
dept_20200402.log
30 SALES 1900
40 OPERATIONS 1700
dept_20200403.log
50 TEST 2000
60 DEV 1900
- Download Data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');
Note: When loading data in the partition table, you must specify the partition
4 ) Query the data in the partition table
Single partition query
hive (default)> select * from dept_partition where day='20200401';
Multi-partition joint query
hive (default)> select * from dept_partition where day='20200401'
union
select * from dept_partition where day='20200402'
union
select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or
day='20200402' or day='20200403';
5 ) Increase the partition
Create a single partition
hive (default)> alter table dept_partition add partition(day='20200404');
Create multiple partitions at the same time
hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');
6 ) Delete the partition
Delete a single partition
hive (default)> alter table dept_partition drop partition (day='20200406');
Delete multiple partitions at the same time
hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');
7 ) View how many partitions the partition table has
hive> show partitions dept_partition;
8 ) View the partition table structure
hive> desc formatted dept_partition;
# Partition Information
# col_name data_type comment
month string
2 Secondary partition
Thinking: How to split the log data in one day?
1 ) Create a secondary partition table
|
2 ) Normal loading data
(1) Load data into the secondary partition table
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');
(2) Query partition data
hive (default)> select * from dept_partition2 where day='20200401' and hour='12';
3 ) data directly uploaded to the directory partition , so that the partition table and the associated data generated in three ways
(1) Method 1: Repair after uploading data
upload data
hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
Query data (the data just uploaded cannot be queried)
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
Execute repair command
hive> msck repair table dept_partition2;
Query data again
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
(2) Method 2: Add partition after uploading data
upload data
hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
Execute add partition
hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');
Query data
hive (default)> select * from dept_partition2 where day='20200401' and hour='14';
(3) Method 3: Load the data to the partition after creating the folder
Create a directory
hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
upload data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401',hour='15');
Query data
hive (default)> select * from dept_partition2 where day='20200401' and hour='15';
3 Dynamic partition adjustment
In a relational database, when inserting data in a partitioned table, the database will automatically insert the data into the corresponding partition based on the value of the partition field. Hive also provides a similar mechanism, namely Dynamic Partition (Dynamic Partition), but, To use Hive's dynamic partition, you need to configure it accordingly.
1 ) Open the dynamic partition parameter setting
(1) Open the dynamic partition function (default true, open)
hive.exec.dynamic.partition=true
(2) Set to non-strict mode (the dynamic partition mode, the default is strict, which means that at least one partition must be designated as a static partition, and the nonstrict mode means that all partition fields are allowed to use dynamic partitions.)
hive.exec.dynamic.partition.mode=nonstrict
(3) The maximum number of dynamic partitions that can be created on all nodes that perform MR. Default 1000
hive.exec.max.dynamic.partitions=1000
(4) On each node that performs MR , how many dynamic partitions can be created at most. This parameter needs to be set according to actual data. For example, if the source data contains one year's data, that is, the day field has 365 values, then this parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported.
hive.exec.max.dynamic.partitions.pernode=100
(5) The maximum number of HDFS files that can be created in the entire MR Job. Default 100000
hive.exec.max.created.files=100000
(6) When an empty partition is generated, whether an exception is thrown. Generally, no setting is required. Default false
hive.error.on.empty.partition=false
2 ) Case practice
Requirement: Insert the data in the dept table into the corresponding partition of the target table dept_partition according to the region (loc field).
(1) Create the target partition table
hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';
(2) Set up dynamic partition
set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;
(3) View the partition situation of the target partition table
hive (default)> show partitions dept_partition;
Thinking: How does the target partition table match the partition field?