Partition Table
Partition Table
The partition table actually corresponds to an independent folder on the HDFS file system, under which are all the data files of the partition. Partitions in Hive are sub-directories, which divide a large data set into small data sets according to business needs. When querying, use the expression in the WHERE clause to select the specified partition required for the query, which will greatly improve the query efficiency.
Basic operation of partition table
create partition table syntax
Note: The partition field cannot be data that already exists in the table, and the partition field can be regarded as a pseudo-column of the table.
create table user_partition(
no int,
name string,
)
partitioned by (day string)
row format delimited fields terminated by '\t';
Load data into partition table
prepare data
20230312.log
20230313.log
20230314.log
Download Data
Note: When loading data in a partitioned table, you must specify a partition
load data local inpath
'/data/20230312.log' into table user_partition partition(day='20230312');
load data local inpath
'/data/20230313.log' into table user_partition partition(day='20230313');
load data local inpath
'/data/20230314.log' into table user_partition partition(day='20230314');
add partition
Create a single partition
alter table user_partition add partition(day='20230311');
Create multiple partitions at the same time
alter table user_partition add partition(day='20230309') partition(day='20230310');
delete partition
delete a single partition
alter table user_partition drop partition (day='20230309');
Delete multiple partitions at the same time
alter table user_partition drop partition (day='20230311'), partition(day='20230310');
Check how many partitions the partition table has
show partitions user_partition;
View partition table structure
desc formatted user_partition;
Secondary partition
Create a secondary partition table
create table access_log( id int, name string, loc string
) partitioned by (day string, hour string);
load data normally
Load data into the secondary partition table
load data local inpath '/data/access_20230312.log' into table access_log partition(day='202303', hour='12');
Query partition data
select * from access_log where day='202303' and hour='12';
Three ways to directly upload data to the partition directory and associate the partition table with the data
- Method 1: Repair the upload after uploading the data
hive (default)> dfs -mkdir -p
/hive/warehouse/op_log.db/access_log/day=202303/hour=12;
hive (default)> dfs -put /data/access_20230312.log
/hive/warehouse/op_log.db/access_log/day=202303/hour=12;
Query data (the data just uploaded cannot be queried)
select * from access_log where day='202303' and hour='12';
Execute the repair command
msck repair table access_log;
execute query
- Method 2: Add partitions after uploading data
hive (default)> dfs -mkdir -p
/hive/warehouse/op_log.db/access_log/day=202304/hour=14;
hive (default)> dfs -put /data/access_20230414.log
/hive/warehouse/op_log.db/access_log/day=202304/hour=14;
Execute add partition
hive (default)> alter table access_log add partition(day='202304',hour='14');
execute query
- Method 3: Create a folder and load data to the partition to create a directory
Create a directory
hive (default)> dfs -mkdir -p
/hive/warehouse/op_log.db/access_log/day=202303/hour=15;
upload data
hive (default)> load data local inpath '/data/access_20230315.log' into table access_log partition(day='202303',hour='15');
execute query
Dynamic Partition Adjustment
In a relational database, when inserting data into a partition table, the database will automatically insert the data into the corresponding partition according to the value of the partition field. Hive also provides a similar mechanism, that is, Dynamic Partition (Dynamic Partition), but, To use Hive's dynamic partitions, corresponding configurations are required.
Enable dynamic partition parameter settings
Related configuration items
Enable the dynamic partition function (default true, enabled)
hive.exec.dynamic.partition=true
Set to non-strict mode (the dynamic partition mode, the default strict, means that at least one partition must be designated as a static partition, and the nonstrict mode means that all partition fields are allowed to use dynamic partitions.)
hive.exec.dynamic.partition.mode=nonstrict
The maximum number of dynamic partitions that can be created on all nodes that execute MR. Default 1000
hive.exec.max.dynamic.partitions=1000
On each node that executes MR, the maximum number of dynamic partitions that can be created. This parameter needs to be set according to the actual data. For example: the source data contains one year's data, that is, the day field has 365 values, then this parameter needs to be
set to be greater than 365, and if the default value of 100 is used, an error will be reported.
hive.exec.max.dynamic.partitions.pernode=100
The maximum number of HDFS files that can be created in the entire MR Job. Default 100000
hive.exec.max.created.files=100000
Whether to throw an exception when an empty partition is generated. Generally no setting is required. default false
hive.error.on.empty.partition=false
the case
Requirement: Insert the data in the user table into the corresponding partition of the target table person according to the region (loc field).
- Create target partition table
hive (default)> create table user_partition(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';
- Set up dynamic partitions
set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table user_partition partition(loc) select id, name, loc from user;
- View the partition status of the target partition table
hive (default)> show partitions user_partition;
I hope it will be helpful to you who are viewing the article, remember to pay attention, comment, and favorite, thank you