table of Contents

0 partition table

The partition table actually corresponds to an independent folder on the HDFS file system, and all the data files of the partition are under this folder. The partition in Hive is the sub-directory , which divides a large data set into small data sets according to business needs. In the query, the specified partition required by the query is selected through the expression in the WHERE clause, and the query efficiency will be improved a lot.

1 Basic operation of partition table

1 ) Introduce the partition table (the log needs to be managed according to the date, and simulated by department information)

dept_20200401.log

dept_20200402.log

dept_20200403.log

2 ) Create partition table syntax

hive (default)> create table dept_partition(

deptno int, dname string, loc string

)

partitioned by (day string)

row format delimited fields terminated by '\t';

Note: The partition field cannot be data that already exists in the table. The partition field can be regarded as a pseudo column of the table.

3 ) Load data into the partition table

data preparation

dept_20200401.log

10 ACCOUNTING 1700

20 RESEARCH 1800

dept_20200402.log

30 SALES 1900

40 OPERATIONS 1700

dept_20200403.log

50 TEST 2000

60 DEV 1900

Download Data

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');

Note: When loading data in the partition table, you must specify the partition

4 ) Query the data in the partition table

Single partition query

hive (default)> select * from dept_partition where day='20200401';

Multi-partition joint query

hive (default)> select * from dept_partition where day='20200401'

              union

              select * from dept_partition where day='20200402'

              union

              select * from dept_partition where day='20200403';

hive (default)> select * from dept_partition where day='20200401' or

                day='20200402' or day='20200403';

5 ) Increase the partition

Create a single partition

hive (default)> alter table dept_partition add partition(day='20200404');

Create multiple partitions at the same time

hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');

6 ) Delete the partition

Delete a single partition

hive (default)> alter table dept_partition drop partition (day='20200406');

Delete multiple partitions at the same time

hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

7 ) View how many partitions the partition table has

hive> show partitions dept_partition;

8 ) View the partition table structure

hive> desc formatted dept_partition;

# Partition Information         

# col_name              data_type               comment            

month                   string

2 Secondary partition

Thinking: How to split the log data in one day?

1 ) Create a secondary partition table

hive (default)> create table dept_partition2(

               deptno int, dname string, loc string

               )

               partitioned by (day string, hour string)

               row format delimited fields terminated by '\t';

2 ) Normal loading data

(1) Load data into the secondary partition table

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table

dept_partition2 partition(day='20200401', hour='12');

(2) Query partition data

hive (default)> select * from dept_partition2 where day='20200401' and hour='12';

3 ) data directly uploaded to the directory partition , so that the partition table and the associated data generated in three ways

(1) Method 1: Repair after uploading data

upload data

hive (default)> dfs -mkdir -p

 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

hive (default)> dfs -put /opt/module/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

Query data (the data just uploaded cannot be queried)

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

Execute repair command

hive> msck repair table dept_partition2;

Query data again

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

(2) Method 2: Add partition after uploading data

upload data

hive (default)> dfs -mkdir -p

 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

Execute add partition

  hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');

Query data

hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

(3) Method 3: Load the data to the partition after creating the folder

Create a directory

hive (default)> dfs -mkdir -p

 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

upload data

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table

 dept_partition2 partition(day='20200401',hour='15');

Query data

hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

3 Dynamic partition adjustment

In a relational database, when inserting data in a partitioned table, the database will automatically insert the data into the corresponding partition based on the value of the partition field. Hive also provides a similar mechanism, namely Dynamic Partition (Dynamic Partition), but, To use Hive's dynamic partition, you need to configure it accordingly.

1 ) Open the dynamic partition parameter setting

(1) Open the dynamic partition function (default true, open)

hive.exec.dynamic.partition=true

(2) Set to non-strict mode (the dynamic partition mode, the default is strict, which means that at least one partition must be designated as a static partition, and the nonstrict mode means that all partition fields are allowed to use dynamic partitions.)

hive.exec.dynamic.partition.mode=nonstrict

(3) The maximum number of dynamic partitions that can be created on all nodes that perform MR. Default 1000

hive.exec.max.dynamic.partitions=1000

(4) On each node that performs MR , how many dynamic partitions can be created at most. This parameter needs to be set according to actual data. For example, if the source data contains one year's data, that is, the day field has 365 values, then this parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported.

hive.exec.max.dynamic.partitions.pernode=100

(5) The maximum number of HDFS files that can be created in the entire MR Job. Default 100000

hive.exec.max.created.files=100000

(6) When an empty partition is generated, whether an exception is thrown. Generally, no setting is required. Default false

hive.error.on.empty.partition=false

2 ) Case practice

Requirement: Insert the data in the dept table into the corresponding partition of the target table dept_partition according to the region (loc field).

(1) Create the target partition table

hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';

(2) Set up dynamic partition

set hive.exec.dynamic.partition.mode = nonstrict;

hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

(3) View the partition situation of the target partition table

hive (default)> show partitions dept_partition;

Thinking: How does the target partition table match the partition field?

The actual combat of partition table in Hive design mode

0 partition table

1 Basic operation of partition table

2 Secondary partition

3 Dynamic partition adjustment

Guess you like