The actual combat of partition table in Hive design mode

table of Contents

0 partition table

1 Basic operation of partition table

2 Secondary partition

3 Dynamic partition adjustment


0 partition table

The partition table actually corresponds to an independent folder on the HDFS file system, and all the data files of the partition are under this folder. The partition in Hive is the sub-directory , which divides a large data set into small data sets according to business needs. In the query, the specified partition required by the query is selected through the expression in the WHERE clause, and the query efficiency will be improved a lot.

1 Basic operation of partition table

1 ) Introduce the partition table (the log needs to be managed according to the date, and simulated by department information)

dept_20200401.log

dept_20200402.log

dept_20200403.log

2 ) Create partition table syntax

hive (default)> create table dept_partition(

deptno int, dname string, loc string

)

partitioned by (day string)

row format delimited fields terminated by '\t';

Note: The partition field cannot be data that already exists in the table. The partition field can be regarded as a pseudo column of the table.

3 ) Load data into the partition table

  • data preparation

dept_20200401.log

10  ACCOUNTING  1700

20  RESEARCH    1800

dept_20200402.log

30  SALES   1900

40  OPERATIONS  1700

dept_20200403.log

50  TEST    2000

60  DEV 1900

  • Download Data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');

Note: When loading data in the partition table, you must specify the partition

4 ) Query the data in the partition table

Single partition query

hive (default)> select * from dept_partition where day='20200401';

Multi-partition joint query

hive (default)> select * from dept_partition where day='20200401'

              union

              select * from dept_partition where day='20200402'

              union

              select * from dept_partition where day='20200403';

hive (default)> select * from dept_partition where day='20200401' or

                day='20200402' or day='20200403'; 

5 ) Increase the partition

Create a single partition

hive (default)> alter table dept_partition add partition(day='20200404');

Create multiple partitions at the same time

hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');

6 ) Delete the partition

Delete a single partition

hive (default)> alter table dept_partition drop partition (day='20200406');

Delete multiple partitions at the same time

hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

7 ) View how many partitions the partition table has

hive> show partitions dept_partition;

8 ) View the partition table structure

hive> desc formatted dept_partition;

# Partition Information         

# col_name              data_type               comment            

month                   string    

2 Secondary partition

Thinking: How to split the log data in one day?

1 ) Create a secondary partition table

hive (default)> create table dept_partition2(

               deptno int, dname string, loc string

               )

               partitioned by (day string, hour string)

               row format delimited fields terminated by '\t';

 

2 ) Normal loading data

(1) Load data into the secondary partition table

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table

dept_partition2 partition(day='20200401', hour='12');

(2) Query partition data

hive (default)> select * from dept_partition2 where day='20200401' and hour='12';

3 ) data directly uploaded to the directory partition , so that the partition table and the associated data generated in three ways

(1) Method 1: Repair after uploading data

upload data

hive (default)> dfs -mkdir -p

 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

hive (default)> dfs -put /opt/module/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

Query data (the data just uploaded cannot be queried)

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

Execute repair command

hive> msck repair table dept_partition2;

Query data again

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

(2) Method 2: Add partition after uploading data

upload data

hive (default)> dfs -mkdir -p

 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

Execute add partition

  

  hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');

Query data

hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

(3) Method 3: Load the data to the partition after creating the folder

Create a directory

hive (default)> dfs -mkdir -p

 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

upload data

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table

 dept_partition2 partition(day='20200401',hour='15');

Query data

hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

3 Dynamic partition adjustment

In a relational database, when inserting data in a partitioned table, the database will automatically insert the data into the corresponding partition based on the value of the partition field. Hive also provides a similar mechanism, namely Dynamic Partition (Dynamic Partition), but, To use Hive's dynamic partition, you need to configure it accordingly.

1 ) Open the dynamic partition parameter setting

(1) Open the dynamic partition function (default true, open)

hive.exec.dynamic.partition=true

(2) Set to non-strict mode (the dynamic partition mode, the default is strict, which means that at least one partition must be designated as a static partition, and the nonstrict mode means that all partition fields are allowed to use dynamic partitions.)

hive.exec.dynamic.partition.mode=nonstrict

(3) The maximum number of dynamic partitions that can be created on all nodes that perform MR. Default 1000

hive.exec.max.dynamic.partitions=1000

(4) On each node that performs MR , how many dynamic partitions can be created at most. This parameter needs to be set according to actual data. For example, if the source data contains one year's data, that is, the day field has 365 values, then this parameter needs to be set to be greater than 365. If the default value of 100 is used, an error will be reported.

hive.exec.max.dynamic.partitions.pernode=100

(5) The maximum number of HDFS files that can be created in the entire MR Job. Default 100000

hive.exec.max.created.files=100000

(6) When an empty partition is generated, whether an exception is thrown. Generally, no setting is required. Default false

hive.error.on.empty.partition=false

2 ) Case practice

Requirement: Insert the data in the dept table into the corresponding partition of the target table dept_partition according to the region (loc field).

(1) Create the target partition table

hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';

(2) Set up dynamic partition

set hive.exec.dynamic.partition.mode = nonstrict;

hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

(3) View the partition situation of the target partition table

hive (default)> show partitions dept_partition;

Thinking: How does the target partition table match the partition field?

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/110775274