hive partition

https://www.cnblogs.com/yongjian/p/6640951.html

Hive zoning concept with the traditional relational database partitions different.

Traditional database partitioning: on oracle, the partitions exist independently in the segment, which store the actual data, it is automatically assigned at the time of partition data inserted.

Hive Partitioning: Since Hive is actually stored on the abstract of HDFS, Hive is a partition name corresponds to a directory name, sub-partition name is the subdirectory name, not an actual field.

 

It can be understood that when we specify the partition when inserting data, in fact, create a new directory or subdirectory, or add a data file in the original directory.

 

Hive create partitions

 

Hive is a partition Partitioned by-defined keywords, but be careful when you create the table, Partitioned by clause column is defined in a formal table columns, but the data file does not contain the Hive in these columns, because they is the name of the directory.

 

Static partition

Creating a static partition table par_tab, a single partition

create table par_tab (name string,nation string) partitioned by (sex string) row format delimited fields terminated by ',';

This time by desc see table structure is as follows

Copy the code
hive> desc par_tab;
OK
name                    string                                      
nation                  string                                      
sex                     string                                      
          
# Partition Information          
# col_name                data_type               comment             
          
sex                     string                                      
Time taken: 0.038 seconds, Fetched: 8 row(s)
Copy the code

Prepare local data files par_tab.txt, the contents of "name / nationality", will be gender (sex) as the partition

jan,china
mary,america
lilei,china
heyong,china
yiku, japan
emoji,japan

Inserting data into the table (in fact, the load operation is equivalent to moving the file to the HDFS Hive directory)

load data local inpath '/home/hadoop/files/par_tab.txt' into table par_tab partition (sex='man');

This time the query in the hive par_tab table, turned into three, pay attention.

Copy the code
hive> select * from par_tab;
OK
jan    china    man
mary    america    man
lilei    china    man
heyong    china    man
yiku japan man
Emoji Japanese man
Time taken: 0.076 seconds, Fetched: 6 row(s)
Copy the code

View par_tab directory structure

[hadoop@hadoop001 files]$ hadoop dfs -lsr /user/hive/warehouse/par_tab

drwxr-xr-x   - hadoop supergroup          0 2017-03-29 08:25 /user/hive/warehouse/par_tab/sex=man
-rwxr-xr-x   1 hadoop supergroup         71 2017-03-29 08:25 /user/hive/warehouse/par_tab/sex=man/par_tab.txt

It can be seen in the new partition table, the system will default path / user / hive / warehouse / data warehouse under the hive to create a directory (table name), and then create a subdirectory of sex = man (partition name), and finally in the name of the partition to store the actual data files.

If you insert another data file data, such as file

lily,china
nancy,china
hanmeimei, america

Insert data

load data local inpath '/home/hadoop/files/par_tab_wm.txt' into table par_tab partition (sex='woman');

View par_tab table directory structure

[hadoop@hadoop001 files]$ hadoop dfs -lsr /user/hive/warehouse/par_tab
drwxr-xr-x   - hadoop supergroup          0 2017-03-29 08:25 /user/hive/warehouse/par_tab/sex=man
-rwxr-xr-x   1 hadoop supergroup         71 2017-03-29 08:25 /user/hive/warehouse/par_tab/sex=man/par_tab.txt
drwxr-xr-x   - hadoop supergroup          0 2017-03-29 08:35 /user/hive/warehouse/par_tab/sex=woman
-rwxr-xr-x   1 hadoop supergroup         41 2017-03-29 08:35 /user/hive/warehouse/par_tab/sex=woman/par_tab_wm.txt

View last two result of the insertion, including the man and woman

Copy the code
hive> select * from par_tab;
OK
jan    china    man
mary    america    man
lilei    china    man
heyong    china    man
yiku japan man
Emoji Japanese man
lily    china    woman
nancy    china    woman
hanmeimei america woman
Time taken: 0.136 seconds, Fetched: 9 row(s)
Copy the code

Because the partition table column is the actual definition of the column, the query partition data

hive> select * from par_tab where sex='woman';
OK
lily    china    woman
nancy    china    woman
hanmeimei america woman
Time taken: 0.515 seconds, Fetched: 3 row(s)

 

Creating a static partition table below par_tab_muilt, multiple partitions (sex + date)

Copy the code
hive> create table par_tab_muilt (name string, nation string) partitioned by (sex string,dt string) row format delimited fields terminated by ',' ;
hive> load data local inpath '/home/hadoop/files/par_tab.txt' into table par_tab_muilt partition (sex='man',dt='2017-03-29');


[hadoop@hadoop001 files]$ hadoop dfs -lsr /user/hive/warehouse/par_tab_muilt
drwxr-xr-x   - hadoop supergroup          0 2017-03-29 08:45 /user/hive/warehouse/par_tab_muilt/sex=man
drwxr-xr-x   - hadoop supergroup          0 2017-03-29 08:45 /user/hive/warehouse/par_tab_muilt/sex=man/dt=2017-03-29
-rwxr-xr-x   1 hadoop supergroup         71 2017-03-29 08:45 /user/hive/warehouse/par_tab_muilt/sex=man/dt=2017-03-29/par_tab.txt
Copy the code

Visible, when the order of the partitions defined in the new table, determines the order of the file directory (the directory who is the father who is a subdirectory), precisely because of this hierarchy, when we query all man when the man at all dates below data will be checked out. If only query date partition, but the parent directory sex = man and sex = woman have data for that date, then the Hive will enter the path trimmed so that only the partition scan date, gender partition without filter (ie, the query results include all genders ).

 

 

Dynamic Partitioning

If the above static partition, you must first insert the time to know what partition type, and each partition to write a load data, too annoying. Use dynamic partitioning solve the above problems, it can be dynamically assigned to the partition based on data from the query. In fact, the dynamic and static partition Partition difference is that you do not specify a directory partition, chosen by the system itself.

First, start dynamic partitioning feature

hive> set hive.exec.dynamic.partition=true;

Suppose a table has been par_tab, the former two is the name of name and nationality nation, after the two columns is partitioned, gender, sex and date dt, the following data

Copy the code
hive> select * from par_tab;
OK
lily    china    man    2013-03-28
nancy    china    man    2013-03-28
hanmeimei america man 2013-03-28
jan    china    man    2013-03-29
mary    america    man    2013-03-29
lilei    china    man    2013-03-29
heyong    china    man    2013-03-29
yiku japan man 2013-03-29
Emoji Japanese man 2013-03-29
Time taken: 1.141 seconds, Fetched: 9 row(s)
Copy the code

Now I put the contents of this table are inserted directly into another table par_dnm in, and realize sex is a static partition, dt dynamic partitioning (which is not specified in the end day, allow the system to their own allocation decisions)

hive> insert overwrite table par_dnm partition(sex='man',dt)
    > select name, nation, dt from par_tab;

After inserting the directory structure look

drwxr-xr-x   - hadoop supergroup          0 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man
drwxr-xr-x   - hadoop supergroup          0 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-28
-rwxr-xr-x   1 hadoop supergroup         41 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-28/000000_0
drwxr-xr-x   - hadoop supergroup          0 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-29
-rwxr-xr-x   1 hadoop supergroup         71 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-29/000000_0

View the number of partitions

hive> show par_dnm scores;
OK
sex=man/dt=2013-03-28
sex=man/dt=2013-03-29
Time taken: 0.065 seconds, Fetched: 2 row(s)

Dynamic partitioning proved successful.

 

Note that dynamic partitioning does not allow the use of primary partition and sub-partition static column dynamic column, so will cause all of the primary partitions to be created, vice partitions that are defined in the static column.

Dynamic partitioning allows all partitioning columns are dynamic partitioning column, but must first set a parameter hive.exec.dynamic.partition.mode:

hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict

Its default value is strick, which does not allow all the partitioning column is dynamic, it is possible to prevent the user intent is only to build dynamic partitioning in sub-partition, but inadvertently forgot to specify the value of the primary partition column, which will resulting in a large number of dml statement creates a new partition (corresponding to the number of new folders) in a short time, impact on system performance.
So we want to set:

hive> set hive.exec.dynamic.partition.mode=nostrick;

 

 

Any errors, please notify correct.

Guess you like

Origin www.cnblogs.com/zourui4271/p/12198466.html