[Hive big data] Detailed explanation of the use of Hive partition table and bucket table

Table of contents

1. The background of the partition concept

Second, the characteristics of the partition table

3. Partition table type

3.1 Single Partition

3.2 Multi-partition

Four, dynamic partition and static partition

4.1 Static partition [static loading]

4.1.1 Operation Demonstration

4.2 Multiple Partitions

4.2.1 Operation Demonstration

4.3 Dynamic loading of partition data

4.3.1 Partition table data loading -- dynamic partition

4.3.2 Operation Demonstration

5. Bucket table

5.1 Bucket table concept

5.2 Description of bucketing rules

5.2.1 Basic rules of bucketing

5.3 Bucket complete syntax tree

5.4 Operation demonstration of bucket table

5.4.1 Create table

5.4.2 Benefits of using bucket table


1. The background of the partition concept

When using hive to query a table, such as: select * from t_user where name ='lihua', when hive executes this sql, it generally scans the data of the entire table. We know that the efficiency of full table scanning is very low, especially It is even less efficient to scan the hdfs file for the final data query of hive.

In fact, in many cases, the data that needs to be queried in the business does not require a full table scan, but it can be predicted that the queried data is in a certain area. Based on this premise, hive introduces partition (partition )the concept of. The location of the syntax tree corresponding to table creation is as follows:

Second, the characteristics of the partition table

The partition table refers to the partition space of the partition specified when creating the table. If you need to create a partitioned table, you need to call the optional parameter partitioned by when creating the table;

  • A table can have one or more partitions, and each partition exists in the form of a folder under the directory of the table folder;

  • Table and column names are case insensitive;

  • Partitions exist in the table structure in the form of fields. You can view the existence of fields through the desc formatted table command, but this field does not store actual data content, but only represents the partition;

3. Partition table type

Partition tables are divided into single-partition tables and multi-partition tables according to the number of partition fields when the table is created, as follows:

3.1 Single Partition

Single partition means that there is only one first-level folder directory under the table folder directory. When creating a table, there is only one field in PARTITIONED BY, as follows, single partition according to province;

create table t_user_province (
    id int, 
    name string,
    age int
) partitioned by (province string);

3.2 Multi-partition

The other is multi-partition, where a multi-folder nesting mode appears under the table folder. When creating a table, you can specify multiple partition fields according to business needs. The following is a three-partition table, partitioned by province, city, and county;

create table t_user_province_city_county (
    id int, 
    name string,
    age int
) partitioned by (province string, city string,county string);

Four, dynamic partition and static partition

4.1 Static partition [static loading]

Static partition refers to the attribute value of the partition, which is manually specified by the user when loading data. The syntax is:

load data [local] inpath 'filepath' into table tablename partition(partition field='partition value'...);

Notice:

The Local parameter is used to specify whether the data to be loaded is located in the local file system or the HDFS file system;

4.1.1 Operation Demonstration

create partition table

Create a partition table t_all_hero_part with role as the partition field;

create table t_all_hero_part(
   id int,
   name string,
   hp_max int,
   mp_max int,
   attack_max int,
   defense_max int,
   attack_range string,
   role_main string,
   role_assist string
) partitioned by (role string)
row format delimited
fields terminated by "\t";

Execute the above sql to create the partition table;

It can be seen that there is an extra partition field of role in this table;

Upload local data to server directory

Upload the following local test data files to the specified directory

load data into hive

Use the following command to load the local data file to the hive table

load data local inpath '/usr/local/soft/hivedata/archer.txt' into table t_all_hero_part partition(role='sheshou');
load data local inpath '/usr/local/soft/hivedata/assassin.txt' into table t_all_hero_part partition(role='cike');
load data local inpath '/usr/local/soft/hivedata/mage.txt' into table t_all_hero_part partition(role='fashi');
load data local inpath '/usr/local/soft/hivedata/support.txt' into table t_all_hero_part partition(role='fuzhu');
load data local inpath '/usr/local/soft/hivedata/tank.txt' into table t_all_hero_part partition(role='tanke');
load data local inpath '/usr/local/soft/hivedata/warrior.txt' into table t_all_hero_part partition(role='zhanshi');

Implementation process

Check the data, you can see that the data is successfully mapped to the partition table, pay attention to the last column is the partition field;

At the same time, the data in the HDFS directory shows the following structure

The data storage of the static partition table is very regular. The outer layer uses the partition name as the directory, and the inner layer is the specific data file of the current partition;

The difference between this and ordinary non-partitioned tables is clear at a glance, so I won’t go into details. Through this intuitive way, the concept of partitioned tables can be further understood as follows:

  • The concept of partitioning provides a way to separate Hive table data into multiple files/directories;

  • Different partitions correspond to different folders, and the data of the same partition is stored in the same folder;

  • When querying and filtering, you only need to find the corresponding folder according to the partition value, and scan the files under this partition under this folder to avoid full table data scanning;

  • This way of specifying partition queries is called partition pruning;

At this time, if you use the following sql to query, you can locate the partition first, and then only query the data file under this partition, so that you do not need to scan the entire table, and the efficiency is greatly improved;

select * from t_all_hero_part where role="sheshou" and hp_max >6000;

4.2 Multiple Partitions

It can be found from the relevant syntax about partitions in the table creation statement that Hive supports multiple partition fields:

PARTITIONED BY (partition1 data_type, partition2 data_type,….)

Under multiple partitions, there is a progressive relationship between partitions, which can be understood as continuing to partition on the basis of the previous partition. From the perspective of HDFS, it means continuing to divide subfolders under the folder. For example: divide the national population data according to the province first, and then divide it according to the city. If you need to, you can even continue to divide according to the district and county. At this time, it is a 3-partition table.

4.2.1 Operation Demonstration

Create two partition tables

Create a dual partition table, divided by province and city

create table t_user_province_city (id int, name string,age int) partitioned by (province string, city string);

Create a three-partition table, divided by province, city, and county

create table t_user_province_city_county (id int, name string,age int) partitioned by (province string, city string,county string);

4.3 Dynamic loading of partition data

The above method of manually specifying partition data by using the load method is also called static partition (static loading). In actual use, if there are many partitions created, it means that the load command must be used to load many times, so the efficiency is bound to be very low. At this time You can consider using hive's dynamic partition;

4.3.1 Partition table data loading -- dynamic partition

  • The so-called dynamic partition means that the field value of the partition is automatically inferred based on the query result (parameter position). The core syntax is insert+select;

To enable hive dynamic partitioning, two parameters need to be set in the hive session:

#Whether to enable the dynamic partition function 
set hive.exec.dynamic.partition=true; 
​#Specify
the dynamic partition mode, which is divided into nonstick non-strict mode and strict strict mode 
#strict strict mode requires at least one partition to be a static partition 
set hive.exec .dynamic.partition.mode=nonstrict 
;

4.3.2 Operation Demonstration

Create a new partition table and perform dynamic partition insertion. The sql table creation statement is as follows

create table t_all_hero_part_dynamic(
    id int,
    name string,
    hp_max int,
    mp_max int,
    attack_max int,
    defense_max int,
    attack_range string,
    role_main string,
    role_assist string
) partitioned by (role string)
row format delimited
fields terminated by "\t";

Before executing the above sql, you need to set the following parameters in the current session:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

After the creation is complete, execute the following sql to import the data of the previous table

insert into table t_all_hero_part_dynamic partition(role) select tmp.*,tmp.role_main from t_all_hero tmp;

May take a long time to execute

 

View the directory of the dynamic table on hdfs, and you can see that the data files are also divided according to the expected partition fields;

Note on partition tables:

  • The partition table is not a necessary grammar rule for creating a table, it is an optimization means table, optional;

  • The partition field cannot be an existing field in the table and cannot be repeated;

  • The partition field is a virtual field, and its data is not stored in the underlying file;

  • The determination of partition field values ​​comes from manual specification of user value data (static partition) or automatic inference based on the location of query results (dynamic partition);

  • Hive supports multiple partitions, that is to say, partitions are continued on the basis of partitions, and the partitions are more fine-grained;

5. Bucket table

5.1 Bucket table concept

A bucket table is also called a bucket table. The name comes from the word bucket in the table creation syntax. It is a table type designed to optimize queries. The data file corresponding to the bucket table will be decomposed into several parts at the bottom layer. In other words, it is split into several independent small files. When dividing the buckets, it is necessary to specify according to which field the data is divided into several buckets (several parts);

5.2 Description of bucketing rules

5.2.1 Basic rules of bucketing

Data with the same bucket number will be allocated to the same bucket;

Bucket number = hash_function(bucketing_column) mod num_buckets 
      bucket number = hash method (bucket field) modulo bucket number

hash_function depends on the type of bucketing_column:

  • If it is an int type, hash_function(int) == int;

  • If it is other such as bigint, string or complex data types, hash_function is more tricky, it will be a number derived from this type, such as hashcode value;

5.3 Bucket complete syntax tree

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... ]
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT DELIMITED|SERDE serde_name WITH SERDEPROPERTIES (property_name=property_value,...)]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)];

Bucket key parameter description:

  • CLUSTERED BY (col_name) indicates which field is used to divide;

  • INTO N BUCKETS means that it is divided into several buckets (that is, several parts);

  • It should be noted that the fields for bucketing must be fields that already exist in the table;

The simplest bucket SQL

CREATE [EXTERNAL] TABLE [db_name.]table_name
[(col_name data_type, ...)]
CLUSTERED BY (col_name)
INTO N BUCKETS;

5.4 Operation demonstration of bucket table

Requirements, there are the following data files,

Interpretation of the file content:

  • The content shows the cumulative case information of the new crown epidemic in each county on January 28, 2021 in the United States, including confirmed cases and deaths;

  • Field meaning: count_date (statistical date), county (county), state (state), fips (county code code), cases (cumulative confirmed cases), deaths (cumulative death cases);

5.4.1 Create table

The bucketing field must be a field that already exists in the table

Bucket table without field sorting

According to the state field, the data is divided into 5 buckets, and the table creation statement is as follows:

CREATE TABLE t_usa_covid19_bucket(
      count_date string,
      county string,
      state string,
      fips int,
      cases int,
      deaths int)
CLUSTERED BY(state) INTO 5 BUCKETS; 

Bucket table with field sorting

According to the state field, it is divided into 5 buckets, and each bucket is sorted in reverse order according to the number of confirmed cases of cases

CREATE TABLE t_usa_covid19_bucket_sort(
     count_date string,
     county string,
     state string,
     fips int,
     cases int,
     deaths int)
CLUSTERED BY(state)
sorted by (cases desc) INTO 5 BUCKETS;

After the creation is successful, check hdfs, and you can see the relevant data directory of the table;

Note: When importing data into hive's bucket table, you can no longer directly use hdfs commands to directly load data files to the table directory, you need to use insert + select;

In order to import data into the bucket table created above, first create an ordinary table, and create the table sql as follows:

CREATE TABLE t_usa_covid19(
       count_date string,
       county string,
       state string,
       fips int,
       cases int,
       deaths int)
row format delimited fields terminated by ",";

Upload data files to the table;

hdfs dfs -put ./us-covid19-counties.dat /user/hive/warehouse/test.db/t_usa_covid19

After the execution is successful, check the table data, and you can see that the data is loaded successfully;

Use the insert+select syntax to load data into the bucket table

insert into t_usa_covid19_bucket select * from t_usa_covid19;

After seeing that the map-reduce task is completed, check the bucket table data, and the data will be loaded in at this time;

Looking at the underlying data structure of t_usa_covid19_bucket on HDFS, we can find that the data is divided into 5 parts, and from the results, we can find that the data with the same bucketing field must be divided into the same bucket;

Next, let’s do a test. Query the data from New York state based on the bucketing field state. Since the data is bucketed according to the state (state), when querying, it is no longer necessary to scan and filter the whole table. The bottom layer , when querying, the bucket number will be calculated according to the bucketing rule hash_function(New York) mod 5, and the data in the specified bucket can be queried to find out that the result is a bucket scan instead of a full table scan;

select * from t_usa_covid19_bucket where state="New York";

In terms of query speed, it is still very fast;

5.4.2 Benefits of using bucket table

Reduce full table scans when querying based on bucketed fields

After bucketing, if you filter according to the bucketing field, the amount of data is significantly reduced, and you can avoid scanning the entire table and improve query efficiency;

JOIN can improve the efficiency of MR programs and reduce the number of Cartesian products

Normal two-table associated query, such as select a.* from a join b on a.id = b.id ... such query, if you do not use the bucket table, the underlying data must be based on the number of Cartesian products Scanning and query, and for the bucket table, if the field after on happens to be a bucket field, the query data will be carried out in a certain bucket, and the amount of data is reduced, so the number of Cartesian products will also be greatly reduced ;

Efficient sampling of data using bucketed tables

When the amount of data is particularly large and it is difficult to process all the data, sampling becomes particularly important. Sampling can estimate and infer the characteristics of the whole from the sampled data, and it is an economical and effective work and research method commonly used in scientific experiments, quality inspections, and social surveys.

Guess you like

Origin blog.csdn.net/congge_study/article/details/128888831