The execution order of internal tables, external tables, partitions, buckets, and SQL in Hive

There are four data models in Hive - internal table, external table, partition table, and bucket. The following four different models are introduced one by one.

Hive internal and external tables

The most intuitive distinction between internal tables and external tables is actually whether to use the external keyword to modify, and the external table is modified by external. There are three main differences between inner and outer tables:

1) Whether it is modified by the keyword external.

2) When deleting an external table, the metadata will be deleted, but the actual data will not be actually deleted, and will still exist in the specified location. However, when the internal table is deleted, both metadata and data will be deleted.

3) When importing data to an external table, the data is not moved to the data warehouse (/user/hive/warehouse) directory, and the data is managed by HDFS. The data of the internal table is stored in the (/user/hive/warehouse) directory, and the data is managed by Hive itself.

The usage scenarios of the two are also different.

External table usage scenario - import HDFS data, users can store some log information, but the data will not be deleted.

Internal table usage scenarios - used to store intermediate tables and result tables processed by Hive, intermediate tables generated by the intermediate process of logic processing, or some temporary tables, which can be deleted directly after use.

In practical applications, external and internal tables are usually used together. For example, import daily log data into HDFS, one directory per day, Hive builds an external table based on the incoming data, maps the daily original log on HDFS to the daily partition of the external table; and then makes statistics on the basis of the external table For analysis, use the internal storage intermediate table and result table, and enter the internal table through SELECT+external table+INSERT.

There are three ways to create tables in Hive: direct table creation, extraction table creation, and Like table creation. Use the case to explain these several table building methods

  • Create table directly

To directly create a table, you can directly customize the field type, field remarks, and data storage format.

-- 创建内部表
create table article_internal(
    sentence string
)
row format delimited fields terminated by '\t'  -- 字符之间的分割符
lines terminated by '\n'; -- 每一行之间的分割符

-- 创建外部表
create external table article_external(
    sentence string
) 
row format delimited fields terminated by '\t' -- 字符之间的分割符
lines terminated by '\n'; -- 每一行之间的分割符

load data local inpath '/usr/local/src/code/hive_data/The_Man_of_Property.txt'
overwrite into table article_external -- 导入本地数据到指定的表

load data local inpath '/usr/local/src/code/hive_data/The_Man_of_Property.txt'
overwrite into table article_internal -- 导入本地数据到指定表

The above code shows the table creation statement of the internal table and the external table. We can also directly import data into the specified table. The above is the specified local data. If you want to import the data on HDFS into the table, You only need to upload the file to HDFS first, and then execute the above load data statement to remove the local keyword, and then the data on HDFS will be imported into the specified table.

We also use this case to distinguish the difference between internal tables and external tables. First, view the table structure (show create table xxx). It can be seen that the biggest difference is that the external table has an extra keyword of external, and the others are similar.

Secondly. Verify the deletion of inner and outer tables. After importing the data of the table, you will find that whether it is an internal table or an external table, the data will actually be uploaded to HDFS. The deletion of Hive tables is divided into deleting data (retaining tables) and deleting tables. To delete data, use the truncate table table name, but this can only delete the data of the internal table, because the data of the external table is not stored in the Hive metadata storage; to delete the table, use the Drop table table name. Here, let's practice drop to delete the table to see the difference between the internal table and the external table.

Delete the data stored in HDFS before

 

After the internal table is dropped, you will see that the table structure and data on HDFS do not exist.

After the external table is deleted, the metadata is deleted, but the data on HDFS still exists.

  • Extract table creation——as

The usage scenario of extracting (as) table building is common in the middle logic processing, when building a table, and directly copying the data and structure of the table.

create table article_as as select * from artilce;
  • like build table

Like table creation is suitable for only focusing on the table structure and does not require data.

create table article_like like article


Partition

Why introduce the concept of partition? This is because when the amount of data in a single table is getting larger and larger, Hive Select queries generally scan the contents of the entire table, which consumes a lot of time doing unnecessary work. Sometimes only the most concerned part of the data in the table needs to be scanned, so The concept of partition is introduced when creating a table to reduce the amount of data to be queried and improve query execution efficiency. You can't help but ask why not use an index? Isn't indexing possible? In fact, indexing is also supported in Hive, but it is quite different from partitioning. The difference between the two is that indexing does not divide the database, while partitioning does. Indexing is actually exchanging additional storage space for query time. However, partitioning has split the entire big data into multiple small databases according to the partition columns.

Then how to partition? Generally, partitions are made according to business needs, depending entirely on the business scenario. Commonly used is to use year, month, day, gender, age group, or attributes that can evenly divide data into different files. It is very important to choose a suitable partition, because a bad partition will directly lead to the delay of query results. In work, the dimension of day is often used to partition, and it is represented by the d or dt field. Generally, today is regarded as t, and the data of the partition is generally t-1.

For the partition, there are five small details as follows:

1) A table can have one or more partitions, and each partition exists in the form of a folder under the directory of the table folder.

2) Table and column names are case insensitive.

3) Partitions exist in the table structure in the form of fields. You can view the existence of fields through the desc table command (it can be regarded as a pseudo-column). This field is not actually stored in the actual data content, but only represents the partition.

4) There are first-level and second-level settings for the partition, and it is generally set as the first-level partition.

5) Partitions are also divided into dynamic partitions and static partitions.

In fact, the first four points of the above five points are easy to understand, and the fifth point is actually more confusing. What is dynamic? What is static? In fact, the following two insert statements can be used to distinguish the difference between the two.

-- 静态分区插入数据
insert overwrite table udata_paratition partition(dt='2020-12-19')  select user_id,item_id,rating from udata  where user_id='305'

-- 动态分区插入数据
insert overwrite table udata_paratition partition(dt) select user_id,item_id,rating,to_date(from_unixtime(cast('timestamp' as bigint),'yyyy-MM-dd HH:mm:ss')) res  from udata  where user_id='244'

Obviously, the static partition of Hive is to manually specify the value of the partition as a static value, which is more friendly to the insertion of small batches of partitions, but it seems powerless to import a large number of partitions. Therefore, dynamic partitioning means that the partition value is set to a dynamic value. When you do not specify the query result field value for dynamic partitioning, the value of the last column will be used as the partition value by default, or you can manually set which column of the query result.

insert overwrite table udata_partition(dt=res) select  to_date(from_unixtime(cast('timestamp' as bigint),'yyyy-MM-dd HH:mm:ss')) res from udata 

The number of partitions needs to be strictly planned according to business needs, because the partitions will occupy IO resources, the more the number, the greater the consumption of IO resources, and the loss of query time and performance. In addition, dynamic partitioning is disabled by default, so you need to manually enable dynamic partitioning before using dynamic partitioning. In actual work, dynamic partitioning tends to be used, and for the data buried in the front-end log, the form of day+hour is generally used as the partition (dt='2020-12-19'/hour='01').

-- 打开动态分区模式
set hive.exec.dynamic.partition = true;
-- 设置分区模式为非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;
-- 设置一条带有动态分区SQL语句所能创建的最大动态分区总数,超过则报错
set hive.exec.dynamic.partitions=10;

The following are some basic operations of partitions:

create partition
View partition table structure and partition data

 The following operation commands for partitions are generally more commonly used:

--查看表结构

show create table udata_partition;

desc udata_partition【可以看到分区信息】

-- 查看分区表分区

show partitions udata_paratition;

In addition, when there is data in the table, a corresponding partition directory will be generated on HDFS. As shown below:

 

bucket

In order to divide the data in Hive to different degrees, so that the query can improve query efficiency on a small range of data, Hive proposes two strategies of partitioning and bucketing. Partitioning is a coarse-grained division, and bucketing is Further fine-grained division is possible. Hive only uses the method of hashing the column value, and then dividing it by the number of buckets to determine which bucket the record is placed in. It can be understood as the HashPartitioner in MapReduce, both of which divide the data into buckets based on Hash.

The following is a calculation process for bucketing.

分桶计算:hive计算桶列的hash值再除以桶的个数取模,得到某条记录到底属于哪个桶。

定义桶数:3个 0 1 2

输入以下数据:

    user_id    order_id    gender

    196         12010200       1

    186         19201000        0

    22           12891999        1

    244          19192908        0

计算:

    196%3=1 ,放入1号桶

    186%3=0,放入0号桶

    22%3=1,放入1号桶

    244%3=1,放入1号桶

Common application scenarios for bucketing include: data sampling and Map-Side Join.

Map-Side Join: Higher query processing efficiency can be obtained. Buckets add an extra structure to the table (use the original fields for bucketing), and Hive can use this structure when processing some queries. Specifically, joining two tables that are partitioned on the same column (including the join column) can be efficiently implemented using Map-side join. For example, JOIN operation, for the JOIN operation, two tables have the same column, if the bucket operation is performed on both tables. Then you can perform JOIN operations on the buckets that store the same column values, which can greatly reduce the amount of JOIN data.

Data sampling: When dealing with large-scale data sets, especially in the data mining stage, you can use a piece of data to verify whether the code can run successfully, conduct local tests, and sample for some representative statistical analysis. For details on how to sample data, see the following SQL.

select * from student tablesample(bucket 1 out of 32 on id)

tablesample is a sampling statement, the syntax is: Tablesample(Bucket x OUT OF y);

-x indicates which bucket to sample from, y indicates to select a bucket every few buckets, and y must be a multiple or factor of the total number of buckets in the table. Hive determines the sampling ratio according to the size of y. For example, the table is divided into 64 parts in total. When y=32, (64/32) 2 buckets of data are extracted, and when y=128, (64/128) 0.5 buckets of data are extracted. x indicates which bucket to extract from. When the total number of buckets in the table is 32, tablesmaple(bucket 1 out of 16) means that a total of 2 bucket data are extracted, which are the data of the 1st and (1+16) 17th buckets.

Bucket, like dynamic partitioning, is disabled by default at the beginning, so we need to manually set it to enable—set hive.enforce.bucketing = true.

The following are the basic operations on bucketing:

-- 创建分桶
create table bucket_user(
    id int
)clustered by(id) into 4 buckets;

-- 导入数据
insert overwrite table bucket_user select cast(user_id as int) from udata;

With the concepts of partitioning and bucketing, when to use partitioning? When to use bucketing? When the amount of data is relatively large, partitions are used for fast query; more fine-grained queries, data sampling, or data skew use bucketing.

The concept of Hive bucketing is exactly the same as that of MapReduce. Physically, each bucket is a file in the directory, and the number of buckets (output files) generated by a job is the same as the number of Reduce tasks. The concept of partition table is a new concept. The partition represents the warehouse of the data, that is, the folder directory. Different data files can be placed under each folder. You can query the files stored in the folder through the folder. But the folder itself has nothing to do with the data content. Buckets are divided into buckets according to a certain value of the data content, and a large file is hashed into small files. These small files can be sorted individually. If another table is also divided into small files according to the same rules. When two tables are joined, there is no need to scan the entire table, only the data in the same bucket needs to be matched, which greatly improves the efficiency. Similarly, when sampling data, there is no need to scan the entire file, and only a part of the data needs to be extracted from each partition according to the same rules.

Execution order of Hive table

The general writing format of SQL for executing statements in Hive is: select .... from .... where ... group by ... having ... order by...

Its execution order is: from...where...select...group by...having...order by..

The execution sequence can actually be divided into two parts, which just correspond to the two stages of MapReduce. The Map stage: from...where...select, and the rest is executed on the Redcue side.

  • Map stage:

① Execute from loading, search and load the table

② Execute where filtering, conditional filtering and screening

③ Execute the select query to filter the output items

④ Map-side file merging: the merge operation of local overflow files on the map side, and each map will eventually form a temporary file. Then map to the corresponding reduce by column.

  • Reduce stage:

① group by : Group and calculate the data sent by the map terminal.

② having: the final filter column is used to output the result

③ order by: After sorting, the results are output to HDFS files.

 

 

 

Guess you like

Origin blog.csdn.net/qq_35363507/article/details/117284292