Hive entry bucket table

Partitioning provides a convenient way to isolate data and optimize queries. However, not all data sets can form reasonable partitions. For a table or partition, Hive can be further organized into buckets, which is a more fine-grained data range division.
Bucketing is another technique that breaks down a data set into more manageable parts.
Partitioning is for data storage paths; bucketing is for data files.

Steps to create a bucket table:

1. Set mandatory buckets

If the file is too small, even if the bucket table is created, it cannot be bucketed, so you need to force the bucket to be divided according to the requirements, and also set the reduce to distinguish by default.

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;

2. Create a bucket table

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

The above is bucketing according to the field id, the number of buckets is 4.
Insert picture description here

3. Add data to the bucket table

Since the condition of bucketing is hash calculation according to the field behind clustered by when creating the table, the mapreduce program needs to be used when adding data, and the ordinary load method cannot be bucketed.

  • Create a similar table, regardless of bucket:
create table stu(id int, name string)
row format delimited fields terminated by '\t';

Insert picture description here

  • Upload data to stu table:
load data local inpath '/home/hive/student.txt' into table stu;

Insert picture description here

  • Import data to the bucket table by subquery:
insert into table stu_buck select id, name from stu;

Insert picture description hereInsert picture description here

4. Bucket sampling query

For very large data sets, sometimes users need to use a representative query result instead of all results. Hive can meet this demand by sampling the table.
Query the data in the table stu_buck.

select * from stu_buck tablesample(bucket 1 out of 4 on id);

Note: tablesample is a sampling statement, syntax: TABLESAMPLE (BUCKET x OUT OF y).
y must be a multiple or factor of the total number of buckets in the table. Hive decides the sampling ratio according to the size of y. For example, the table is divided into 4 parts. When y = 2, the data of 2 buckets is extracted (4/2 =); when y = 8, the data of 1/2 buckets is extracted (4/8 =).
x indicates from which bucket to start extraction. If multiple partitions need to be taken, the subsequent partition number is the current partition number plus y. For example, the total number of table buckets is 4, tablesample (bucket 1 out of 2), which means that a total of (4/2 =) 2 buckets of data are extracted, and the 1st (x) and 3rd (x + y) buckets are extracted The data.
Note: The value of x must be less than or equal to the value of y, otherwise
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck
Insert picture description here

Published 39 original articles · won praise 1 · views 4620

Guess you like

Origin blog.csdn.net/thetimelyrain/article/details/104170307