[Big Data] Hive Series - Hive-Bucket Table

bucket table

Partitions provide a convenient way to segregate data and optimize queries. However, not all datasets form reasonable partitions. For a table or partition, Hive can be further organized into buckets, that is, more fine-grained data range division.
Bucketing is another technique for breaking a dataset into more manageable parts. Partitions are for data storage paths; bucketing is for data files.

First create the bucket table

data preparation

100	ss1
1	ss2
100	ss3
2	ss4
100	ss5
3	ss6
100	ss7
4	ss8
100	ss9
5	ss10
100	ss1

Create a bucket table

hive (default)> create table car_bucket(id int, name string) clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';

View table structure

hive (default)> desc formatted car_bucket;
Num Buckets:	4

Import data into the bucket table, load method

hive (default)> load data inpath '/car.txt' into table car_bucket; 

Bucket rule

According to the results, it can be known that the bucketing of Hive uses the method of hashing the value of the bucketing field, and then dividing it by the number of buckets to find the remainder to determine which bucket the record is stored in.

Matters needing attention in bucket table operation

  • The number of reduce is set to -1, let the job decide how many reduce it needs to use or set the number of reduce to be greater than or equal to the number of buckets in the bucket table
  • Load data from hdfs to the bucket table to avoid the problem that local files cannot be found
  • don't use local mode

The insert method imports data into the bucket table

hive(default)>insert into table car_bucket select * from car_tmp;

Guess you like

Origin blog.csdn.net/u013412066/article/details/129540418