bucket table
bucket table
Partitions provide a convenient way to segregate data and optimize queries. However, not all datasets form reasonable partitions. For a table or partition, Hive can be further organized into buckets, that is, more fine-grained data range division.
Bucketing is another technique for breaking a dataset into more manageable parts. Partitions are for data storage paths; bucketing is for data files.
First create the bucket table
data preparation
100 ss1
1 ss2
100 ss3
2 ss4
100 ss5
3 ss6
100 ss7
4 ss8
100 ss9
5 ss10
100 ss1
Create a bucket table
hive (default)> create table car_bucket(id int, name string) clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';
View table structure
hive (default)> desc formatted car_bucket;
Num Buckets: 4
Import data into the bucket table, load method
hive (default)> load data inpath '/car.txt' into table car_bucket;
Bucket rule
According to the results, it can be known that the bucketing of Hive uses the method of hashing the value of the bucketing field, and then dividing it by the number of buckets to find the remainder to determine which bucket the record is stored in.
Matters needing attention in bucket table operation
- The number of reduce is set to -1, let the job decide how many reduce it needs to use or set the number of reduce to be greater than or equal to the number of buckets in the bucket table
- Load data from hdfs to the bucket table to avoid the problem that local files cannot be found
- don't use local mode
The insert method imports data into the bucket table
hive(default)>insert into table car_bucket select * from car_tmp;