Hive bucket - 代码天地

Hive bucket

编程语言 2018-04-25 17:41:45 阅读次数: 4

create table buck(id string,name string)

clustered by (id) 加ed表示被即建表的时候已经将此表分区排序，只是一个格式

sorted by (id)

into 4 buckets

row format delimited fields terminated by ',' ;

load data local

但hdfs上还是一个整个文件

truncate table xx清空表数据，

bucket 表不会自动的去分桶

#开启分桶

set hive.enforce.bucketing=true;
#设置reduce数量和分桶数量一致

set mapreduce.job.reduces=4;

hive> insert into table buck
> select id,name from p distribute by (id)
> sort by (id) ;

distribgute 表示以什么来hashpartition 以什么来分区，前面设置了reduce数量为4

sort by每个分区内的按照什么排序

也可

insert into table buck

select id,name from p cluster by (id);

即cluster by相当与 distributed by +sort by

但后者更加灵活，如果有多个reduce sort 是每个reduce里面的数据排序

而hive的partitioner 和mr的partitioner不是一回事，

hive的partition只是把文件上传load 分开按照指定目录

而分桶clusterd by 是按照hashpartition 即mr的partitioner分的

如果直接load 上传文件，不会直接分区

而是通过从别的表查询数据然后将查询的数据按照分区放到不同的桶中桶>分区

1 create table ordinary (id int,name string) row format delimited fields terminated by ',' ;

创建一个普通表

load data local inpath 'xxx' into(overwrite) table ordinary;

加载本地文件到普通表中

2create table buck (id int, name string) clustered by (id) sorted by (id) into 4 buckets row format delimited fields terminated by ',' ;

创建分区表

insert into table buck select id,name from ordinary cluster by(id)

分桶作用是为了join方便因为cluster distribute 根据hash算法相同的id肯定在同一个桶中，提高效率,前提两个表都必须是分桶表

如join select a.id a.name b.id b.name from a join b on a.id=b.id;

而partition 是分目录为了查询方便

猜你喜欢

转载自blog.csdn.net/qq_38250124/article/details/80079419

Hive bucket

Hive分桶bucket

hive--Sort Merge Bucket Map Join

Hive分桶之BUCKET详解

HIVE学习三：partition和bucket及Join

bucket

hive中 bucket mapjoin 与 SMB join(Sort-Merge-Bucket)区别

[转] Hive 基础（1）：分区、桶、Sort Merge Bucket Join

Hive之分区以及bucket分桶认识理解

Hive 基础之：分区、桶、Sort Merge Bucket Join

The Bucket List

Minio Bucket 通知试用

"error":"bucket is protected"

桶排序（Bucket Sort）

goim解读(Bucket篇)

存储空间（Bucket）

CarbonData OSS Bucket管理

A. The Bucket List

codeforces The Bucket List

Hierachical token bucket theory

G. Bucket Brigade

F - Stones in the Bucket

Token Bucket Algorithm

桶排序(bucket sort)

Elasticsearch 中的 Bucket

Bucket Brigade（bfs)

Spark Bucket Table Join

一文弄懂Hive基本架构和原理——Hive元数据信息存储在Hive MetaStore中，Hive 中所有的数据都存储在 HDFS 中，Hive 中数据模型：Table，External Table，Partition，Bucket;最后将一个SQL变成hadoop MapReduce作业

算法学习-Bucket排序

【转】桶排序(Bucket Sort)

今日推荐

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

《2024 年一季度互联网投融资运行情况》研究报告

报告：Django 仍然是 74% 开发者的首选

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

周排行

记一下去大梅沙的准备（2018-05-26）

Spring 注解事务

基于HTTP协议的客户端缓存

阿里云rds 备份和还原

[PHP] 几个拖慢 PHP 程序/API 运行速度的点

python 代码风格------------PEP8规则

js控制json生成菜单——自制菜单（一）

将字符串: 'k:1|k1:2|k2:3|k3:4 ' ,处理成 python 字典: {'k':1, 'k1':2, ...}

微信小程序转支付宝小程序

Qt551.窗口滚动条

每日归档

更多

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)