Hive map阶段缓慢 - 代码天地

Hive map阶段缓慢

其他 2019-03-02 00:21:21 阅读次数: 0

不同数据类型关联产生数据倾斜

场景：用户表中user_id字段为int，log表中user_id字段既有string类型也有int类型。当按照user_id进行两个表的Join操作时，默认的Hash操作会按int型的id来进行分配，这样会导致所有string类型id的记录都分配到一个Reducer中。

解决方法：把数字类型转换成字符串类型

select * from users a

left outer join logs b

on a.usr_id = cast(b.user_id as string)

现实描述：（默认是mapjoin）

fin_ihotel_ceq_external_ctrip表（大表）的countryid和cityid 为int类型，

而dim_ihotel_country_ctrip（小表）的country_id，dim_ihotel_city_ctrip（小表）的city_id都为string类型

from fin_ihotel_ceq_external_ctrip t1
left join dim_ihotel_country_ctrip t2
on t1.countryid=t2.country_id
left join dim_ihotel_city_ctrip t3
on t1.cityid=t3.city_id and t1.countryid=t2.country_id

改成

from fin_ihotel_ceq_external_ctrip t1

left join dim_ihotel_country_ctrip t2

on cast(t1.countryid as string)=t2.country_id

left join dim_ihotel_city_ctrip t3

on cast(t1.cityid as string)=t3.city_id and cast(t1.countryid as string)=t2.country_id

速度提升1个半小时到3分钟

以后需要关联的字段最好都用string类型定义

附其他参考：

Hive map阶段缓慢，优化过程详细分析

http://bigdata.51cto.com/art/201703/535606.htm

猜你喜欢

转载自blog.csdn.net/hellojoy/article/details/86625859

Hive map阶段缓慢

HIVE- SCD缓慢变化

hive map端聚合

hive的map结果压缩

HIVE MAP排序 GenericUDF

hive的map join原理

hive的map join

hive 读取 map的value

hive：函数：map / json

理解Hive Map join

Hive基础07、Hive引入Map

hive数仓中缓慢变化维

Hive Map Side Join解析

hive array、map、struct使用

hive函数str_to_map

Hive中使用MAP JOIN

Hive实现返回MAP的UDF

Hive UDAF collect_map

hive hive.optimize.ppd=false导致map数量很大

hive.map.aggr、hive.groupby.skewindata执行过程

HIVE-shuffle阶段的oom处理方法

HIVE客户端启动缓慢处理步骤

Hive

hive not in

hive：

hive (with as)

hive Map-side Aggregation OOM 异常

hive--Sort Merge Bucket Map Join

hive优化，控制map、reduce数量

hive复杂格式array,map,struct使用

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

curl的POST请求，封装方法

8.1.1. Integer Types

Java基础 Day05(个人复习整理)

Python - Django - 中间件 process_exception

小L的试卷

【Shell编程】（函数）判断用户是否存在

python(css样式)

spring ant path 匹配原则 - 【笔记】

《JavaScript与JScript从入门到精通》(美)James.Jaworski.中译本.扫描版.pdf

Eclipse运行带参数的java程序

每日归档

更多

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)