大数据面试题知识点分析（七）

本篇博客继续HIVE，将所有HIVE优化相关的内容深入清楚：

hive 优化：

1）Map的优化

   • 增加map的个数：
        set mapred.map.tasks=10;
    • 减少map的个数（合并小文件）：
        set mapred.max.split.size=100000000;
        set mapred.min.split.size.per.node=100000000;
        set mapred.min.split.size.per.rack=100000000;
        set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
• Map端聚合（combiner）：
hive.map.aggr=true;

2）Reduce的优化

    • 设置reduce的个数：
        set mapred.reduce.tasks=10;
    • reduce任务处理的数据量
        set hive.exec.reducers.bytes.per.reducer=100000;

    • 避免使用可能启动mapreduce的查询语句

        1)group by
        2)order by(改用distribute by和sort by)

3）Join的优化

• Join on的条件：

    SELECT a.val, b.val, c.val
    FROM a
    JOIN b ON (a.key = b.key1)
    JOIN c ON (a.key = c.key1)

• Join的顺序：

    /*+ STREAMTABLE(a) */ ：a被视为大表
    /*+ MAPJOIN(b) */：b被视为小表 

    SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val
    FROM a
    JOIN b ON (a.key = b.key1)
    JOIN c ON (c.key = b.key1);

4）数据倾斜的优化

    • 万能方法：
        hive.groupby.skewindata=true
    • 大小表关联：
        Small_table join big_table

• 数据中有大量0或NULL：

        on case when (x.uid = '-' or x.uid = '0‘ or x.uid is null)
        then concat('dp_hive_search',rand()) else x.uid
        end = f.user_id;

• 大大表关联：

        Select/*+MAPJOIN(t12)*/ *
        from dw_log t11
        join (
            select/*+MAPJOIN(t)*/ t1.*
            from (
                select user_id from dw_log group by user_id
                ) t
                join dw_user t1
                on t.user_id=t1.user_id
        ) t12
        on t11.user_id=t12.user_id

• count distinct时存在大量特殊值：

        select cast(count(distinct user_id)+1 as bigint) as user_cnt
        from tab_a
        where user_id is not null and user_id <> ''

• 空间换时间：

        select day,
        count(case when type='session' then 1 else null end) as session_cnt,
        count(case when type='user' then 1 else null end) as user_cnt
        from (
            select day,session_id,type
                from (
                    select day,session_id,'session' as type
                    from log
                    union all
                    select day user_id,'user' as type
                    from log
                )
                group by day,session_id,type
            ) t1
        group by day

5）其他的优化

    • 分区裁剪（partition）：

        Where中的分区条件，会提前生效，不必特意做子查询，直接Join和GroupBy

    • 笛卡尔积：

        Join的时候不加on条件或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积

    • Union all：

    先做union all再做join或group by等操作可以有效减少MR过程，多个Select，也只需一个MR

    • Multi-insert & multi-group by：

    从一份基础表中按照不同的维度，一次组合出不同的数据
            FROM from_statement
            INSERT OVERWRITE TABLE table1 [PARTITION (partcol1=val1)] select_statement1 group by key1
            INSERT OVERWRITE TABLE table2 [PARTITION(partcol2=val2 )] select_statement2 group by key2
    • Automatic merge：

    当文件大小比阈值小时，hive会启动一个mr进行合并

            hive.merge.mapfiles = true 是否和并 Map 输出文件，默认为 True
            hive.merge.mapredfiles = false 是否合并 Reduce 输出文件，默认为 False
            hive.merge.size.per.task = 256*1000*1000 合并文件的大小

    • Multi-Count Distinct：

    一份表中count多个参数（必须设置参数：set hive.groupby.skewindata=true;）
            select dt, count(distinct uniq_id), count(distinct ip)
            from ods_log where dt=20170301 group by dt
• 并行实行：

hive执行开启：set hive.exec.parallel=true

大数据面试题知识点分析（七）

本篇博客继续HIVE，将所有HIVE优化相关的内容深入清楚：

hive 优化：

1）Map的优化

2）Reduce的优化

3）Join的优化

4）数据倾斜的优化

5）其他的优化

猜你喜欢