Hive编程指南读书笔记

1. MapReduce的任务， Map之后，会进行排序，然后才会传入Reduce作业。 MapReduce的本质是结果集从1个集合到另外一个集合归并过滤的过程。

2. HBase的场景是所要查询的列只是一个列的子集的时候，查询速度会很快。提供行级别的更新和快速查询。(亿级别的数据查询)

3. Hive可以用java编码来扩充Hive的功能UDF函数。

1. 下载Hive, 找到hive-exec-1.1.0.jar，用来编译java

2. Java继承UDF，复写evaluate, 打包成HelloUDF.jar

------------------------------

package com.webank.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

public class HelloUDF extends UDF{

public String evaluate(String str ){

return "udf:" + str ;

}

-------------------------------

3. 把jar放到/tmp/HelloUDF.jar

4. Hive中执行：

add jar /tmp/HelloUDF.jar

create temporary function helloudf as 'com.webank.udf.HelloUDF'；

5. 使用UDF函数

select HelloUDF('778') from dual;

4. Spark+Hive = Shark ： Hive指向Spark的工具。 MapReduce优化的框架， Scala API分布式数据集，分布式计算框架

5. 可以通过JDBC来访问Hive，把Hive当作一个抽象的数据库。Thrift模块。

6. Hive的元数据存储在Derby SQL服务器上。生产机上要集成Mysql，支持多线程查询。

7. -S 只输出有用信息， -e 输入多行命令， -v 显示执行的命令 source命令执行文件脚本

8. hiverc文件提供了加载hive环境的初始化命令，比如设定用户，加入UDF函数定义的jar.

9. 查看目录/表的大小，分区表是以每个分区1个文件的方式存在，表是每个表1个文件。

hadoop fs -count -R /user/hive/warehouse/hduser02crs/hduser02crsdb.db/

hadoop fs -ls /user/hive/warehouse/hduser02crs/hduser02crsdb.db/rrs_grzx_bf_base_info_loan_single_mid/

数据文件：

-rwxr-xr-x 3 hduser02crs hadoop 49828 2015-04-27 19:22 /user/hive/warehouse/hduser02crs/hduser02crsdb.db/rrs_grzx_bf_base_info_loan_single_mid/ds=2015-05-09/000000_0

10. 创建外部表create external table, 指定location '/data/stock'

drop table ext_rrs_grzx_bf_base_info_loan_single_mid;

建中间表，非RCFILE，之后再导入到RCFILE表

CREATE TABLE ext_rrs_grzx_bf_base_info_loan_single_mid row format delimited fields terminated by '|' STORED AS TEXTFile AS select * from rrs_grzx_bf_base_info_loan_single_mid limit 1

drop table ext_rrs_grzx_bf_base_info_loan_single_mid;

create table if not exists ext_rrs_grzx_bf_base_info_loan_single_mid

like rrs_grzx_bf_base_info_loan_single_mid
row format delimited fields terminated by '|'

stored as textfile;

11. 性能安全的查询，把查询模式设置为strict，全分区扫描会被禁止。

set hive.mapred.mode=strict;

12. 从文件加载数据到分区，或者到表 (txt合适)

hive -e"load data local inpath '/tmp/loan.mid' overwrite into table rrs_grzx_bf_base_info_loan partition(ds='2013-05-09')"

hive -e"load data local inpath '/tmp/single.mid' overwrite into table rrs_grzx_bf_base_info_loan_single_mid partition(ds='2013-05-09')"

13. 从表里面把数据下载到文件

hive -e"insert overwrite local directory '/tmp/loan.mid' select * from rrs_grzx_bf_base_info_loan where ds='2015-05-09'"

hive -e"insert overwrite local directory '/tmp/single.mid' select * from rrs_grzx_bf_base_info_loan_single_mid where ds='2015-05-09'"

14. Hive的执行计划：

从左到右启动mapreduce任务

如果所有表join都使用到相同的字段进行join, a.lending_ref，那么这些join会被优化到1个mapreduce里面

由于每次join都是需要缓存结果集，然后再次向右边join，所以，尽量把大表放在最右边。或者标记出来 select /*+STREAMTABLE(s)*/ 标记s为最大的表（Hive废弃了标识的方法，应该可以有效地动态判断）

15. 嵌套select的写法是最有效率的。在执行join前先分区过滤。

16. 在on子句里面不要添加过滤条件，这种条件在left join的时候是无效的， inner join才有效。

17. Order by ， Sort by, Distribute By, Cluster By

Order by 一般不能用于大数据集，因为全部在1个reducer里面排序。 -- 数据倾斜

Sort by 在多个reducer里面并发排序，但是输出结果会重叠，不按顺序。

Distribute by 把相同字段的结果，发给相同的reducer. 所以 Distribute by + Sort by可以保证全局排序。

Cluster by = Distribute by + Sort by (Sort by 不是并发的)

18. 看hive version

/data/bdp/Install/hive/lib/hive-hwi-0.14.0.jar

hive -e "set hive.hwi.war.file;" |grep hwi| cut -d'-' -f3 –

19. 数据抽样: 每个分区采样1/1000

select * from rrs_cnc_tm_loan tablesample(bucket 1 out of 1000 on ds) s

select * from rrs_cnc_tm_loan tablesample(0.1 percent) s

20. Hive的视图：

create view common_mid as

select * from rrs_grzx_common_merge_mid where ds='2015-05-09'

21. 添加Bitmap索引，可以提高性能。用于数据值重复的列

22. 动态加载表分区数据

create table test_tm_loan like rrs_cnc_tm_loan;

insert overwrite table test_tm_loan partition(ds='2015-05-09',dcn)
select
org
....
,grace_date
,dcn --无ds
from rrs_cnc_tm_loan where ds='2015-05-09';

24. 数据倾斜

在Map Reduce的过程中，有可能存在1个reducer，处理绝大部分的情况，因此把数据均匀地分布到多个reduce上面，是解决数据倾斜的根本原因。

倾斜一般出现在order by的排序中

25. 导出RCFILE的文件，其他格式表导出也通用

查看

hadoop fs -ls /user/hive/warehouse/hduser02crs/hduser02crsdb.db/rrs_grzx_bf_base_info_loan_single_mid/ds=2015-05-09

-mkdir建目录， -put rm删文件 -rmr删除目录 -cat -get $HADOOP_HOME start-all.sh stop_all.sh

下载

hadoop fs -get /user/hive/warehouse/hduser02crs/hduser02crsdb.db/rrs_grzx_bf_base_info_loan_single_mid/ds=2015-05-09 /tmp/rc

导入

hive -e"load data local inpath '/tmp/rc/ds=2015-05-09' overwrite into table rrs_grzx_bf_base_info_loan_single_mid partition(ds='2013-05-09')"

检验

hive -e"select * from rrs_grzx_bf_base_info_loan_single_mid where ds='2013-05-09'"

26. Strict模式下的3种限制：

1. 查询必须加分区，不能全表扫描

2. order by 必须加limit，不然全部集中到1个reducer排序

3. join没有on，变成笛卡尔集合

27. Hive索引的作用

索引会加快group by的计算速度

28. Steaming开发

create table streamingtest (col1 int, col2 int) row format delimited fields terminated by '\001';
insert into table streamingtest select 4,5 from dual;

insert into table streamingtest select 3,8 from dual;

把4换成10

select transform(col1) using '/bin/sed s/4/10/g' as newA from streamingtest;

29. 查看用户权限

show grant user hduser02 on database hduser02db;

show grant user hduser02 on table hduser03db.tm_loan

30. Hive的死锁

insert overwrite 会锁表，或者锁分区。所以，一般都需要不同线程使用不同的分区，全部区分开来。

Hive编程指南读书笔记

猜你喜欢