hive使用总结

1.hive 数据类型
http://www.cnblogs.com/sharpxiajun/archive/2013/06/03/3114560.html
2.substr('abc',0,2) = ab
mysql 是从1开始的 select * from online_server where substr(count_time,1,10)='2011-12-11'
and
(server_name='iphone_onlines' or server_name='symbian_onlines' or server_name='gtalk_onlines' or server_name='wm_onlines');
php 是从0开始的
java 也是从0开始的

3.hadoop job -list 查看当前任务
4.split(memberid,':')[2] 分成的数组从0开始(awk split($1,arr,":")分成的数组是从1开始的),好像sql 中偏移量都是从0开始的
split(string str, string pat)  Split str around pat (pat is a regular expression) 
  示例:
  select dt,case when split(req_url,'[/?]')[3] is null then 'main' when split(req_url,'[/?]')[3] = '' then 'main' else split(req_url,'[/?]')[3] end,count(1) pv,count(distinct sinaglobal) uip
from f_suda_log where dt='2011-10-13' and channel='other' and req_url like 'http://china.nba.com%'
group by dt,case when split(req_url,'[/?]')[3] is null then 'main' when split(req_url,'[/?]')[3] = '' then 'main' else split(req_url,'[/?]')[3] end
order by pv desc limit 10;
split(req_url,'[/?]')[3]是指不管碰到/还是?都要分割字符串,并且都放入到结果数组中
是按正则拆分,特殊字符需要两个反斜线转移
5.not req_url like '%v.t.sina.com.cn/widget%' group by sinaglobal
6.hive查询结果分隔符转换 cat $tarpath/channel1UIP/* | awk '{gsub(/\001/,"\t",$0);gsub(/\\N/,"",$0);print $0}' >>


7.CASE WHEN THEN  END 用法示例
总结:case 条件分支语句可以来搜索数据,跟子查询一样在复杂查询和列行转换上有很大作用,这样看来case 的分支是不能有重复的?hive 跟mysql 实验证明:
如果有重复,证明只进入第一个分支,第二个没进,说明不能有重复!
1).在hive中测试
select
case
when channel='sports' then 'sports_flow'
when channel='sports' and channel2='sports_nba' then 'sports_nba_flow'
end,
count(1)
from f_suda_log
where dt='2011-12-11'
and channel='sports'
group by
case
when channel='sports' then 'sports_flow'
when channel='sports' and channel2='sports_nba' then 'sports_nba_flow'
end;
结果:sports_flow     51120650
证明只进入第一个分支,第二个没进,说明不能有重复
在mysql里测试
SELECT CASE WHEN `score` !=0
THEN '有分数记录的分数'
WHEN `cost` != ''
THEN '有花钱记录的分数'
END , sum( `score` ) AS sum_socre
FROM `state_score`
GROUP BY CASE WHEN `score` !=0
THEN '有分数记录的分数'
WHEN `cost` !=0
THEN '有花钱记录的分数'
END HAVING sum_socre !=0


结果

有分数记录的分数 1947
也证明了上面的结论。

示例1:
select dt,
case
when req_url like 'http://food.weibo.com/s/2013593411%' then 'a'
when req_url like  'http://food.weibo.com/coupon/detail/%' then 'b'
when req_url like   'http://food.weibo.com/coupon/coupon/2013593411%' then 'c'
when req_url like  'http://food.weibo.com/foodmenu/menu/2013593411%' then 'd'
end,
count(1)
from f_suda_log
where
dt >='2011-10-11'
and (req_url like 'http://food.weibo.com/s/2013593411%' or
req_url like  'http://food.weibo.com/coupon/detail/%' or
req_url like   'http://food.weibo.com/coupon/coupon/2013593411%' or
req_url like  'http://food.weibo.com/foodmenu/menu/2013593411%')
group by dt,
case
when req_url like 'http://food.weibo.com/s/2013593411%' then 'a'
when req_url like  'http://food.weibo.com/coupon/detail/%' then 'b'
when req_url like   'http://food.weibo.com/coupon/coupon/2013593411%' then 'c'
when req_url like  'http://food.weibo.com/foodmenu/menu/2013593411%' then 'd'
end;

修改:
select dt,sum(
case
when result='succ' then 1 else 0 
end) 
as  'succNum'
,sum(
case
when result='succ' or result='fail' then 1 else 0 
end) 
as  'allNum'
from f_unipro_chksso
where
dt ='2011-10-24';

2).
case when then 还有另一种写法(可用于行转列)

select convert(varchar(10),datestr,23)datestr,
max(case chname when 'NBA专栏PV' then cast(num as int)/1000.0 else 0 end) as 'NBA专栏PV',
max(case chname when 'NBA专栏UV' then cast(num as int)/1000.0 else 0 end) as 'NBA专栏UV',
max(case chname when '访问频次' then cast(num as float) else 0 end) as '访问频次',
max(case chname when '总访问次数' then cast(num as int)/1000.0 else 0 end) as '总访问次数',
max(case chname when '停留时间' then num else 0 end) as '停留时间',
max(case chname when '最高同时在线人数' then cast(num as int)/1000.0 else 0 end) as '最高在线人数',
max(case chname when '直播页面PV' then cast(num as int)/1000.0 else 0 end) as '直播页面PV',
max(case chname when '直播页面UV' then cast(num as int)/1000.0 else 0 end) as '直播页面UV',
max(case chname when '直播累计播放量' then cast(num as int)/1000.0 else 0 end) as '直播累计播放量',
max(case chname when '点播页面UV' then cast(num as int)/1000.0 else 0 end) as '点播页面UV',
max(case chname when '点播累计播放量' then cast(num as int)/1000.0 else 0 end) as '点播累计播放量',
max(case chname when '其他新浪频道流入' then cast(num as int)/1000.0 else 0 end) as '其他新浪频道流入',
max(case chname when '导航网站流入' then cast(num as int)/1000.0 else 0 end) as '导航网站流入',
max(case chname when '搜索引擎流入' then cast(num as int)/1000.0 else 0 end) as '搜索引擎流入',
max(case chname when '专题PV' then cast(num as int)/1000.0 else 0 end) as '专题PV',
max(case chname when '专题UV' then cast(num as int)/1000.0 else 0 end) as '专题UV',
max(case chname when '原创微博数' then cast(num as int) else 0 end) as '原创微博数',
max(case chname when '被转发次数' then cast(num as int) else 0 end) as '被转发次数',
max(case chname when '被评论次数' then cast(num as int) else 0 end) as '被评论次数'
from
(select datestr,chname,num
from t_nba_data a join t_nba_tpname b
on a.datatype = b.enname
)2013/6/7 a
group by datestr

3).hive 中case when 中不能再包含聚合函数,例如如下方式是不行的:
case
when sum(case when substr(a.logstatus,1,1)<>5 then 1 else 0 end)=0 and a.time >0 then a.email
when sum(case when substr(a.logstatus,1,1)<>5 then 1 else 0 end)>0 and a.time >max(case when substr(c.logstatus,1,1)<>5 then a.time) then a.email
end email

8.left join 示例

select dt,count(distinct client_ip) as allNum,b.succNum from f_unipro_chksso a
left join
(select dt,count(distinct client_ip) as succNum from f_unipro_chksso where dt ='2011-10-24' and result='succ') b
on a.dt=b.dt
where dt ='2011-10-24' and (result='succ' or result='fail')


select dt,sum(
case
when result='succ'
then 1
end) 
as  uip_all
,sum(
case
when result='succ' or result='fail'
then 1 
end) 
as  uip_succ
from f_unipro_chksso
where
dt ='2011-10-24'
group by dt;


9.子查询用的很多

10.sql 语句的效率差距
select count(distinct uid) from f_tblog_behavior where dt>='2011-09-20' and dt<='2011-12-20' and behavior='14000003' 这样很低
select count(1) from ( select distinct uid from f_tblog_behavior where dt>='2011-06-20' and dt<='2011-12-20' and behavior='14000003') a;  这样很高

10.hive 正则中 .是不是像php preg_match 的正则表示除回车符之外的任意字符呢?实验一下,hive中正则和其他正则有细微的差别
.* 也可以表示除换行符以为的任意字符 .不用转义就可以表示.
select regexp_extract(parse_url(refurl,'HOST'),'([0-9a-zA-Z]+.sina*)') from f_suda_path_site_nba where dt='2012-01-04' and (parse_url(refurl,'HOST') rlike 'sina.com$' or parse_url(refurl,'HOST') rlike 'sina.com.cn$' or parse_url(refurl,'HOST') rlike 'sina.cn$') limit 20;



11.regexp_extract 的用法
有时不太准,换了字段位置就准了
select regexp_extract(parse_url(refurl,'HOST'),'([0-9a-zA-Z]+.sina\.)') from f_suda_path_site_nba where dt='2012-01-04' and (parse_url(refurl,'HOST') rlike 'sina.com$' or parse_url(refurl,'HOST') rlike 'sina.com.cn$' or parse_url(refurl,'HOST') rlike 'sina.cn$') limit 20;

12.find_in_set 的用法
select a.dt,count(1) from
(
select dt,regexp_extract(req_url, 'http://weibo.com/([0-9a-zA-Z]+)/?[^/.]*$', 1) domain,split(memberid,':')[2] memberid
from f_suda_log
where  dt='2011-12-05' and channel='tblog' and req_url rlike '^http://weibo.com/[0-9a-zA-Z]+/?[^/.]*$' 
and find_in_set( regexp_extract(req_url, 'http://weibo.com/([0-9a-zA-Z]+)/?[^/.]*$', 1),'app,fav,favsearch,messages,reportspam,tool,comments,atme,new,comment,k,at,pub,f,zt,systemnotice,sorry,message,settings,findfriends,verify,u')=0
) a join f_tblog_user_data b on a.memberid=b.uid and b.domain=a.domain group by a.dt

13.distribute by group by order by sort by 有什么区别和联系?
distribute by 要与sort by 对应,每个reduce中分组排序
group by 要去 order by对应,所有reduce完了,文件合起来后再分组排序

14.用先union再group by 的方式优化hive job,我理解这就是multi-group by
15.hive中添加set io.sort.mb=256;防止map机器内存溢出 http://www.searchtb.com/2010/12/hadoop-job-tuning.html
16.hive中split中分号,逗号需不需要转义?
试一下就知道
rlike 中
select zt from f_suda_log where dt='2012-03-11' and zt rlike '^[\\d]+\\,[\\d]+\\,[\\d]+' and zt rlike '\\;' limit 10;
试split
select split(zt,'\\;')[0] from f_suda_log where dt='2012-03-11' and zt rlike '^[\\d]+\\,[\\d]+\\,[\\d]+' and zt rlike '\\;' limit 10;
select concat(split(split(zt,'\\\;')[0],',')[0],',',split(split(zt,'\\\;')[0],',')[1],',',split(split(zt,'\\\;')[0],',')[2],',') from f_suda_log where dt='2012-03-11' and zt rlike '^[\\d]+\\,[\\d]+\\,[\\d]+' and zt rlike '\\;' limit 10;
17.hive 调优参数学习

18.查看表分区语句 describe extended f_tblog_enterprise_user partition(dt='2012-03-01');
19.MPP数据库是指大规模并行处理数据库
20.查看表分区show partitions day_hour_table;
21.bucket的使用
提高hadoop效率,很重要的方法是多个map和多个reduce
通过指定CLUSTERED的字段,将文件通过hash打散成多个小文件。

create table sunwg_test11(id int,name string)
CLUSTERED BY(id) SORTED BY(name) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;

执行insert前不要忘记设置
set hive.enforce.bucketing = true;
强制采用多个reduce进行输出

22.添加多个分区,还是一个一个添加好,多个一块添加会有问题
alter table f_selfLog_hour add partition(dt='2013-05-02',hour='00') location '/input/selfLog/2013/05/02/00' partition(dt='2013-05-02',hour='01') location '/input/selfLog/2013/05/02/01' partition(dt='2013-05-02',hour='02')

23.可以用LIKE复制表结构,而不复制数据
24.hive create table 的时候SORTED BY 可以指定几列,放到一个桶bucket中
25.LOAD DATA ,如果表包含分区,必须指定每一个分区的分区名
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
26.SELECT 语句可以使用正则表达式做列选择,下面的语句查询除了 ds 和 hr 之外的所有列:
SELECT `(ds|hr)?+.+` FROM sales
select `(requestargv|requestip)?+.+` from f_selfLog_hour where dt='2013-05-21' and hour='12' limit 5;
27.REGEXP与rlike功能相同
28.// 多表插入

from source

insert overwrite table records_by_year select year, count(1) group by year

insert overwrite table good_records_by_year select year, count(1) where quality ==0 group by year;

29.Add File /path/to/is_good_quality.py;   // 将过滤程序加载到分布式缓存中
30.select * from things left semi join sales on (sales.id = things.id);

// 类似于in子查询:select * from things where things.id in (select id from sales);

// 写left semi join查询时必须遵循一个限制,右表只能在on子句中出现。
31.Map连接:如果有一个连接表小到足以放入内存,Hive就可以把较小的表放入每个mapper的内存来执行连接操作。如果要指定map连接,需要在SQL中使用C语言风格的注释:
Join 操作在 Map 阶段完成,不再需要Reduce,前提条件是需要的数据在 Map 的过程中可以访问到。
select /*+ MAPJOIN(things) */ sales.*, things.* from sales join things on (sales.id = things.id);
这个语句不会产生reduce。
MAPJOIN(things)里可以是多个表,一块读入内存。
Map连接可以利用分桶的表,因为作用于桶的mapper加载右侧表中对应的桶即可执行连接。启动优化选项: set hive.optimize.bucketmapjoin = true;
32.内部表也可以建分区。示例如下:
create table test_4
(
requestIp string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
;

from f_selfLog_hour
insert overwrite table test_4 partition (dt='2013-05-11')
select requestip  where dt='2013-05-11' group by requestip


33.hive 添加列
   ALTER TABLE f_selfPcApacheLog_deserialize_hour ADD COLUMNS (url string);
34.hive有时会有一种假null,'\N'

35.使用外部脚本文件
hive> add FILE weekday_mapper.py;

将数据按周进行分割
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(1)
FROM u_data_new
GROUP BY weekday;

我写的例子:
create table f_jyzlog_urldecode_day
(oneline string)
PARTITIONED BY (dt string);
add file /download/jyzStat/bin/transtorm_urldecode.py;
INSERT OVERWRITE TABLE f_jyzlog_urldecode_day partition(dt='2013-06-30')
SELECT
TRANSFORM(oneline)
USING 'python transtorm_urldecode.py'
AS (oneline)
from f_jyzlog_initial
where dt='2013-06-30'
;
缓存中没有了文件会导致异常。

36.mapreduce 压缩最佳实践:
gz文件不能split(大文件也是一个map),得用lzo
map输出的中间数据使用 LzoCodec,reduce输出使用 LzopCodec
   http://blog.sina.com.cn/s/blog_7673d4a50101b3yy.html

37.hive表创建索引:
http://www.cnblogs.com/end/archive/2013/01/21/2870352.html
创建索引:
create index uuid_index on table kysttdlurl(uuid) AS  'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD ;
索引生成:
alter index uuid_index on kysttdlurl rebuild;
查看表的索引:show index on index_test;
38.LEFT SEMI JOIN
http://www.360doc.com/content/11/0212/10/2459_92341146.shtml
hive中现在没有IN/EXSITS,可以用LEFT SEMI JOIN 代替
LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现。
Hive 当前没有实现 IN/EXISTS 子查询,所以你可以用 LEFT SEMI JOIN 重写你的子查询语句。
LEFT SEMI JOIN 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,
在 WHERE 子句、SELECT 子句或其他地方过滤都不行。
39。要会用动态分区和桶。http://blog.csdn.net/awayyao/article/details/7630000
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
静态分区和动态分区的区别在于导入数据时,是手动输入分区名称,还是通过数据来判断数据分区.
然后用hive的insert命令进行插入操作。注意,除了所有列外,需要将分区的动态字段跟在后面。

INSERT OVERWRITE TABLE target PARTITION (dt)
SELECT id,user_id,app_id,time,ip,substr(time,0,10) FROM origin

例子:
set hive.exec.dynamic.partition=true;

INSERT OVERWRITE TABLE temp_app_devicemac PARTITION(dt,appid)
SELECT devicemac,dt,appid
FROM f_selfpcapachelog_deserialize_hour
where dt='2013-06-20'
distribute by
dt,appid

加大动态分区数即可,下面是在当前session下设置。

SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;




40.hive中有时得精确,比如rlike '^2.0' 跟rlike '^2\.0'(直接在hive里执行斜线两个跟三个都是一样的效果)有时会导致错误,
还有有时的过滤null 或者空的情况。总之要规范准确。
41.hive日期函数
date_sub('2012-12-08',10) 10天前
http://dacoolbaby.iteye.com/blog/1826307
42.join无论多少表都是一个job,情况应该是这样的:
如果 Join 的 key 相同,不管有多少个表,都会则会合并为一个 Map-Reduce
如果 Join 的条件不相同,Map-Reduce 的任务数目和 Join 操作的数目是对应的

43.Join 操作的查询语句时有一条原则:应该将条目少的表/子查询放在 Join 操作符的左边。
原因是在 Join 操作的 Reduce 阶段,位于 Join 操作符左边的表的内容会被加载进内存,将条目少的表放在左边,
可以有效减少发生 OOM 错误的几率。
44. hive.map.aggr = true 是否在 Map 端进行聚合,默认为 True
45.有数据倾斜的时候进行负载均衡
   hive.groupby.skewindata = false
   当选项设定为 true,生成的查询计划会有两个 MR Job。第一个 MR Job 中,Map 的输出结果集合会随机分布到 Reduce 中,每个 Reduce 做部分聚合操作,并输出结果,这样处理的结果是相同的 Group By Key 有可能被分发到不同的 Reduce 中,从而达到负载均衡的目的;第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中
   (这个过程可以保证相同的 Group By Key 被分布到同一个 Reduce 中),最后完成最终的聚合操作。
46.
hive使用map/reduce脚本。
FROM (
SELECT TRANSFORM(user_id, page_url, unix_time)
USING 'page_url_to_id.py'
AS (user_id, page_id, unix_time)
  FROM mylog
DISTRIBUTE BY user_id
SORT BY user_id, unix_time)
  mylog2
SELECT TRANSFORM(user_id, page_id, unix_time)
USING 'my_python_session_cutter.py' AS (user_id, session_info);

UDF跟这个是有区别的,
add jar build/contrib/hive_contrib.jar;
  CREATE TEMPORARY FUNCTION example_add
AS 'org.apache.hadoop.hive.contrib.udf.example.UDFExampleAdd';

  SELECT example_add(1, 2) FROM src;

也可以在hive中使用MAP和REDUCE关键字:
使用MAP关键字代替TRANSFORM关键字,
from records2
MAP year,temperature,quality
USING 'is_good_quality.py'
AS year,temperature) a
REDUCE year,temperature
USING 'max_temperature_reduce.py' AS year,temperature
* 用 java 写 UDF 很容易。
    * Hadoop 的 Writables/Text 具有较高性能。
    * UDF 可以被重载。
    * Hive 支持隐式类型转换。
    * UDF 支持变长的参数。
    * genericUDF 提供了较好的性能(避免了反射)。
    http://blog.csdn.net/cajeep2001/article/details/7824642
UDF使用:
先下载net.sf.fjep.fatjar_0.0.31.jar插件包
http://blog.csdn.net/ckl_soft/article/details/8183304


add jar /home/hadoop/jar/kuaiyong_udf_test_TestUDF.jar;
add jar /home/hadoop/jar/kuaiyong_udf_test_ip2byte.jar;
add jar /home/hadoop/jar/kuaiyong_udf.jar;
CREATE TEMPORARY FUNCTION ip2byte AS 'udf.ip2byte';
SELECT test_ip2byte(requestip) FROM f_selfPcApacheLog_deserialize_hour where dt='2013-06-30' limit 10;

mkdir -p com/kuaiyong/hive/udf
vi com/kuaiyong/hive/udf/ip2byte.java
javac -classpath /usr/local/hadoop-1.1.2/hadoop-core-1.1.2.jar:/usr/local/hive/lib/hive-exec-0.10.0.jar /home/hadoop/com/kuaiyong/hive/udf/ip2byte.java

jar -cvf ip2byte.jar /home/hadoop/com/kuaiyong/hive/udf/ip2byte.class
echo $CLASSPATH
mkdir /usr/java/jdk1.7.0_17/lib/org
chown -R hadoop:hadoop /usr/java/jdk1.7.0_17/lib/org/
su - hadoop -c "chmod -R 755 /usr/java/jdk1.7.0_17/lib/org/"

报错记录:
java.lang.IllegalArgumentException: Can not create a Path from an empty string
是因为add jar 没有成功、
UDAF
    * 编写 UDAF 和 UDF 类似
    * UDAF 可以重载
    * UDAF 可以返回复杂类
    * 在使用 UDAF 的时候可以禁止部分聚合功能
尽量使用UDF而不是transfrom……
UDTF(用户自定义表生成函数)介绍
http://blog.linezing.com/2011/03/hive%E4%B8%ADudtf%E7%BC%96%E5%86%99%E5%92%8C%E4%BD%BF%E7%94%A8
UDTF(User-Defined Table-Generating Functions)  用来解决 输入一行输出多行(On-to-many maping) 的需求。
UDTF函数在某些应用场景下可以大大提高hql语句的性能,如需要多次解析json或者xml数据的应用场景。
add jar /home/hadoop/jar/kuaiyong_UDTF_test.jar;
CREATE TEMPORARY FUNCTION test_UDTF AS 'com.kuaiyong.UDF.GetAllLinesUDTF';
select T.col1 from f_ip_dist_new_china lateral view serial(5) T as col1;

47.使用了Multi-group by特性连续group by了2次数据,使用不同的group by key。这一特性可以减少一次MapReduce操作。
48.Multi-distinct是淘宝开发的另一个multi-xxx特性,使用Multi-distinct可以在同一查询/子查询中使用多个distinct,
这同样减少了多次MapReduce操作。
49.if(net is null,'0',net)
50.mkfs.ext2 -c /dev/hda1检查磁盘坏道
51.正则表达式替换函数:regexp_replace
语法: regexp_replace(string A, string B, string C)
返回值: string

说明:将字符串A中的符合java正则表达式B的部分替换为C。注意,在有些情况下要使用转义字符

举例:

hive> select regexp_replace(‘foobar’, ‘oo|ar’, ”) from dual;

fb

52.hive正则中都需要用两个斜线转义。
53.HAProxy+Hive构建高可用数据挖掘集群
http://slaytanic.blog.51cto.com/2057708/803626
54.各种join性能对比
http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2842821.html
55.from (

)
多个insert时,有时会出问题,不好用。
56.php连接hive执行sql
http://blog.csdn.net/jiedushi/article/details/6579285
http://blog.csdn.net/woailuoyaya/article/details/8508276
tar zxf thrift-0.9.0.tar.gz
cd thrift-0.9.0
./configure --without-ruby
make && make install
用hadoop权限执行
/usr/local/hive/bin/hive --service hiveserver 10001 >/dev/null  &
57.hadoop之java.net.NoRouteToHostException: No route to host
是因为防火墙没有关闭。
58.出现异常:
Failed to connect to :50010, add to deadNodes and continue java.io.IOException
http://blog.163.com/ly_89/blog/static/186902299201247105751976/
查看 Hadoop out_file 的时候出现如下的错误:

出现这个问题是因为datanode rpc连接都被占用,导致客户端请求超过等待时间所致。两方面处理这个问题:
1. 增加datanode RPC连接数,默认值为3,我们设置为5。假如我们集群30台服务器,300台客户端。建议该值在3-8之间,太多会影响datanode的内存和CPU
2. 客户端conf设置属性“dfs.socket.timeout”,我这里值为"180000"
59.hive中table可以拆分成partition,table和partition可以通过‘CLUSTERED BY ’进一步分bucket,
bucket中的数据可以通过‘SORT BY’排序。
set hive.enforce.bucketing = true;
create external table f_appid_pcmac_day(appid string,pcmac string)
partitioned by(dt STRING)
clustered by(appid) into 50 BUCKETS
row format delimited fields terminated by '\t';
60.使用mapjoin实现非等值连接
http://my.oschina.net/leejun2005/blog/98244
61.设置hadoop 中job的优先级

当前Hadoop将作业分为了5个等级,分别是VERY_HIGH、HIGH、NORMAL、LOW、VERY_LOW,
用户提交作业时可通过参数“mapred.job.priority”设置作业优先级。
-jobconf | -D NAME=VALUE:指定作业参数,NAME是参数名,VALUE是参数值,
可以指定的参数参考hadoop-default.xml。特别建议用-jobconf mapred.job.name=’My Job Name’设置作业名,
使用-jobconf mapred.job.priority=VERY_HIGH | HIGH | NORMAL | LOW | VERY_LOW设置作业优先级,
使用-jobconf mapred.job.map.capacity=M设置同时最多运行M个map任务,
使用-jobconf mapred.job.reduce.capacity=N设置同时最多运行N个reduce任务。
测试
设置优先级:
set mapred.job.priority=VERY_HIGH;
62.hive 内部表转成外部表
ALTER TABLE f_selfpcapachelog_deserialize_hour SET TBLPROPERTIES ('EXTERNAL'='TRUE');

63.用mapjoin解决非等值连接问题
set hive.smalltable.filesize=100M;
set hive.join.emit.interval=10000000;
set hive.mapjoin.check.memory.rows=1000000;
set hive.mapjoin.cache.numrows=1000000;
select
/*+ MAPJOIN(f_ip_dist_new) */
b.ret,b.lerr,b.requesttime,b.uuid,b.pcmac,b.devicemac,b.requestip,b.clientversion,b.channel,b.product,b.report,b.appversion,b.appid,b.error,b.result,b.imode,b.wififlag,b.duid,b.device,b.iosver,b.iosdbver,b.jailbreak,b.jip,b.connectstatus,b.serverip,b.connectstatus_2,b.length,b.beginlength,b.endlength,b.tick,b.btick,b.etick,b.downloadspeed,b.dlfrom,b.resourcelog,b.keyword,b.referer,b.requestua,b.j,b.sn,b.netret,b.tep,b.lasterror,b.url,b.dla,b.def,b.type,b.step,b.cmd,b.dlat,b.dlate,b.test,a.nettype,a.province,b.dt,b.hour
from f_ip_dist_new a
right outer join
f_selfPcApacheLog_deserialize_hour b
where b.dt='2013-06-30'
and ip2byte(b.requestip)>=a.fromint
and ip2byte(b.requestip)<=a.toint
;


64.深入了解hive index
http://www.cnblogs.com/end/archive/2013/01/22/2871147.html
65.hive参数学习
http://sishuok.com/forum/blogPost/list/6225.html
set hive.smalltable.filesize=100M;
set hive.join.emit.interval=10000000;
set hive.mapjoin.check.memory.rows=1000000;
set hive.mapjoin.cache.numrows=1000000;
How many rows
in the right-most join operand Hive should buffer before emitting the join result.
66.合并小文件
set hive.merge.mapfiles=true;
67.表关联时有主表和驱动表的概念。
68.调整map/reduce数的方法
http://blog.sina.com.cn/s/blog_9f48885501017dua.html
调整hive.exec.reducers.bytes.per.reducer参数的值;
set hive.exec.reducers.bytes.per.reducer=500000000; (500M)
默认是1G
hive.exec.reducers.bytes.per.reducer(每个reduce任务处理的数据量,默认为1000^3=1G)
hive.exec.reducers.max(每个任务最大的reduce数,默认为999)
计算reducer数的公式很简单N=min(参数2,总输入数据量/参数1)
set hive.exec.reducers.bytes.per.reducer=100000000;
即,如果reduce的输入(map的输出)总大小不超过1G,那么只会有一个reduce任务;
69.hive函数总结
http://www.cnblogs.com/end/archive/2012/06/18/2553682.html
array collect_set(column_1) 返回无重复记录的数组
explode 函数能把array的元素分成多行。
70.hiveserver ESTABLISHED 与CLOSE_WAIT报错解决
http://www.tuicool.com/articles/MRFBRv

71.
hadoop jar /usr/local/hadoop-1.1.2/lib/hadoop-lzo-*.jar com.hadoop.compression.lzo.DistributedLzoIndexer /input/selfLog/2013/04/30/*lzo*

72.一个hive多个job时
在之前加set mapred.reduce.tasks=4;每个job 的reduce数都不会超过4个。
73.桶可以用于mapjoin。
74.hive支持CTAS,即CREATE TABLE AS SELECT,并且CTAS是原子操作,如果SELECT操作因为某种原因失败,是不会创建新表的。
75.用LIKE关键字产生结构一样的新表,CREATE TABLE new_table like exists_table;
76.DISTRIBUTE BY 和SORT BY一块用 是每个reduce聚合排序。
如果DISTRIBUTE BY 和SORT BY的字段一样,可以一个CLUSTER BY 代替。
为什么有时CLUSTER BY 不好用?我感觉是因为只有一个reduce的原因。
77.用bucket执行mapjoin需要设置优化参数:
set hive.optimize.bucketmapjoin=true;
78.hive对子查询的支持很有限,只允许出现在SELECT语句的FROM子句中,其他数据允许子查询出现在几乎任何表达式可以出现的地方。
很多使用子查询的地方都可以重写成连接操作。
79.全连接,半连接,内连接,外连接。

80.
left outer join 的on上的陷阱
右边表的where除了判断null的条件都写到on里或者子查询里,左边表的条件不能写到on里,写了就会出错。
left outer join 有时在子查询过滤好才管用,直接在left outer join后边where中过滤有问题。
select
/*+MAPJOIN(d_pchome_appid)*/
b.appid,b.appid
from d_pchome_appid b
left outer join
(select
appid
from
f_selfpcapachelog_deserialize_hour
where
dt='2013-08-06'
and dlfrom='webdl%23downloadpchomenet'
and report='dlurl'
)a
on(regexp_replace(b.appid,'[.-]','_')=a.appid)
81.没有group by的时候会出现一个reduce
82.hive中rownumber
add jar /home/hadoop/jar/kuaiyong_UDF_rowNumber.jar;
create temporary function row_number as 'com.kuaiyong.UDF.RowNumber';
select a.dt,a.appid,a.cnt
from
(
select dt,appid,count(1) cnt
from f_app_hour
where dt>='2013-08-10'
and dt<='2013-08-14'
and appid!=''
and report='install'
group by dt,appid 
order by dt asc,cnt desc
) a
where row_number(dt) <= 100;
83.lateral view
84.hive中left outer join时
on上的条件要小心,不然会加倍数据,出错
比如:
from f_jyzlog_hour a
left outer join f_jyz_station_mac b
on (a.jyz_mac=b.station_mac)
where b.dt='${yesterday_formate}'
就会加倍
需要改成如下方式:
from f_jyzlog_hour a
left outer join f_jyz_station_mac b
on a.jyz_mac=b.station_mac and b.dt='${yesterday_formate}'

看来left outer join 右边的表,要不得子查询查好,要不在on中得限制好数据行。
85.show tables '*channel*';
86.union all前后的表可以用相同的alias表名称
87.修改表名
alter table f_appid_devicemac_day_active rename to f_appid_devicemac_day_active_pc;
88.hive中可以按聚合后的字段排序,没有经过聚合的字段不能排序。如下可以排序;
select
uuid,
sum(case when date_str>'2013-08-01' and date_str<='2013-08-25' then 1 else 0 end) f_cnt,
sum(case when date_str>='2013-07-05' and date_str<='2013-08-01' then 1 else 0 end) s_cnt
from tmp_chaobo_uuid_stat_20130902_new
where date_str>='2013-07-05'
and date_str<='2013-08-25'
and report='dlurl'
group by uuid

order by f_cnt desc
limit 500
;

89.一定要小心hive中的where陷阱(写hive中的left outer join 一定要按hive的特性写)
左边表的过滤条件一定要写到on字段里边,
右边表的过滤条件一定不要写到on字段里,
不管是左边表还是右边表都一定要先在子查询中做好过滤再做关联。
insert overwrite local directory '/download/hadoop/tmp/20130911/uuid_stat_all_3'
select
c.uuid,c.userType,c.date_str,c.time,c.ip,c.ip_dist,c.report
from tmp_chaobo_uuid_stat_20130902_new c
left outer join
(select
uuid,
sum(case when date_str>'2013-08-01' and date_str<='2013-08-25' then 1 else 0 end) f_cnt,
sum(case when date_str>='2013-07-05' and date_str<='2013-08-01' then 1 else 0 end) s_cnt
from tmp_chaobo_uuid_stat_20130902_new
where date_str>='2013-07-05'
and date_str<='2013-08-25'
and report='dlurl'
group by uuid
)b
on c.uuid=b.uuid and b.f_cnt>1000 and b.s_cnt<100
and b.uuid is not null
;

90.add file的时候在/home/chaobo/python里不行,移动到/download/hadoop/resultDir中就可以了。
91.hive 优化原则
join需要将大表放在靠右位置;
尽量使用UDF,而不是transform
92.运行hive还有运行自己写的mapreduce,要配置HADOOP_CLASSPATH
(1.hadoop-env.sh中export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/lib
(2./etc/profile 中export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/lib
93.如果使用hive接口时,有NotFoundClass的报错,可能是CLASSPATH的问题。
94.对查询结果order by 会增加一个job。
95.hive 0.10暂时不支持select a.col1,b.col2 from a,b where a.col1=b.col1;
96.hive中有几个group by就至少有几个job。
Multi-group by
97.case when 后的字段用别名group by会有问题。
98.hive默认只支持单字符的分隔符。可以扩展。默认分隔符是\001(ctrl+A)
99.创建与已知表相同结构的表。
create table test_1 like test_2;
100.hive 在查询时的优化
用列剪裁和分区剪裁,设置hive.optimize.cp=true;hive.optimize.pruner=true;
列裁剪(column pruning),只读取需要的列。
分区裁剪(partition pruning)
101.hive防止数据倾斜,join时连接字段数据类型得相同。
102.hive 定制inputFormat,outputFormat
http://www.cnblogs.com/likai198981/archive/2013/05/09/3068586.html
把jar包kuaiyong_util.jar放在hive/lib下。
create temporary function UDF_deDesCipher as 'com.kuaiyong.UDF.UDFDeDESCipher';
create temporary function UDF_urlDecode as 'com.kuaiyong.UDF.UDFURLDecode';
create external table f_gameCenter_user_login_out_day_test_2
(
sub_guid string,
guid string,
channel string,
report string,
app_joc_aid string,
time string,
requestip string
)
partitioned by (dt string)
stored as INPUTFORMAT 'com.kuaiyong.util.HiveInputFormatDoubleVertical'  
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
;


alter table f_gameCenter_user_login_out_day_test_2 add partition(dt='2013-12-11') location '/input/gameCenter/game_login_out_log/2013/12/11';


查的时候得加add file /usr/local/hive/lib/kuaiyong_util.jar;否则会包class not found 的错误。
103.hive配置默认加载类包路径
conf/hive-env.sh 中修改export HIVE_AUX_JARS_PATH=/usr/local/hive/lib即可。
104.hive修改表名
alter table f_360lite_hour rename to f_ky_crypt_hour;
105.
一次读表,多次写入,好用。
from f_selflog_hour
insert overwrite local directory '/download/hadoop/resultDir/tmp/20131220_dlm2_austate_not0'
select * where dt='2013-12-19'
and regexp_extract(requestargv,'report=([^ &]+)',1) = 'dlurl'
and regexp_extract(requestargv,'ret=([^ &]+)',1) = '0'
and regexp_extract(requestargv,'dlm=([0-9]+)',1)='2'
and regexp_extract(requestargv,'austate=([^ &]+)',1)!='0'
and regexp_extract(requestargv,'austate=([^ &]+)',1) != ''
insert overwrite local directory '/download/hadoop/resultDir/tmp/20131220_dlm104'
select * where dt='2013-12-19'
and regexp_extract(requestargv,'report=([^ &]+)',1) = 'dlurl'
and regexp_extract(requestargv,'ret=([^ &]+)',1) = '0'
and regexp_extract(requestargv,'dlm=([0-9]+)',1)='104'
and regexp_extract(requestargv,'austate=([^ &]+)',1)!=''
106.hive中!= 和is not null 不同。
例如:
select appname,appdigitalid from t_ipa_base_infor where dt='2013-12-29' and appname!='' and appdigitalid!='' limit 10;
select app_unique_id,ca_id from t_business_company_app where dt='2013-12-29' and ca_id is not null and app_unique_id is not null limit 10;
107.小文件合并
108.
load data local inpath "/data1/result/tblog/hive/api/hqltest/test1.txt" overwrite into table f_test;
109.hive中union和连接有时会合并job。

猜你喜欢

转载自chaoboma.iteye.com/blog/1999661
今日推荐