hive 备忘录

1 hive结果用gzip压缩输出

    在运行查询命令之前,设置下面参数:

set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
INSERT OVERWRITE DIRECTORY 'hive_out' select * from tables limit 10000;

2 应用cloudera 的cdh3进行 hive left outer join,并且两个表都有分区的时候:
方法一:用子查询
方法二:select a.*,b.* from table a left outer join table b on(a.uid=b.uuid and b.dt='2011-08-21') where a.dt='2011-08-21';

3 hive写sql的时候注意数据类型:
当uid是string的时候
select count(distinct uid) from table where dt = '2011-08-28' and type=2 and loginflag='3' and (uid<'23000000' or (uid>'50000000' and uid<'1500000000'))
select count(distinct uid) from newbehavior_table where dt='2011-08-28' and type=2 and (uid<23000000 or (uid<1500000000 and uid>50000000)) and loginflag='3';
两个sql的结果是不一样的。。。。。

4 在hive建立一个存储apache 日志的表
add jar ../build/contrib/hive_contrib.jar;
 
CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;


猜你喜欢

转载自baiyunl.iteye.com/blog/1156881