hive 备忘录

set mapred.output.compress=true;

set hive.exec.compress.output=true;

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

INSERT OVERWRITE DIRECTORY 'hive_out' select * from tables limit 10000;

2 应用cloudera 的cdh3进行 hive left outer join，并且两个表都有分区的时候：

方法一：用子查询

方法二：select a.*,b.* from table a left outer join table b on(a.uid=b.uuid and b.dt='2011-08-21') where a.dt='2011-08-21'；

3 hive写sql的时候注意数据类型：

当uid是string的时候

select count(distinct uid) from table where dt = '2011-08-28' and type=2 and loginflag='3' and (uid<'23000000' or (uid>'50000000' and uid<'1500000000'))

select count(distinct uid) from newbehavior_table where dt='2011-08-28' and type=2 and (uid<23000000 or (uid<1500000000 and uid>50000000)) and loginflag='3';

两个sql的结果是不一样的。。。。。

4 在hive建立一个存储apache 日志的表

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"

)

STORED AS TEXTFILE;

猜你喜欢