hive学习使用

HQL数据查询
数据查询是hive最主要的功能
查询语法

select..from语句

基本和mysql的操作差不多

select col1，col2 from table;
select col1 c1，col2 c2 from table;

select l.name,r.cousrse from (select id,name from student) l join (select id,course from student_info) r on l.id=r.id;

select * from student limit 100;

如果需要在select语句中根据某列的值进行相应的处理，hive支持在select语句中使用

case...when...then 
select id,enname,sex,
case
when sex='M' then '男' 
when sex='M' then '男' 
else '无性'
end
from student;

where 语句

对查询条件进行限制，就需要使用where语句，

select * from student where age=18;

//也可使用or和and 连接过多个谓词表达式//

select * from student where age=18 and sex='F';

其他的谓词使用都和mysql的使用差不多。

### group by语句和having语句
group by 通常会和聚合函数一起使用，先按照一个列或多个列对结果进行分组再执行聚合，

select couint(*) from student group by age;
select avg(age) from student group by c_id;  //这里查询的字段没有出现在group by后边就必须使用avg()这些聚合函数//
//如果想对分组结果进行条件过滤，可以使用having子句//
select c_id,avg(age) from student where sex='F' group by c_id having avg(age)>18;

join语句
hive支持一般的join语句，但目前只支持等值连接。

1.inner join

默认采用的方式

select t1.id,t2.id from student t1 join student_info t2 on t1.id=t2.id;

2. left/right outer join

左右外连接

select t1.id,t2.id from student t1 left outer join student_info t2 on t1.id=t2.id;

3. full outer join

全外表连接

select t1.id,t2.id from student t1 full outer join student_info t2 on t1.id=t2.id;

4. left-semi join
```
左半连接是hive特有的语法，会返回左表的记录，前提是记录对于游标满足on中的判定条件。左半连接被用来代替标准sql中in的操作，

select t1.id from student t1 left semi join student_info t2 on t1.id=t2.id;

找出student_info中存在的表1的记录，不能像使用标准的sql一样通过in来完成

select t1.id from student t1 where t1.id in (select t2.id from student_info t2);//错误的语法//

这样可以使用内连接得到同样的结果，hive在有表中找到了匹配的记录，hive就会停止扫描。
```
5. map-side join

通过Mapreduce实现了join操作。不难发现，求笛卡尔积的操作时时在Reduce端完成的，这种连接

select t1.id t2.id from student t1 join student_info t2 on t1.id=t2.id;

6. 多表join

hive可以支持对多表进行join，如下：

select * from student t1 join student_info t2 on t1.id=t2.id join student_class t3 on t1.id=t3.id;

hive会为每一个join操作启动一个作业，第一个作业完成表1和表2的连接操作，第二个作业完成第一个作业的输出和表3的连接操作，以此类推。这个应该是在多次Mapreduce中是非常有用的。

### ORDER BY和SORTBY语句
hive中的ORDER BY和sql中的order by的语义是一样的，执行全局的排序，这就必须有一个Reduce来完成，否则无法达到全局排序的要求。

select * from student order by c_id desc,age asc;   //在排序时候还可以用desc和asc关键字指定排序方式//

hive还有一种排序方式，这种排序防护只会在Reducer中进行一个局部排序，也就是sortby

select * from student sort by c_id desc,age asc;

order by和sort by在hive中排序结构方式结构完全一样，当Reducer的个数只有一个时，两种结果完全相同，当Reducer的个数不止一个时，sort by的输出就可能会有重合，如表student中的id列为：
1，7，3，9，2，12
其中order by id 的结果为：

Reducer的数目为2时候，sort by id的结果为

其中前3行数据由第一个Reducer输出，后3行数据由第二个Reducer输出。

DISTRIBUTE BY语句
hive可以通过distribute by控制map的输出在reducer中是如何划分的，也就是等同于Mapreducer中的Partitioner，通过分区可以使得数据进入同一个Reducer，我们操作一下

select col1,col2 from student distribute by col1 sort by col1,col2; //distribute by保证了col1相同的数据一定进入同一个Reducer，在Reducer中再按照col1，col2的顺序达到要求，这样就和上边只有一个reducer达到的效果是一样的//

```
CLUSTER BY
这个只是在 DISTRIBUTE BY和SORT BY中的列完全相同时候可以用cluster by替代

分桶和抽样
这个没有理清

UNION ALL
hive中对于unonall的使用时非常常见的，主要用于多表合并的场景，union all要求各表select出的字段类型必须完全匹配。hive是不支持union all的，所有必须进行嵌套查询

select s.id,s.age
from(select si.cid,si.age from student_info si union all select si1.cid,si1.age from student_info si1) s;

hive函数
hive内置函数很多，在数据查询中有时候会常常用到，可以通过以下命令查看

show functions;

标准函数
聚合函数
表生成函数
UDF自定义函数

猜你喜欢