Hive学习（三）---hive的DML,DLL以及hive的优化

DDL数据定义
1）创建数据库
-》查看数据库
show databases;
-》创建数据库
create database hive_db;
-》创建数据库标准写法
create database if not exist db_hive;
-》创建数据库指定所在hdfs路径
create database hive_db1 location '/hive_db';
2）修改数据库
-》查看数据库结构
desc database hive-db;
-》添加描述信息
alter database hive_db set dbproperties('datais'='hunter');
-》查看拓展属性
desc database extended hive_db;
3）查询数据库
-》显示数据库
show databases;
-》筛选查询的数据库
show database like 'db*';
4）删除数据库
-》删除数据库
drop dabase hive_db;
-》删除数据库标准写法
drop database if exists hive_db;
5）创建表
-》创建表
> create table db_h(id int,name string)
> row format
> delimited fields
> terminated by "\t"
6）管理表（内部表）
不擅长做数据共享
删除hive中管理表，数据删除。
-》加载数据
load data local inpath '/root/hunter.txt' into table emp;
-》查询并保存到一张新的表
create table if not exists emp2 as select * from emp where name = 'hunte
r';
-》查询表结构
desc formatted emp;
Table Type: MANAGED_TABLE
7）外部表
hive不认为这张表拥有这份数据，删除该表，数据不删除。
擅长做数据共享
-》创建外部表
> create external table if not exists emptable(empno int,ename string,job
string,mgr int,birthdate string,sal double,comm double,deptno int)
> row format
> delimited fields
> terminated by '\t';
-》导入数据
load data local inpath '/root/emp.txt' into table emp;
-》查看表结构
desc formatted emp;
Table Type: EXTERNAL_TABLE
-》删除表
drop table emptable;
提示：再次创建相同的表 字段相同 将自动关联数据！
8）分区表`
-》创建分区表
hive> create table dept_partitions(depno int,dept string,loc string)
> partitioned by(day string)
> row format delimited fields
> terminated by '\t';
-》加载数据
load data local inpath '/root/dept.txt' into table dept_partitions;
注意：不能直接导入需要指定分区
load data local inpath '/root/dept.txt' into table dept_partitions partit
ion(day='1112');
-》添加分区
alter table dept_partitions add partition(day='1113');
-》单分区查询
select * from dept_partitions where day='1112';
-》全查询
select * from dept_partitions;
-》查询表结构
desc formatted dept_partitions;
-》删除单个分区
alter table dept_partitions drop partition(day='1112');
9）修改表
-》修改表名
alter table emptable rename to new_table_name;
-》添加列
alter table dept_partitions add columns(desc string);
-》更新列
alter table dept_partitions change column desc desccc int;
-》替换
alter table dept_partitions replace column(desccc int);
DML数据操作
1）向表中加载数据
load data local inpath '/root/itstar.txt' into table hunter;
2)加载hdfs中数据
load data inpath '/hunter,txt' into table hunter;
提示：相当于剪切
3）覆盖原有的数据
load data local inpath '/root/itstar.txt' overwrite into table hunter;
4)创建分区表
create table hunter_partitions(id int,name string) partitioned by (month
string) row format
delimited fields terminated by '\t';
5)向分区表插入数据
insert into table hunter_partitions partition(month='201811') values(1,'t
ongliya');
6)按照条件查询结果存储到新表
create table if not exists hunter_ppp as select * from hunter_partitions
where name='tongliya';
7）创建表时加载数据
> create table db_h(id int,name string)
> row format
> delimited fields
> terminated by "\t"
> location '';
8)查询结果导出到本地
insert overwrite local directory '/root/datas/yangmi.txt' select * from h
h where name='yangmi';
bin/hive -e 'select * from hunter' > /root/hunter.txt
> dfs -get /usr/hive/warehouse/00000 /root;
查询
1）配置查询头信息
在hive-site.xml
<property> <name>hive.cli.print.header
</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
2)基本查询
-》全表查询
select * from empt;
-》查询指定列
select empt.empno,empt.empname from empt;
-》列别名
select ename name,empno from empt;
-》算数运算符
算数运算符 描述 Col3
+ 相加 field3
 相减 field3
* 相乘 field3
/ 相除 field3
% 取余 field3
& 按位取与 field3
按位取或
^ 异或 field3
~ 按位取反 field3
》函数
(1)求行数count
select count(*) from empt;
(2)求最大值max
select max(empt.sal) sal_max from empt;
(3)求最小值
select min(empt.sal) sal_min from empt;
(4)求总和
select sum(empt.sal) sal_sum from empt;
(5)求平均值
select avg(empt.sal) sal_avg from empt;
(6)前两条
select * from empt limit 2;
-》where语句
（1）工资大于1700的员工信息
select * from empt where empt.sal > 1700;
(2)工资小于1800的员工信息
select * from empt where empt.sal < 1800;
(3)查询工资在1500到1800区间的员工信息
select * from empt where empt.sal between 1500 and 1800;
(4)查询有奖金的员工信息
select *from empt where empt.comm is not null;
(5)查询无奖金的员工信息
select * from empt where empt.comm is null;
(6)查询工资是1700和1900的员工信息
select * from empt where empt.sal in(1700,1900);
-》Like
使用like运算选择类似的值
选择条件可以包含字母和数字
(1)查找员工薪水第二位为6的员工信息
select * from empt where empt.sal like '_6%';
_代表一个字符
%代表0个或多个字符
(2)查找员工薪水中包含7的员工信息
select * from empt where empt.sal like '%7%';
-》rlike
select * from empt where empt.sal rlike '[7]';
-》分组
（1）Group By语句
计算empt表每个部门的平均工资
select avg(empt.sal) avg_sal,deptno from empt group by deptno;select
avg(empt.sal) avg_sal,deptno from empt group by deptno;
（2）计算empt每个部门中最高的薪水
select max(empt.sal) max_sal,deptno from empt group by deptno;
（3）求部门平均薪水大于1700的部门
select deptno,avg(sal) avg_sal from empt group by deptno having avg_sal>1
700;
注意：having只用于group by分组统计语句
-》Join操作
（1）等值join
根据员工表和部门表中部门编号相等，查询员工编号、员工名、部门名称
select e.empno,e.ename,d.dept from empt e join dept d on e.deptno=d.deptn
o;
（2）左外连接 left join
null
select e.empno,e.ename,d.dept from empt e left join dept d on e.deptno=d.
deptno;
（3）右外连接 right join
select e.empno,e.ename,d.dept from dept d right join empt e on e.deptno=
d.deptno;
（4）多表连接查询
查询员工名字、部门名称、员工地址
select e.ename,d.dept,l.loc_name from empt e join dept d on e.deptno=d.de
ptno join location l on d.loc = l.loc_no;
（5）笛卡尔积
为了避免笛卡尔积采用设置为严格模式
set hive.mapred.mode;
set hive.mapred.mode=strict;
-》排序
（1）全局排序 order by
查询员工信息按照工资升序排列
select * from empt order by sal asc;默认
select * from empt order by sal desc;降序
（2）查询员工号与员工薪水按照员工二倍工资排序
select empt.empno,empt.sal*2 two2sal from empt order by two2sal;
(3)分区排序
select * from empt distribute by deptno sort by empno desc;
-》分桶
分区表分的是数据的存储路径
分桶针对数据文件
（1）创建分桶表
create table emp_buck(id int,name string)
clustered by(id) into 4 buckets
row format
delimited fields
terminated by '\t';
(2)设置属性
set hive.enforce.bucketing=true;
(3)导入数据
insert into table emp_buck select * from emp_b;
注意：分区分的是文件夹 分桶是分的文件
抽样测试
-》自定义函数
之前使用hive自带函数sum/avg/max/min...
三种自定义函数：
UDF：一进一出（User-Defined-Function）
UDAF:多进一出 （count、max、min）
UDTF:一进多出
（1）导入hive依赖包
hive/lib下
（2）上传
alt+p
（3）添加到hive中
add jar /root/lower.jar;
（4）关联
create temporary function my_lower as "com.itstaredu.com.Lower";
(5)使用
select ename,my_lower(ename) lowername from empt;
Hive优化
-》压缩
（1）开启Map阶段输出压缩
开启输出压缩功能：
set hive.exec.compress.intermediate=true;
开启map输出压缩功能：
set mapreduce.map.output.compress=true;
设置压缩方式：
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compres
s.SnappyCodec;
(2)开启reduce输出端压缩
开启最终输出压缩功能
set hive.exec.compress.output=true;
开启最终数据压缩功能
set mapreduce.output.fileoutputformat.compress=true;
设置压缩方式
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoo
p.io.compress.SnappyCodec;
设置块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
-》存储
Hive存储格式：TextFile/SequenceFile/orc/Parquet
orc:Index Data/row Data/stripe Footer
压缩比：
orc > parquet > textFile
查询速度：
orc > textFile
50s > 54s
-》Group by优化
分组：mr程序，map阶段把相同key的数据分发给一个reduce,一个key的量很大。
解决方案：
在map端进行聚合（combiner）
set hive.map.aggr=true;
设置负载均衡
set hive.groupby.skewindata=true;
-》数据倾斜
（1）合理避免数据倾斜
合理设置map数
合并小文件
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInp
utFormat;
合理设置reduce数
（2）解决数据倾斜
在map端进行聚合（combiner）
set hive.map.aggr=true;
设置负载均衡
set hive.groupby.skewindata=true;
（3）JVM重用
mapred-site.xml
mapreduce.job.jvm.numtasks
10~20
Hive学习（三）---hive的DML,DLL以及hive的优化

猜你喜欢