Hive总结

一、启动

hive --service metastore 端口检查：ss -nal 9083端口
hiveserver2 端口检查 10000
hive
数据：
11,zhangsan12,daqiu-kanshu-kandingyi,beijing:tiananmeng-shanghai:pudong-shengzheng:huaqingbei,nan,23
12,xiaoming13,pingshu-xiangsheng-moshu,beijing:jingnan-hebei:zhijiazhan-henan:zhengzhou,nv,34
11,zhangsan14,daqiu-kanshu-kandingyi,beijing:tiananmeng-shanghai:pudong-shengzheng:huaqingbei,nan,45
12,xiaoming15,pingshu-xiangsheng-moshu,beijing:jingnan-hebei:zhijiazhan-henan:zhengzhou,nv,23
11,zhangsan17,daqiu-kanshu-kandingyi,beijing:tiananmeng-shanghai:pudong-shengzheng:huaqingbei,nan,56
12,xiaoming34,pingshu-xiangsheng-moshu,beijing:jingnan-hebei:zhijiazhan-henan:zhengzhou,nv,34
二、基本操作

创建hive表
create table psn2(id int ,name string,likes array<string>,address map<string,string>
)Row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
stored as textfile;
以模板方式创建表
create table psn6 like psn2;
通过查询来创建表
create table psn7 as select * from psn2;
外部表：
create external table psn3(id int ,name string,likes array<string>,address map<string,string>
)Row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/external_table/psn3';
分区表：
create table psn4(id int ,name string,likes array<string>,address map<string,string>
)partitioned by (sex string)
Row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
多分区表
create table psn5(id int ,name string,likes array<string>,address map<string,string>
)partitioned by (sex string,age int)
Row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
数据导入分区
load data local inpath '/app/hive/apache-hive-2.1.1-bin/Testdata/data1' into table psn5 partition (sex='nan',age=32);
添加分区
alter table psn5 add partition (sex='weizhi',age=12);
删除分区
alter table psn5 drop partition (sex='weizhi');
删除内嵌套的分布会将外嵌套分区的也删除
alter table psn5 drop partition (age=32);
查询出的统计数据保存成表
create table psn8(sum int);
insert into table psn8 select count(*) from psn2;
可以使用下面的sql代替
create table psn9 as count(*) from psn2;

example：
tomcat日志过滤
CREATE TABLE logtbl (
host STRING,
identity STRING,
t_user STRING,
time STRING,
request STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
)
STORED AS TEXTFILE;

导入数据：
load data local inpath '/app/hive/apache-hive-2.1.1-bin/Testdata/log' into table logtbl;

导入时不检查数据，读数据时才进行数据检查
写时模式读时模式

beeline方式启动
服务端：
hiveserver2
客户端：
1[hadoop@worker1 conf]$ beeline -u jdbc:hive2://master:10000 -n hadoop;
2 beeline
!connect jdbc:hive2://master:10000 hadoop 123

区别：以jdbc为中间服务

hive动态分区：一个分区对应一个目录，为了提高查询效率。
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode=nostrict;
创建普通表
create table psn11(
id int ,name string,likes array<string>,address map<string,string>,sex string,age int
)Row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
stored as textfile;

创建分区表
create table psn12(id int ,name string,likes array<string>,address map<string,string>
)partitioned by (sex string,age int)
Row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
将普通表的数据导入分区表，根据sex，age分流规则
from psn11
insert overwrite table psn12 partition(sex,age)
select id,name,likes,address,sex,age distribute by sex,age;

结果：Loaded : 1/5 partitions.
Loaded : 2/5 partitions.
Loaded : 3/5 partitions.
Loaded : 4/5 partitions.
Loaded : 5/5 partitions.

Hive分桶：
(针对文件)一个桶对应一个文件
分桶表示对列值取哈希值的方式，将不同数据放到不同的文件中存储。
对于hive中的每一个表、分区都可以进一步进行分桶
由列的哈希值除以桶的个数来决定每条数据划分在哪个桶中。

使用场景：
数据抽样(sampling),map-join

set hive.enforce.bucketing=true;

创建分桶表：
数据：
1   tom   11
2   cat   22
3   dog   33
4   hive   44
5   hbase   55
6   mr   66
7   alice   77
8   scala   88
根据age进行分桶 4个
create table psnbucket(id int,name string,age int)
clustered by (age) into 4 buckets
row format delimited fields terminated by ',';
创建普通表：
create table psn31 (id int,name string,age int)
row format delimited fields terminated by ',';

将普通表的数据导入分桶表：
insert into table psnbucket select id,name,age from psn31;

抽样：(抽一个桶，第2个桶) out of 4 ，4/4=1 就是抽一个桶，抽第一个，如果是out of 2 则为4/2=2 抽两个桶第一个抽第二桶，第二个抽2+2第四个桶 bucket 2 是从第2个桶开始
select id,name,age from psnbucket tablesample(bucket 2 out of 4 on age);
3 dog 33
7 alice 77

Hive Lateral View:
   用于和UDTF函数(explode,split)结合来使用
   首先通过UDTF函数拆分成多行，再将多行结果组合成一个支持别名的虚拟表
   主要解决在select 使用UDTF做查询过程中，查询只能包含单个UDTF，不能包含其他字段、以及多个UDTF的问题。

select explode(likes) from psn2;
select count(explode(likes)) from psn2;
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
select explode(likes),id from psn2;
FAILED: SemanticException 1:22 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'id'

   语法：
       lateral view udtf(expression) table Alias as columnAlias (',',columnAlias)
   例：
   统计人员表中共有多少种爱好、多少个城市
select count(distinct(myCol1)),count(distinct(myCol2)) from psn2
lateral view explode(likes) myTable1 as myCol1
lateral view explode(address) myTable2 as myCol2,myCol3;

Hive视图：
   特点：
       不支持物化视图
       只能查询，不能加载数据
       视图的创建，只是保存一份元数据，查询视图时才执行对应的子查询
       view定义中若包含了order by 、limit 语句，当查询视图也包含了order by 、limit 语句，view 中定义的优先级要高。
       view支持迭代视图

   语法：
创建视图
create view viewname as select ...;
查询视图：
select columns from view ;
删除视图:
drop view viewname;

Hive索引：
创建索引
create index t1_index on table psn2(name)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild
in table t1_index_table;

创建索引后必须重建索引才能生效
alter index t1_index on psn2 rebuild;

create index t2_index on table psn3(name)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild;
不指定索引表默认存到default_psn3_t2_index_

创建索引后必须重建索引才能生效
alter index t2_index on psn3 rebuild;

删除索引
drop index t1_index on psn2;

Hive 运行方式

hive> dfs -ls /;
Found 5 items
drwxr-xr-x - hadoop supergroup 0 2018-07-16 20:46 /cnpc
drwxr-xr-x - hadoop supergroup 0 2018-07-23 15:41 /data
drwxr-xr-x - hadoop supergroup 0 2018-07-23 14:58 /external_table
drwx------ - hadoop supergroup 0 2018-07-23 11:30 /tmp
drwxr-xr-x - hadoop supergroup 0 2018-07-23 10:22 /user

hive -S -e "select * from psn2" >> psn2
hive -f sql1
hive -i sql1 静默输出但会回到hive命令行

vim sql.sh
#! /bin/bash
hive -f /app/hive/apache-hive-2.1.1-bin/sqlscripts/sql1

chmod u+x sql.sh
./sql.sh

三、Hive GUI接口

下载源码包：
将hwiwar包放在$HIVE_HOME/lib/下
   制作方法：将hwi/web/*里面所有的文件打成war包
   cd apache-hive-2.1.1-src/hwi/web
   jar -cvf hive-hwi.war *
复制tools.jar放在$HIVE_HOME/lib/下
修改hive-site.xml
启动hwi服务(9999)
   hive --service hwi
浏览器访问
   http://master:9999/hwi/
   若报错，多刷新几次

四、Hive权限管理：
三种授权模型：
   1.Storage Based Authorization in the Metastore Server
   基于存储的授权，可以对元数据进行保护，表级别
   2.SQL Standards Based Authorization in HiveServer2
   基于sql标准的hive授权，推荐使用
   3.Default Hive Authorization(Legacy Mode)
   hive默认授权，仅做到防止用户误操作，无法防止恶意

   role是一组权限的集合，通过role为用户授权
   默认角色public、admin

   用户自定义函数可使用admin设置永久函数

创建角色：
create role role_name;
drop role role_name;
set role admin;
show current roles;

授予权限：
select
insert
update
delete
all

grant insert on psn2 to user root with grant option;
查看权限：
show grant

五、Hive 优化：
核心思想：把hive sql 当成MapReduce程序去优化
不会转化为MR的sql
   select仅查询本表字段
   where 仅对本表字段做条件过滤

Explain 显示执行计划
   explain extended query
   例：
explain select * from psn2;
优化方式:
1.小表可以使用本地方式运行
           set hive.exec.mode.local.auto=true;
           注意：hive.exec.mode.local.auto.inputbytes.max=128M
           如果文件大于该值，仍会以集群方式运行
2.并行计算
       set hive.exec.parallel=true
       注意：hive.exec.parallel.thread.number
       一次sql计算中允许并行执行的job个数的最大值
select t1.ct1,t2.ct2 from
(select count(id) as ct1 from psn2) t1,
(select count(name) as ct2 from psn2) t2;
3.严格模式
   set hive.exec.dynamic.partition.mode=strict;
   查询限制：
   1.对于分区表，必须添加where对于分区字段的条件过滤
   2.order by 语句必须包含limit输出限制(order by默认是一个reducetask，在reduce端对查询结果做全排序)
   3.限制执行笛卡尔积的查询

   hive排序：
   order by 语句必须包含limit输出限制(order by默认是一个reducetask，在reduce端对查询结果做全排序)
   Sort by 对于单个reduce的数据进行排序(局部排序)
   Distribute By 分区排序，常和sort by结合使用
   Cluster by相当于 Sort by + Distribute by (但不能asc、desc指定规则)可通过distribute by column sort by column asc|desc指定规则 ==>mapTask 内有序然后分发分区排序达到全排序

   Hive join：
   join计算时，将小表(驱动表)放在前面 ==>优先加载小表数据 + 大表进行匹配
   map join：在map端join
       两种实现方式：
           1.sql 方式，在sql 语句中添加MapJoin标记(mapjoin hint)
           语法：
           select /*+mapjoin(smalltable)*/smalltable.key,bigtable.value from smalltable join bigtable on smalltable.key=bigtable.key;
           2.开启自动的MapJoin
           set hive.auto.convert.join=true;
   map-side聚合：count
   通过设置：
       set hive.map.aggr=true;

       ***hive.groupby.skewindata
           是否对groupby产生的数据倾斜做优化，默认为false;
           解决数据倾斜的原理：(两个mr：1.map 随机分发 reduce ，局部聚合2.拿第一个mr结果根据自己分组规则进行分发，解决数据倾斜)
       hive.map.aggr.hash.min.reduction:
       进行聚合的最小比例聚合后的数据量 9000/10000 =0.9 >0.5 不会再map端进行聚合
   控制hive中map和reduce数量
       map端： mapred.max.split.size
               mapred.min.split.size.per.node
               mapred.min.split.size.per.rack
       reduce端：
           mapred.reduce.tasks=10
           hive.exec.reducers.bytes.per.reducer
           hive.exec.reducers.max

猜你喜欢