hive编程指南-笔记-2

注：《hive实战 practical hive a guide to hadoop's data warehouse system 》以下简称 hive实战也有一些被加入到其中
第七章视图
7.1
from ( )
a select from a;
-- 这个写法oracle 没有，可以借用下有点类似于 with
--创建视图
create view if not exists stock_basic_test_view(stock_id,stock_name)
comment '股票视图'
tblproperties('creator'='zhangyt')
as select stock_id,stock_name from stock_basic_test;
--视图中字段可以不要

create view if not exists stock_basic_partition_view  
comment '股票视图'
tblproperties('creator'='zhangyt')
as 
select 
stock_id,
stock_name ,
count(*)  
from  stock_basic_partition 
group by  stock_id,stock_name ;
-- 如果属性字段没有指定列名会是cN这个样式

describe  stock_basic_test_view ;

使用like  创建视图
create  view   stock_basic_test_view2 as select  * from  stock_basic_test_view;

删除视图
drop  view  stock_basic_test_view2;

第七章索引
8.1创建索引
一般索引：
create index stock_basic_index
on table stock_basic_test(stock_id)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild
in table index_table
comment 'stock_basic_partition indexed by stock_id ';

    说明:as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 是指定了索引处理器，也就是一个java类  大小写严格一致
    in table 表示 不一定需要在新标中保留索引数据。 
    注意 但是每个索引只能用一次这个表 下次就报错表已经存在 建议不加这个
    partition by (stock_id) 表示索引分区了
    其实在 comment 之前可以加上 stored as , row format, location 等字段。 

    create  index stock_basic_index_map 
    on table stock_basic_test(stock_id)
    as 'bitmap'
    with deferred rebuild 
    in  table  index_table2  
    comment 'stock_basic_partition indexed  by  stock_id ';
    bitmap索引
    只需要 将     as 'org.apache.hadoop.hive.ql.index.compact.compactindexhandler' 替换为 as  'bitmap' 和 in table 
8.2-8.4重建索引和删除索引 
    show  formatted index  on stock_basic_test;

    alter  index stock_basic_index_map on     stock_basic_test   rebuild;

    drop  index  stock_basic_index_map on  stock_basic_test;

    /**
    分区索引  --报错后续研究下    
    create  index stock_basic_partition_index 
    on table stock_basic_partition(stock_id)
    as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    with deferred rebuild 
    in  table  index_table partitioned  by (stock_id)
    comment 'stock_basic_partition indexed  by  stock_id '
    alter  index stock_basic_partition_index on  table  stock_basic_partition partition(stock_id=) rebuild;
    **/

第九章模式设计
9.2 关于分区
map reduce会将一个任务（job）转换成多个task 默认每个task都是一个新的jvm实例，都需要开启和销毁的开销，每一个小文件都会对应
一个task。在一些情况下，jvm开启和销毁的时间可能比实际处理数据时间要长。
9.4 同一份数据多种处理：
就是从一个表插入到多张表的功能
from table_all
insert overwrite table table_a select where table_all.column_name_id='1'
insert overwrite table table_b select where table_all.column_name_id='1';
9.6 分桶表数据储存：
如果用分区表来存储，一级分区为日期二级分区为用户编码的话，二级分区就太多了，会引起大量小文件的问题。可以一级建分区二级建桶。
create table stock_basic_par_clust(
stock_name string comment '股票名称'
,stock_date string comment '股票日期'
,stock_start_price DECIMAL(15,3) comment '开盘价'
,stock_max_price DECIMAL(15,3) comment '最高价'
,stock_min_price DECIMAL(15,3) comment '最低价'
,stock_end_price DECIMAL(15,3) comment '收盘价'
,stock_volume DECIMAL(15,3) comment '成交量'
,stock_amount DECIMAL(15,3) comment '成交金额'
) partitioned by (stock_id string ) clustered by(stock_date ) into 10 buckets;
#注意分区和分桶不一样，分桶字段还在建表语句里，但是分区不在。

    from  stock_basic 
    insert overwrite  table stock_basic_par_clust partition (stock_id='000001') select  
     stock_name   
    ,stock_date   
    ,stock_start_price
    ,stock_max_price
    ,stock_min_price
    ,stock_end_price
    ,stock_volume   
    ,stock_amount   
    where  stock_id='000001';
    注意：书中虽然说了要设置 set hive.enforce.bucketing=true 才能执行插入语句，但实际没设置也可以  该参数已经被移除了
    《hive实战》分桶表注意事项
    选择唯一值的个数比较多的桶键，这样会减少出现倾斜的可能
    如果桶键中数据是倾斜的，为倾斜的值单独创建桶，这可以通过列表分桶来实现
    使用质数作为桶的编号
    分桶对于常用链接在一起的事实表很有用
    需要连接在一起的分桶表，其桶的个数必须一致，或者一个桶的个数是另外一个的因子
    等等
9.8 使用列存储表
    具体等到15.3.2 看

第十章调优
10.1 使用EXPLAIN
一个hive任务会包含多个stage(阶段),不同的阶段会存在依赖关系。
一个stage可以是一个mapreduce任务也可以是一个抽样阶段或者一个合并阶段，或者一个limit阶段等。
hive一般一次执行一个阶段。除非有要求进行并行执行。
10.2使用 explain extended
注意explain extended 比 explain多的数据。初步看是多了文件位置？
10.3 限制调整
就是类似于 limit防止在cli返回所有结果。
set hive.limit.optimize.enable=true;
set hive.limit.row.max.size=100; #这个没啥用系统默认 10000 改了还是显示这么多。
set hive.limit.optimize.limit.file=10;
不过这个没啥用
10.4join 优化 mapjoin 感觉已经很烂大街了。
10.5本地模式
10.6并行执行
一个hive任务会有多个stage,如果不同的，没有依赖的任务可以让他们并行执行
set hive.exec.parallel=true;
10.7严格模式
set hive.exec.dynamic.partition.mode=strict ;
set hive.exec.dynamic.partition.mode=nonstrict ;
好处：1.分区表必须加分区否则不行
2.使用了order by 必须加limit

禁止笛卡儿积查询
真实情况：严格模式下 1 和2 可以执行，也就是说严格模式下不加分区和不加limit也能执行
但是不能执行笛卡儿积
注意即使将 set hive.exec.dynamic.partition.mode=nonstrict ;
也不能执行笛卡儿积还要将下面的参数设置为false,这样才可以
set hive.strict.checks.cartesian.product=false;
10.8 调整mapper 和reducer个数
保持平衡是有必要的，太多的mapper 和reducer个数导致启动阶段，调度和运行JOB过程中产生过多的开销，如果mapper 和reducer个数太少，
hadoop的并行性没有展示出来。
CLI可以看出大致需要的reducer个数：一下都出现了
Number of reduce tasks determined at compile time: 10
Number of reduce tasks not specified. Estimated from input data size: 3
number of mappers: 3; number of reducers: 1
hive是通过输入的数据量的大小来控制rducer的个数。可以通过如下命令来计算输入量大小
hdfs dfs -count hdfs://master:9000/hive_dir/stock_basic/stock_info2020414.txt 注：值为 760035648B
```
属性 hive.exec.reducers.bytes.per.reducer 默认为 1GB(课本值) 256000000B(查得值)      通过设置这个值进而可以reducer数量调整
(不知道是书上说的不准还是实际的有变化，并不是两个除的值)
一个系统的插槽数（map reduce 个数）是固定的。可以设置属性 hive.exec.reducers.max防止某些大的job消耗完所有插槽。 建议值:reduce槽位个数*1.5/执行中的查询的平均个数
```
10.9jvm 重用：
个人理解：当小文件比较多的时候，就需要开启jvm。会造成很大的开销，通过jvm重用可以使得在同一job下的jvm使用n次。n的值由参数 mapred.job.reuse.jvm.num.tasks控制。
但是这个功能有个缺点：开启的jvm会一直占用使用到的task插槽。知道任务结束
10.10索引略
10.11 动态分区调整略见上文
10.12 推测执行：理解：就是回去侦测一些跑的比较慢的任务加入到黑名单，然后触发一些新的重复任务。所以会由于重复的数据计算导致消耗更多的资源。
由maped.map.tasks.speculative.execution 和maped.reduce.tasks.speculative.execution
10.13 单个mp中的多个group by:将查询中带有多个group by 的放到一个mp任务中：(但是没查到，应该是被改了)
set hive.multigroupby.singlemr=false
10.14 虚拟列:其实就是展示数据的时候不是乱七八糟的，而是整齐的显示：(感觉没啥用)
set hive.exec.rowoffset=true;
第十一章其文件格式和压缩方法
11.1 确定安装编解码
set io.compression.codecs; #没有用了没搜到
11.2选择一种压缩编码/解码器
压缩率高必然解压就会增加CPU的开销.分类: GZIP,BZip2,snappy,LZO
BZip2压缩率最高 GZIP是压缩比是压缩比/解压速度较好的选择。
snappy,LZO压缩率不如前两种，但是解压速度快
压缩的另外一个考虑因素：文件是否是可以分割的，mapreduce 需要将非常大的输入文件进行分割（通常一个文件块 64M倍数）。
如果不能分割，就会出现一个单独的task来读取整个文件
GZIP和snappy由于把边界信息被掩盖了所以是不可分割的，BZip2和LZO提供了块级别的压缩。
11.3 开启中间压缩
set hive.exec.compress.intermediate=true
hive默认压缩编解码器是 DefaultCodec 即org.apache.hadoop.io.compress.DefaultCodec 可以通过 mapred.map.output.compression.codec 来设置建议 SnappyCodec
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
11.4 最终输出结果压缩:
set hive.exec.compress.output=true
可以通过 mapred.output.compression.codec 设置压缩模式建议 GZIP 但是注意 GZIP 不可分割。
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
11.5 sequence file 支持三个级别压缩 none record(默认) block 可分割
create table xxxx(a int )stored as sequencefile;
可以设置压缩级别：set mapred.output.compression.type=BLOCK;
11.6 压缩实践略结果压缩只有原来的1/4 的确效果明显
11.7 存档分区见前文
第十二章开发（略）没啥用
第十三章函数（略）需要开发java代码来写自定义udf函数暂时不想涉及
第十四章 Streaming
类似于hadoop的streaming一样可以调用写的python,linux脚本来处理数据然后返回。效率比UDF函数慢一点。但是可以避免写java代码哈哈
14.1恒等变换 select transform(stock_id,stock_name) using '/bin/cat' as (a,b) from stock_basic_bak limit 100 ; -- 注意要不要括号都可以
14.2 改变类型 select transform(stock_id,stock_name) using '/bin/cat' as a double , b from stock_basic_bak; 报错不行
14.3投影变换 select transform(stock_id,stock_name) using '/bin/cut -f1' as a , b from stock_basic_bak limit 100;
14.4操作转换 select transform(stock_id,stock_name,stock_max_price) using '/bin/sed s/0/tt/g' as a , b,c from stock_basic_bak limit 100;
select transform(stock_id,stock_name) using '/bin/sed s/0/tt/' as a,b from stock_basic_bak limit 100;
发散：不同字段类型的的转换后数据格式变了吗？结论：转换后字段类型也被改变了。
14.5-14.7使用分布式内存，由一行产生多行：
将文件加入分布式缓存：这只是可以使得transform task 可以直接使用脚本而不用确定去哪里找这些文件。
add file /tmp/python_hive_code/hive_test_1.py #注意这是本地路径
select transform(line) using 'python hive_test_1.py' as a,b,c from test_stream_split;
其实如下也可以直接执行：
select transform(line) using 'python /tmp/python_hive_code/hive_test_1.py ' as a,b,c from test_stream_split;
书上脚本using 后面没有python 可能是版本不同语法不同，这里要加python
附hive_test_1代码如下：

-- coding: utf-8 --

import sys
#sys.stdout.write('：\n')
for line in sys.stdin.readlines():
out = line.split(',')
print('\t'.join(out))
有个遗留问题：输出数据有空行
14.7-8留着以后慢慢研究--待续

第十五章:文件和记录格式
15.1-15.2文件和记录格式 hive 表的owner是linux系统用户而不是default,default是dbname.
15.3文件格式：
1.sequencefile:含有键值对的二进制文件，
2.rcfile 列式存储，压缩率比较高，同时不需要物理存储一些为空的列。
3.textfile
hive可以指定不同的输入格式和输出格式的类型：还是一样，包名要区分大小写
create table tt(a int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
stored as
inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';

    注意必须要  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' 否则用text类的表插入数据会报错，建表不会报错。。很奇怪.
    hdfs dfs -cat 命令无法查看rcfile  不过 hive --service  rcfilecat可以查看rcfile文件
    hive --service  rcfilecat  hdfs://master:9000/hive_dir/tt/000000_0 虽然可以看 但是和查出来的数字类数据不一样，可能是序列化的问题.中文不受影响

    附：自定义输入格式：
15.4 记录格式：serde是序列化和反序列化的简写。 引用见15.3        
15.5 csv 和tsv serde
create  table  stock( stock_id string ,stock_name string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE;
#待验证，又是csv 又是STORED AS TEXTFILE不知道是什么意思。
15.6-15.8略
15.9 xpath 相关函数：百度吧
15.10 验证了  with serdeproperties 没啥用
     json serde:  org.apache.hive.hcatalog.data.JsonSerDe   org.apache.hadoop.hive.contrib.serde2.JsonSerde
    drop  table test_json_t;
    create   table  test_json_t
    (
    id string,
    index string ,
    guid string ,
    balance string  
    ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    with serdeproperties
    (
    "id"="s._id",
    "index"="s.index",
    "guid"= "s.guid",
    "balance"= "s.balance"
    )
    stored as textfile;

    load data local  inpath '/tmp/json_data/'  overwrite  into  table test_json_t ;
《hive实战》 提取json数据
    1.使用udf查询：CREATE TABLE json_table (json string);
        load data local  inpath '/tmp/json_data/' into    table json_table;
        SELECT get_json_object(json_table.json, '$')   as json FROM json_table; 
        select get_json_object(json_table.json,'$.balance') as balance,
            get_json_object(json_table.json, '$.gender') as gender,
            get_json_object(json_table.json, '$.phone') as phone,
            get_json_object(json_table.json, '$.friends.name') as friendname
            from json_table;
    2.使用Serde查询
        先增加jar包：hive自带
        add  jar  /opt/tools/apache-hive-2.3.6-bin/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.6.jar

        drop table json_serde_table;
        CREATE TABLE json_serde_table (
        id string,
        index string,
        guid string,
        balance string 
        )
        ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
        WITH SERDEPROPERTIES ( "mapping._id" = "id" );
        load data local  inpath '/tmp/json_data/' into    table json_serde_table;
        带下划线的暂时不知道怎么搞

《hive实战》 orc file  也是列式存储
create table stock_test_orc_file stored as orc  as select  * from  stock_basic limit  200000;  --10.59M
create table stock_test_normal_file  as select  * from  stock_basic limit  200000;     --1.68M
create  table  stock_test_rc_file                                                     --9.28 感觉效果不明显 还是用 orc file 吧 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'  
stored as 
inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' 
as   select  * from  stock_basic limit  200000;

对比的确压缩了很多
orc 或者rc  file 的小文件合并： alter table  stock_test_orc_file concatenate;

hdfs dfs -cat   hdfs://master:9000/hive_dir/stock_test_normal_file/000000_0
hive --service  rcfilecat  hdfs://master:9000/hive_dir/stock_test_rc_file/000000_0
rcfilecat  看不了orc file    百度下怎么看

第十六章:hive的 thrift 服务：
16.1启动 thrift server
hive --service hiveserver2 &
查看监听是否启动成功
netstat -nl | grep 10000
16.2配置 groovy 使用HiveServer jdk版本不合适。作罢
第十七章:安全（不行）
18.2使用hive验证
新建文件设置默认权限的umask值
set hive.files.umask.value=0002; #没有
同时设置以下为true(默认false)如果用户没有权限删除表底层的数据，hive就会阻止用户来删除表：
set hive.metastore.authorization.storage.checks= true;
18.3hive中的权限管理
只有将下值设为true才能开始进行权限设置功能默认为false
set hive.security.authorization.enabled
18.3.1 用户，组和角色
set hive.security.authorization.enabled= true; 不能再运行时设置后放到hive-site.xml中可以设置但是没有起作用
查看用户名 set system:user.name;
完全行不通。。后续再看
第十八章锁：
19.1hive 和zookeeper 支持锁功能
先按照zookeeper 安装好的方式安装zookeeper
然后在hive-site.xml中加入配置增加如下属性

    <property>
    <name>hive.zookeeper.quorum</name>
    <value>master,slaves1,slaves2</value>
    </property>

    <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
    </property>

    在hive中 show locks; 查看锁
19.2 显式锁和独占锁
    lock table  stock_basic exclusive;

    select * from  stock_basic  limit 10 ;
    执行极慢在本身的窗口执行也很慢 应该都被锁了 ：疑问，如果本身的窗口都不能进行操作，锁表他有什么意义呢
     解锁表 解锁后马上查询就有结果
    unlock table stock_basic;

《hive实战》：报错不行哦
set hive.execution.engine; 查看hive引擎默认mr [mr, tez, spark]
select count(*) from stock_basic; -- 69.859 seconds
set hive.execution.engine=tez;
set hive.prewarm.enabled=true;
set hive.prewarm.numcontainers=10;

存储格式：orc 和 parquet 以及矢量化查询
parquet:也是另外一种列式存储格式。将每列的所有数据连续存储在磁盘上，因此具有orc类似的性能优势
drop table stock_test_orc_file;
create table stock_test_orc_file stored as orc as select from stock_basic;
select count() from stock_test_orc_file; --0.98s

矢量化查询：hive默认查询执行引擎一次处理一行，因此在嵌套循环中需要有多层的虚拟方法调用，从cpu角度来看这是非常低效的。
矢量化查询时hive的一种特性，其目的是按照每批1024行读取数据，并且每次都是针对整个记录集合来操作，进而消除那些效率底下的问题。
对于典型的查询操作：扫描，筛选，合计和链接已经证明矢量执行速度提高了一个数量级
但是矢量化查询需要orc格式存储
set hive.vectorized.execution.enabled=true;

hive编程指南-笔记-2

-- coding: utf-8 --

猜你喜欢