hive常用函数-建表-jdbc

hive：
================
    数据仓库，用于分析海量数据
    底层使用hadoop
    hive一般不支持实务操作，行级更新，必要时候也能支持事务性

数据仓库和关系型数据库适用场景：
    数仓：OLAP
    关系型数据库：OLTP    //事务


Array：    select array[0] from t;
Map:    select map['height'] from t;
Struct: select struct.DB from t;
Union    select union['height'][0] from t


hive操作流程：
=======================
    用户在界面输入HQL(hiveQL)语句，被编译器解释为MR作业，通过执行引擎传给hadoop


hive1    //推荐使用MR作为执行引擎
hive2    //推荐使用spark作为执行引擎


hive的数据结构：
==========================
    元数据：mysql
    真实数据：HDFS

    表：文件夹
    数据库：文件夹
    数据：HDFS文件，默认文本文件


hive：\0001
============================
    create table t1(id int, name string) row format delimited
    fields terminated by '\t'
    lines terminated by '\n'
    collection items terminated by ','
    map keys terminated by ':' 
    stored as textfile;


hive脚本分析：
================================
hive
    --version
     
    --service
     
    --rcfilecat
     
    --orcfiledump
    
    --llapdump
    
    --help
    
    --debug


    二级命令使用帮助
    hive --service serviceName --help


hive服务：hive --serveice 
====================================
    beeline        二代用户接口
    
    cli        一代用户接口
    
    help        
    
    hiveserver2    二代hive服务
    
    version


beeline和hiveserver2    
===================================
    hiveserver2    //hive的jdbc服务

    hive --service hiveserver2    //启动hiveserver2
    hiveserver2            //启动hiveserver2
        
     
    s101:10002    //hiveserver2的webUI
    s101:10000    //hiveserver2的RPC地址


    beeline：hive2新客户端

出现 User: centos is not allowed to impersonate anonymous (state=08S01,code=0)问题：
===========================================================================================
    解决方案：在core-site.xml中添加如下配置
    <property>     
        <name>hadoop.proxyuser.centos.hosts</name>     
        <value>*</value> 
    </property> 
    <property>     
        <name>hadoop.proxyuser.centos.groups</name>     
        <value>*</value> 
    </property>

    添加之后，分发到其他机器，重启hadoop


    问题2：配置文件后，创建异常·    //anonymous用户没有写权限的问题
        1、第一种方案，修改hive-site.xml配置文件，将其改为centos
          <property>
            <name>hive.server2.thrift.client.user</name>
            <value>centos</value>
            <description>Username to use against thrift client</description>
          </property>

        2、第二种修改方案，在连接串中将用户名指定    //-u为url，-n为name
           ~]$  beeline -u jdbc:hive2://localhost:10000 -n centos

        



连接hive的JDBC服务流程：
=======================================
    1、启动hiveserver2
        hive --service hiveserver2    //启动hiveserver2
        hiveserver2            //启动hiveserver2
    
    2、启动客户端(IDEA的JDBC \ beeline)
        
        代码：
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>2.1.1</version>
        </dependency>

        public static void main(String[] args) throws Exception {
            String driver = "org.apache.hive.jdbc.HiveDriver";
            String url = "jdbc:hive2://192.168.23.101:10000/myhive";
            
            //注册驱动
            Class.forName(driver);
            //通过DriverManager获取连接
            Connection conn = DriverManager.getConnection(url);
            //通过连接创建statement
            Statement st = conn.createStatement();
            //通过st执行增删改查
            //其中查询需要使用executeQuery返回结果集
            ResultSet rs = st.executeQuery("select * from tt");
            while (rs.next()){
                int id = rs.getInt(1);
                String name = rs.getString(2);
                int age = rs.getInt(3);
                System.out.println(id + "/" + name+"/"+age);
            }
        }

        beeline：
            1、在linux命令行输入beeline
                ~]$ beeline
            2、在beeline命令行进行连接
                beeline> !connect jdbc:hive2://localhost:10000
            
        
托管表(内部表)和外部表：
==========================================
    1、默认创建托管表(内部表)，删除表之后会删除数据
        create table xxx

    2、外部表，删除表之后不会删除数据，只删除元数据信息    //相当于只把表的信息从mysql中删除
        create external table xxx



hive的函数：
==================================
     desc function [extended] substr; //显示函数的使用[扩展]方法
    
     select current_database();    //显示当前数据库
     select current_date();        //显示当前日期
     select current_timestamp();    //显示详细时间(戳)

     select explode(map or array) from xxx    ; //将map或array炸开，单行转多行

     select substr(str,index,len)    //select('hello',2,3) ====> ell
                    //select substr('hello',-4); ====> ello
                    //select substr('hello',-4,3); ====> ell

     select split('hello world',' ') //返回array类型

     1、普通函数
     2、表生成函数
     3、聚合函数


hive的explode函数实现wordcount：
=============================================

    1、字段：line =====> array类型 ====> collection items terminated by ''

    create table t5(line ARRAY<string>) row format delimited
    fields terminated by '\t'
    collection items terminated by ' '
    lines terminated by '\n'
    stored as textfile;


    炸开：select explode(line) as word from t5;
    进行分组：select word, count(1) from (select explode(line) as word from t5) t6 group by word;
    分组+排序：select word, count(1) as c from (select explode(line) as word from t5) t6 group by word order by c desc;


    2、字段：line =====> string类型 ====> split(line,' ')
    create table t6(line string) row format delimited
    fields terminated by '\t'
    lines terminated by '\n'
    stored as textfile;

    截串操作：select split(line,' ') from t6;
    炸开：select explode(split(line,' ')) as word from t6;                                                                      
    进行分组：select word, count(1) from (select explode(split(line,' ')) as word from t6) t7 group by word;                
    分组+排序：select word, count(1) as c from (select explode(split(line,' ')) as word from t6) t7 group by word order by c desc;


    !!!!!!!!!!!!!函数中不允许加select，嵌套函数直接写即可

hive的.hiverc文件：
===================================
    是hive的环境变量，在用户的家目录(~)，每次启动hive，hive会自动识别并加载此文件
    编辑内容：
        set hive.cli.print.header=true

    

使用hive完成最高气温统计：
===================================
    ftp://ftp.ncdc.noaa.gov/pub/data/noaa/    //ncdc气象局气象数据

    
    截串,取得年份和气温 ===> 将气温进行类型转换 ====> 分组聚合(max) 
    
    1、建表
        create table temp(line string) row format delimited
        fields terminated by '\t'                        
        lines terminated by '\n'                         
        stored as textfile;                              

    2、加载数据
        load data local inpath 'Temp/*' into table temp;

    3、编写sql
        0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
    
        cast(col as int)        //cast(line as int) 将line列转换为int值 

        获取年份：year：substr(line,16,4)
        获取气温：temp: cast(substr(line,88,5) as int)   != 9999
        获取气温和年份：select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp;
        过滤无效气温值：select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999;

        select a.year, max(a.temp) from 
        (select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999) as a 
        group by a.year;
    


表的相关操作：
=========================================
    like        
        create table ep like employee;        //创建表结构
        
    as    
        create table ep2 as select * from employee;            //复制表
        create table ep3 as select ep2.name, ep2.sex_age from ep2;    //复制表，指定字段

    truncate
        truncate table ep3            //截断(清空)表

    column
        添加列
            alter table ep3 add columns(id int , age int);

        删除列
            alter table drop column(id)            //删除列不成功

        修改列
            alter table ep3 change id no string ;        //将id列变为no并指明数据格式string

        替换列
            alter table ep3 replace columns(name string, id int, age int) //全部替换列

分区表：目录
    
    select ... where ; 


1,tom,20,henan,kaifeng

    
    创建分区表：
        create table users(id int, name string, age int) partitioned by (province string, city string)
        row format delimited 
        fields terminated by '\t'                        
        lines terminated by '\n'                         
        stored as textfile;   

    
    在分区表中静态添加分区：
        alter table users add partition(province='henan', city='luoyang') ;

        alter table users add 
        partition(province='beijing', city='beijing') 
        partition(province='hebei', city='xiongan') 
        partition(province='hebei', city='shijiazhuang') 
        ;

    删除分区：[IF EXISTS] 如果存在则删除，不存在不报错
        ALTER TABLE users DROP [IF EXISTS] PARTITION (province='beijing', city='beijing') ;

    查看分区表的分区：
        show partitions users;

    
    load数据到分区表：
        load data local inpath '' overwrite into table users partition(province='beijing', city='beijing');

        

insert:
===========================================
    使用insert命令动态插入分区：
        
        1、设置动态分区严格模式关闭：
            原因：避免因为分区数过多引发性能问题，严格模式声明至少存在一个静态分区
            关闭严格模式
            set hive.exec.dynamic.partition.mode=nonstrict 
        
        未分区表：id,name,age.province,city
        分区表：  id,name,age    partitions:province,city

        2、动态插入数据,默认分区字段是指定的select语句中最后n个字段
            insert into table users partition(province , city) 
            select use0.age, user0.province, user0.city,user0.id,user0.name from  user0;

        
    
    使用insert命令进行数据的导出：注意，数据只能覆盖，不能拿掉overwrite
        
        insert overwrite local directory '/home/centos/users' row format delimited 
        fields terminated by '\t'
        select * from users;

        将数据插入到多个目录下
        from users
        insert overwrite local directory '/home/centos/users' 
        row format delimited fields terminated by '\t'
        select * from users;
        insert overwrite local directory '/home/centos/user2' 
        row format delimited fields terminated by '\t'
        select * from users;


    使用insert命令将一个表数据插入到另一个表：
        insert into t1 select t4.name from t4;
    

order by：全排序
    使用一个reduce
    1、在hive中手动指定reduce个数：
        set mapreduce.job.reduces=2

    2、使用group by操作
        select province, count(*) from users group by province;

    3、使用order by操作
        select province, count(*) as c from users group by province order by c;


sort by：部分排序、
    1、在hive中手动指定reduce个数：
        set mapreduce.job.reduces=2

    2、使用group by + sort by 操作
        select province, count(*) as c from users group by province sort by c;
    
    tianjin  1 
    hebei    2 
    beijing  9 
    henan    2 



distribute by:指定字段 对数据进行分区

    select province, age  from users distribute by province sort by age;
    
    select province, count(*) as c from users group by province distribute by province sort by c;



distribute by 和group by的区别
    1、group by 和聚合函数一起使用，分区判定未知
    2、distribute by 可以不和聚合函数使用，分区判定为指定字段



cluster by ：相当于 distribute by + sort by  指定分区的部分排序,但是分区字段和排序字段需要一致

    select province, count(*) as c from users group by province cluster by province;

    



Hive作业和MR流程的对应关系：
==========================================
    select ... from ... where ... having ... group by ... order by ... limit
    
    map       input    map          reduce     partition    第二个job       第二个job的cleanup

    
limit:分页查询
==========================
    将输出数据指定输出数量
    select * from users limit 10    //前十行
    select * from users limit 10,10    //从第十个开始，往下十行


join：
==============================
    1、内连接
        select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a, orders b where a.id=b.uid;
        select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a inner join orders b on a.id=b.uid; 
    2、左外连接
        select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid; 
    3、右外连接
        select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid; 
    4、全外连接    mysql不支持
        select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a full outer join orders b on a.id=b.uid; 


user0表：
100    yyy    100    beijing    beijing
7    tt    7    tianjin    tianjin
1    tom    20    hebei    langfang
2    tomas    30    henan    xinyang
3    tomson    45    hebei    shijiazhuang
4    tomason    6    beijing    beijing
5    tomasLee    7    henan    kaifeng


orders表：
1    order001    20    100
2    order002    30    5
3    order003    90    7
4    order004    30.6    6
5    order005    4.2    1
6    order006    5    2



手动指定大小表，reduce端join的实现
hive常用函数-建表-jdbc

猜你喜欢