hive笔记-hive配置及基本操作

hive笔记
1、hive中涉及的概念：
  1、hive介绍：
   1、是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL）。
   2、这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive 定义了简单的类 SQL 查询语言，称为 QL，它允许熟悉 SQL 的用户查询数据。（hql:hive query language）
   3、同时，这个语言也允许熟悉 MapReduce 开发者的开发自定义的 mapper 和 reducer 来处理内建的 mapper 和 reducer 无法完成的复杂的分析工作
  2、数据仓库etl的介绍
  3、hive的架构图
  4、hive中的核心组件：metestore（用来存储hive中表的元数据的工具），可以在传统的关系型数据库中存储
   元数据:表的名字，表的列和分区及其属性，表的属性（是否为外部表等），以及表所在的数据存储目录等等
2、伪分布式的安装：
  pre1:安装mysql
  pre2:开启mysql的远程连接权限
   GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
   FLUSH PRIVILEGES;
  pre3:将mysql的驱动jar上传到hive的lib目录下（注意mysql的驱动jar包的版本）。
  pre4:安装好hadoop，并启动hadoop的hdfs和mapreduce服务。
  1、配置hive,hadoop的环境变量（配置在~/.bashrc文件中）
   LASSPATH=.
   JAVA_HOME=/usr/java/latest
   HIVE_HOME=/usr/hive-2.1.1
   HADOOP_HOME=/usr/hadoop-2.4.0
   PATH=$JAVA_HOME/bin:$PATH:$HIVE_HOME/bin
   export JAVA_HOME
   export HIVE_HOME
   export HADOOP_HOME
   export PATH
   export CLASSPATH

2、在目录$HIVE_HOME/conf/下，执行命令mv hive-default.xml.template hive-site.xml重命名
       在目录$HIVE_HOME/conf/下，执行命令mv hive-env.sh.template hive-env.sh重命名

  4、配置Hive Metastore
   1、使用derby作为hive的metastore(缺陷：同一时刻，只允许有一个客户端连接hive，在运行过程中，不可取)
                               derby是hive里用来存储元数据的一个关系型数据库。
    1、在hive-site.xml文件中添加以下信息：
     <configuration>
      <property>
       <name>hive.metastore.warehouse.dir</name>
       <value>/user/hive-2.1.1/warehouse</value> // 在hdfs中的路径
      </property>
      <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
      </property>
     </configuration>
   2、使用mysql作为hive的metastore
    1、将mysql的驱动jar包拷贝到hive的lib目录里面
    2、修改hive-site.xml文件，修改内容如下：
    <configuration>
     <property> 
      <name>hive.metastore.warehouse.dir</name>
      <value>/user/hive-2.1.1/warehouse</value>
     </property>
     <property>
       <name>hive.metastore.schema.verification</name>
       <value>false</value>
     </property>
     <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://192.168.1.3:3306/hive?createDatabaseIfNotExist=true</value>
     </property>
     <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
     </property>
     <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>root</value>
     </property>
     <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>root</value>
     </property>
     </configuration>
    在hive-env.sh的文件中，配置hadoop的环境变量：
    # Set HADOOP_HOME to point to a specific hadoop install directory
    HADOOP_HOME=/usr/hadoop-2.4.0

  5、执行以下指令：
   [root@hadoop01 hive-2.1.1]# ./bin/schematool --dbType derby -initSchema
   [root@hadoop01 hive-2.1.1]# ./bin/schematool --dbType mysql -initSchema
  6、连接hive
   [root@hadoop01 hive-2.1.1]# ./bin/hive
3、hive客户端连接方式:
  1、通过hive提供的client command 指令连接并操作
  2、通过hive提供的hive-driver，利用jdbc操作
  3、通过hive提供的hwi（hive web ui）操作
4、hive中基于clientComand的基本操作
  1、客户端连接hive：
   [root@hadoop06 hive-2.1.1]# ./bin/hive
  2、预先的设置：
   1、连接后显示当前数据库的名称：
    hive> set hive.cli.print.current.db=true;
   2、在查询的时候显示列的信息：
    hive (default)>set hive.cli.print.header=true;
  3、hive操作database
   hive (db1)> show databases;
   hive (db1)> create database db1;
   hive (db1)> use db1;
   hive (db1)> describe database db1;
   hive (db1)> drop database db1;          //如果db里面有数据，默认不允许删除
   hive (db1)> drop database db1 cascade; //如果db里面有数据，默认不允许删除。可以级联删除
  4、hive对表相关操作
   1、建表基本概念：
    1、hive的建表语句：
     参照 hive建表语句说明图.jpg
    2、hive中的表的分类：
     分为内部表（MANAGED_TABLE）和外部表(EXTERNAL_TABLE)。建表的时候如果不指定表类型，默认是内部表
     区别：(原始数据存储在hdfs上)
      1、创建内部表时，会将数据移动到数据仓库指向的路径。在删除内部表的时候，表的元数据和数据全部删除。
      2、若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。删除时只删除元数据，不删除数据。
    3、hive表中列支持的数据类型：
     1、简单类型：
      tinyint,smallint,int,bigint,
      boolean,
      float,double,
      string
     2、集合类型（存放相同类型的多个数据）
      array person:通过下标访问。例如：persons[0]，
      map city：保存key/value数据类型。通过key访问值。例如：city['name']
     3、结构类型（存放不同类型的多个数据）
      struct str：可以通过'点语法获得'。例如：str.key1
    4、hive表在hdfs中的存储类型
     1.包含以下格式：、
      1、textfile：以文本形式存储数据
      2、sequencefile：将key/value的数据以二进制形式存储。可以进行数据压缩，减少hdfs空间，会损耗效率
      3、Rcfile：行列存储结合的方式，它会首先将数据进行分块，保证同一个record在一个分块上，避免读一次记录需要读多个块。其次块数据列式存储，便于数据存储和快速的列存取
     2.可以扩展文件格式：默认文件的读取方式，自定义inputformat，自定义serde
   2、表的基本操作
    1、通用操作
     hive (db1)> show tables;
     hive (db1)> show tables in db1;
     hive (db1)> show tables like 't*';
     hive (db1)> show create table t_user; // 获得表的建表语句
     hive (db1)> drop table t1;
     hive (db1)> desc formatted t_emp01; // 查看表的描述信息
    2、建表（建表并插入数据后可以在hadoop的hdfs中查看表数据）
     1、常规建表：
      hive (db1)> create table t_user02(id int,name string);
     2、根据已有的表结构建表
      hive (db1)> create table t_emp02 like t_emp01;
     3、根据已有的表结构建表并插入数据
                                                            create table 表名 as sql语句;
      hive (db1)> create table t_emp03 as select name,salary from t_emp01;
    3、修改表
     1、修改表结构：
      hive (db1)> alter table t_emp rename to t_emp04;
      hive (db1)> alter table t_emp04 add columns(id int,sex string);
      hive (db1)> alter table t_emp04 change id newId string; // 相当与删除列
     2、修改表的类型：
      alter table t_user01 set tblproperties (external=true); //由内部表转换为外部表
      alter table t_user01 set tblproperties (external=false); //由外部表转换为内部表
    4、插入数据：
     1、通过load添加数据
      追加：
       hive (db1)> load data local inpath '/usr/a.txt' into table t_emp;
      覆盖：
       hive (db1)> load data local inpath '/usr/a.txt' overwrite into table t_emp;
      (从hadoop的hdfs中将某个文件的内容上传到hive中)
       hive (db1)> load data inpath '/hivedata/a.txt' into table t_use01;
     2、通过insert指令添加：
      opt1:
       hive> insert into table t_user03 select id from t_user02;   （追加）
       hive> insert overwrite table t_user03 select id,name from t_user02; （覆盖）
      opt2:
       hive> from t_user02 insert into table t_user03 select id,name;
       hive> from t_user02 insert overwrite table t_user03 select id,name;
       hive> from t_user02 insert into table t_user03 select id,name where id >3;
     3、分区插入
      1、通过load指令添加
      load data 【local】inpath '/usr/a.data' 【overwrite】 into table t_user02 partition (userGroup='ug3');
      2、通过insert指令添加
       insert into table t_user02 partition(userGroup="ug3") select id,name from t_user02 where id=1;
       insert overwrite table t_user02 partition(userGroup="ug3") select id,name from t_user02 where id=1;
    4、查询：
      SELECT [ALL | DISTINCT] select_expr, select_expr, ...
      FROM table_reference
      [WHERE where_condition]
      [GROUP BY col_list]
      [CLUSTER BY col_list
        | [DISTRIBUTE BY col_list] [SORT BY col_list]
      ]
      [LIMIT number]
     1、常规查询：
      select * from t_user;
      select * from t_user where id >1;
      select * from t_user where id=1;
     2、排序 order by。排序是在reduce端排序，并且只有一个reduce
      hive> select * from t_us where uid=1 order by movieid 【asc|desc】,score;
      原理实现：以排序列为key,其他列为value，交给reduce处理
     3、分组 group by
      hive> select uid,mid from t_us group by uid,mid;
      hive> select count(*) from t_us group by uid;
      hive> select count(*) from t_us group by score;
      hive> select count(*),score from t_us group by score having score >2;
      ps: 设置reducer的个数：
       set mapreduce.job.reduces=3
      底层实现原理：按 col分组，把col列的内容当做key，其他列的值为values,交给reduce处理

     4、常见聚合函数（组函数）
      hive> select max(uid) from t_us;
      hive> select count(*) from t_us;
      hive> select sum(uid) from t_us;
      hive> select avg(uid) from t_us;
      hive> select distinct uid from t_us;
      hive> select count(distinct uid) from t_us;
     5、连接 join
      select * from (t_user02 join on t_us02.id = t_us02.uid);
      select * from (t_user02 left outer join t_us02 on t_user02.id = t_us02.uid);
      select * from (t_user02 right outer join t_us02 on t_user02.id = t_us02.uid);
      select * from (t_user02 right outer join t_us02 on t_user02.id = t_us02.uid) where t_us02.uid >12;
     6、dristribute by：distribute by col: 按照col列把数据分散到不同的reduce
       hive> select * from t_us distribute by score order by uid;
      sort by col 【asc|desc】：按照col列把数据排序
      distribute by 和sort by 结合使用，保证了每个reduce中的数据都是有序的。

      distribute by 和group by 的区别：
       都是按照key值划分数据，都使用reduce操作
       distribute by 只是单纯的将数据分散到不同的reduce中，group by 是将key系统的进行聚合
      order by 和 sort by的区别：
       order by 是全局排序
       sort by 确保每个reduce的数据是有顺序。当只有一个reduce时。sort by 和order by 一样
     7、culster by:把具有相同值的数据聚合到一起，并排序。相当于 distribute by col order by col
      select * from t_us cluster by score;
     8、将查询结果写到其他介质：
      将查询的结果写到本地linux文件夹下
       hive> insert overwrite local directory '/usr/cc' select * from t_user;
      将查询的结果写到hadoop 的hdfs中
       hive> insert overwrite directory '/usr/cc' select * from t_user;
      将查询的结果写到hive的 t_temp03表中
       hive> insert overwrite table t_temp03 select * from t_user;
   3、hive中的函数：
    1、内置函数：
     1、显示当前会话所有可用的函数信息
      hive> show functions;
     2、显示函数的描述信息
      hive> desc function max;
     3、显示函数的扩展描述信息
      hive> desc function extended max;
     4、常见的聚合函数
      sum() count() avg() distinct() min() max()
     5、窗口函数:主要用来完成以下功能：分区排序，动态group by, Top N,累计计算，层次查询
      lead(),ag(),first_value(),last_value()
     6、分析函数：
      rank(),row_number(),dense_rank(),cume_dist(),percent_rank(),ntile();
    2、自定义函数：
     udf(user define function):用户自定义函数
  ================================================================================
  hive 和hbase的整合：
   hive推出了storage-handler，用于将数据存储到HDFS以外的其他存储上。并方便的通过hive进行插入、查询等操作。同时hive提供了针对Hbase的hive-hbase-handler。这使我们在使用hive节省开发MR代码成本的同时还能获得HBase的特性来快速响应随机查询。

  整合步骤：
   1、确保hadoop和hbase正常启动。
   2、确保hive的lib目录下有 hive-hbase-handler-x.y.z.jr包。
   3、执行以下指令：./hive -hiveconf hbase.zookeeper.quorum=hadoop01:2181
   4、执行以下指令，进行建表：

create table t_user07_1(id int,age int,sex string,role string,uid int)
row format delimited fields terminated by '\|'
lines terminated by '\n';
load data local inpath '/usr/f.txt' into table t_user07_1

//hbase如果存在某个表，必须使用外部表。导入数据必须使用insert指令
create table t_user07(id int,age int,sex string,role string,uid int)
row format delimited fields terminated by '\|'
lines terminated by '\n'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:age,info:sex,info:role,info:uid")
TBLPROPERTIES ("hbase.table.name" = "t_user07");

insert into t_user07 select * from t_user07_1;

hive笔记-hive配置及基本操作

猜你喜欢