Hive --------- hive补充知识

补充：

对于“正则表达式”和“解析json”、”解析xml”需要大家自己学习掌握好这些知识点。

自定义函数类别

UDF 作用于单个数据行，产生一个数据行作为输出。（数学函数，字符串函数）

UDAF（用户定义聚集函数）：接收多个输入数据行，并产生一个输出数据行。（count，max）

UDF开发实例

0、先导入相应的jar包（位置当天资料里面software/lib-开发）

1、开发一个java类，继承UDF，并重载evaluate方法

Package com.qianfeng.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public final class Lower extends UDF{

public Text evaluate(final Text s){

if(s==null){return null;}

return new Text(s.toString().toLowerCase());

}

2、打成jar包上传到服务器

3、将jar包添加到hive的classpath，在hive命令行下执行下面语句

hive>add JAR /home/hadoop/udf.jar;

4、创建临时函数与开发好的java class关联

Hive>create temporary function Lower as 'com.qianfeng.udf.Lower';

5、即可在hql中使用自定义的函数Lower

Select Lower(name),age from t_test;

如果该函数确定不再使用时可以删除

  Drop temporary  function Lower;

7、查看函数 show functions；

上述方式是配置自定义函数的使用方式一。下面还有其他方式：

自定义函数使用的方式二：

1.自定义udf函数并打包。

2.在hive_home的目录下新建一个文件夹

mkdir   customUDF-lib

3.在hive_home / conf / hive-site.xml添加如下配置

<value>/home/hadoop/develop_env/apache-hive-1.2.1-bin/customUDF-lib</value>

</property>

3.将自定义函数jar放入/home/hadoop/develop_env/apache-hive-1.2.1-bin/customUDF-lib目录下。

4.创建临时函数与开发好的java class关联

create temporary function Lower as 'com.qianfeng.udf.Lower';

5.使用自定义函数

Select Lower(name),age from t_test;

该方式是在启动hive命令行时，自动加载/home/hadoop/develop_env/apache-hive-1.2.1-bin/customUDF-lib下的jar包

自定义函数使用的方式三：

1、在/home/hadoop目录下创建一个名为init-hive的初始化文件：

vi  /home/hadoop/init-hive    #添加如下内容

add jar /home/hadoop/udf.jar;

create temporary function myLower as "com.qianfeng.udf.Lower";

2、启动使用命令：hive -i ./init-hive

3、测试是否添加好：

show functions;

select myLower ("aaaEEEDDD");

自定义UDF案例

（1）字符串反转：如：www.baidu.com

select fun1("www.baidu.com");    #输出   moc.udibd.www

（2）域名反转：如：www.baidu.com

select fun1("www.baidu.com"); com.baidu.www

hive 数据导入导出

数据导入：

从linux中导入到hive表中

load data local inpath '/home/hd'  overwirte  into table hd;

从hdfs中导入的hive表中

load data inpath '/home/hd'  overwirte  into table hd;

3、从hive一张表中导入到另外一张表中

insert into table hd

select * from hdtmp [where];

4、手动copy到hive表目录下

Hadoop fs -put data.txt  /user/hive/warehouse/库名/表名

5、location hdfs的目录导入

create table if not exists t_order2(

Id int,name string,rl string,price double

)

row format delimited fields terminated by ','

location 'hdfs://pdm:9000/aa';

首先这种创建表同时导入数据的方式的特点：

这个表也是内部表
建表的同时导入的数据通过（location语句）
该方式，不会在/user/hive/warehouse/库名目录下创建一个t_order2的文件夹。那它对应的文件夹又是什么呢？

就是location语句指定的那个文件（本里的/aa文件夹）

当drop table t_order2时，会删除掉/aa文件夹及内部文件。
使用location时，该语句中文件路径一定是 hdfs完整路径，所谓的完整路径，包含”hdfs://pdm:9000”前缀的。

6、like location (克隆表)

create table t_order5 like t_order2;  ###克隆表结构

###克隆表结构并加载数据

Create table t_order6 like  t_order2

location "hdfs://pdm:9000/aa";

7、CTAS 导入hive表

根据一个表的结果集，生成一个新表 t_order3

create table if not exists t_order3

as

select id,name,rl

from t_order

where id = 101    #可以过滤条件的。

数据导出：

        //导出到/home/out/00目录下

insert overwrite local directory '/home/out/00'  #####不是insert into

select * from t_order

2、将hive表中的数据导出到hdfs目录中

insert overwrite directory '/home/out/00'    #没有local就导出到hdfs的                                                                                                      #/home/out/00目录下

select * from t_order

上述两种方式导出后的数据的分隔符异常，使用下面的方式解决

###解决导出后数据字段分隔符问题：

insert overwrite local directory '/home/out/01'

row format delimited fields terminated by '\t'

select * from t_order;

3、将hive表中的数据导出到linux的文件中

#此命令要在linux的shell中执行（不在hive的shell中）

hive -e  "use qf_db ; select * from t_order"  >>  /home/out/02;

### -S 静音模式，不会输出和结果集无关信息

hive -S  -e  "use qf_db ; select * from t_order" > /home/out/03;

Hive --------- hive补充知识

补充：

猜你喜欢