Hive 中的批量数据导入

在博客【大数据】Hive 表中插入多条数据中，我简单介绍了几种向 Hive 表中插入数据的方法。然而更多的时候，我们并不是一条数据一条数据的插入，而是以批量导入的方式。在本文中，我将较为全面地介绍几种向 Hive 中批量导入数据的方法。

1.从本地文件系统加载（load）数据

load data [local] inpath '路径' [overwrite] into table 表名 [partition (分区字段=值,…)];

overwrite：表示覆盖表中已有数据，否则表示追加。
此种加载方式是数据的复制。

（1）创建一张表。

hive (default)> create table student(id string, name string) 
                row format delimited fields terminated by '\t';

（2）加载本地文件到 Hive。

hive (default)> load data local inpath '/opt/module/datas/student.txt' 
                into table default.student;

2.从 HDFS 文件系统加载（load）数据

从 HDFS 文件系统向表中加载数据，其实就是一个移动文件的操作，需要提前将数据上传到 HDFS 文件系统。

（1）上传文件到 HDFS（Linux 本地 /opt/module/datas/student.txt 文件传到 /user/victor/hive 目录）。

hive (default)> dfs -put /opt/module/datas/student.txt /user/victor/hive;

（2）从 HDFS 文件系统向表中加载数据。

hive (default)> load data inpath '/user/victor/hive/student.txt' 
                into table default.student;

3.通过 as select 向表中插入数据

hive (default)> create table if not exists student3 as select id, name from student;

4.通过 insert into 向表中插入数据

insert into table test [partition(partcol1=val1, partcol2=val2 ...)] select id,name from student;

insert into：以追加数据的方式插入到表或分区，原有数据不会删除。

insert overwrite table test [partition(partcol1=val1, partcol2=val2 ...)] select id,name from student;

insert overwrite：覆盖表中已存在的数据。

（1）创建一张分区表。

hive (default)> create table student(id string, name string) 
				partitioned by (month string) 
				row format delimited fields terminated by '\t';

（2）基本插入数据。

hive (default)> insert into table student partition(month='201801') 
				values('1004','wangwu');

（3）基本模式插入（根据单张表查询结果）。

hive (default)> insert overwrite table student partition(month='201802') 
				select id, name from student where month='201801';

（4）多插入模式（只需要扫描一遍源表就可以生成多个不相交的输出）。

hive (default)> from student
             	insert overwrite table student partition(month='201803')
             	select id, name where month='201801'
             	insert overwrite table student partition(month='201804')
             	select id, name where month='201801';

5.通过 location 的方式

直接将数据文件上传到 location 指定的 HDFS 的目录下；

（1）创建表，并指定在 HDFS 上的位置。

hive (default)> create external table student(id int, name string)
             	row format delimited fields terminated by '\t'
             	location '/user/hive/warehouse/student';

（2）上传数据到 HDFS 上。

hive (default)> dfs -mkdir -p /user/hive/warehouse/student;
hive (default)> dfs -put /opt/module/datas/student.txt /user/hive/warehouse/student;

（3）查询数据。

select * from student;

【大数据】Hive 中的批量数据导入