Bulk data import in Hive

In the blog [Big Data] Insert multiple pieces of data into Hive table , I briefly introduced several methods of inserting data into Hive table. However, more often than not, we do not insert data one by one, but import in batches. In this article, I will give a more comprehensive introduction to several methods of bulk importing data into Hive.

1. Load (load) data from the local file system

load data [local] inpath '路径' [overwrite] into table 表名 [partition (分区字段=值,…)];

overwrite: means to overwrite existing data in the table, otherwise means to append.
This loading method is a copy of the data.

(1) Create a table.

hive (default)> create table student(id string, name string) 
                row format delimited fields terminated by '\t';

(2) Load local files to Hive.

hive (default)> load data local inpath '/opt/module/datas/student.txt' 
                into table default.student;

2. Load (load) data from the HDFS file system

Loading data from the HDFS file system to the table is actually an operation of moving files, and the data needs to be uploaded to the HDFS file system in advance.

(1) Upload files to HDFS (Linux local /opt/module/datas/student.txtfiles to /user/victor/hivedirectory).

hive (default)> dfs -put /opt/module/datas/student.txt /user/victor/hive;

(2) Load data into the table from the HDFS file system.

hive (default)> load data inpath '/user/victor/hive/student.txt' 
                into table default.student;

3. Insert data into the table through as select

hive (default)> create table if not exists student3 as select id, name from student;

4. Insert data into the table through insert into

insert into table test [partition(partcol1=val1, partcol2=val2 ...)] select id,name from student;

insert into: Insert data into a table or partition by appending data, and the original data will not be deleted.

insert overwrite table test [partition(partcol1=val1, partcol2=val2 ...)] select id,name from student;

insert overwrite: Overwrite existing data in the table.

(1) Create a partition table.

hive (default)> create table student(id string, name string) 
				partitioned by (month string) 
				row format delimited fields terminated by '\t';

(2) Basic data insertion.

hive (default)> insert into table student partition(month='201801') 
				values('1004','wangwu');

(3) Basic mode insertion (according to the query result of a single table).

hive (default)> insert overwrite table student partition(month='201802') 
				select id, name from student where month='201801';

(4) Multi-insert mode (it only needs to scan the source table once to generate multiple disjoint outputs).

hive (default)> from student
             	insert overwrite table student partition(month='201803')
             	select id, name where month='201801'
             	insert overwrite table student partition(month='201804')
             	select id, name where month='201801';

5. By location

Upload the data file directly to locationthe specified HDFS directory;

(1) Create a table and specify its location on HDFS.

hive (default)> create external table student(id int, name string)
             	row format delimited fields terminated by '\t'
             	location '/user/hive/warehouse/student';

(2) Upload data to HDFS.

hive (default)> dfs -mkdir -p /user/hive/warehouse/student;
hive (default)> dfs -put /opt/module/datas/student.txt /user/hive/warehouse/student;

(3) Query data.

select * from student;

[Big Data] Batch data import in Hive