Hive four data import methods


Document reference for four data import methods in Hive:
http://blog.csdn.net/lifuxiangcaohui/article/details/40588929

Several common data import methods in Hive are introduced
here:
(1) Import from the local file system Data to Hive table;
(2) Import data from HDFS to Hive table;
(3) Query the corresponding data from other tables and import it into Hive table;
(4) When creating a table, pass it from The corresponding records are queried from other tables and inserted into the created table.

1. Import data from local file system into Hive table
create table wyp
 (id int, name string,
 age int, tel string)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

data is placed in
[root@node0 hadoop -2.6.4]#cp data.txt /usr/local/apache-hive-2.0.0-bin/load
data statement
[root@node0 hadoop-2.6.4]#load data local inpath 'data.txt' into table test;
data.txt
1 9 0 7
Y 7 8 9
hive> desc test;
Note: It should be noted that the tab character is used as the delimiter
:
unlike the relational database we are familiar with, Hive does not currently support directly giving a set of records in the text form in the insert statement, that is Say, Hive does not support statements of the form INSERT INTO .... VALUES.

2. Import data from HDFS
to Hive table In the process of importing data from the local file system to the Hive table, in fact, the data is first copied to the HDFS home directory of the uploading user, such as /usr/hive/warehouse/add.txt , the specific operations are as follows:

[root@node0 bin]# bin/hadoop fs –rm /usr/hive/warehouse/add.txt 
[root@node0 hadoop-2.6.4]# bin/hadoop fs -put /home/panqiong /Documents/add.txt /usr/hive/warehouse/
[root@node0 hadoop-2.6.4]# bin/hadoop fs -cat /usr/hive/warehouse/add.txt 
16/05/11 18:24:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1 9 0 8 
8 7 8 9 
hive> load data inpath '/usr/hive/warehouse/add.txt' into table wyp;

从上面的执行结果我们可以看到,数据的确导入到wyp表中了!请注意load data inpath ‘/usr/hive/warehouse/add.txt’ into table wyp;里面是没有local这个单词的,这个是和一中的区别。

3. 从别的表中查询出相应的数据并导入到Hive表中

假设Hive中有test1表,其建表语句如下所示:
create table test1(
id int, name string
,tel string)
partitioned by
(age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

大体和wyp表的建表语句类似,只不过test1表里面用age作为了分区字段。对于分区,这里在做解释一下:
分区:在Hive中,表的每一个分区对应表下的相应目录,所有分区的数据都是存储在对应的目录中。比如wyp表有dt和city两个分区,则对应dt=20131218,city=BJ对应表的目录为/user/hive/warehouse/dt=20131218/city=BJ,所有属于这个分区的数据都存放在这个目录中。

下面语句就是将wyp表中的查询结果并插入到test表中:
insert into table test1
partition (age=25)
select id, name, tel
from wyp;
我们知道我们传统数据块的形式insert into table values(字段1,字段2),这种形式hive是不支持的。

通过上面的输出,我们可以看到从wyp表中查询出来的东西已经成功插入到test表中去了!如果目标表(test)中不存在分区字段,可以去掉partition (age=’25′)语句。当然,我们也可以在select语句里面通过使用分区值来动态指明分区:
hive> set hive.exec.dynamic.partition.mode=nonstrict;
insert into table test1
partition (age)
select id, name, 
tel, age
from wyp;
hive> select * from test1;
OK
1 0 8 25
9 9 7 25
1 0 8 25
9 9 7 25
9 9 7 8
1 0 8 9
Time taken: 0.689 seconds, Fetched: 6 row(s)

这种方法叫做动态分区插入,但是Hive中默认是关闭的,所以在使用前需要先把hive.exec.dynamic.partition.mode设置为nonstrict。当然,Hive也支持insert overwrite方式来插入数据,从字面我们就可以看出,overwrite是覆盖的意思,是的,执行完这条语句的时候,相应数据目录下的数据将会被覆盖!而insert into则不会,注意两者之间的区别。例子如下:
hive> set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table test1
PARTITION (age)
select id, name, tel, age
from wyp;

insert overwrite table test3
PARTITION (age)
select id, name, tel, age
from wyp;

更可喜的是,Hive还支持多表插入,什么意思呢?在Hive中,我们可以把insert语句倒过来,把from放在最前面,它的执行效果和放在后面是一样的,如下:
hive> show create table test3;
OK
CREATE  TABLE test3(
  id int,
  name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
数据格式用制表符隔开

Time taken: 0.277 seconds, Fetched: 18 row(s)

CREATE  TABLE test4(
  id int,
  name string);
hive 默认的字段分隔符为ascii码的控制符\001,建表的时候用fields terminated by '\001',如果要测试的话,
造数据在vi 打开文件里面,用ctrl+v然后再ctrl+a可以输入这个控制符\001。按顺序,\002的输入方式为ctrl+v,ctrl+b。以此类推。


from wyp
insert into table test1
partition(age)
select id, name, tel, age
insert into table test3
select id, name
where age>25;

hive> select * from test3;
OK
8       wyp4
2       test
3       zs
Time taken: 4.308 seconds, Fetched: 3 row(s)

可以在同一个查询中使用多个insert子句,这样的好处是我们只需要扫描一遍源表就可以生成多个不相交的输出。这个很酷吧!

[root@node0 bin]# hadoop fs -put /home/panqiong/file.txt /usr/hive/warehouse/
[root@node0 bin]# bin/hadoop fs -cat /usr/hive/warehouse/file.txt
hive> load data inpath '/usr/hive/warehouse/file.txt' into table add partition (age =30);
 
需要指定分区 
 
数据为空的问题:
文件中需要以制表符隔开,非空格键,否则数据为空
制表符 \t 就是tab键

4. 在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中
在实际情况中,表的输出结果可能太多,不适于显示在控制台上,这时候,将Hive的查询输出结果直接存在一个新的表中是非常方便的,我们称这种情况为CTAS(create table .. as select)如下:

create table test4
as
select id, name, tel
from wyp;

数据就插入到test4表中去了,CTAS操作是原子的,因此如果select查询由于某种原因而失败,新表是不会创建的!

5. 插入过程报错:
create table test4
    > as
    > select id, name, tel
    > from wyp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = root_20160511030540_df6875cd-17b0-454e-bd39-5598c41bc592
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1462934737317_0010, Tracking URL = http://localhost:8088/proxy/application_1462934737317_0010/
Kill Command = /root/hadoop/hadoop-2.6.4/bin/hadoop job  -kill job_1462934737317_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-05-11 19:26:56,145 Stage-1 map = 0%,  reduce = 0%
2016-05-11 19:27:09,513 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1462934737317_0010 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1462934737317_0010_m_000000 (and more) from job job_1462934737317_0010

Task with the most failures(4): 
-----
Task ID:
  task_1462934737317_0010_m_000000


URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1462934737317_0010&tipid=task_1462934737317_0010_m_000000
-----
Diagnostic Messages for this Task:
Container launch failed for container_1462934737317_0010_01_000005 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
解决方法:修改配置文件:

hive> insert into table test1
    > partition (age=25)
    > select id, name, tel
    > from wyp;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = root_20160511201903_22a557b4-bc47-4e08-8081-627e9ce5be4d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1462934737317_0012, Tracking URL = http://localhost:8088/proxy/application_1462934737317_0012/
Kill Command = /root/hadoop/hadoop-2.6.4/bin/hadoop job  -kill job_1462934737317_0012
Interrupting... Be patient, this might take some time.
Press Ctrl+C again to kill JVM
killing job with: job_1462934737317_0012
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2016-05-11 20:27:19,225 Stage-1 map = 0%,  reduce = 0%
Ended Job = job_1462934737317_0012 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
解决方法:修改配置文件:
hive -hiveconf hive.root.logger=DEBUG,console 

 

http://blog.itpub.net/29050044/viewspace-2098563/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326473452&siteId=291194637