Hive数据仓库的基本操作

为了加深一下Hive操作的印象，也为了方便以后的学习，罗列一下Hive的基本操作。

一、Hive基本操作

1.在Linux本地新建/data/hive目录：

mkdir -p /data/hive

2.切换到/data/hive目录下，使用ftp工具将作业附件中的stu_group.txt文件上传到该目录下:

推荐用WinSCP上传

3.启动Hadoop，检查Hadoop相关进程是否已经启动:

jps

4.启动Mysql:

启动：service mysqld start
进入：mysql -uroot -p

5.查看Mysql的运行状态:

show status

6.配置Hive环境变量:

根据自己的Hive所在的目录，按照这个格式配置环境变量

export HIVE_HOME=/opt/hive
export PATH=$PATH:$HIVE_HOME/bin

启动Hive:
HIve
输入 show databases;
如果出现错误，请退出后初始化元数据:
schematool -dbType mysql -initSchema

二、Hive数据表的操作

命令	作用
create table	创建表
create external table	创建外部表
fields terminated by ‘\t’	字段之间的分隔方式
collection items terminated by ‘,’	maps，array，structs数据的分隔方式
map keys terminated by ‘:’	键值对中的分隔符
location ‘/external’	指定外部表的路径
row format delimited	按照行存储数据

1.创建一个名为stu的内部表，有两个字段为stu_id和stu_name，字符类型为string，字段之间的分隔符为‘\t’:

hive> use default;
hive> create table stu(stu_id string , stu_name string)
    > row format delimited
    > fields terminated by '\t';

2.创建一个外部表，表名为stu_external，有两个字段为stu_id和stu_name，字符类型为string，字段之间的分隔符为‘\t’:

hive> create external table stu_external(stu_id string , stu_name string)
    > row format delimited
    > fields terminated by '\t'
    > location '/external';

3.创建与已知表相同结构的表，创建一个与stu表结构相同的表，名为stu2，这里要用到like关键字:

hive> create table stu2 like stu;

4.创建order_array表，表中的字段包含数组food以及整型price，字段分隔符为’\t’，array数组之间的分隔符为’,’:

hive> create table order_array(food array<string> , price int)
	> row format delimited
	> fields terminated by '\t'
	> collection items terminated by ',';

5.创建info_map表，表中的字段包含字符串类型name以及键值对类型information

hive> create table info_map(name string , information map<string,string>)
	> row format delimited
	> fields terminated by '\t'
	> collection items terminated by ',';
	> map keys terminated by ':';

6.创建分区表，按照某字段对表进行分区时，该字段可以是表中的字段，也可以不在表中。如下，按照班级进行分区。

hive> create table stu_partition(name string , age int , grade int , sex string)
	> partition by (class int)
	> row format delimited
	> fields terminated by ',';

7.创建分桶表，一般分桶字段为整型，对字段进行取余的操作，然后按照余数分桶。

hive> create table stu_bucket(name string , age int , grade int , sex string)
	> clustered by (age) into 3 buckets
	> row format delimited
	> fields terminated by ','

三、Hive中数据的导入导出

1.从Linux本地文件系统中导入数据到Hive表。

将Linux本地/data/hive目录下的stu_group.txt文件导入到Hive中的stu表中:

hive> load data local inpath '/data/hive/stu_group.txt' into table stu;

通过select语句查看stu表中是否成功导入数据，由于数据量大，使用limit关键字限制输出10条记录:

hive> select * from stu limit 10;

2.将HDFS上的数据导入到Hive中。
在HDFS上创建/myhive2目录:

hdfs dfs -mkdir /myhive2

将本地/data/hive/下的stu_group.txt文件上传到HDFS的/myhive2上:

hdfs dfs -put /data/hive/stu_group.txt /myhive //注意空格

将HDFS下/myhive2中的stu_group.txt文件导入到Hive中的stu_external表中:

hive> load data inpath '/mayhive2/stu_group.txt' into table stu_external

3.从别的表中查询出相应的数据并导入到Hive中

使用insert into table和insert overwrite table这两种方式将stu表中的数据导入到stu2表中:

hive> insert into table stu2
	> select * from stu;

hive> insert overwrite table stu2
	> select * from stu;

如果出现了一下的bug:

Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

可能是由于namenode的内存空间不足导致的，建议启动Hive后，输入以下命令:

set hive.exec.mode.local.auto=true;

将Hive设置成本地模式来执行任务

四、数据导出方式

1.导出到Linux本地文件系统。

首先，在Linux本地新建/data/hive/out目录:

mkdir -p /data/hive/out

并将Hive中的stu表导出到本地文件系统/data/hive/out中:

hive> insert overwrite local directory '/data/hive/out' 
    > row format delimited
    > fields terminated by '\t'
    > select * from stu;

导出完成后，在Linux本地切换到/data/hive/out目录，查询导出文件的内容:

cd /data/hive/out
stu 000000_0

2.Hive中数据导出到HDFS中

在HDFS上创建/myhive2/out目录:

hdfs dfs -mkdir -p /myhive2/out

并将Hive中的表stu中的数据导入到HDFS的/myhive2/out目录里:

hive> insert overwrite directory '/myhive2/out'
    > row format delimited
    > fields terminated by '\t'
    > select * from stu;

导入完成后，在HDFS上的/myhive2/out目录下查看结果:

dfs -cat /myhive2/out/000000_0