DDL
与SQL语句很类似
创建数据库:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
创建表:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
DML
导入数据:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOCAL:本地系统
非LOCAL: HDFS
OVERWRITE: 覆盖
非OVERWRITE:追加
插入数据:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
使用:
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/hivetmp'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
select * from emp;
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
from emp
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/hivetmp1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
select empno, ename
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/hivetmp2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
select ename;
分区表:
静态分区:
CREATE TABLE ruoze_order_partition (
order_number string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
LOAD DATA LOCAL INPATH "/home/hadoop/data/order_created.txt"
OVERWRITE INTO TABLE ruoze_order_partition
PARTITION (event_month='2014-05');
如果手动添加了一个分区,比如event_month=’2014-06’(是在hdfs上添加,目录结构什么的都对),但是元数据没有,查询也查不出来,可以:1、用SQL语句加上分区(ALTER TABLE table_name ADD PARTITION )
2、使用msck修复(MSCK REPAIR TABLE table_name;)
动态分区:
将emp表的数据按照部门分组,并将数据加载到其对应的分组中去ruoze_emp_partition
可以:
insert into table ruoze_emp_partition partition(deptno=10)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;
insert into table ruoze_emp_partition partition(deptno=20)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;
insert into table ruoze_emp_partition partition(deptno=30)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;
但是分区或数据太多就不行了。所以需要动态分区:
CREATE TABLE ruoze_emp_dynamic_partition (
empno int,
ename string,
job string,
mgr int,
hiredate string,
salary double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
insert into table ruoze_emp_dynamic_partition partition(deptno)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;
deptno需要放在最后一个。
如果失败,提示的是动态分区需要至少有一个分区字段的话,可以设置:
set hive.exec.dynamic.partition.mode=nonstrict;
注意点
在插入数据的时候,需要注意字段的对应。
MANAGED_TABLE vs EXTERNAL的区别
MANAGED_TABLE 删除 HDFS+META DELETE
EXTERNAL 删除 META DELETE
所以最好用外部表。
另一种导入数据的方法
INSERT OVERWRITE TABLE ruoze_emp2 select * from emp;
或者
CREATE TABLE emp1 as select * from emp [where 1=0];
不用启动Hive客户端的用法
hive -e "select * from emp limit 5"
可以和shell配合使用
hive --help可以查看其他用法。
乱码解决办法
改变mysql设置,不能改变已经存在的表。你需要转换表的编码。
alter database ruozedata_basic02 character set latin1;
use ruozedata_basic02;
alter table PARTITIONS convert to character set latin1;
alter table PARTITION_KEYS convert to character set latin1;
EXPORT/IMPORT
导出:
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]
其中,export_target_path指的是hdfs中的目录,可以是/user/hive/warehouse/.. ,也可以是hdfs://192.168.137.201:9000/user/hive/warehouse/..
导入:
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
FROM 'source_path'
[LOCATION 'import_target_path']
这里的source_path是export_target_path。
hiveserver2
hive根目录的bin目录下
启动hiveserver2:./hiveserver2
启动beeline:./beeline
连接:!connect jdbc:hive2://localhost:10000 hadoop
端口号后面的应该是用户名和密码,密码好像可以随意(或者不需要密码)。
beeline -u方式
注意不要连到spark上面去了(环境变量有可能会覆盖掉)
./beeline -u jdbc:hive2://localhost:10000/default -n hadoop
hadoop应该是所使用的系统用户名,而且不需要密码
jdbc方式
根据官网的案例https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-JDBC 来使用,其中driverName是”org.apache.hive.jdbc.HiveDriver”,驱动管理的getConnection是(“jdbc:hive2://localhost:10000/default”,”“,”“)。官网这两个地方有误。