Hive DDL和DML、乱码、hiveserver2/beeline

DDL

与SQL语句很类似

创建数据库：

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

创建表：

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

DML

导入数据：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

LOCAL：本地系统
非LOCAL： HDFS

OVERWRITE: 覆盖
非OVERWRITE：追加

插入数据：

INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] 
  SELECT ... FROM ...

使用：

INSERT OVERWRITE  LOCAL DIRECTORY '/home/hadoop/tmp/hivetmp'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
select * from emp;

FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...


from emp
INSERT OVERWRITE  LOCAL DIRECTORY '/home/hadoop/tmp/hivetmp1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
select empno, ename  
INSERT OVERWRITE  LOCAL DIRECTORY '/home/hadoop/tmp/hivetmp2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
select ename;

分区表：

静态分区：

CREATE TABLE ruoze_order_partition (
order_number string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";

LOAD DATA LOCAL INPATH "/home/hadoop/data/order_created.txt" 
OVERWRITE INTO TABLE ruoze_order_partition
PARTITION (event_month='2014-05');

如果手动添加了一个分区，比如event_month=’2014-06’（是在hdfs上添加，目录结构什么的都对），但是元数据没有，查询也查不出来，可以：1、用SQL语句加上分区（ALTER TABLE table_name ADD PARTITION ）
2、使用msck修复（MSCK REPAIR TABLE table_name;）

动态分区：

将emp表的数据按照部门分组，并将数据加载到其对应的分组中去ruoze_emp_partition
可以：

insert into table ruoze_emp_partition partition(deptno=10)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;

insert into table ruoze_emp_partition partition(deptno=20)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;

insert into table ruoze_emp_partition partition(deptno=30)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;

但是分区或数据太多就不行了。所以需要动态分区：

CREATE TABLE ruoze_emp_dynamic_partition (
empno int,
ename string,
job string,
mgr int,
hiredate string,
salary double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"; 

insert into table ruoze_emp_dynamic_partition partition(deptno)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;

deptno需要放在最后一个。
如果失败，提示的是动态分区需要至少有一个分区字段的话，可以设置：
set hive.exec.dynamic.partition.mode=nonstrict;

注意点

在插入数据的时候，需要注意字段的对应。

MANAGED_TABLE vs EXTERNAL的区别

MANAGED_TABLE  删除  HDFS+META  DELETE
EXTERNAL       删除  META DELETE 
所以最好用外部表。

另一种导入数据的方法

INSERT OVERWRITE TABLE ruoze_emp2 select * from emp; 
或者
CREATE TABLE emp1 as select * from emp [where 1=0];

不用启动Hive客户端的用法

hive -e "select * from emp limit 5"
可以和shell配合使用
hive --help可以查看其他用法。

乱码解决办法

改变mysql设置，不能改变已经存在的表。你需要转换表的编码。

扫描二维码关注公众号，回复： 3159991 查看本文章

alter database ruozedata_basic02 character set latin1;
use ruozedata_basic02;
alter table PARTITIONS convert to character set latin1;
alter table PARTITION_KEYS convert to character set latin1;

EXPORT/IMPORT

导出：

EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
  TO 'export_target_path' [ FOR replication('eventid') ]
其中，export_target_path指的是hdfs中的目录，可以是/user/hive/warehouse/..  ，也可以是hdfs://192.168.137.201:9000/user/hive/warehouse/..

导入：

IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
  FROM 'source_path'
  [LOCATION 'import_target_path']
这里的source_path是export_target_path。

hiveserver2

hive根目录的bin目录下
启动hiveserver2：./hiveserver2
启动beeline：./beeline
连接：!connect jdbc:hive2://localhost:10000 hadoop
端口号后面的应该是用户名和密码，密码好像可以随意（或者不需要密码）。

beeline -u方式

注意不要连到spark上面去了（环境变量有可能会覆盖掉）
./beeline -u jdbc:hive2://localhost:10000/default -n hadoop
hadoop应该是所使用的系统用户名，而且不需要密码

jdbc方式

根据官网的案例https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-JDBC 来使用，其中driverName是”org.apache.hive.jdbc.HiveDriver”，驱动管理的getConnection是(“jdbc:hive2://localhost:10000/default”,”“,”“)。官网这两个地方有误。