Hive总结 --hive表的创建，删除和修改

一、创建表

官网创建HIVE表定义如下：

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)`

默认创建内部表，内部表和外部表的区别在于：hive能删除内部表的数据，但是不能删除外部表的数据；
hive支持的一些特殊数据类型如下：

array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

对于行的格式定义如下：

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

hive创建表的时候默认的文件格式是textfile，无压缩类型，可自行定义，文件格式如下：

file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

创建hive表时，还可手动设置数据存储路径。

location hdfs_path

创建表的实例如下：

create external table if not exists student(
  name string,
  grade string,
  teacher array<string>,
  scores map<string, int>,
  address struct<street:string, city:string>
)
-- 分区表里定义的字段不能是表中已存在的数据
partitioned by(day string,hour string)
clustered by(id) into 4 buckets
row format delimited fields terminated by ','
collection items terminated by '?'
map keys terminated by ':'
--默认行分隔符就是'\n'
lines terminated by '\n'
--默认存储格式为textfile
stored as orc 
location '/student';

以下官网中的实例也可进行参考：

CREATE TABLE new_key_value_store
 ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
 STORED AS RCFile
 AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE;

二、删除表

使用drop删除

DROP TABLE [IF EXISTS] table_name [PURGE];     -- (Note: PURGE available in Hive 0.14.0 and later)

DROP TABLE删除该表的元数据和数据。如果配置了Trash(并且没有指定PURGE)，数据实际上会移动到.Trash/Current目录。元数据完全丢失。
需要注意的是：在hive 4.X以后，设置 external.table.purge=true,也可以删除外部表数据了。

When dropping an EXTERNAL table, data in the table will NOT be deleted from the file system. 
Starting Hive 4.0.0 ( HIVE-19981 - Managed tables converted to external tables by the 
HiveStrictManagedMigration utility should be set to delete data when the table is dropped
 RESOLVED ) setting table property external.table.purge=true, will also delete the data.

使用truncate清空表数据。

TRUNCATE [TABLE] table_name [PARTITION partition_spec];

partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

使用该参数删除表数据，不会删除表结构，还可以通过设置分区值，删除指定分区内的数据。

三、修改表

修改表的语法如下：

--修改表名
ALTER TABLE table_name RENAME TO new_table_name;

--修改表特性
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
 table_properties:
  : (property_name = property_value, property_name = property_value, ... )
  
--例如：
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);

--修改分桶
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
  INTO num_buckets BUCKETS; 

--增加分区 hive0.8+
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec 
[LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
 
partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
  
--Example:
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/part080808'
                          PARTITION (dt='2008-08-09', country='us') location '/part080809';
                          
--修改分区
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

--可以使用如下语句进行添加，删除，更新分区，一般用于使用hdfs命令修改数据，没同步元数据情况
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

--删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
  [IGNORE PROTECTION] [PURGE];           
 -- (Note: PURGE available in Hive 1.2.0 and later, IGNORE PROTECTION not available 2.0.0 and later)
 
 --Example
 ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');

--修改列
ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
  [COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
Example:
CREATE TABLE test_change (a int, b int, c int);
 
// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;
 
// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is:  b int, a2 string, c int.
  
// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is:  c1 int, b int, a2 string.
  
// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';

--增加删除列
ALTER TABLE table_name 
  [PARTITION partition_spec]                 -- (Note: Hive 0.14.0 and later)
  ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
  [CASCADE|RESTRICT]                         -- (Note: Hive 1.1.0 and later)
--ADD COLUMNS 允许将新列添加到现有列的末尾，但在分区列之前。这也支持Avro备份表，Hive 0.14和更高版本

参考文档：hive官方文档–DDL

四、查询数据

查询语句的基本规则如下：

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

having语句一般和group by 语句一起使用，和where的区别主要是having可以使用聚合函数作为条件,如下所示：

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10;

limit语句可以使用limit n展示n条数据，也可以使用 limit start,n 表示从start开始，到start+n条数据，结果默认从第0行开始。

SELECT * FROM customers ORDER BY create_date LIMIT 5;
SELECT * FROM customers ORDER BY create_date LIMIT 2,5; --展示第3到7条

Order, Sort, Cluster, and Distribute By的区别

order by：是对查询出的结果进行全部排序，在严格模式下（hive.mapred.mode=strict），必须和limit一起使用，使用order by的字句最终肯定只有一个reducer，默认是升序（asc）；
在Hive 2.1.0及以后版本中，支持在“order by”子句中为null值排序。ASC默认是null在最前，而DESC默认排在最后。
sort by：sort by在reducer之前就会排序，默认的排序方式是按照数字大小和字典编撰顺序进行排序，在hive3.0.0及以后，在子查询和view中，使用没有limit的sort by会被优化器移除，要想使用可通过设置参数hive.remove.orderby.in.subquery=false进行使用。

In Hive 3.0.0 and later, sort by without limit in subqueries and views will be removed by the 
optimizer. To disable it, set hive.remove.orderby.in.subquery to false.

order by和sort by的区别：Hive支持SORT BY，按照每个reducer对数据进行排序。order by和sort by之间的区别是，前者保证输出中的总顺序，而后者只保证reducer中的行顺序。如果有多个reducer，sort by排序可能会给出部分有序的最终结果。

distribute by:distribute by 的分区规则是根据分区字段的 hash 码与 reduce 的个数进行模除后，
余数相同的分到一个区，Hive 要求 DISTRIBUTE BY 语句要写在 SORT BY 语句之前。

cluster by:当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式。cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC。

参考文档：hive官方文档–Sort by

============================= 未完，待续 ===================================