hive操作总结DDL

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/weixin_42411818/article/details/100132642

话不多说,直接写笔记了,你不用知道数据原本是什么样的,能够举一反三就行,操作都是一样的,只是场景不同而已,另外一些没有备注操作是干嘛的,复制粘贴看下就知道啦,很简单的,如果你有MySQL等数据库基础,一般都看得懂,注意,下面的所有你看到的 都是空格,不是table键打出来的,因为table键打出来的,在CLI和beeline上是不支持的,是会报错的

基本操作

CREATE DataBase

语法:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    	[COMMENT database_comment]
    	[LOCATION hdfs_path]
    	[WITH DBPROPERTIES (property_name=property_value, ...)];

–说明:
–DATABASE|SCHEMA : 这两个是同一个东西,都表示数据库
–[IF NOT EXISTS] : 可选的,如果不使用这个,然后去创建一张已经存在的表的话则会报错
–[COMMENT] : 可选的,表示给创建的DB备注,必须备注在单引号中
–[LOCATION] : 可选的,DB的存储路径是我们在hive-site.xml中配置的hive.metastore.warehouse.dir(即默认为/user/hive/warehouse)
– 如果想给DB自定义存储路径的话,则设置这个选项,下面的外部表会讲到这个

[WITH DBPROPERTIES] : 可选的,表示给创建的DB的属性

举个例子:

CREATE DATABASE hive_learning;
CREATE DATABASE IF NOT EXISTS hive_learning;

– 注意,前面不能是table,必须一步步敲空格,否则会报错

CREATE DATABASE IF NOT EXISTS hive_test
    COMMENT 'just a test db'
    LOCATION 'hdfs://master:8020/user/hadoop-jrq/hive-course/dbs'
    WITH DBPROPERTIES ('Created by' = 'jrq', 'Created on' = '2019-08-29');

Show DATABASE

查看有哪些数据库

SHOW DATABASES [LIKE identifier_with_wildcards];
SHOW DATABASES; -- 查看有哪些数据库
SHOW DATABASES LIKE 'hive*';  -- 查看以hive开头的表
show tables;  -- 查看这个数据库下有哪些表

describe DATABASE

DESCRIBE DATABASE [EXTENDED] database_name;

–[EXTENDED] : 表示将DB的属性也展示出来

DESCRIBE DATABASE hive_test;
DESCRIBE DATABASE EXTENDED hive_test;

AlTER DATABASE

修改表的一些属性等操作

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES(property_name = property_value, ....);

-- 修改数据库的创建者为jeffy
ALTER DATABASE hive_test SET DBPROPERTIES('Created by' = 'jeffy');

DESCRIBE DATABASE EXTENDED hive_test;

USE DATABASE

切换到某个数据库中去

USE (DATABASE|SCHEMA) database_name;
-- 下面这两个操作是一样的
USE DATABASE hive_test;
use hive_test;
-- 一进去都是在default数据库,如果不切换,建的表都在这下面
CREATE TABLE pokes(foo STRING, bar STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

DROP DATABASE

语法:

  DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE]

–[RESTRICT|CASCADE] : 如果DB中还有表存在的话,那么在RESTRICT模式下,则不能删除DB;
– 但是如果是CASCADE的模式下,则会先删除这个DB下的所有表,然后再输出这个DB
– hive默认是RESTRICT模式

深入操作

表的行格式(ROW FORMAT)

CREATE TABLE IF NOT EXISTS sensor_row_format(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

– 将指定的数据导入一张表里去,这个文件是以 , 分割的数据

LOAD DATA INPATH 'hdfs://master:8020//user/hadoop-jrq/hbase-course/20190829/omneo.csv' OVERWRITE INTO TABLE sensor_row_format;

[ROW FORMAT] : 表示表中每一行的格式,可以取 DELIMITED 和 SERDE 两个值
ROW FORMAT DELIMITED —> 表示每一行是按照一定的分隔符分隔开的格式
FIELDS TERMINATED BY ‘,’ —> 表示每一行的每一个字段是按照逗号分隔开的 这两个是配合使用的

ROW FORMAT SERDE —> 自定义每一行的格式
‘org.apache.hadoop.hive.serde2.RegexSerDe’ – 表示按照正则去匹配每一行的数据
‘org.apache.hive.hcatalog.data.JsonSerDe’ – 说明数据文件是json的格式存储的
‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ – 说明数据文件是csv的格式存储的

Regex

CREATE TABLE IF NOT EXISTS tomcat_access_log(
  ip STRING,
  userid STRING,
  username STRING,
  time STRING,
  url STRING,
  status STRING,
  file_size STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES(
     'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) (-|\\[.*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)'
);

LOAD DATA LOCAL INPATH '/home/hadoop-jrq/hbase-course/apache-tomcat-9.0.8/logs/localhost_access_log.2019-08-07.txt' OVERWRITE INTO TABLE tomcat_access_log;

JSON

CREATE TABLE json_table(a string, b bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
LOAD DATA LOCAL INPATH '/home/hadoop-jrq/hive-course/json_table.json' OVERWRITE INTO TABLE json_table;

CSV

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",  -- 表示csv的分隔符
   "quoteChar"     = "'", -- 表示将不需要切割的字段值
   "escapeChar"    = "\\"  -- 如果在quote标记的字段值中还含有quote,则用escape来避免(读写参数)
) 

上面几个参数不理解的,可以移步到我的另一个博客https://blog.csdn.net/weixin_42411818/article/details/98734464

表结构的查询

SHOW TABLES --> 表示查看一个DB中有多少张表
–下面的功能是一样的

DESC tomcat_access_log;
DESCRIBE tomcat_access_log;

DESCRIBE EXTENDED sensor_row_format;   -- 看表的字段信息
DESCRIBE FORMATTED sensor_row_format;  -- 好看一点

SHOW CREATE TABLE sensor_row_format;  -- 看这个表的创建语句
SHOW CREATE TABLE tomcat_access_log;

Storage Format

注意数据格式是什么,导入的数据格式就是什么,而且必须按照指定的切割方式存储
1.默认的存储格式是text format

CREATE TABLE IF NOT EXISTS sensor_format_text(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;  -- 数据存储文件格式默认就是TXT文件

SHOW CREATE TABLE sensor_format_text;  -- 查看创建语句

上面的语句和下面的语句是同等的(因为其他的都是默认的)

CREATE TABLE `sensor_format_text`(
  `id` string, 
  `event_id` string, 
  `event_type` string, 
  `part_name` string, 
  `part_number` string, 
  `version` string, 
  `payload` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://master:8020/user/hive/warehouse/sensor.db/sensor_format_text'
TBLPROPERTIES (
  'transient_lastDdlTime'=' 1567058652');

2.Hive的查询处理过程
SerDe 是 “Serializer and Deserializer.” 的缩写
Hive uses SerDe (and FileFormat) to read and write table rows.
读的过程步骤:HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
写的过程步骤:Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files

3.其他的File Format

parquet

CREATE TABLE IF NOT EXISTS sensor_parquet(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
STORED AS PARQUET;  -- 文件存储格式为parquet

SHOW CREATE TABLE sensor_parquet;
结果为:

ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

导入数据方式:
LOAD DATA INPATH “/user/hadoop-jrq/hive-course/parquet” OVERWRITE INTO TABLE sensor_parquet;

orc

CREATE TABLE IF NOT EXISTS sensor_orc(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
STORED AS ORC;

SHOW CREATE TABLE sensor_orc;  -- 看表的建表信息
-- 结果为:
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
-- 导入数据
LOAD DATA INPATH "/user/hadoop-jrq/hive-course/orc" OVERWRITE INTO TABLE sensor_orc;   

avro

CREATE TABLE IF NOT EXISTS sensor_avro(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
STORED AS AVRO;

SHOW CREATE TABLE sensor_avro;

ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  
LOAD DATA INPATH "/user/hadoop-jrq/hive-course/avro" OVERWRITE INTO TABLE sensor_avro;  

sequence File

CREATE TABLE IF NOT EXISTS sensor_sequence(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;

--SEQUENCEFILE需要自定义 SERDE
SHOW CREATE TABLE sensor_sequence;

STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileOutputFormat'
  
LOAD DATA INPATH "/user/hadoop-jrq/hive-course/sequence" OVERWRITE INTO TABLE sensor_sequence;

表的种类

内部表

CREATE TABLE IF NOT EXISTS sensor.sensor_managed(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;


LOAD DATA LOCAL INPATH '/home/hadoop-jrq/hive-course/omneo.csv' OVERWRITE INTO TABLE sensor_managed;

TRUNCATE TABLE sensor_managed; -- 清除表的数据

LOAD DATA INPATH 'hdfs://master:8020/user/hadoop-jrq/hive-course/omneo.csv' OVERWRITE INTO TABLE sensor_managed;  -- 如果是在hdfs上的话,会把这个数据mv过去,所以原数据会不存在

DROP TABLE sensor_managed; -- 删除表的时候,数据也删除了  这就是内部表的一个特点

外部表

CREATE EXTERNAL TABLE IF NOT EXISTS sensor_external (  -- EXTERNAL加上它就是一个外部表
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
LOCATION 'hdfs://master:8020/user/hadoop-jrq/hive-course/omneo';  -- 指定外部表数据的存储位置

DROP TABLE sensor_external; -- 删除表的时候,数据不会被删除

内部表和外部表的最大区别就是内部表删除后数据会不存在了,而外部表删除表后数据依旧会存在

临时表
只存在于当前会话的表,一旦会话关闭,则表被删除 一般作为临时数据存放

CREATE TEMPORARY EXTERNAL TABLE IF NOT EXISTS sensor (
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
LOCATION 'hdfs://master:8020/user/hadoop-jrq/hive-course/omneo';

视图
它只是一个逻辑,并不是存储数据的
比如有个一查询很长,可以写一个视图,这样的话就可以直接用了 比如 select * from alert_sensor_view;

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ....)] AS SELECT ....

CREATE VIEW IF NOT EXISTS alert_sensor_view AS SELECT * FROM jrq.sensor_external WHERE event_type = 'ALERT';

DROP VIEW IF EXISTS alert_sensor_view;  -- 删除视图

数据类型

numeric data types – 数字的类型

TINYINT  1-byte 50Y  -- 50Y就表示这个50类型是TINYINT,下面同理
SMALLINT 2-byte 50S
INT		4-byte 50
BIGINT	8-byte 50L
FLOAT	4-byte
DOUBLE	8-byte
DECIMAL	17-byte

CREATE TABLE customer(id BIGINT, age TINYINT)
CREATE TABLE ORDER (id BIGINT, price DECIMAL(10, 2))  -- (10,2)  长度10,精度2

string data types – 字符串类型

STRING:使用单引号或者双引号括起来的字符串
VARCHAR:长度可变的字符串
CHAR:定长字符串,如果存储的字符串的长度小于指定的长度,则用空格填充

CREATE TABLE customer(id BIGINT, name STRING, sex CHAR(6), role VARCHAR(64))

Date/Time data types – 时间类型

DATE表示日期,从0000-01-01到8020-12-31
TIMESTAMP表示时间,包含年月日时分秒

CREATE TABLE date_example(id INT, created_at DATE, updated_at TIMESTAMP)

boolean date types

BOOLEAN      true and false

complex data types – 复杂数据类型

STRUCT 语法:STRUCT<col_name : data_type, ....>  --结构体
MAP   -- 隐射类型
ARRAY  -- 数组类型


CREATE TABLE employee(
  name STRING, 
  salary FLOAT, 
  subordinates ARRAY<string>, 
  deductions MAP<string,FLOAT>, 
  address STRUCT<stree:string,city:string,state:string,zip:int>) 
ROW FORMAT DELIMITED  -- 自定义分割符
      FIELDS TERMINATED BY '\t'   -- 每个字段是按照什么分割的
      COLLECTION ITEMS TERMINATED BY ','   -- 数组或者结构体的分隔符是按照什么分割的
      MAP KEYS TERMINATED BY ':'  -- map的分割符是什么
STORED AS TEXTFILE;


LOAD DATA LOCAL INPATH '/home/hadoop-jrq/hive-course/employee.txt' OVERWRITE INTO TABLE employee;

SELECT subordinates[0] FROM employee;   -- 通过下标去查询
SELECT deductions['Federal Taxes'] FROM employee;  -- 直接访问key去查map类型
SELECT size(deductions) FROM employee;  -- 这个map的大小
SELECT address.state,address.stree FROM employee;  -- 查询结构体的数据

分区表

创建一张分区内部表

CREATE TABLE IF NOT EXISTS sensor_managed_partition(
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING) 
PARTITIONED BY(year INT, month INT, day INT)  -- 按照年月日进行分区
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;


LOAD DATA INPATH 'hdfs://master:8020/user/hadoop-jrq/hive-course/omneo/20190829' OVERWRITE INTO TABLE sensor_managed_partition PARTITION(year=2019, month=201908, day=201908029);


LOAD DATA INPATH 'hdfs://master:8020/user/hadoop-jrq/hive-course/omneo/20190829' OVERWRITE INTO TABLE sensor_managed_partition PARTITION(year=2019, month=201908, day=20190829);

select * from sensor_managed_partition where day=20190829 limit 10;  -- 只会扫描这一天的数据

创建一张分区外部表

CREATE EXTERNAL TABLE IF NOT EXISTS sensor_external_partition (
  id STRING, 
  event_id STRING, 
  event_type STRING, 
  part_name STRING, 
  part_number STRING, 
  version STRING, 
  payload STRING)  
PARTITIONED BY(year INT, month INT, day INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
-- 下面的文件结构一定要和年月日对应起来,如下:
--year=2019/month=201908/day=20190829
LOCATION 'hdfs://master:8020/user/hadoop-jrq/hive-course/omneo'; 

select * from sensor_external_partition limit 10;  -- 这时候是没有数据的,因为还没有挂载上去 分区挂载一次就行了,当你文件增加的时候,数据也会自动加载到hive中去

-- 挂载数据
-- ALTER 的方式加上一个分区的方式挂载数据
ALTER TABLE sensor_external_partition ADD PARTITION (year=2019, month=201908, day=20190829) LOCATION '/user/hadoop-jrq/hive-course/omneo/year=2019/month=201908/day=20190829';

-- 这个方式也是可以的,前提是分区正确
ALTER TABLE sensor_external_partition ADD PARTITION (year=2019, month=201908, day=20190829);


-- 查看一张表有多少的PARTITION
SHOW PARTITIONS sensor_external_partition;

外部分区表和内部分区表的 rename partition 区别

-- 外部表的话,分区虽然改变了,但是不会改变数据文件的路径
ALTER TABLE sensor_external_partition PARTITION (year=2019, month=201908, day=20190829) RENAME TO PARTITION (year=2019, month=201908, day=20190829);
SHOW PARTITIONS sensor_external_partition;


-- 内部表的话,会改变数据文件的路径
ALTER TABLE sensor_managed_partition PARTITION (year=2019, month=201908, day=20190829) RENAME TO PARTITION (year=2019, month=201908, day=20190829);
SHOW PARTITIONS sensor_managed_partition;

外部表和内部表的exchange partition 交换分区

-- 将sensor_external_partition分区复制给sensor_external_partition_like
-- sensor_external_partition本身的分区就会被删除
CREATE TABLE sensor_external_partition_like LIKE sensor_external_partition; -- 创建一个一样的表

ALTER TABLE sensor_external_partition_like EXCHANGE PARTITION(year=2019, month=201908, day=20190829) WITH TABLE sensor_external_partition;

-- 修复在对应HDFS上有文件而在metastore中没有对应分区的分区,当程序出问题没有挂载数据的时候,就可以用这个
MSCK REPAIR TABLE sensor_external_partition;

外部表和内部表的drop partition

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...] [PURGE]; 
-- 删除分区的时候外部表的数据不会被清除掉
ALTER TABLE sensor_external_partition DROP IF EXISTS PARTITION (year=2019, month=201908, day=20190829);
-- 删除分区的时候内部表的数据会被清除掉,放到了HDFS的Trash中
ALTER TABLE sensor_managed_partition DROP IF EXISTS PARTITION (year=2019, month=201908, day=20190829) PURGE; 

[PURGE] : 内部表加上 PURGE 的话,则数据不会放到HDFS的Trash中,永远被删除了

-- 删除分区数据(只能删除内部表的分区数据),数据虽然删除,但是表的分区还在
TRUNCATE TABLE table_name [PARTITION partition_spec];
TRUNCATE TABLE sensor_managed_partition PARTITION(year=2019, month=201908, day=20190829);

表和分区的修改

表的重命名

ALTER TABLE sensor_managed_partition RENAME TO sensor_managed_partition_new;  -- 内部表重命名路径会改变
ALTER TABLE sensor_managed_partition_new RENAME TO sensor_managed_partition;  -- 外部表重命名路径不会改变

每一个分区对应的存储数据文件的格式可以不同

ALTER TABLE sensor_external_partition SET LOCATION "/user/hadoop-jrq/hive-course/omneo-new";  --默认会加上hdfs,改变数据路径,修改后原来的分区数据依旧存在,因为分区还在

-- 重新指定相同的分区后(指定的路径没有数据)数据才会消失
ALTER TABLE sensor_external_partition PARTITION (year=2019, month=201908, day=20190829) SET LOCATION "/user/hadoop-jrq/hive-course/omneo-new/year=2019/month=201908/day=20190829";

-- 但是20190829这一天的数据却是以PARQUET的文件格式存在
ALTER TABLE sensor_external_partition PARTITION (year=2019, month=201908, day=20190829)  SET FILEFORMAT PARQUET;
ALTER TABLE sensor_external_partition PARTITION (year=2019, month=201908, day=20190829) SET LOCATION "/user/hadoop-jrq/hive-course/omneo-new/year=2019/month=201908/day=20190829";

--修改一张表的file format,不要轻易使用
ALTER TABLE sensor_external_partition SET FILEFORMAT PARQUET;

字段名的修改:

CREATE TABLE test_change (a int comment "this", b int, c int);  -- 先创建一张表
 
--将字段a修改为a1
ALTER TABLE test_change CHANGE a a1 INT;
 
--将字段a1重名为a2,且将字段a2的数据类型设置为STRING,并且将a2放置在字段b的后面
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;

--给字段a2添加commnent
ALTER TABLE test_change CHANGE a2 a2 STRING COMMENT 'this is column a1';

--用于新增不存在的列
ALTER TABLE test_change ADD COLUMNS (a int, d int);

猜你喜欢

转载自blog.csdn.net/weixin_42411818/article/details/100132642