3. Hive SQL Data Definition Language (DDL)

1. Data Definition Language Overview

1.1 Common development methods

(1) Hive CLI, Beeline CLI
The command line client that comes with Hive
Advantages: No additional installation is required
Disadvantages: The environment for writing SQL is harsh, no effective prompts, no syntax highlighting, and a high rate of misoperations

(2) Some text editors
such as Sublime, Emacs, and EditPlus
do not support connecting to Hive services as clients, but support SQL syntax environment, then develop SQL in the editor and copy it to Hive CLI for execution; some support installing plug-ins as clients Direct connection to Hive service;

(3) Hive visualization tools
IntelliJ IDEA, DataGrip, Dbeaver, SQuirrel SQL Client, etc.
Graphical interface tools that can connect to HiveServer2 through JDBC on Windows and MAC platforms

Hive visualization tool IntelliJ IDEA

  • Configure the Hive data source
    In any project of IDEA, configure the Hive Driver driver before selecting the Database standard configuration. Configure
    the Driver
    img
    img

  • Configure the data source
    Configure the Hive data source and connect to the HS2
    configuration data source
    img
    img

  • Hive visually uses to
    write code, select the code to be executed, and right click to execute
    img

1.2 Overview of DDL

The role of DDL syntax in SQL Data definition language is the language used to create, delete, and modify the object structure
inside the database in the SQL language . The DDL core syntax consists of CREATE, ALTER and DROP. DDL does not involve the operation of the internal data of the table.

The use of DDL syntax in Hive The
syntax of Hive SQL is similar to that of standard SQL, and they are basically the same. Based
on the design and usage characteristics of Hive, the create syntax (especially create table) in HQL will be the most important thing for learning and mastering Hive DDL syntax

Focus: complete syntax tree
HIVE DDL CREATE TABLE

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)[SORTED BY (col_name[ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT DELIMITED|SERDE serde_name WITH SERDEPROPERTIES (property_name=property_value, ...)]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]

1.3 Detailed Explanation of Hive Data Types

Hive data type refers to the field type of the column in the table. It
is divided into two categories: native data type and complex data type.
Native data type: numeric type, time and date type, string type, miscellaneous data type. Complex
data type: array array, map mapping, struct structure, union union

Precautions

  • In Hive SQL, the data type English letters and case are not clear
  • In addition to SQL data types, Java data types are also supported, such as string string
  • The use of complex data types usually needs to be used in conjunction with the delimiter specification syntax
  • If the defined data type is inconsistent with the file, Hive will try to convert implicitly, but success is not guaranteed

The display type conversion uses the CAST function, such as CAST('100' as INT), but if the conversion fails, null will be returned

1.4 Hive read and write file mechanism

SerDe

  • SerDe is the abbreviation of Serializer and Deserializer, which is used for serialization and deserialization
  • The process of converting objects into bytecodes during serialization; and the process of converting bytecodes into objects during deserialization
  • Hive uses SerDe (including FileFormat) to read and write table rows. It should be noted that the "key" part is ignored when reading, and the key is always a constant when writing. Basically row objects are stored in "value".
Read:
HDFS files -> InputFileFormat -> <key,value> -> Deserializer -> Deserializer(反序列化) -> Row object

Write:
Row object -> Serializer(序列化) -> <key,value> -> OutputFileFormat -> HDFS files

SerDe-related syntax
ROW FORMAT This line represents the syntax related to reading and writing files and serializing SerDe. It has two functions:
1. Which SerDe class to use for serialization; 2. How to specify the separator

[ROW FORMAT DELIMITED|SERDE serde_name WITH SERDEPROPERTIES (property_name=property_value, ...)]

Choose one of DELIMITED and SERDE. If you use delimited, you can use the default LazySimpleSerDe class to process data; if the data file format is special, you can use ROW FORMAT SERDE serde_name to specify other Serde classes to process data, and even support user-defined SerDe classes.

The LazySimpleSerDe delimiter specifies that
LazySimpleSerDe is the default serialization class of Hive and contains 4 neutron syntaxes, which are used to specify the delimiter symbols between fields, between collection elements, between map mapping kv, and newlines. When building a table, it can be used flexibly according to the characteristics of the data

ROW FORMAT DELIMITED
    [FIELDS TERMINATED BY char] ——> 字段之间分隔符
    [COLLECTION ITEMS TERMINATED BY char] ——> 集合元素之间分隔符
    [MAP KEYS TERMINATED BY char] ——> Map映射kv之间分隔符
    [LINES TERMINATED BY char] ——> 行数据之间分隔符

Hive default delimiter : The default delimiter is '\001', which uses an ASCII-encoded value, which cannot be typed by the keyboard.
Continuously input ctrl+v/ctrl+a in the vim editor to enter '\001', which will be displayed ^A
in the form of billions of SOH in some text editors

Specify the storage path

  • When Hive builds a table, you can use the localtion syntax to change the storage path of data on HDFS, making it more flexible to load data when building a table
  • Syntax: LOCATION '<hdfs_location>'
  • For the data files that have been generated, it will be convenient to use location to specify the path
[LOCATION hdfs_path]

2. Hive SQL table creation basic syntax

2.1 Hive table creation syntax exercise

2.1.1 Use of native data types

The file name is archer.txt, which records the relevant information of the King of Glory shooter, including life, physical defense, etc., where the separator between fields is a tab character \t, which requires the table to be successfully mapped in Hive.

数据示例:
1   马可波罗    5584    200 362 remotely    archer
show database;

-- 切换数据库
use testdb;

-- 建表
create table t_archer(
    id int comment "ID",
    name string comment "英雄名称",
    hp_max int comment "最大生命",
    mp_max int comment "最大法力",
    attack_max int comment "最高物攻",
    defense_max int comment "最高物防",
    attack_range string comment "攻击范围",
    role_main string comment "主要定位",
    role_assist string comment "次要定位"
) comment "王者荣耀射手信息"
    row format delimited
        fields terminated by "\t";

-- 删除表
drop table t_archer;

2.1.2 Use of complex data types

The file name hot_hero_skin_price.txt records the relevant skin price information of the popular mobile game Glory of Kings, and it is required to create a table in Hive and successfully map this file.

数据示例:字段:id、name(英雄名称)、win_rate(胜率)、skin_price(皮肤及价格)
2,鲁班七号,54,木偶奇遇记:288-福禄兄弟:288-兄控梦想:0
3,铠,52,龙域领主:288-曙光守护者:1776

Analysis: the first three fields are native data types, and the last field is a complex type map

-- 复杂数据类型建表
create table t_hot_hero_skin_price(
    id int,
    name string,
    win_rate int,
    skin_price map<string,int> -- 注意这个map复杂类型
) row format delimited
fields terminated by ',' -- 指定字段之间的分隔符
collection items terminated by '-' -- 指定元素集合之间的分隔符
map keys terminated by ':'; -- 指定map元素kv之间的分隔符

2.1.3 Default delimiter usage

The file name is team_ace_player.txt, which records the information of the most popular ace players in the main team. The fields are separated by \001, which requires the table to be mapped successfully in Hive.

数据示例:
1^A成都AG超会玩^A一诺

Analysis: The data are all native data types, and the separator between fields is \001, so the row format statement can be omitted even if the table is built

create table t_team_ace_player(
    id int,
    team_name string,
    ace_player_name string
); -- 没有指定row format语句,此时采用的是默认的\001作为字段的分隔符

2.1.4 Specify the data storage path

The file name is team_ace_player.txt, which records the information of the most popular ace players in the main team. The field uses \001 as the separator. It is required to upload the file to any path in HDFS. It cannot be moved and copied, and it must be created in Hive. The table mapping is successful for the file

create table t_team_ace_player(
    id int,
    team_name string,
    ace_player_name string
) location '/date'; -- 使用location关键字指定本张表在hdfs上的存储路径

2. Hive inside, outside table

Internal tables
Internal tables are also called managed tables owned and managed by Hive.
By default, all tables created are internal tables. Hive owns the structure and files of the table (Hive fully manages the life cycle of the table), and the data and tables will be deleted when deleted. Metadata
You can use DESCRIBE FORMATTED tablename to obtain the metadata description information of the table, from which you can see the type of the table

External table
The data in the external table is not owned or managed by Hive, only the life cycle of table metadata is managed. Use the
EXTERNAL keyword to create an external table. Deleting an external
table will only delete the metadata, but not the actual data.
The external table is specified with the location syntax The data path is more secure

-- 创建外部表 需要关键字external
-- 外部表数据存储路径不指定 默认规则和内部表一致
-- 也可以使用location关键字指定HDFS任意路径
create external table student_ext(
    num int,
    name string,
    sex string,
    age int,
    dept string
)row format delimited
fields terminated by ','
location '/stu';

How to choose internal and external tables
When you need to fully manage and control the entire life cycle of the table through Hive, please use the internal table
When the data is hard to come by and prevent accidental deletion, please use the external table, because even if the table is deleted, the file will be retained

The role of the location keyword

  • When creating an external table, you can use location to specify the path of the storage location. What happens if you don’t specify it?
    • If the location is not specified, the default path of the external table is also located in /usr/hive/warehouse, which is controlled by the default parameters
  • When creating an internal table, can the location be specified?
    • Internal tables can use location to specify the location
  • Does it mean that the location of Hive table data on HDFS must not be under /usr/hive/warehouse?
    • Not necessarily, the storage location of the table data in Hive, no matter the internal table or the external table, is in /usr/hive/warehouse by default. Of course, you can use the location keyword to specify the storage location at any location in HDFS when creating the table.

3. Partition table

3.1 Partition table generation background

There are currently 6 chapters of structured data files, which respectively record the hero information in the 6 positions in the glory of the king. Now it is required to map and load the 6 files at the same time by creating a table t_all_hero

Just create the t_all_hero table, and then copy the 6 tables to the specified path.

Now it is required to query the number of shooters when role_main is mainly positioned and the maximum life of hp_max is greater than 6000. The sql statement is as follows

select count(*) from t_all_hero where role_main="archer" and hp_max>6000;

Existing problems:

  1. Behind the where statement, a full table scan is required to filter out the results. For hive, each file needs to be scanned. If there are many data files, the scanning efficiency is very slow and it is unnecessary
  2. In this requirement, just scan the archer.txt file
  3. There are differences in efficiency between specified file scans and full table scans

Partition table concept
When the Hive table corresponds to a large amount of data and a large number of files, in order to avoid scanning the data in the entire table during query, Hive supports partitioning the table according to specified fields. The partitioned fields can be identified by date, region, type, etc. field of meaning

-- 注意分区表创建语法规则
create table t_all_hero_part(
    id int,
    name string,
    hp_max int,
    mp_max int,
    attack_max int,
    defense_max int,
    role_main string,
    role_assist string
) partitioned by (role string) -- 分区字段
row format delimited
    fields terminated by "\t";

Note: The partition field cannot be an existing field in the table , because the partition field will eventually be displayed on the table structure in the form of a virtual field.

3.2 Partition table data loading - static partition

Static partition
The so-called static partition means that the attribute value of the partition is manually specified by the user when loading data.
The syntax is as follows

load data [local] inpath 'filepath' into table tablename partition(分区字段='分区值'...);

The Local parameter is used to specify whether the data to be loaded is located in the local file system or the HDFS file system.

-- 静态加载分区表数据
load data local inpath '/root/hivedata/archer.txt' into table t_all_hero_part partition(role='sheshou');
load data local inpath '/root/hivedata/assassin.txt' into table t_all_hero_part partition(role='cike');
...

Nature

  • The concept of partitioning provides a way to separate Hive table data into multiple files/directories
  • Different partitions correspond to different folders, and the data of the same partition is stored in the same folder
  • When querying and filtering, you only need to find the corresponding folder according to the partition value, and scan the files in this partition under the folder to avoid full table data scanning
  • This way of specifying partition queries is called partition pruning.

3.3 Multiple Partition Tables

It can be found from the relevant syntax about partitions in the table creation statement that Hive supports multiple partition fields:
PARTITIONED BY (partition1 data_type, partition2 data_type,…)
Under multiple partitions, there is a progressive relationship between partitions, which can be understood as the previous Continue partitioning on the basis of a partition
From the perspective of HDFS, it is to continue to carve up subfolders under the folder.

-- 单分区表,按身份分区
create table t_user_province (id int, name string, age int) partitiond by (province string);
-- 双分区表,按省份和市分区
-- 分区字段之间时一种递进的关系,因此要注意分区字段的顺序
create table t_user_province_city (id int, name string, age int) partitiond by (province string, city string);

-- 双分区表的数据加载 静态分区加载数据
load data local inpath '/root/hivedata/user.txt' into table t_user_province_city partition(province='zhejiang', city='hangzhou');
...

-- 双分区表的使用,使用分区进行过滤 减少全表扫描 提高效率
select * from t_user_province_city where province='zhejiang' and city='hangzhou';

3.4 Partition table data loading - dynamic partition

Dynamic partition
Dynamic partition means that the field value of the partition is automatically inferred based on the query result (parameter position).
Core syntax: insert+select

To enable hive dynamic partitioning, two parameters need to be set in hive painting:

# 是否启用动态分区功能
set hive.exec.dynamic.partition=true;

# 指定动态分区模式,分别为nonstrict非严格模式和strict严格模式
# strict严格模式要求至少由一个分区为静态分区
set hive.exec.dynamic.partition.mode=nonstrict;

-- 从一张已有的表中执行动态分区插入
insert into table t_all_hero_part_dynamic partition(role) -- 注意:分区值并没有手动写死指定
select tmp.*, tmp.role_main from t_all_hero tmp;

4. Bucket table

Concept
A bucket table is also called a bucket table, which is a table type designed for query optimization. The
data file corresponding to the bucket table will be decomposed into several parts at the bottom layer. small files.
When bucketing, you need to specify which field to divide the data into several buckets.

Rules
The bucketing rules are as follows: data with the same bucket number will be allocated to the same bucket

Bucket number=hash_function(bucketing_column) mod num_buckets
分桶编号     =哈希方法     (分桶字段)         取模 分桶个数

hash_function depends on the type of bucketing_column:

  1. If it is an int type, hash_function(int) == int;
  2. If it is other such as bigint, string or complex data type, hash_function is tricky, it will be a certain number derived from this type, such as hashcode value

grammar

-- 分桶表建表语句
CREATE [EXTERNAL] TABLE [db_name.]table_name
[(col_name data_type,...)]
CLUSTERED BY (col_name)
INTO N BUCKETS;

CLUSTERED BY (col_name) means according to which field;
INTO N BUCKETS means divided into several buckets
Note: The field for bucketing must be an existing field in the table

Creation of the bucket table
Existing United States 2021-1-28, the cumulative case information of the new epidemic situation in each county, including confirmed cases and deaths

数据示例:包含字段count_date(统计日期),county(县),fips(县编码code),cases(累计确诊病例),deaths(累计死亡病例)
2021-01-28,Jefferson,Alabama,01073,65992,1101
...

The data is divided into 5 buckets according to the state, and the table creation statement is as follows:

CREATE TABLE itheima.t_usa_covid19(
    count_date string,
    county string,
    state string,
    fips int,
    cases int,
    deaths int
) CLUSTERED BY(state) INTO 5 BUCKETS;

When creating a bucket table, you can also specify the data sorting rules in the bucket:

-- 根据state州分为5桶,每个桶内根据cases确诊病例数倒序排序
CREATE TABLE itheima.t_usa_covid19(
    count_date string,
    county string,
    state string,
    fips int,
    cases int,
    deaths int
) CLUSTERED BY(state) 
sorted by (cases desc) INTO 5 BUCKETS;

Bucket table data loading

-- step1:开启分桶的功能 从Hive2.0开始不再需要设置
set hive.enforce.bucketing=true;

-- step2:把数据加载到普通hive表
drop table if exists t_usa_covid19(
    count_date string,
    county string,
    state string,
    fips int,
    cases int,
    deaths int
) row format delimited fields terminated by ",";

-- 将源数据上传到HDFS,t_usa_covid19表对应的路径下
hadoop fs -put 源数据 目标路径

-- step3:使用insert+select语法将数据加载到分桶表中
insert into t_usa_covid19_bucket select * from t_usa_covid19;

benefit

  1. Reduce full table scans when querying based on bucketed fields
-- 基于分桶字段state查询来自于New York州的数据
-- 不再需要进行全表扫描过滤
-- 根据分桶的规则hash_function(New York) mod 5 计算出分桶编号
-- 查询指定分桶里面的数据
select * from t_usa_covid19_bucket where state="New York";
  1. When JOIN, it can improve the efficiency of MR programs and reduce the number of Cartesian products, and
    perform bucketing operations on the table according to the fields of the join.

  2. Efficient sampling of bucketed table data
    When the amount of data is particularly large and it is difficult to process all the data, sampling is particularly important. Sampling can estimate and infer the characteristics of the population from the sampled data.

5. Transaction table

limitation

  • BEGIN, COMMIT and ROLLBACK are not yet supported. All language operations are submitted automatically
  • Only supports ORC file format (STORED AS ORC)
  • Transactions are configured off by default. Need to configure parameters to enable use
  • The table must be a bucket table to use the transaction function
  • The table parameter transactional must be true
  • External tables cannot be ACID tables, reading/writing to ACID tables from non-ACID sessions is not allowed

Create and use Hive transaction table
Create a table with transaction function in Hive, and try to add, delete and modify

-- Hive 事务表
-- step1,创建普通的表
drop table if exists student;
create table student(
    num int,
    name string,
    sex string,
    age int,
    dept string
) row format delimited
fields terminated by ',';

-- step2:加载数据到普通表中
load data local inpath '/root/hivedata/student.txt' into table student;
select * from student;

-- Hive中事务表的创建使用
-- 1.开启事务配置(可以使用set设置当前session生效 也可以配置在hive-site.xml中)
set hive.support.concurrency = true; -- Hive是否支持并发
set hive.enforce.bucketing = true; -- 从Hive2.0开始就不再需要 是否开启分桶功能
set hive.exec.dynamic.partition.mode = nonstrict; -- 动态分区模式 非严格
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; 
set hive.compactor.initiator.on = true; -- 是否在Metastore实例上运行启动线程和清理线程
set hive.compactor.worker.threads = 1; -- 在此metastore实例上运行多个压缩程序工作线程

-- 2.创建Hive事务表
create table trans_student(
    id int,
    name string,
    age int
) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');
--注意:事务表创建几个要素:开启参数、分桶、存储格式orc、表属性

-- 3.针对事务表进行insert update delete操作
insert into trans_student values(1,"allen",18);

6. View

concept

  • A view in Hive is a virtual table that only saves the definition and does not actually store the data
  • Generated views are usually created from actual physical table queries, and new views can also be created from existing views
  • The view's schema is frozen when the view is created, and becomes invalid if the underlying tables are dropped or altered
  • Views are used to simplify operations, do not buffer records, and do not improve query performance
-- Hive View 视图相关语法
-- hive中有一张真实的基础表t_usa_covid19
select * from t_usa_covid19;

-- 1.创建视图
create view v_usa_covid19 as select count_date, county, state, deaths from t_usa_covid19 limit 5;

-- 能否从已有的视图中创建视图 可以的
create view v_usa_covid19_from_view as select * from v_usa_covid19 limit 2;

-- 2.显示当前已有的视图
show tables;
show views; -- hive v2.2.0之后支持

-- 3.视图的查询使用
select * from v_usa_covid19;
-- 注意:视图不支持插入数据

-- 4.查看视图定义
show create table v_usa_covid19;

-- 5.删除视图
drop view v_usa_covid19_from_view;

-- 6.更改视图属性
alter view v_usa_covid19 set TBLPROPERTIES ('comment' = 'This is a view');

-- 7.更改视图定义
alter view v_usa_covid19 as select county, deaths from t_usa_covid19 limit 2; 

Benefits of Views

  • Provide specific column data in the real table to users to protect data privacy
-- 通过视图来限制数据访问可以用来保护信息不被随意查询
create table userinfo(firstname string, lastname string, ssn string, password string);
create view safer_user_info as select firstname,lastname from userinfo;

-- 可以通过where子句限制数据访问,比如,提供一个员工表视图,只暴露来自特定部门的员工信息
create table employee(firstname string, lastname string, ssn string, password string, department string);
create view techops_employee as select firstname, lastname, ssn from userinfo where department='java';
  • Reduce query complexity and optimize query statements
-- 使用视图优化嵌套查询
from(
    select * from people join cart
        on(cart.people_id = people.id) where firstname = 'join'
) a select a.lastname where a.id = 3;

-- 把嵌套子查询变成一个视图
create view shorter_join as
select * from people join cart
    on(cart.people_id = people.id) where firstname = 'join';

-- 基于视图查询
select lastname from shorter_join where id = 3;

7. New feature of Hive3.0: materialized view

concept

  • A materialized view is a database object including query results, which can be used to pre-calculate and save the results of time-consuming operations such as table joins or aggregations. When executing the query, these time-consuming operations can be avoided, and the results can be obtained quickly
  • The purpose of using materialized views is to improve query performance through precomputation, which of course requires a certain amount of storage space
  • Hive3.0 began to try to introduce materialized views and provide an automatic query rewriting mechanism for materialized views (implemented based on Apache Calcite)
  • Hive's materialized view also provides a materialized view storage selection mechanism, which can be stored locally in Hive or stored in other systems through user-defined storage handlers
  • The purpose of Hive introducing materialized views is to optimize the efficiency of data query access, which is equivalent to optimizing data access from the perspective of data preprocessing
  • Hive dropped the index syntax support from 3.0, and recommends using materialized views and columnar storage file formats to speed up queries

The difference between materialized view and view

  • Views are virtual and exist logically, only definitions do not store data
  • Materialized views are real, physically stored, and store and calculate data in them
  • The purpose of the view is to simplify and reduce the complexity of the query, while the purpose of the materialized view is to improve query performance

grammar

-- 物化视图的创建语法
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[DISABLE REWRITE]
[COMMENT materialized_view_comment]
[PARTITIONED ON (col_name, ...)]
[CLUSTERED ON (col_name, ...) | DISTRIBUTED ON (col_name, ...) SORTED ON (col_name, ...)]
[
    [ROW FORMATE row_format]
    [STORED AS file_format]
    | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES(property_name=property_value, ...)]
AS SELECT ...;

(1) After the materialized view is created, the execution data of select unthreading will be automatically landed. "Automatic" means that during the execution of the query, any user will not be able to see the materialized view, and the materialized view will be available after the execution is completed; (2) By
default In some cases, the created materialized view can be used to rewrite the query optimizer query. During the creation of the materialized view, it can be disabled by setting the DISABLE PEWRITE parameter. (3
) The default SerDe and storage format are hive. materializedview.serde, hive.materializedview.fileformat;
(4) The materialized view supports storing data in external systems (such as druid), and the syntax is as follows:

CREATE MATERIALIZED VIEW druid_wiki_mv
    STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time,page.user,c_added,c_removed
FROM src;

(5) The drop and show operations of materialized views are currently supported, and other operations will be added in the future

-- Drops a materialized view
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;

-- Shows materialized views (with optional filters)
SHOW MATERIALIZED VIEW [IN database_name];

-- Shows information about a specific materialized view
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;

(6) When the data source changes (new data inserted, data modified modified), the materialized view also needs to be updated to maintain data consistency. Currently, the user needs to actively trigger the rebuild reconstruction

ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;

Query rewriting based on materialized views

  • After the materialized view is created, it can be used to accelerate related queries, that is, if the user submits a query query, if the query can hit the existing materialized view after rewriting, the result will be returned directly through the materialized view query data to achieve query acceleration
  • Whether to rewrite query to use materialized view can be controlled by global parameter, the default is true: hive.materializedview.rewriting=true;
  • Users can selectively control the specified materialized view query rewriting mechanism, the syntax is as follows
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;

Case: Query rewriting based on materialized views

-- 1.新建一张事务表 student_trans
set hive.support.concurrency = true; -- Hive是否支持并发
set hive.enforce.bucketing = true; -- 从Hive2.0开始就不再需要 是否开启分桶功能
set hive.exec.dynamic.partition.mode = nonstrict; -- 动态分区模式 非严格
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; 
set hive.compactor.initiator.on = true; --是否在Metastore实例上运行启动线程和清理线程
set hive.compactor.worker.threads= 1 -- 在此metastore实例上运行多少个压缩程序工作线程

drop table if exists student_trans;

CREATE TABLE student_trans(
    sno int,
    sname string,
    sdept string
) clustered by (sno) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');

-- 2.导入数据到student_trans中
insert overwrite table student_trans
select num,name,dept
from student;

select * 
from student_trans;

-- 3.对student_trans建立聚合物化视图
CREATE MATERIALIZED VIEW student_trans_agg
AS SELECT sdept, count(*) as sdept_cnt from student_trans group by sdept;
-- 注意:这里当执行CREATE MATERIALIZED VIEW,会启动一个MR对物化视图进行构建
-- 可以发现当下数据库中有一个物化视图
show tables;
show materialized views;

-- 4.对于原始表student_trans拆线呢
-- 由于会命中物化视图,重写query查询物化视图,查询速度会加快(没有启动MR,只是普通的table scan)
SELECT sdept,count(*) as sdept_cnt from student_trans group by sdept;

-- 5.查询执行计划可以发现 查询被自动重写为 TableScan alias
-- 转化成了物化视图的查询 提高了查询效率
explain SELECT sdept,count(*) as sdept_cnt from student_trans group by sdept;

-- 验证禁用物化视图自动重写
ALTER MATERIALIZED VIEW student_trans_agg DISABLE REWRITE;

-- 删除物化视图
drop materialized view student_trans_agg;

8. Hive Database|Schema (database) DDL operation

Overall overview

  • In Hive, the concept of DATABASE is similar to RDBMS, called database, DATABASE and SCHEMA are interchangeable and can be used
  • The default database is called default, and the stored data is located under /user/hive/warehouse
  • The storage location of the database created by the user is /user/hive/warehouse/database_name.db

create database
is used to create a new database
COMMENT: the comment description statement of the database
LOCATION: specify the storage location of the database in HDFS, the default /user/hive/warehouse/dbname.db
WITH DBPROPERTIES: used to specify the property configuration of some databases

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name = property_value, ...)];

describe database
shows information such as the name of the database in Hive, the comment (if set) and its location on the file system
EXTENDED keyword is used to show more information. The keyword describe can be abbreviated as desc

语法:
DESCRIBE DATABASE|SCHEMA [EXTENDED] db_name;

use database
Select a specific database
to switch which database is used for the current session

drop database
The
default behavior is RESTRICT, which means that the database is only dropped if it is empty.
To drop a database with tables (not empty ones), we can use CASCADE

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

alter database
changes the metadata associated with a database in Hive

-- 更改数据库属性
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name = property_value, ...);

-- 更改数据库所有者
ALTER (DATABASE|SCHEMA) database_name SET OWNER USER user;

-- 更改数据库位置
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;
-- 创建数据库
create database if not exists test
comment "this is test db"
with dbproperties('createdBy'='Allen');

-- 描述数据库信息
describe database test;
describe database extended test;
desc database extended test;

-- 切换数据库
use default;
use test;

-- 删除数据库
-- 注意:CASCADE关键字谨慎使用
drop database test;

9. Hive Table (table) DDL operation

Overall overview

  • The DDL operations for tables in Hive can be said to be the core operations in DDL, including creating tables, modifying tables, deleting tables, and describing table metadata information
  • Among them, the table creation statement is the core of the core, see Hive DDL table creation statement for details
  • It can be said that whether the definition of the table is successful directly affects the successful mapping of the data, and then affects whether Sun Li can use Hive to carry out data analysis
  • Since Hive loads the mapping data very quickly after the table is built, if there is a problem with the table in practice, it can be deleted and rebuilt directly without modification.

describe table
displays the metadata information of the table in Hive
If the EXTENDED keyword is specified, it will display all the metadata of the table in Thrift serialized form
If the FORMATTED keyword is specified, it will display the metadata in table format

drop table
deletes the metadata and data of the table.
If the trash can is configured and PURGE is not specified, the data corresponding to the table will actually be moved to the HDFS trash can, and the metadata will be completely dropped.
When the EXTERNAL table is deleted, the data in the table It will not be deleted from the file system, only the metadata will be deleted
. If PURGE is specified, the table data will be deleted directly, skipping the HDFS trash can. Therefore, if DROP fails, the table data cannot be retrieved

DROP TABLE [IF EXISTS] table_name [PURGE];

truncate table
deletes all rows from the table.
It can be simply understood as clearing all the data in the table but retaining the metadata structure of the table.
If the trash can is enabled in HDFS, the data will be thrown into the trash can, otherwise it will be deleted

TRUNCATE [TABLE] table_name;

alter table

-- 1.更改表名
ALTER TABLE table_name RENAME TO new_table_name;

-- 2.更改表属性
ALTER TABLE table_name SET TBLPROPERTIES (property_name = property_value, ...);
-- 更改表注释
ALTER TABLE student SET TBLPROPERTIES ('comment' = "new comment for student table");

-- 3.更改SerDE属性
ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES (property_name = property_value, ...)];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim'=',');
-- 移出SerDe属性
ALTER TABLE table_name [PARTITION partition_spec] UNSET SERDEPROPERTIES (property_name, ...);

-- 4.更改表的文件存储格式 该操作仅更改表元数据,校友数据的人格转换都必须在Hive之外进行

-- 5.更改表的存储位置路径
ALTER TABLE table_name SET LOCATION "new location";

-- 6.更改列名称/类型/位置/注释
CREATE TABLE test_change (a int, b int, c int);
ALTER TABLE test_change CHANGE a a1 INT;
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
ALTER TABLE test_change CHANGE c c1 INT FIRST;
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';

-- 7.添加/替换列
-- 使用 ADD COLUMNS,可以将新列添加到现有列的末尾但在分区列之前
-- REPLACE COLUMNS,将删除所有现有列,并添加新的列集
ALTER table_name ADD|REPLACE COLUMNS (col_name data_type,...);

10.add partition (partition) DDL operation

  • ADD PARTITION changes table metadata, but does not load data. If there is no data in the partition location, the query will not return results
  • Therefore, it is necessary to ensure that the data already exists under the added partition location path, or import the partition data after adding manure.
-- 1.增加分区
ALTER TABLE table_name ADD PARTITION (df='20170101') location '/user/hadoop/warehouse/table_name/dt=20170101'
-- 一次添加一个分区

ALTER TABLE table_name ADD PARTITION (dt='2008-08-08',county='us') location '/path/to/us/part080808' PARTITION (df='2008-08-09', country='us') location '/path/to/us/part080809';
-- 一次添加多个分区

rename partition

-- 2.重命名分区
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
ALTER TABLE table_name PARTITION (dt='2008-08-09') RENAME TO PARTITION (dt='20080809')

delete partition
deletes the partition of the table, and the data and metadata of the partition will be deleted

-- 3.删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (dt='2008-08-08', country='us') PURGE; -- 直接删除数据 不进垃圾桶

alter partition

-- 5.修改分区
-- 更改分区文件存储格式
ALTER TABLE table_name PARTITION (dt='2008-08-09') SET FILEFORMAT file_format;
-- 更改分区位置
ALTER TABLE table_name PARTITION (df='2008-08-09') SET LOCATION "new location";

MSCK partition
MSCK is the abbreviation of metastore check, which means metadata check operation, which can be used for metadata repair

  • MSCK's default behavior is ADD PARTITIONS, which adds all partitions that exist on HDFS but do not exist in the metastore to the metastore
  • The DROP PARTITIONS option will remove partition information from the metastore that has been removed from HDFS
  • The SYNC PARTITIONS option is equivalent to calling ADD and DROP PARTITIONS
  • If there are a large number of untracked partitions, you can run MSCK PEPAIR TABLE in batches to avoid OOME (out of memory errors)
-- 分区修复
MSCK [PEPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

11. Hive show syntax

-- 1.显示所有数据库 SCHEMAS和DATABASES的用法 功能一样
show database;
show schemas;

-- 2.显示当前数据库所有表/视图/物化视图/分区/索引
show tables;
SHOW TABLES [IN database_name]; -- 指定某个数据库

-- 3.显示当前数据库下所有视图
show views;
show views 'test_*';
show views from test1;
SHOW VIEWS [IN/FROM database_name];

-- 4.显示当前数据库下所有物化视图
SHOW MATERIALIZED VIEW [IN/FROM database_name];

-- 5.显示分区信息,分区按字母顺序列出,不是分区表执行该语句会报错
show partitions table_name;

-- 6.显示表/分区的扩展信息
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE table_name;
show table extended like student;

-- 7.显示表的属性信息
SHOW TBLPROPERTIES table_name;
show tblproperties student;

-- 8.显示表、视图的创建语句
SHOW CREATE TABLE ([db_name.]table_name|view_name)
show create table student;

-- 9.显示表的所有列,包括分区列
SHOW COLUMNS (FORM|IN) table_name [(FROM|IN) db_name];
show columns in student;

-- 10.显示当前支持的所有自定义和内置的函数
show functions;

-- 11.Describe desc
-- 查询表信息
desc extended table_name;
-- 查看表信息(格式化美观)
desc formatted table_name;
-- 查看数据库相关信息
describe database database_name;

Guess you like

Origin blog.csdn.net/hutc_Alan/article/details/131481153
Recommended