[Big data hive] Detailed explanation of the use of hive view and materialized view

Table of contents

1. The view in hive

Two, hive view syntax and operation

2.1 Data preparation

2.2 Create a view

2.2.1 Create a common view

2.2.2 Create a view based on a view

2.3 View view definition

2.4 Using Views

2.5 Delete view

2.6 Changing View Properties

2.7 Changing the view definition

3. The benefits of using views

3.1 Only provide specific column data in the real table to users, and protect data implicitly

3.1.1 Create a table

3.1.2 Create a view based on this table

3.2 Reduce query complexity and optimize query statements

Four, hive materialized view

4.1 Hive materialized view concept

4.1.1 Features of hive materialized view

4.2 The difference between materialized view and view

4.3 Materialized view syntax

4.4 Query rewriting based on materialized view

4.5 Operation Demonstration

4.5.1 Create a new transaction table student_trans

4.5.2 Import data into student_trans

4.5.3 Create a polymerized view based on student_trans

4.5.4 Aggregate Query Original Table and Query Materialized View Comparison

4.5.5 Other Common Commands of Materialized View


1. The view in hive

Students who have used mysql views should be familiar with the concept of views. Views are virtual tables that can temporarily store query data. Views are also provided in hive. Views in hive have the following characteristics:

  • The view (view) in Hive is a virtual table, which only saves the definition and does not actually store the data;
  • Generated views are usually created from actual physical table queries, and new views can also be created from existing views;
  • When a view is created, the view's schema is frozen, and if the underlying tables are dropped or altered, the view will fail;

Views are used to simplify operations, do not buffer records, and do not improve query performance

Two, hive view syntax and operation

2.1 Data preparation

There are the following tables under the test database

 Use the t_usa_covid19 table and insert some data

2.2 Create a view

2.2.1 Create a common view

create view v_usa_covid19 as select count_date, county,state,deaths from t_usa_covid19 limit 10;

After the creation is successful, you can use: show views to view the view; 

2.2.2 Create a view based on a view

Create a view again based on the view created above

create view v_usa_covid19_from_view as select * from v_usa_covid19 limit 5;

2.3 View view definition

show create table view name;

2.4 Using Views

After the view is created, you can query data based on the view, such as using the first view to query data

select * from v_usa_covid19;

The data can be returned normally; 

 Notice

The view is virtual and can only be used for query-related operations, and cannot insert data into the view

2.5 Delete view

drop view name

2.6 Changing View Properties

Add a comment name to the view

alter view v_usa_covid19 set TBLPROPERTIES ('comment' = 'This is my view');

2.7 Changing the view definition

Query only certain fields specified

alter view v_usa_covid19 as  select county,deaths from t_usa_covid19 limit 5;

3. The benefits of using views

3.1 Only provide specific column data in the real table to users, and protect data implicitly

The following example using views to restrict data access can be used to protect information from random queries

3.1.1 Create a table

create table userinfo(firstname string, lastname string, ssn string, password string);

3.1.2 Create a view based on this table

This view only returns some fields, for example, the password is more private and will not be returned;

create view safer_user_info as select firstname, lastname from userinfo;

3.2 Reduce query complexity and optimize query statements

The following is a nested subquery statement

from (          

                select * from people join cart                                    

                                on(cart.pepople_id = people.id) where firstname = 'join'  )  a select a.lastname where a.id = 3;

Using views, you can encapsulate the internal relational query as a view

create view shorter_join as select * from people join cart   n (cart.pepople_id = people.id) where firstname = 'join';

Once you have a view, you can simplify the query statement based on the view

select lastname from shorter_join where id = 3;

Four, hive materialized view

4.1 Hive materialized view concept

Materialized View is a database object including query results, which can be used to pre-calculate and save the results of time-consuming operations such as table joins or aggregations. When executing queries, these time-consuming operations can be avoided, and results can be obtained quickly.

The purpose of using materialized views is to improve query performance through precomputation, which of course requires a certain amount of storage space.

4.1.1 Features of hive materialized view

  • Hive3.0 began to try to introduce materialized views and provide an automatic query rewriting mechanism for materialized views (based on Apache Calcite);
  • Hive's materialized view also provides a materialized view storage selection mechanism, which can be stored locally in Hive or stored in other systems (such as Druid) through user-defined storage handlers;
  • The purpose of Hive introducing materialized views is to optimize the efficiency of data query access, which is equivalent to optimizing data access from the perspective of data preprocessing;
  • Hive has discarded the syntax support for index index from 3.0, and recommends using materialized views and columnar storage file formats to speed up queries;

4.2 The difference between materialized view and view

  • The view is virtual and logically exists, only the definition does not store data;
  • The materialized view is real, physically exists, and pre-calculated data is stored in it;
  • The purpose of the view is to simplify and reduce the complexity of the query, while the purpose of the materialized view is to improve query performance;

The materialized view can cache data, and the data is cached when the materialized view is created. Hive treats the materialized view as a "table" and caches the data. The view just creates a virtual table, only the table structure, no data, and then rewrite the SQL to access the actual data table during the actual query;

4.3 Materialized view syntax

The complete syntax tree is as follows


CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
 [DISABLE REWRITE] 
 [COMMENT materialized_view_comment]   
 [PARTITIONED ON (col_name, ...)]
 [CLUSTERED ON (col_name, ...) | DISTRIBUTED ON (col_name, ...) SORTED ON (col_name, ...)]

[ 
  [ROW FORMAT row_format]
  [STORED AS file_format]| 
  STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS SELECT ...;

Supplementary note:

  • After the materialized view is created, the select query execution data is automatically landed. "Automatic" means that during the execution of the query, any user is invisible to the materialized view, and the materialized view is available after the execution is completed;
  • By default, the created materialized view can be rewritten by the query optimizer optimizer, which can be disabled by setting the DISABLE REWRITE parameter during materialized view creation;
  • The default SerDe and storage format are hive.materializedview.serde, hive.materializedview.fileformat;

Materialized views support storing data in external systems (such as druid), as shown in the following syntax:

CREATE MATERIALIZED VIEW druid_wiki_mv

        STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'

AS

SELECT __time, page, user, c_added, c_removed

FROM src;

Currently, the operations that support materialized views include drop and show operations, and other operations will be added in the future

DROP MATERIALIZED VIEW [db_name.]materialized_view_name;

DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;

When the data source changes (new data inserted, data modified modified), the materialized view also needs to be updated to maintain data consistency. Currently, the user needs to actively trigger the rebuild reconstruction;

ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;

4.4 Query rewriting based on materialized view

From the above content, we know that after the materialized view is created, it can be used to accelerate related queries, that is, if the user submits a query query, if the query can hit the existing materialized view after rewriting, the result will be returned directly through the materialized view query data , to achieve query acceleration.

Whether to rewrite the query to use the materialized view can be controlled by the global parameter, the default is true, set as follows;

hive.materializedview.rewriting=true

Users can selectively control the specified materialized view query rewriting mechanism, the syntax is as follows:

ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;

Query Rewrite Process for Materialized Views

  • User submits query query
  • If the query can hit the existing materialized view after rewriting
  • Then query the data directly through the materialized view and return the results to achieve query acceleration

4.5 Operation Demonstration

Pre-operation, the current session window performs the following settings

set hive.support.concurrency = true; --Hive是否支持并发
set hive.enforce.bucketing = true; --从Hive2.0开始不再需要  是否开启分桶功能
set hive.exec.dynamic.partition.mode = nonstrict; --动态分区模式  非严格
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; --
set hive.compactor.initiator.on = true; --是否在Metastore实例上运行启动线程和清理线程
set hive.compactor.worker.threads = 1; --在此metastore实例上运行多少个压缩程序工作线程。

4.5.1 Create a new transaction table student_trans

CREATE TABLE student_trans (
      sno int,
      sname string,
      sdept string)
clustered by (sno) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');

4.5.2 Import data into student_trans

insert overwrite table student_trans select num,name,dept from student;

Check the data of the table after the execution is complete

4.5.3 Create a polymerized view based on student_trans

CREATE MATERIALIZED VIEW student_trans_agg AS SELECT sdept, count(*) as sdept_cnt from student_trans group by sdept;

It can also be seen from the execution process that when CREATE MATERIALIZED VIEW is executed here, an MR will be started to build the materialized view. After the execution is completed, it can be found that there is a materialized view in the current database;

4.5.4 Aggregate Query Original Table and Query Materialized View Comparison

First delete the above materialized view

Query the original table student_trans

Query again after recreating the materialized view

Comparing the query process of the two, it is not difficult to see that the first query executes the map-reduce task, which takes more than 2 seconds, and the second query does not execute the map-reduce task. Because it will hit the materialized view, rewrite the query query materialization view, the query speed will be accelerated (MR is not started, it is just a normal table scan), and the query time has more than doubled, which will be a huge improvement in performance when the amount of data is very large;

In order to further verify the above statement, you can use explain to view the execution plan

Check the hdfs file directory, and you can find that a data directory for a materialized view table has also been created 

4.5.5 Other Common Commands of Materialized View

#验证禁用物化视图自动重写
ALTER MATERIALIZED VIEW student_trans_agg DISABLE REWRITE;

#查看物化视图
show materialized views;

#删除物化视图
drop materialized view student_trans_agg;

Guess you like

Origin blog.csdn.net/congge_study/article/details/128840818