Impala in invalidate metadata and refresh

First look: Impala how to integrate Hadoop ecosystem

Impala uses the Hadoop ecosystem, many familiar components. Impala as consumers and producers to exchange data with other Hadoop components, so it can be a flexible manner appropriate for your ETL and ELT pipeline.


How Impala Works with Hive
A major goal is to make the Impala SQL-on-Hadoop operations fast and efficient enough to attract new categories of users, and new types of use cases open Hadoop. In practical situations, utilizing existing Apache Hive infrastructure (many Hadoop users who already have the infrastructure) to perform long running batch-oriented SQL query.
In particular, Impala the table definition is stored in its conventional MySQL or PostgreSQL database (referred Metastore) in, Hive such data will be stored in this database. Therefore, if all columns using the Impala supported data types, file formats and compression codecs, Impala can access the Hive defined or loaded table.
Initially focused on the characteristics and performance means Impala query data can be read more than write INSERT statement type of SELECT statement. To use the Avro, RCFile or SequenceFile file format query data, you can use the Hive load data.
Impala query optimizer can use statistics table and column statistics. Initially, you use the analysis TABLE statement Hive in collecting such information; Impala in version 1.2.2 or later, use the Impala COMPUTE STATS statement. Calculating statistics requires less setup, more reliable, no need to switch back and forth between impala-shell and Hive shell

Overview of Impala Metadata and the Metastore
As discussed in the "How to work with Impala Hive" in, Impala maintain information about the table definition in a central database called metastore in. Impala also low-level features of the trace data file other metadata:
the physical location of the block in HDFS. For a large amount of data or more partitions of the table, all the metadata retrieval table can be very time consuming, and in some cases take several minutes. Thus, each node caches all these Impala metadata for future reuse queries on the same table.
If you update the data table definition or the table, all other Impala daemons in the cluster must receive the latest metadata prior to issuing a query on the table, replace the outdated cached metadata.

In Impala 1.2 and later, for all DDL and DML statements issued by Impala, metadata updates are automatic, coordinated through catalogd daemon.
After the adoption of the hive issue DDL and DML, or manually change the HDFS file, you can still use the REFRESH statement (when new data files are added to an existing table) or failure metadata statement (a new table, or delete a table, perform HDFS operating a balance, or delete data files). Issued by the table itself will retrieve all metadata metastore tracked INVALIDATE METADATA. If you know only certain table is changed outside Impala, you can make a REFRESH table_name for each affected table to retrieve only the most recent meta data table

How Impala Uses HDFS

Impala use HDFS distributed file system as its main data storage medium. Impala rely on HDFS provides redundancy to prevent a single node on the hardware or network interruption. Impala table data using familiar HDFS file formats and compression codec in HDFS physically represented as a data file. When the data file directory appears in the new table, Impala will read all the files, regardless of file name. The new data is added to the file name by Impala in control

 

INVALIDATE METADATA Statement

One or all of the metadata table is marked as stale. After you create a table by the Hive shell, before the table can be used to query Impala, it is required. The next time the current node Impala invalid metadata table to perform a query, Impala will reload associated metadata query before continuing. Incremental metadata execution with REFRESH statement to update the comparison, this is a relatively expensive operation, thus adding a new data file to an existing table common scenario, is preferably used rather than REFRESH metadata failure.

语法:INVALIDATE   METADATA   [[db_name.]table_name]

By default, the cache metadata for all tables are refreshed. If you specify a table name, only to refresh the metadata of the table. Even for a single table, INVALIDATE METADATA more expensive than the REFRESH, so in case of adding new common data file for an existing table, select REFRESH.

 

INVALIDATE METADATA and REFRESH is the corresponding: INVALIDATE METADATA waiting to reload the metadata when a subsequent query required, but will re-load all the metadata tables, this may be an expensive operation, particularly for large tables there are many partitions. REFRESH reloads immediately metadata, but only load the newly added location data blocks of the data file, thereby reducing the overall cost of the operation. If data is changed in a broader way, for example, reorganized by the HDFS balancer, INVALIDATE using metadata to avoid performance losses due to the local reduction of the reading result. Impala 1.1 REFRESH optimized for common use cases to add new data files to an existing table, so now need the table name parameter.

Caution:
INVALIDATE the METADATA is used to refresh the whole library or a table of metadata, including metadata table and file data in the table, it will first clear the cache table, and then reload all the data from the cache and metastore, the cost of operation is relatively heavy, is mainly used to modify the metadata table in the hive, it is necessary to synchronize impalad, e.g. create table / drop table / alter table add columns and the like.


Syntax: the REFRESH [Table] the PARTITION [Partition]
the REFRESH is for refreshing a table, or a data partition, it will reuse the metadata table before, the refresh operation is performed only file, it is possible to detect an increase in the partition table and reduction, mainly for metadata table unmodified, modified data, such as INSERT INTO, LOAD dATA, ALTER tABLE ADD PARTITION, LLTER tABLE DROP PARTITION the like, if the file table HDFS directly modify (add, delete or rename) also you need to specify rEFRESH to refresh the data.

Use principles

If updating the metadata related to the data or during use, an operator is required to complete both of these, specifically how to select based on the following principles:
. 1) than the refresh operation for the invalidate Metadata heavyweight
2) if it involves the schema change table using the invalidate Metadata [table]
. 3) If only it relates to the data change table using Refresh [table]
. 4) change if only relates to a partition in a data table using refresh [table] partition [partition ]
5) prohibits the use invalidate metadata ... nothing, preferring to restart catalogd. ( Note: This may be used in the project )

summarizes
two REFRESH and INVALIDATE METADATA impala, is more important for the operation, were treated to modify the data and metadata, wherein the REFRESH operation is synchronous, asynchronous INVALIDATE METADATA, herein Details of the applicable principles of the two scenarios and execute the statement, as well as the possible impact, most importantly, keep in mind these two queries usage scenarios.
 

Reference document: https: //www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_langref_sql.html#langref_sql

 

Guess you like

Origin www.cnblogs.com/hello-wei/p/11414207.html