Impala 刷表的几种方式

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/zpf336/article/details/78920919

摘要:

在使用Impala的时候发现有人会使用Refresh table去刷新表,而有人会使用invalidate metadata table去刷新表,再有者会使用alter table tablename recover partitions去刷新表,当不熟悉的时候,就容易用错,或者三种一起用。

网上关于Impala的中文资料还是比较的少,今天就来简单了解一下这几种刷表方式以及区别。


1. REFRESH


        这种刷新方式比较简单,功能也比较弱,当你使用的时候回发现,我refresh了,可是使用select查询的时候还是查不到,怎么回事,原因如下:

https://impala.apache.org/docs/build/html/topics/impala_refresh.html

Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name after you add data files for that table.


In Impala 2.3 and higher, the syntax ALTER TABLE table_name RECOVER PARTITIONS is a faster alternative to REFRESH when the only change to the table data is the addition of new partition directories through Hive or manual HDFS operations.

You only need to issue the REFRESH statement on the node to which you connect to issue queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read requests for the correct HDFS blocks without relying on the metadata on the other nodes.

REFRESH reloads the metadata for the table from the metastore database, and does an incremental reload of the low-level block location data to account for any new data files added to the HDFS data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common scenario where new data files are added to HDFS.

Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, either because the CREATE TABLE statement was run in Impala rather than Hive, or because a previous INVALIDATE METADATAstatement caused Impala to reload its entire metadata catalog.


2. INVALIDATE METADATA


        这种刷新方式比REFRESH功能要强大,特点如下

  • 1、不加参数,表示刷新所有表的metadata (INVALIDATE METADATA)
  • 2、加参数,刷新一张表的metadata(INVALIDATE METADATA table_name)
  • 3、不管是不是在当前impala节点,皆可刷新

        详情如下:

https://impala.apache.org/docs/build/html/topics/impala_invalidate_metadata.html

The INVALIDATE METADATA statement is new in Impala 1.1 and higher, and takes over some of the use cases of the Impala 1.0 REFRESH statement. Because REFRESH now requires a table name parameter, to flush the metadata for all tables at once, use the INVALIDATE METADATA statement.

Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name after you add data files for that table.

Usage notes:

A metadata update for an impalad instance is required if:

  • A metadata change occurs.
  • and the change is made from another impalad instance in your cluster, or through Hive.
  • and the change is made to a metastore database to which clients such as the Impala shell or ODBC directly connect.

A metadata update for an Impala node is not required when you issue queries from the same Impala node where you ran ALTER TABLEINSERT, or other table-modifying statement.



3. ALTER TABLE tableName RECOVER PARTITIONS


        这种刷新方式是告诉Impala去扫描文件发现新的partition,使用的场景有两种:

  • 1、使用hive命令alter table xxx add partition ()
  • 2、使用HDFS 命令新建的或拷贝出来的目录

        基于以上两种方式操作之后,impala并不知道文件有变化,如果直接使用select,并不能查询到新增加的文件或目录,必须要先使用这种刷表方式扫描一下,之后才能正常使用select,得到正确的结果。

        具体使用方式如下:

        https://impala.apache.org/docs/build/html/topics/impala_alter_table.html

In Impala 2.3 and higher, the RECOVER PARTITIONS clause scans a partitioned table to detect if any new partition directories were added outside of Impala, such as by Hive ALTER TABLE statements or by hdfs dfs or hadoop fs commands. The RECOVER PARTITIONS clause automatically recognizes any data files present in these new directories, the same as the REFRESH statement does.

For example, here is a sequence of examples showing how you might create a partitioned table in Impala, create new partitions through Hive, copy data files into the new partitions with the hdfs command, and have Impala recognize the new partitions and new data:

In Impala, create the table, and a single partition for demonstration purposes:


create database recover_partitions;
use recover_partitions;
create table t1 (s string) partitioned by (yy int, mm int);
insert into t1 partition (yy = 2016, mm = 1) values ('Partition exists');
show files in t1;
+---------------------------------------------------------------------+------+--------------+
| Path                                                                | Size | Partition    |
+---------------------------------------------------------------------+------+--------------+
| /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt | 17B  | yy=2016/mm=1 |
+---------------------------------------------------------------------+------+--------------+
quit;

In Hive, create some new partitions. In a real use case, you might create the partitions and populate them with data as the final stages of an ETL pipeline.



hive> use recover_partitions;
OK
hive> alter table t1 add partition (yy = 2016, mm = 2);
OK
hive> alter table t1 add partition (yy = 2016, mm = 3);
OK
hive> quit;

For demonstration purposes, manually copy data (a single row) into these new partitions, using manual HDFS operations:



$ hdfs dfs -ls /user/hive/warehouse/recover_partitions.db/t1/yy=2016/
Found 3 items
drwxr-xr-x - impala   hive 0 2016-05-09 16:06 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1
drwxr-xr-x - jrussell hive 0 2016-05-09 16:14 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2
drwxr-xr-x - jrussell hive 0 2016-05-09 16:13 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3

$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
  /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2/data.txt
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
  /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3/data.txt



hive> select * from t1;
OK
Partition exists  2016  1
Partition exists  2016  2
Partition exists  2016  3
hive> quit;

In Impala, initially the partitions and data are not visible. Running ALTER TABLE with the RECOVER PARTITIONS clause scans the table data directory to find any new partition directories, and the data files inside them:



select * from t1;
+------------------+------+----+
| s                | yy   | mm |
+------------------+------+----+
| Partition exists | 2016 | 1  |
+------------------+------+----+

alter table t1 recover partitions;
select * from t1;
+------------------+------+----+
| s                | yy   | mm |
+------------------+------+----+
| Partition exists | 2016 | 1  |
| Partition exists | 2016 | 3  |
| Partition exists | 2016 | 2  |
+------------------+------+----+


猜你喜欢

转载自blog.csdn.net/zpf336/article/details/78920919