摘要:
在使用Impala的时候发现有人会使用Refresh table去刷新表,而有人会使用invalidate metadata table去刷新表,再有者会使用alter table tablename recover partitions去刷新表,当不熟悉的时候,就容易用错,或者三种一起用。
网上关于Impala的中文资料还是比较的少,今天就来简单了解一下这几种刷表方式以及区别。
1. REFRESH
这种刷新方式比较简单,功能也比较弱,当你使用的时候回发现,我refresh了,可是使用select查询的时候还是查不到,怎么回事,原因如下:
https://impala.apache.org/docs/build/html/topics/impala_refresh.html
Because REFRESH table_name
only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table
before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name
after you add data files for that table.
In Impala 2.3 and higher, the syntax ALTER TABLE table_name RECOVER PARTITIONS
is a faster alternative to REFRESH
when the only change to the table data is the addition of new partition directories through Hive or manual HDFS operations.
You only need to issue the REFRESH
statement on the node to which you connect to issue queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read requests for the correct HDFS blocks without relying on the metadata on the other nodes.
REFRESH
reloads the metadata for the table from the metastore database, and does an incremental reload of the low-level block location data to account for any new data files added to the HDFS data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common scenario where new data files are added to HDFS.
Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, either because the CREATE TABLE
statement was run in Impala rather than Hive, or because a previous INVALIDATE METADATA
statement caused Impala to reload its entire metadata catalog.
2. INVALIDATE METADATA
这种刷新方式比REFRESH功能要强大,特点如下
- 1、不加参数,表示刷新所有表的metadata (INVALIDATE METADATA)
- 2、加参数,刷新一张表的metadata(INVALIDATE METADATA table_name)
- 3、不管是不是在当前impala节点,皆可刷新
详情如下:
https://impala.apache.org/docs/build/html/topics/impala_invalidate_metadata.html
The INVALIDATE METADATA
statement is new in Impala 1.1 and higher, and takes over some of the use cases of the Impala 1.0 REFRESH
statement. Because REFRESH
now requires a table name parameter, to flush the metadata for all tables at once, use the INVALIDATE METADATA
statement.
Because REFRESH table_name
only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table
before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name
after you add data files for that table.
Usage notes:
A metadata update for an impalad
instance is required if:
- A metadata change occurs.
- and the change is made from another
impalad
instance in your cluster, or through Hive. - and the change is made to a metastore database to which clients such as the Impala shell or ODBC directly connect.
A metadata update for an Impala node is not required when you issue queries from the same Impala node where you ran ALTER TABLE
, INSERT
, or other table-modifying statement.
3. ALTER TABLE tableName RECOVER PARTITIONS
这种刷新方式是告诉Impala去扫描文件发现新的partition,使用的场景有两种:
- 1、使用hive命令alter table xxx add partition ()
- 2、使用HDFS 命令新建的或拷贝出来的目录
基于以上两种方式操作之后,impala并不知道文件有变化,如果直接使用select,并不能查询到新增加的文件或目录,必须要先使用这种刷表方式扫描一下,之后才能正常使用select,得到正确的结果。
具体使用方式如下:
https://impala.apache.org/docs/build/html/topics/impala_alter_table.html
In Impala 2.3 and higher, the RECOVER PARTITIONS
clause scans a partitioned table to detect if any new partition directories were added outside of Impala, such as by Hive ALTER TABLE
statements or by hdfs dfs or hadoop fs commands. The RECOVER PARTITIONS
clause automatically recognizes any data files present in these new directories, the same as the REFRESH
statement does.
For example, here is a sequence of examples showing how you might create a partitioned table in Impala, create new partitions through Hive, copy data files into the new partitions with the hdfs command, and have Impala recognize the new partitions and new data:
In Impala, create the table, and a single partition for demonstration purposes:
create database recover_partitions;
use recover_partitions;
create table t1 (s string) partitioned by (yy int, mm int);
insert into t1 partition (yy = 2016, mm = 1) values ('Partition exists');
show files in t1;
+---------------------------------------------------------------------+------+--------------+
| Path | Size | Partition |
+---------------------------------------------------------------------+------+--------------+
| /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt | 17B | yy=2016/mm=1 |
+---------------------------------------------------------------------+------+--------------+
quit;
In Hive, create some new partitions. In a real use case, you might create the partitions and populate them with data as the final stages of an ETL pipeline.
hive> use recover_partitions;
OK
hive> alter table t1 add partition (yy = 2016, mm = 2);
OK
hive> alter table t1 add partition (yy = 2016, mm = 3);
OK
hive> quit;
For demonstration purposes, manually copy data (a single row) into these new partitions, using manual HDFS operations:
$ hdfs dfs -ls /user/hive/warehouse/recover_partitions.db/t1/yy=2016/
Found 3 items
drwxr-xr-x - impala hive 0 2016-05-09 16:06 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1
drwxr-xr-x - jrussell hive 0 2016-05-09 16:14 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2
drwxr-xr-x - jrussell hive 0 2016-05-09 16:13 /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
/user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=2/data.txt
$ hdfs dfs -cp /user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=1/data.txt \
/user/hive/warehouse/recover_partitions.db/t1/yy=2016/mm=3/data.txt
hive> select * from t1;
OK
Partition exists 2016 1
Partition exists 2016 2
Partition exists 2016 3
hive> quit;
In Impala, initially the partitions and data are not visible. Running ALTER TABLE
with the RECOVER PARTITIONS
clause scans the table data directory to find any new partition directories, and the data files inside them:
select * from t1;
+------------------+------+----+
| s | yy | mm |
+------------------+------+----+
| Partition exists | 2016 | 1 |
+------------------+------+----+
alter table t1 recover partitions;
select * from t1;
+------------------+------+----+
| s | yy | mm |
+------------------+------+----+
| Partition exists | 2016 | 1 |
| Partition exists | 2016 | 3 |
| Partition exists | 2016 | 2 |
+------------------+------+----+