Impala overview

Impala is a real-time query tool. The main goal is to make SQL-on-Hadoop operations fast and efficient. It improves the performance of big data SQL queries on Hadoop. Impala is a supplement to big data query tools. Impala does not replace batch processing frameworks built on MapReduce, such as Hive.

Impala directly reads data stored in HDFS, HBase, or Amazon Object Storage Service (S3). In addition to using the same storage platform as Hive, impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Impala query UI in HUE) as Hive.

Hive based on MapReduce and other frameworks (such as spark) are suitable for long-running batch jobs, such as those involving extraction, transformation, and loading (ETL). Impala is more suitable for interactive query.

In order to avoid delays, Impala bypasses MapReduce and directly accesses data through a dedicated distributed query engine, which is very similar to the query engine in commercial distributed RDBMSs . This result is an order of magnitude faster performance than hive, depending on the type of query and configuration.

How Impala works on Apach Hadoop

How Impala fits into the Hadoop ecosystem

Impala makes use of many familiar components in the Hadoop ecosystem. Impala can exchange data with other Hadoop components (as consumers and producers), so it can be flexibly put into ETL and ELT pipelines.

How Impala works with Hive

The main goal of Impala is to make SQL-on-Hadoop operations fast and efficient. In actual applications, it uses the existing Hive data of many Hadoop users to execute long-running, batch-oriented SQL queries.

Impala can use the metadata information stored in MySQL or PostgreSQL by Hive, and store some of its own metadata information in the same database. This MySQL or PostgreSQL is called a metastore, as long as all columns use the data types, file formats, and file formats supported by Impala. Compressing the codec, Impala can access the tables defined or loaded by Hive.

Overview of Impala's Metadata and Metastore

As mentioned above, Impala maintains table metadata information in a database called metastore, and also saves some lower-level metadata information, such as the physical location of blocks in HDFS.

For tables with a large amount of data or multiple partitions, retrieving all metadata of the table is very time-consuming, in some cases it takes several minutes. Therefore, each Impala node caches all of these metadata for future queries against the same table.

If the metadata of the table or the data in the table has been updated, all other Impalad processes in the cluster must receive the latest metadata, replace the outdated cached metadata, and then query the table. In Impala 1.2 and later, metadata updates are automatic and coordinated through the Catalog process.

For changes made to tables through Hive, or manual changes to files in HDFS, you still need to use the REFRESH statement (when a new data file is added to an existing table) or INVALIDATA statement (for a brand new table, or After deleting tables, performing HDFS rebalance operations, or deleting data files). Executing INVALIDATE METADATA will retrieve the metadata of all tables tracked by the metastore. If you know that only specific tables have been changed by programs outside of Impala (such as Hive), you can execute REFRESH table_name for each affected table to retrieve only the latest metadata for these tables.

How Impala uses HDFS

Impala uses the distributed file system HDFS as its main data storage medium. Impala relies on the redundancy provided by HDFS to prevent hardware or network outages on a single node. Impala table data is physically represented as data files in HDFS, using the familiar HDFS file format and compression codec. When the data file appears in the directory of the new table, Impala will read all data files regardless of the file name. New data will be added to the file whose name is controlled by Impala.

Impala can be deployed on the same node as the DataNode, so that Impala can directly read local files, greatly reducing the resources consumed by network transmission.

How Impala uses HBase

HBase serves as an alternative to Impala's data storage medium HDFS. It is a database storage system built on HDFS without built-in SQL support. Many Hadoop users have already configured it and stored large (usually sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in HBase, you can query the contents of HBase tables through Impala, and even perform join queries including Impala tables and HBase tables.

How does Impala complete a query

By providing a standardized SQL interface, such as an ODBC or JDBC client, the user sends a piece of SQL to Impala, which can be sent to any Impalad in the cluster, and this Impalad will become the coordinator of this query statement.
Impala parses a piece of SQL and analyzes it to determine which tasks the Impala instance needs to execute in order to obtain the highest execution efficiency.
Storage services such as HDFS and HBase provide data for the local Impala instance.
Each Impala instance returns the data to the original coordinator, and it returns the result to the client.

Impala architecture

Clients : Including Hue, ODBC client, JDBC client and Impala Shell can interact with Impala.

Hive Metastore : Impala can use Hive's metadata information. Through Hive's metadata, Impala can obtain Hive's databases and the structure of these databases. When you create, delete, modify the table structure, and load data into the table through Impala SQL, the related metadata will be modified and automatically broadcast to all impala nodes.

Impala : Impala will run on DataNodes to coordinate and execute query statements. Each Impala instance can receive, plan, and coordinate queries from Impala clients. The query work will be distributed among Impala nodes, and then these nodes will act as workers to execute these query statements in parallel.

HBase and HDFS : used to store the data being queried.

Impala components

Impalad

Impala is a distributed, massively parallel processing (MPP) database engine. Impala has only one core Impalad process, which is distributed on specific hosts in the cluster.

Impalad is the daemon of impala. As the core component of impala, it has the following important functions:

Read and write data files.
Accept queries transmitted from impala shell commands, Hue, JDBC or ODBC.
Parallel the query and distribute the work among the clusters.
Transmit the intermediate query results back to the central coordinator.

Impalad can be deployed in the following ways:

If hdfs and impala are in the same cluster, impalad and DataNode are deployed on the same machine.
Impala is deployed separately in a computing cluster and reads data remotely from HDFS, S3, ADL, etc.

Impalad communicates with StateStore continuously to confirm which daemons are healthy and can accept new tasks. They also receive broadcast messages from the catalogd process (introduced in Impala 1.2), as long as any daemon in the Impala cluster creates, changes, or deletes databases or tables, or executes INSERT or LOAD DATA statements through Impala. Catalogd will REFRESH or INVALIDATE metadata information. Prior to Impala 1.2, metadata was coordinated between Impala daemons.

Statestore

The StateStore component is used to check the running status of all Impala daemons in the cluster and continuously pass its findings to each daemon.

If impalad fails due to hardware failure, network error, software problem or other reasons, StateStore will notify all other Impalad processes, and subsequent queries can avoid making requests to Impalad processes that cannot be accessed.

Catalog

The catalog component is used to send metadata changes to the impalad process in the cluster, and it is represented by a daemon called catalogd. Only need to start this service on one machine. Catalog and StateStore need to run on the same machine, because requests are passed through the StateStore process.

--load_catalog_in_background Options can control when the table metadata is loaded.

If set to false, the metadata of the table will be loaded when the table is first referenced. This means that the first run may be slower. Starting from Impala 2.2, the default --load_catalog_in_backgroundvalue is false.
If set to true, the Catalog service will try to load the metadata of the table even if the metadata is not required for the query. Therefore, when the first query that requires metadata is run, the metadata may already be loaded. However, it is recommended not to set the option to for the following reasons true.
- Background loading may interfere with the metadata loading of the query. For example, at startup or after invalidating metadata, the metadata required for the query may not have been loaded or have been deleted. Its duration depends on the amount of metadata, and may randomly lead to a longer running time of the query, which is difficult to diagnose.
- Impala may load metadata for tables that may never be used, which may increase the catalog size, thereby increasing the memory usage of the Catalog service and Impalad.

Impala: architecture and components