Impala learning summary

 

Impala is a new query system developed by Cloudera. It provides SQL semantics and can query petabyte-scale big data stored in Hadoop's HDFS and HBase. Although the existing Hive system also provides SQL semantics, because the underlying execution of Hive uses the MapReduce engine, it is still a batch process, which is difficult to meet the interactivity of the query. In contrast, Impala's biggest feature and biggest selling point is its speed. So how does Impala realize fast query of big data? Before answering this question, we need to introduce Google's Dremel system, because Impala was originally designed with reference to the Dremel system.

1. Impala core components



 Figure 1: Impala instance

Impala Daemon

The core component of Impala is the impalad daemon (Impala daemon) running on each node. It is responsible for reading and writing data files, receiving query statements sent from impala-shell, Hue, JDBC, ODBC and other interfaces, parallelizing query statements and Distributes work tasks to each node of the Impala cluster, and is also responsible for sending locally calculated query results to the coordinator node.

You can submit a query to an Impala daemon running on any node, this node will act as the coordinator node for the query, and other nodes will transmit partial result sets to this coordinator node. The final result set is constructed by this coordinator node. For convenience when doing experiments or tests, we often connect to the same Impala daemon to execute queries, but when running production-level applications in the production environment, we should submit queries cyclically (sequentially) on different nodes, so that we can Balance the load of the cluster.

The Impala daemon communicates continuously with the statestore to confirm which nodes are healthy and can receive new work tasks. It also receives broadcast messages from catalogd daemon (supported since Impala 1.2) to update metadata information, and triggers broadcast messages when any node in the cluster creates, alters, or drops any object, or executes INSERT and LOAD DATA.

Impala Statestore

Impala Statestore checks the health status of Impala daemons on each node of the cluster, and continuously feeds back the results to each Impala daemon. The physical process name of this service is statestored, and we only need one such process in the entire cluster. If an Impala node is offline due to hardware errors, software errors or other reasons, the statestore will notify other nodes to prevent other nodes from sending requests to the offline node.

Since the statestore is used to notify when there is a problem with the cluster nodes, it does not have a critical impact on the Impala cluster. If the statestore is not running or fails, other nodes and distributed tasks will run as usual, just that the cluster will be less robust when nodes go offline. When the statestore is back up and running, it starts communicating with and monitoring other nodes again.

Impala Catalog

The Imppalla catalog service notifies each node of the cluster of metadata changes made by SQL statements. The physical process name of the catalog service is catalogd, and only one such process is required in the entire cluster. Since its requests will interact with the statestore daemon, it is best to keep the two processes statestored and catalogd on the same node.

The catalog service added in Impala 1.2 reduces the use of REFRESH and INVALIDATE METADATA statements. In previous versions, after executing a CREATE DATABASE, DROP DATABASE, CREATE TABLE, ALTER TABLE, or DROP TABLE statement on a node, you need to execute the command INVALIDATE METADATA on other nodes to ensure the metadata information is updated . Similarly, when you execute an INSERT statement on a node, you must first execute the REFRESH table_name operation when executing a query on other nodes, so that the newly added data file can be identified. It should be noted that the metadata changes brought by the operations performed by Impala do not need to execute REFRESH and INVALIDATE METADATA with the catalog, but if the table is built and data loaded through Hive, REFRESH and INVALIDATE still need to be executed. METADATA to notify Impala to update metadata information.

2. Impala system architecture ( Impala Daemon )



 

Figure 2: Impala deamon system architecture diagram

Impala is actually Dremel of Hadoop, and the column storage format used by Impala is Parquet. Parquet implements column storage in Dremel, and will support Hive in the future and add functions such as dictionary encoding and run-length encoding. The system architecture of Impala is shown in Figure 1. Impala uses Hive's SQL interface (including operations such as SELECT, INSERT, Join, etc.), but currently only implements a subset of Hive's SQL semantics (for example, UDF is not yet supported), and table metadata information is stored in Hive's Metastore . StateStore is a sub-service of Impala, which is used to monitor the health status of each node in the cluster and provide functions such as node registration and error detection. Impala runs a background service Impalad on each node. Impalad is used to respond to external requests and complete the actual query processing. Impalad mainly includes three modules: Query Planner, Query Coordinator and Query Exec Engine. QueryPalnner receives queries from SQL APP and ODBC, and then converts the query into many sub-queries. The Query Coordinator distributes these sub-queries to each node. The Query Exec Engine on each node is responsible for the execution of the sub-queries, and finally returns the sub-queries. As a result, these intermediate results are aggregated and finally returned to the user.

 

3. The reason why impala is better than hive

 

1. Use c++ instead of java to develop, reduce operating load
2. Impala does not write intermediate data to disk, but completes all processing in memory.
3. Brand new execution engine, query tasks will be executed immediately instead of generating mr tasks, saving a lot of initialization time
4. The impala query plan parser uses a more intelligent algorithm to execute each query step in a distributed manner on multiple nodes, while avoiding sorting and shuffle are two very time-consuming stages, which are often unnecessary.

 

4. Why can't you replace hive with impala for the time being?

 

1. Impala is still immature, and its performance may be unstable . In the face of ultra-large-scale data, impala has poor join performance.
2. Metadata often expires, which makes it impossible to query
3. Impala query sql cannot completely cover hive sql
4. Impala often has a query stuck in the executing state, and it cannot be killed by killing it.
#5. The popularity of hive is large and it is not easy to switch

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326768791&siteId=291194637